Non-destructive Identification of the geographical origin of red jujube by near-infrared spectroscopy and fuzzy clustering methods

ABSTRACT The red jujube quality is closely associated with its place of origin. In order to quickly and easily identify the geographical origin of red jujube, the classification of red jujube samples’ near-infrared reflectance (NIR) spectra was performed using several fuzzy clustering methods in combination with principal component analysis (PCA) and linear discriminant analysis (LDA). Firstly, a NIR-M-R2 portable near-infrared spectrometer was used to collect four varieties of red jujube samples from four representative producing areas in four provinces: Gansu, Henan, Shanxi and Xinjiang in China. Each variety corresponded to a producing area, and it had 60 samples with a total of 240 samples. Near-infrared spectra of red jujube were acquired using a NIR-M-R2 portable near-infrared spectrometer, and the initial near-infrared spectra were preprocessed by Savitzky-Golay (SG) filtering. Secondly, PCA and LDA were used to further process the NIR data for dimension reduction and feature extraction, respectively. Finally, red jujube samples were classified by fuzzy C-means (FCM) clustering, Gustafson-Kessel (GK) clustering and possibility fuzzy C-means (PFCM) clustering. When GK served as the clustering algorithm, the clustering accuracy was the highest, as the value of 98.8%. Based on the experimental results, it was evident that the GK clustering algorithm played a significant role in identifying the place of origin of red jujube with near-infrared spectroscopy.


Introduction
Red jujube, as one of China's long historical heritage, carries rich culture and profound significance.The high nutritional value of red jujube has been praised by people.It is rich in nutrition and has certain therapeutic and health care effects, such as anti-oxidation, anti-tumor, anti-fatigue, hypoglycemic and lipid-lowering, liver protection and immune regulation. [1]In the long history of China, red jujube can be traced back to the ancient times thousands of years ago.As early as the Spring and Autumn Period and the Warring States Period, red jujube was listed as a precious tonic food and was widely used.In ancient medical theory, red jujube is considered to have the effects of nourishing the body, strengthening the body, and treating some diseases.Therefore, red jujube is also known as "the first fruit of China."In addition to medical value, red jujube also plays an important role in Chinese culture.In traditional festivals and important occasions, jujube is often used to cook a variety of traditional foods, such as jujube cake, jujube mud cake and so on.Red jujube is a symbol of harvest and happiness.It is regarded as a symbol of auspiciousness and is often used as a gift for relatives and friends, blessing health and a better future.In recent years, as the standard of living keeps improving, people's expectations for the quality and quantity of red jujube are increasing, which has led to an unprecedented period of high consumer demand in the market, resulting in drastic changes in product prices and increased risks.However, due to the constraints of resources, environment and planting technology, such demand for a great variety of red jujube can sometimes lead to a shortage of supply.The wide variety of foods, complex components, and many kinds of doped substances, most of the appearance composition or physical and chemical properties are relatively close, but red jujube samples from different production areas often have distinct taste and nutritive value. [2]The usual detection methods are cumbersome and inefficient for their determination of the quality and type of red jujubes.Driven by economic interests, some merchants faked products, and consumers spent a high price but did not get the good quality of jujube.Therefore, how to quickly and easily identify the red jujube variety is of great significance. [3]he traditional jujube identification method due to the manual grading results in high costs and low quality [4] as well as accurate classification of fruit accessions in processing plants and during postharvesting requests is a challenge that has been widely studied. [5]Moreover, the process consumes too much manpower and material resources, which has been criticized.In recent years, researchers from both domestic and international fields have taken an active part in identifying red jujube varieties.According to the classification of red jujube from the perspective of appearance, Al-Saif et al. established an artificial neural network (ANN) classifier on the basis of the color and morphological attributes of a single Indian jujube fruit to identify different varieties of Indian jujube fruit, the overall classification accuracy was 98.39%. [3]Meng et al. established a deep convolutional neural network model, which could recognize jujube varieties in their natural state by learning contrast differences from jujube images.To further facilitate the research, they built a dataset of 20 jujube varieties in natural settings.These jujube images varied greatly in angle, background and illumination conditions, and the average accuracy of the proposed model on this dataset reached 84.16%. [6]n addition to the classification from the physical appearance, some people also distinguished the types of jujube according to the chemical composition.The use of SNP tags for parental verification has achieved remarkable results. [7]Hyperspectral imaging (HSI) technology was used to collect the hyperspectral data and spatial feature imaging of jujube samples as the original data, and a new method was proposed to select the feature wavelength and simplify the model.Compared with the old models before optimization, the accuracy of the new model in identifying jujube varieties in the same training set and test set reached 96.68%. [8][13][14][15] For example, four kinds of tea samples were scanned by NIRS, and then the spectra were processed by principal component analysis (PCA) and linear discriminant analysis (LDA).At last, they were classified by possibilistic fuzzy discriminant C-means clustering with an accuracy of 98.84% [12] .[18][19][20][21][22][23][24][25][26][27] For instance, the dried Hami jujube was detected by a visible and NIRS spectrometer to collect the diffuse reflection spectrum for checking out starch-head and mildewed fruit. [28]Four varieties of apple samples were made spectral detection with the Antaris II FT-NIR spectrophotometer for collection of their near-infrared (NIR) spectra. [29]isible and NIRS were utilized to carry out detection on maltodextrin and soybean protein isolate (SPI). [30]NIRS was coupled with pattern recognition methods to classify the adulterants in Ginger powder (GP). [31]uzzy pattern recognition (FPR) is an artificial intelligence technology, related to fuzzy reasoning or fuzzy logic.It is a method to deal with fuzzy information.Compared with traditional binary logic, fuzzy logic can better deal with uncertain or fuzzy information and can make more reasonable inferences and judgments when dealing with fuzzy information.At present, the application of FPR is very extensive, including classification, image processing, control system, decision support, and so on.For example, fuzzy uncorrelated discriminant transformation (FUDT) coupled with a portable NIR spectrometer was presented to build a classification system for identifying the geographical origin of milk. [32]Li et al. designed the fire alarm model through the fuzzy inference system, which could effectively reduce the false alarm rate of the system. [33]Wen et al. proposed an integrated evaluation approach for the safety action of coal and gas outburst (CFRD) based on the regression relation method and FPR model. [34]Ding et al. used the fuzzy recognition algorithm to judge the integrated evaluation rank of dam safety. [35]Chen et al. designed a single IT2 FLS based on Nagar Bardini (NB) structure for FPR, and IT2 FLS could obtain better generalization ability of fuzzy recognition problems. [36]uzzy C-means clustering (FCM) is one of fuzzy clustering algorithms based on fuzzy set theory.Compared with the traditional C-means clustering algorithm, the FCM algorithm does not strictly allocate each data point to a cluster but uses the concept of membership degree to represent the degree of each data point belonging to each cluster.Barzegaran et al. used k-means and FCM for data clustering to obtain the IRI range based on pavement conditions. [37]Its advantage is that it can handle very large data sets and high-dimensional data, and its clustering results are more explanatory, which can represent the degree of data points belonging to each cluster.However, FCM algorithm also has some disadvantages, such as being sensitive to noise data and outliers, needing to define the number of clusters in advance, and getting different clustering results for different initial cluster centers. [38,39]ompared with FCM algorithm, the advantage of possibilistic fuzzy C-means (PFCM) clustering algorithm is that it is more robust to noise data and outliers, and can better deal with uncertainty and noise data without pre-defining the number of clusters. [39]However, the computational complexity of PFCM algorithm is higher than FCM, and the values of two parameters (membership degree and possibility measure) need to be adjusted.Therefore, the values of parameters need to be carefully selected in practical applications.
In this study, the raw NIR data of red jujube were collected by a portable NIR spectrometer.After the spectral data were preprocessed by SG filtering, they were made feature extraction by PCA and LDA.Finally, fuzzy C-means clustering (FCM), Gustafson-Kessel (GK) and possibility fuzzy C-means (PFCM) algorithms were used to classify the sample data, respectively, and the experimental classification results were compared and analyzed.

Sample preparation
In this study, experimental samples from four varieties of red jujube were selected, originating from four provinces: Gansu, Henan, Shanxi, and Xinjiang in China.Red jujube in these four regions is common and popular jujube varieties in China.After picking the red jujube, a series of treatments were carried out such as selection, cleaning, sterilization, mixed drying, natural cooling, secondary drying and denuclearization packaging.Each step has specific operation and parameter requirements to ensure the quality of the experimental samples.Each production area was represented by one variety, and 60 samples were selected for each variety, resulting in 240 samples in total.Then the samples of each variety were randomly divided into test samples and training samples as numbers 21 and 39, respectively.In addition, the selection requirement of jujube samples is that the red jujube samples have approximately the size (length: 3-5 cm, width: 2-3 cm), weight (10-20 g) and the time of maturity (September and October).Meanwhile, the experimenters ensured that the surface of the red jujube was clean and free from obvious defects.Because considering the non-target factors of sample differences, it is helpful to eliminate the influence of other factors on the experimental results by unifying the appearance factors of samples, to evaluate the recognition ability of near-infrared spectroscopy for red jujube.

Spectral acquisition
The spectrometer used for spectral acquisition is NIR-M-R2, a portable near-infrared spectrometer (Shenzhen Spectrum Research Interconnection Technology Co., Ltd.).The spectrometer scanned the surface of red jujube to collect the NIR spectral data.The wavelength range of the spectrometer was 900 ~ 1700 nm/11100 ~ 5880 cm −1 ; the wavelength precision is ±1nm; the ratio of signal to noise was 6000:1; the slit size is 1.8 × 0.025 mm; the optical resolution is 10 nm.At the period of NIR collection, it was recommended to maintain a temperature of around 25°C and relative humidity of 50%-60%.Before collecting spectral data, it is important to preheat the spectrometer for 1 hour.The NIR spectra should be within the wavelength range of 900-1700 nm and a resolution of 10 nm.Each sample should be detected by the spectrometer along the equatorial direction.This can reduce the instability during the scanning process and finally obtain more accurate, 228-dimensional near-infrared spectral data.In addition, we used Matlab R2020b (the Math-Works, Natick, MA, USA) to implement all the algorithms in this study.

Data analysis methods
Principal Component Analysis: The original near-infrared spectral dimension is 228, and they contain some redundant information.If the raw spectral data is not processed properly, it can lead to difficulties in the later classification work and reduce the identification accuracy.Therefore, it is important to ensure proper preprocessing of the spectral data.Therefore, in order to obtain more effective information in 228-dimensional data, it is necessary to reduce the dimension of spectra, and then find the eigenvectors that can directly reflect the difference of near-infrared spectra.Principal component analysis (PCA) is a dimensional reduction method that transforms sample data into a new feature space while retaining most of the original information.PCA retains the maximum possible amount of information of the near-infrared spectra by selecting the eigenvectors.Therefore, in this study, PCA could be used to reduce the spectral dimensions.
The covariance matrix describes the linear relationship between data features.For a given data set X, where each column represents a feature, the elements of the covariance matrix C can be calculated by the following equation: where n is the number of samples, and μ is the mean of X.The eigenvectors and eigenvalues of PCA can be computed with eigen-decomposition of the covariance matrix C.After the original data set X is projected onto the selected eigenvectors, PCA finishes the data transformation.Linear Discriminant Analysis: Linear discriminant analysis (LDA) is a classical linear extraction method. [40,41]The basic idea of LDA is to map the data to a straight line or a hyperplane, so that the projection points of similar samples are as close as possible, while the projection points between different categories are as far away as possible.LDA is a supervised learning method that requires the label information of known samples. [40]Specifically, LDA first calculates the mean vector of each category and the overall mean vector, then calculates the intra-class scatter matrix and the inter-class scatter matrix, and finally obtains the projection vectors through matrix operation, which is used to map high-dimensional data to low-dimensional space.[43][44][45][46][47] LDA aims to solve the optimized equation as follows: where W is a matrix formed by eigenvectors; Sw is the within-class scatter matrix and Sb is the between-class scatter matrix.To solve equation ( 2), eigen-decomposition S À 1 w � S b ¼ VDV T is calculated, where V is the eigenvector matrix; D is the diagonal eigenvalue matrix, and the elements on the diagonal of D correspond to the eigenvalues.The original data set X is projected onto the selected eigenvectors to obtain the data for data projection.
Fuzzy C-means Clustering: FCM is a famous fuzzy clustering method.Its key feature is that this clustering method is divided by fuzzy set, and the membership degree of each data to each cluster center can be at 0; 1 ½ �.If FCM is given with c cluster centers, it can minimize the objective function as follows [48] : Among them, u ik is the fuzzy membership value of the kth data point x k to the ith cluster center v i ; m is the fuzzy factor (usually greater than 1).At last, the final fuzzy membership value and the cluster centers can be achieved after fuzzy clustering.
Clustering: Gustafson-Kessel (GK) clustering algorithm implements clustering by assigning data points to different archetypes (i.e., cluster centers).The advantage of GK algorithm is that it adopts an adaptive distance measure of the clustering covariance matrix for fuzzy clustering, and it performs well in processing hyperellipsoid data.Through the computation of the covariance matrix, the GK clustering method can adaptively partition datasets with varying geometric structures.However, the GK algorithm needs to select the appropriate initial value and give the number of clustering in advance.
Possibility Fuzzy C-means Clustering: Based on a possibility constraint condition, the FCM algorithm should satisfy that the sum of the membership values of one sample across all clusters equals 1, but it is different from the intuitive membership degree.The membership degree in FCM does not really represent the degree of the samples belonging to the class.For outliers and noise, even if there is no contribution to clustering, it may still have a large membership degree, resulting in clustering error. [49]PCM selects the typicality of the sample as its clustering result, and it removes the limitation of the possibility constraint condition, so it can distinguish noise, and overcome the defects of FCM.However, PCM is very sensitive to clustering centers and often leads to consistent clustering results. [49]In order to overcome the shortcomings of FCM and PCM, PFCM combined FCM and PCM to provide both the membership degree and the typicality.The objective function model of PFCM is as follows [39] : Where η i can be calculated as follows [49] : Generally, K ¼ 1. Parameters a and b are to characterize the influence of membership value and typical properties, respectively.If the parameter a is larger than b, it shows that the calculation process of the cluster centers receives more influence from the membership value, and the algorithm has weaker sensitivity to the cluster centers.On the contrary, if the parameter b is larger than a, it shows that the calculation process of the cluster center is more affected by the typical value.PFCM has stronger resistance to noise.

Spectral analysis
[52] The raw near-infrared spectra are shown in Figure 1a.As illustrated in Figure 1a, the near-infrared spectra of red jujube samples displayed two prominent peaks, one located at 1180 nm and the other at 1430 nm.The peak at 1180 nm is generated by the first and second frequency multiplication of the C-H group's tensile vibration, which is associated with the presence of protein-like compounds. [53]Additionally, beyond 1270 nm, there was a notable change in the absorbance of all jujube samples, mainly due to the absorption of O-H and water. [54]n the region of 900-1270 nm, the absorbance of jujube samples is low.Above 1270 nm, the absorbance of the sample began to increase sharply and reached a peak at 1420 nm.The baseline drift is obvious.In the near-infrared region of 1400-1450 nm, the absorbance of jujube samples is quite high.Above 1450 nm, the absorbance of the sample began to decrease, and decreased sharply at 1630 nm, reaching the valley of the sample at 1680 nm.This is because the absorption of O-H is related to the absorption of water.The spectra have the phenomenon of baseline drift because the peaks of the curves are very different.

Spectral pretreatment
The original spectral data of red jujube are easily affected by the physical properties of the sample, so there are some noises and unnecessary information in the 228-dimensional data. [48]n order to eliminate these noises and information, Savitzky-Golay (SG) smoothing filter was performed to smooth the NIR spectra with the function y = sgolayfilt (x, k, f) in MATLAB.The SG smoothing filter can reduce the noise data while retaining most of the information in the spectra.Moreover, compared with other pretreatment methods, the SG method is richer, more flexible and has the wider applicability. [55]During the experiment, a polynomial order of 3 and a frame size of 53 were set.Figure 1b is the NIR spectra after the SG smoothing treatment.It can be clearly seen that the spectra become very smooth, and the peaks and troughs are more obvious.Figures 1c,d are the raw mean NIR spectra and the mean NIR spectra by SG smoothing, respectively.From the mean NIR spectra, the difference existing among red jujube varieties could be figured out clearly.
If the data set X has p variables, the cumulative contribution rate of the first m (m ≤ p) PCs is as follows: Generally, the eigenvalues with a cumulative contribution rate of over 85% are taken into the first, second, . . ., mth principal components corresponding to λ 1 ,λ 2 . .., λ m .The classification accuracy is the number of correctly classified test samples/the number of all test samples x100%.
LDA: It can be seen that although the classification effect of PCA is obvious, some red jujube samples cannot be well identified.Using PCA to reduce the dimension of spectral data to 10 dimensions, each variety of the samples was randomly divided into 21 test samples and 39 training samples, respectively.LDA made extraction for discriminant information from the data, and thereafter, the test samples were projected onto the LDA's feature space.
Figure 3 illustrates the LDA scores plot of the three discriminant vectors, and the red jujube samples can be distinguished well.Then, the mean value of each variety of training samples was computed, and it served as the initial clustering centers of the following FCM and PFCM.The initial clustering centers were:

Classification with FCM
The number of clustering centers was 4, and the initial clustering centers were described above.The remaining experimental conditions: the index m of the segmentation matrix was 2, the maximum number of iterations was 100, and the allowable error at the end of the iteration was 0.00001.After FCM clustering, the final clustering centers were: V FCM ¼ À 0:0104 0:0099 0:0087 À 0:0101 À 0:0080 À 0:0116 0:0032 À 0:0163 0:0071 0:0379 0:0250 À 0:0006 The fuzzy membership diagram of FCM is shown in Figure 4.The horizontal axis represents the kth sample, while the vertical axis denotes the fuzzy membership value.Since there are four different varieties of red jujube used in the experiment, there are four different subplots, with each subplot corresponding to a specific red jujube variety.If the value on the vertical axis u ik crosses the 0.5 threshold, it indicates that the sample x k is assigned to the ith class of red jujube, or if u ik is the largest value among the ith class, it can be determined that the kth sample belongs to the ith class.

Classification with PFCM
In this experiment, the number of clusters was 4, and the influence degree of membership value and typical value was the same, that is, a ¼ 1, b ¼ 1.The initial clustering center was V 0 ð Þ ; the maximum number of iterations was 100; the allowable error at the end of the iteration was 0.00001.After running PFCM to termination, the cluster centers were: Therefore, there is a total of five samples misclassification, and the clustering accuracy was 94.0%.Clustering Results Table 1 shows the classification results including the number of misclassifications, the number of convergence and clustering accuracy.It can be observed that the clustering accuracy of GK was the highest among the three fuzzy clustering methods with a value of 98.8%.Both fuzzy membership degree and typical value of PFCM could be applied to classify samples, and the clustering accuracy from the typical value of PFCM was lower than FCM, GK and fuzzy membership degree of PFCM.The number of convergence of FCM was smaller than GK and PFCM, and this means FCM converged faster than GK and PFCM.K-nearest neighbor (KNN) was run for classification with the parameter K = 1, and its classification accuracy was 72.5% which was lower than FCM, GK and PFCM.
Table 2 displays clustering accuracies of FCM, GK and PFCM under preprocessing methods of standard normal variate (SNV), standard normal variate (SNV), SG, SNV+MSC, SNV+SG, and MSC+SG.FCM  achieved the highest accuracy using SG, and the lowest accuracy using MSC or SNV+MSC.Among three fuzzy clustering algorithms, GK had the highest accuracy using any of the preprocessing methods.When the preprocessing method was SNV+SG, the accuracy of GK was 100%.For these three fuzzy clustering algorithms, when the preprocessing methods were SG, SNV+SG, and MSC+SG, their accuracies were high.On the other hand, when the preprocessing methods were SNV, MSC, and SNV+MSC, their accuracies were low.

Discussion
In this experiment, NIR spectral data of four varieties of jujube were collected by a NIR-M-R2 portable NIR spectrometer.The SG smoothing filter was used to preprocess the near-infrared spectrum data.PCA and LDA were used to further process the spectral data for extraction of characteristic information.Finally, the fuzzy membership of FCM, GK and PFCM were used to classify the jujube samples.Experiments showed that the clustering accuracy of jujube varieties was different by using different fuzzy clustering methods, and the gap was large.The classification accuracy was only 61.9% when using the typicality of PFCM, and less than 90% when using the fuzzy membership of GK/PFCM.In contrast, when using GK as the clustering algorithm, the clustering accuracy can reach more than 90%.The quality of jujube is closely related to its place of origin.The soil and climate of the different places of origin will provide different nutrients for red jujube, and this difference will be reflected in the element type of red jujube.For example, according to Lang et al.'s study, the red jujube samples collected from the Xinjiang region had the highest average value for Na and Ge content.On the other hand, the red jujube samples collected from the Hebei region had the highest average value for Ca, Ba, and Ti content. [49]By measuring five different varieties of jujube, Gao et al. found that there were statistically significant differences in the measured parameters between the investigated jujube tree cultivars.These results indicated that cultivar was the main factor affecting the physical and chemical properties of jujube. [50]In conclusion, the soil and climate in different regions will affect the types of red dates, and different varieties of red dates have different proportions of chemical components, so they can be distinguished according to their molecular composition.The near-infrared spectrum can be used to detect the molecular substances related to the absorption bands in the spectrum to distinguish different kinds of jujube.
It is important to note that traditional methods require high-precision experimental instruments, and supporting technicians, and consume much time.Moreover, the use of chemical analysis methods can lead to chemical pollution.In contrast, NIR spectroscopy combined with the three clustering algorithms used in the experiment can achieve the nondestructive and green non-pollution classification of red jujube samples with the GK clustering algorithm achieving the highest classification accuracy.This method can be effectively applied to provide jujube producers and supply chain managers with convenient tools to help them quickly and accurately identify and classify jujube varieties, thereby improving product quality, reducing market risks and increasing competitiveness.
FCM algorithm is widely used in pattern recognition because of its simple design and easy implementation.However, it has some problems: (1) poor robustness to noise and outliers, easy to classify as inaccurate; (2) sensitive to initialization data, and sometimes fall into local optimum.To solve the problem of noise sensitivity in FCM, PFCM was designed based on a possibility partition.PFCM can produce both fuzzy membership and typicality value for clustering data containing noise correctly.Based on a fuzzy covariance matrix and a non-Euclidean distance, GK clustering can deal with dataset with different geometric shape data.The data distribution of NIR spectra of red jujube is complex, so GK clustering can achieve the higher clustering accuracy than FCM and PFCM.

Conclusion
In order to rapidly and nondestructively identify red jujube varieties, this study proposed a classification method by combining fuzzy clustering methods and near-infrared spectroscopy.
Four varieties of red jujube were detected by a NIR-M-R2 portable spectrometer to acquire the NIR spectra, which were preprocessed by SG filtering, PCA and LDA, respectively, so that they were clustered correctly.Finally, FCM, GK and PFCM were performed to cluster the red jujube samples.Compared to other fuzzy clustering methods, GK algorithm showed the highest accuracy in identifying the red jujube samples.PFCM algorithm was found to be effective in classifying the NIR spectra of red jujube by providing both fuzzy membership degree and typicality values.According to the experimental results, the combination of GK clustering algorithm and near-infrared spectroscopy played a significant role in identifying the geographical origin of red jujube samples.In this study, we further proved the effectiveness and feasibility of near-infrared spectroscopy in the rapid identification of jujube varieties.In the future, it not only provides a convenient variety identification tool for jujube producers, dealers and consumers, but also helps to ensure the quality and authenticity of products.It also provides a rapid, efficient and accurate identification method for food varieties, which helps researchers to understand the geographical origin traceability, variety characteristics and adaptability of red jujube.
1-No.21 samples, the No.22-No.42samples, the No.43-No.62samples, and the No.64-No.84samples were classified as Gansu, Henan, Shanxi and Xinjiang, respectively.The No.63 sample was misclassified as Xinjiang; in fact, it belonged to Shanxi.As a result, 83 test samples were correctly clustered, and the clustering accuracy rate was 98.8%.

Table 1 .
Classification results of FCM, GK and PFCM.