Hyperspectral image classification of wolfberry with different geographical origins based on three-dimensional convolutional neural network

ABSTRACT The hyperspectral image is a three-dimensional (3D) hypercube with spectral and spatial continuity. Traditional hyperspectral imaging (HSI) processing mainly focuses on spectral information. However, this paper proposed a new hybrid convolutional neural network (New-Hybrid-CNN) algorithm using HSI spectral-spatial joint information. We used the algorithm combined with HSI processing to classify the origin of Chinese wolfberry from Ningxia, Qinghai, Gansu, and Xinjiang. (1) Selecting the region of interest (ROI) over the raw HSI data as input; (2) Extracting spectral-spatial joint information from the hyperspectral stack information using homogeneous 3D convolution architecture with convolution kernels; (3) Then the depth separable convolution (DSC) was used to learn spatial information. This algorithm combined the advantages of 3D convolution and DSC, and it effectively extracted deep spectral-spatial joint information and made the architecture more lightweight. 3D convolutional neural network (3D-CNN), hybrid spectral convolutional neural network (HybridSN), and support vector machine (SVM) were established to compare with the proposed method. The proposed algorithm made full use of the HSI information while reducing the number of parameters and training time involved in the network, and improved the classification accuracy. The classification accuracy of wolfberry origin reached more than 99%. Therefore, the New-Hybrid-CNN classifier combined with HSI had the potential to classify wolfberry origin and food detection.


INTRODUCTION
Wolfberry contains polysaccharides, carotenoids, flavonoids, and other ingredients. Moreover, it has a wide array of biological and pharmacological activities. [1] Wolfberry is a kind of Chinese food and famous Chinese medicine, and it has been considered to have anti-aging, anti-oxidative, and cancer prevention properties for thousands of years. [2] The medicinal function and price of wolfberry are closely related to its geographical origin, cultivation methods, and environmental factors [3,4] (including sunshine intensity, temperature, precipitation, and soil).
According to most traditional methods, the geographical origins of wolfberries can be identified by their color, shape, and taste. However, this subjective detection method is easily affected by personal emotions. [5] In addition, some researchers use chemical analysis techniques to classify and identify the origin of wolfberry. These research methods take a long time and require skilled analysts to do

Sample preparation
The wolfberry samples were purchased from local farmers in four main producing areas, including Zhongning County (approximately 37 Mongolia, Xinjiang, China). All of them meet the implementation standards of "Wolfberry": GB/ T18672. The wolfberry was stored in plastic bags at room temperature. As shown in Figure 1, about 150 g of wolfberry with uniform appearance and size from each producing area were collected as samples.

Hyperspectral imaging system and image acquisition
A line-scan HSI system with reflection mode was used during the experiment to capture hyperspectral images of wolfberry. The system consists of a spectrograph (ImSpector N17E, Spectral Imaging Ltd., Oulu, Finland), a CCD hyperspectral camera (Zelos-258GV, Kappa optronics GmbH, Germany), a lighting unit composed of four 35 W halogen lamps, and a conveyor stage (PSA200-11-X, Zolix., Ltd., Beijing, China) and a computer (V10E, Isuzu Optical Corp., Taiwan, China) supporting spectral cube data acquisition software. The system collected a total of 25spectral bands in the wavelength range of 900-1700 nm. The image acquisition was carried out in a dark environment to avoid light interference. We took the sample of wolfberry from one of the origins and placed it evenly on a piece of black cardboard. Then, the black cardboard was put on the moving platform at a rate of 16.8 mm/s. The distance from the lens to the sample was 380 mm, and the exposure time was adjusted to 20 ms. Images were collected with a spectral resolution of approximately 3.2 nm. The ENVI 5.3 (ITT Visual Information Solutions, Boulder, Utah, USA) and Python 3.8 were used for image processing. Before image processing, the raw hyperspectral images needed to be corrected using white and dark reference images. The white reference image (I white ) was acquired using a white Teflon tile with nearly 100% reflectance, and the dark reference image (I dark ) was obtained by completely covering the camera lens with an opaque cap. Then the calibrated image (I c ) was calculated by the original hyperspectral image (I raw ) following Equation (1): The denominator can eliminate the spatial inhomogeneity of the light source, and the subtraction in the numerator can reduce the systematic noise. As a whole, the percentage generated by division can simplify subsequent calculations. More importantly, the spectrum from the calibration image can be interpreted as molecular bonds or components that assign features. [28]

Data collection
The original NIR-HSI obtained by the hyperspectral imaging system contained 256 dimensions information. The data information at the beginning and end were both affected by the "noise" of the instrument . [29] Therefore, to avoid noise, we removed some bands (900-998 nm and 1622-1700 nm) and selected 190 spectral bands (998-1622 nm) as each type of wolfberry raw spectral data.
One image was acquired of the wolfberry samples from each origin. After collecting and correcting the four kinds of wolfberry images, the ENVI5.3 analysis software was used to crop manually a region of interest (ROI) from the corrected hyperspectral image. The 50 � 50 area in the middle of the hyperspectral image was selected to remove the background and insignificant pixels. So, the ROI included 50 � 50 pixels and 190 frequency bands (as shown in Figure 2, taking a sample of Gansu wolfberry as an example).

Classification methods
In this section, we adopted four different classification methods to classify the origin of wolfberry: SVM, 3D-CNN, HybridSN, and New-Hybrid-CNN. The flow of the experiment is shown in Figure 3.

New-Hybird-CNN architecture
In this paper, a New-Hybrid-CNN classification framework is established to realize the classification of the origin of Chinese wolfberry. First, the wolfberry HSI data cube is divided into small overlapping 3D patches, where the label of the central pixel determines the truth label. The size of the overlapping 3D patches is S � S � B. The small 3D patches cover the S � S spatial range and all B spectrum bands with the target pixel as the center. The patch-based classification method is performed to fully use the pixel information, dividing the HSI cubes into 2500 small 3D patches of 11 � 11 � 190. So, there are 2500 samples of wolfberry from each origin. The samples of wolfberry from different origins are randomly divided into a training set (70%) and a test set (30%). Then 20% of the training set is randomly selected as the verification set to verify the generalization degree of the algorithm.
These small 3D patches are input into the classification algorithm, which first uses the homogeneous 3D convolution architecture, and the size of the 3D convolution filter is 3 � 3 � 3. Then the output feature maps from the 3D convolution are reshaped and fed to a depth separable 2D convolution to learn spatial information. The output of the deep separable 2D convolutional layer is flattened and passed to the fully connected layer. After the fully connected layer, a softmax classifier for multiclassification tasks is added to output the classification results. The architecture is shown in Figure 4.
3DCNN Block: Usually, the purpose of a one-dimensional convolution operation is to extract spectral features. The purpose of 2D convolution operation is to extract the spatial features, and 3D convolution operation can simultaneously extract spectral and spatial features from high-dimensional information. [30] 2D-CNN has made outstanding achievements in computer vision and image classification. The traditional 2D-CNN can be expressed as: where i represents the i-th layer in the network, and j is the j-th feature map of the layer. v xy ij represents the value at x; y ð Þ in the j-th feature map in the i-th layer, b ij is the offset, and m represents the index of all feature maps connected to the current layer in the i À 1 ð Þ layer. k hw ijm represents the value of the convolution kernel at h; w ð Þ. H i and W i represent the height and width of the corresponding convolution kernel of the layer.
If the original hyperspectral data is directly used for 2D-CNN, each input band needs to be twodimensionally convolved. Therefore, each input band needs a set of convolution kernels is required for each input band, leading to massive network parameters. In addition, this can lead to increased calculation and severe over-fitting problems. [31] Hyperspectral images are 3D hypercubes with spectral and spatial continuity. Combining spectral and spatial information is becoming more and more popular because it can fully use the original data information. Using 3D convolution to extract features is a simple method for hyperspectral image classification, which can simultaneously process 3D regions with joint spatial-spectral information. [32] The 3D convolution operation convolves the input data in spatial and spectral dimensions and outputs 3D data. This 3D data saves the spectral information input by the hyperspectral data. The formula for the 3D convolution operation is: where S represents the size of the 3D convolution kernel in the spectral dimension, i represents the number of feature blocks on the network, and j represents the number of convolution kernels in this layer. Correspondingly, the output of the layer (i-th layer) includes i � j 3D feature values. Depth wise Separable Convolution Block: Figure 5(b) shows that the DSC separates the convolution channel by channel, mainly divides into two steps: depthwise convolution and pointwise convolution. Depthwise convolution uses a filter on each input channel, and Pointwise convolution uses a 1 � 1 convolution kernel for channel fusion. Pointwise convolution helps preserve spatial information, thereby optimizing the performance of convolutional networks. DSC is a transformation form of ordinary 2D convolution. [33] Compared with the ordinary convolution operation, the DSC is lighter (fewer trainable weight parameters), and the calculation cost is less expensive.
Parameter setting: Compared with the image-level classification algorithm, the input data space size used in the pixel-level classification algorithm is relatively small. Using a convolution kernel with a smaller space can avoid excessive loss of input information. [31] Through the study of 2D-CNN, it is found that a convolution kernel with a size of 3 � 3 usually produces better results. Some researchers used 3 � 3 � 3 size 3D convolution kernels for spatiotemporal feature learning [34] and common hyperspectral data sets. [21] In the New-Hybrid-CNN, all 3D convolution filters are 3 � 3 � 3 with a step size of one in the spatial and spectral dimensions.
As for the network structure of 3D-CNN, HybridSN, and New-Hybrid-CNN in this study, the first three layers of 3D convolution in these three classification algorithms all used the homogeneous 3D convolution architecture: all the 3D convolution filters were 3 � 3 � 3; the step size was 1 � 1 � 1; the number of filters was 8, 16, and 32. The fourth layer in the 3D-CNN was 3D convolution kernels of 64 � 3 � 3 � 3. The fourth layer of the HybridSN was ordinary 2D convolution kernels of 64 � 3 � 3. The fourth layer of the proposed New-Hybrid-CNN was a deep separable convolution. It was divided into two steps: The first step was the deep convolution operation, and each convolution kernel only convolved one channel of the input layer (the size of each filter is 3 � 3). The second step used a 1 � 1 size convolution kernel to perform pixel-by-pixel convolution, and the number of filters was 64. The fully connected layer and classifier of these three classification algorithms remained the same.
Studies have shown that appropriately increasing the size of the input data can help improve classification performance. However, a large data window area will generate additional noise. [35] In addition, too much input will significantly increase the amount of calculation in the training and prediction process of the network and increase training time. In the experiment, the input data window size was W � W, and the step size was 1. The experimental performance of 9 � 9, 11 � 11, 13 � 13, and 15 � 15 was evaluated, and the result showed that the optimal window size was 11 � 11. Table 1 shows the type of each layer of the New-Hybrid-CNN structure, the dimensionality of the output mapping, and the number of parameters. Among them, separable_Conv2d represents DSC, including deep convolution and pixel-by-pixel convolution. For the established wolfberry data set, the total number of training parameters for New-Hybrid-CNN is 596180. All weights were randomly initialized and trained using the backpropagation algorithm with the Adam optimizer by using the softmax loss. Minbatch was 128. Since the training set was relatively small, the batch size was set to 100. Rectified linear unit (Relu) was used as the activation function. In order to prevent the model from over-fitting, the dropout method was used to temporarily discard the neural network unit from the network according to a certain probability. The dropout was defined as 0.3. According to the changes in accuracy and loss during training, the learning rate was chosen to be 0.001.

Support vector machine
SVM is a traditional supervised learning algorithm that can perform classification and regression analysis, widely used in pattern recognition algorithms for spectral data analysis. [36] As for linearly separable problems, SVM seeks a hyperplane to maximize the distance between different categories of samples. As for the linearly inseparable issue, SVM maps the original data to a higher dimension space to make the samples linearly separable. SVM explores a set of hyperplanes in a new space to maximize the sample spacing. [13] It introduces kernel functions to effectively process nonlinear data or indivisible linear data based on structural risk minimization. Appropriate kernel function can improve the performance of the model. Radial basis function (RBF) has been widely used in the field of spectral analysis.
In the experiment, 3D (50 � 50 � 190) cubes of four wolfberry origins were first mapped to a 2D (2500 � 190) matrix, and the 2D matrix was used as the input data. The row (representing pixel) in the matrix was defined as a sample, so there were 2500 samples of wolfberry from each place. Based on the Python scikit-learn library, the grid search method is used to determine the optimal parameters of the SVM (penalty coefficient "c" and kernel width "g").

Evaluation metrics
The overall accuracy (OA), kappa coefficient (Kappa) evaluation measures were used to judge the classification performance of HSI. OA measures the number of correctly classified samples in the total test sample; Kappa is a measurement metric calculated by weighted measurement accuracy.
where N is the number of all samples, n is the number of categories, and a ii is the diagonal element of the corresponding confusion matrix. Kappa is a measure of the accuracy of classification. The expression is where a iþ is the sum of the i-th row, and a þi is the row of the i column. In a word, the larger the value of OA and Kappa, the better the performance of the classification algorithm. Figure 6 shows the original spectra of wolfberries from four origins in the ROI. It can be seen that wolfberries from different geographical origins have similar spectral curves. There are absorption peaks at approximately 1000 nm, 1200 nm, and 1465 nm. The absorption peak near 1000 nm is due to the second vibration of the N-H bond in the protein or amino acid. [37] The reason for the absorption peak near 1200 nm is the secondary stretching vibration of the C-H bond in protein, starch, or lipid, [38,39] and the absorption peak near 1465 nm is the sensitive area of water absorption. [40] Through further analysis of the average spectral curves of wolfberry from different origins, as shown in Figure 7, the trends of the average spectral curves of wolfberry from four different origins are generally similar. Still, the reflectance values of the four curves are different. These differences are caused by internal factors such as different regions and climates, indicating that the origin of wolfberry can be classified by using reflectance spectra. Among them, the average spectral curves of Qinghai wolfberry and Xinjiang wolfberry overlap. Therefore, machine learning methods are used to identify the origin of wolfberry further.

Classification results
In the experiment, the New-Hybrid-CNN, HybridSN, 3D-CNN, and SVM parts were programmed in Python 3.8 and implemented based on Tensorflow and Keras open-source framework. The operating platform was on a PC with Intel (R) Core (TM) i5-5200 U CPU with 2.20 GHz and 4 GB RAM. All the classification algorithms were established using the full spectrum (998-1622 nm) and set the same parameters (window size, training sample, testing sample) for a fair comparison. Since the hyperspectral data of wolfberry from each origin was divided into 2500 samples during the experiment, the total number of wolfberry samples from the four origins was 10000. The number of samples of the training set, validation set, and test set input in different classification methods is shown in Table 2.
First, for the SVM using full spectrum (998-1622 nm), the optimal parameters (c, g) were (1000, 0.01). Based on the SVM discriminant model of optimal parameters, the OA of the four origins is 95.36%, and the Kappa coefficient is 93.82%. A confusion matrix visualization method is used to compare the classification performance of the four classification algorithms on the test set, as shown in Figure 8. It shows that the SVM has the worst performance capability. For the classification accuracy of wolfberry from different origins, the SVM classification algorithm has the highest correct recognition rate for Ningxia wolfberry, and the recognition rate is 99.7%. The other three classification algorithms  Compared with the three classification algorithms based on spectral-spatial joint information, the classification accuracy of SVM is the worst. It shows that the classification method combining spectralspatial joint information has improved classification performance than the classification method using spectral information.

Comparison of 3D-CNN, HybridSN and New-Hybrid-CNN
Then, the performance accuracy of the proposed algorithm was compared with the 3D-CNN and the HybridSN. The classification results are shown in Table 3. Under the same conditions as other variables, the OA and Kappa of the New-Hybrid-CNN classification algorithm are higher than the other two classification algorithms. Figure 9 summarizes the parameters and training time involved in the training of these three classification methods. It can clearly show that the proposed New-Hybrid-CNN has the least number of parameters, which is about 16% of the HybridSN method and 2.2% of  3D-CNN. The reduction of parameters can effectively reduce the operating pressure of the computer. In addition, the training time is also the least, reduced by about 47 seconds compared to HybridSN, and significantly reduced by about 1153 seconds compared to 3D-CNN. In order to further study of the stability of the three classification methods, the accuracy and loss rate curves of the training set and the validation set of the three different networks during the training process are compared. As shown in Figure 10, it can be seen from the figure that convergence begins when the epoch reaches about 40. After 100 iterations, the accuracy and loss of the model stabilize, and it is considered that the ideal classification accuracy has been achieved. Among them, the accuracy and loss rate curves of the training set and the verification set of Figure 10(a) are coincident with those of Figure 10(b) and Figure 10(c). There is no large-scale oscillation phenomenon. Therefore, the proposed New-Hybrid-CNN is stable. Moreover, it requires fewer parameters, a shorter time, and the highest classification accuracy, which has great practical application potential. Overall, the New-Hybrid-CNN classification algorithm shows the best classification performance.

Effect of input size of the sample
In order to find the best input size of training samples, we tested the proposed classification algorithm with input data of different sizes. The window sizes are set to 9 � 9, 11 � 11, 13 � 13, 15 � 15, and the test results are shown in Table 4. Figure 11 shows the effect of window size on OA and Kappa coefficients.  It can be seen from Table 4 that as the data window becomes more extensive, the parameters and training time of the algorithm gradually increase. Moreover, Figure 11 shows that an appropriate increase in the input data window size helps improve the classification performance. However, after the window size of 11 � 11, the OA curve and the Kappa curve are gradually smoothed, which means that increasing the window size does not significantly improve the accuracy. In addition, the input window is too large to substantially increase the amount of calculation for network training, resulting in a longer training time. Based on the above experimental results, the window size of 11 � 11 can achieve better results.

CONCLUSION
In this study, hyperspectral imaging in the spectral region of 998-1622 nm was applied to identify the geographical origins of wolfberry by using four types of classifiers named New-Hybrid-CNN, HybridSN, 3D-CNN, and SVM. The SVM classifier was based on the spectral information of wolfberry samples. However, the New-Hybrid-CNN, HybridSN, and 3D-CNN were established on the spectral-spatial information of the samples. The experimental results revealed that the classifier  Figure 11. The influence of window size on OA and Kappa.
using spectral-spatial information was more effective than the SVM focuses on spectral features. In addition, the proposed New-hybrid-CNN classifier combined the advantages of homogeneous 3D convolution and DSC. Compared with the 3D-CNN and HybridSN, it had a lighter model, improved computational efficiency, and the highest classification accuracy. For future work, the New-Hybrid-CNN classification algorithm proposed in this paper is conducive to popularization and application to other samples. It helps to solve similar problems in food classification and agricultural applications.

Disclosure statement
No potential conflict of interest was reported by the author(s).