Discrimination of Boletaceae mushrooms based on data fusion of FT-IR and ICP–AES combined with SVM

ABSTRACT In this study, the individual and data fusion of Fourier transform infrared (FT-IR) spectroscopy and inductively coupled plasma atomic emission spectrometry (ICP–AES) were used for the discrimination of five species of Boletaceae mushrooms with the aid of support vector machine (SVM). First, the original FT-IR spectra of 230 samples with different species were preprocessed and optimized by second derivative (2D), Savitzky–Golay filter (15:1) and standardized normal variate. Second, the datasets of FT-IR spectra and ICP–AES were integrated, and the low-level data fusion strategy was used to classify different species mushrooms. Third, the latent variables of elements concentration and FT-IR spectra were extracted by partial least square discriminant analysis and two datasets were fused into a new matrix. Finally, the classification models were established by SVM. Compared with single spectroscopic technique, the mid-level data fusion strategy can provide better result. Especially, the accuracy of correct classification of samples in calibration and test sets were 100.00% and 98.68%, respectively. The results demonstrated that the mid-level data fusion of FT-IR and ICP–AES can provide higher synergic effect for the discrimination of different species Boletaceae mushrooms, which could be benefited for the further authentication and quality control of edible mushrooms.


Introduction
In China, mushrooms have been used for hundreds years as edible and medicinal resources, and the rate of mushroom consumption is relatively high. [1,2] Apart from characteristic taste, the fruiting bodies of mushrooms are considered as sources of organic nutrients including digestible proteins, carbohydrates, fibre, and certain vitamins, as well as minerals and antioxidants. [3][4][5][6] Yunnan Province is known as its abundant wild edible mushrooms resources. There are more than 880 species of edible mushrooms that have been identified in Yunnan Province, which occupied the 40% of the total number of edible species in the world. [7] However, it is difficult to distinguish wild-grown edible mushrooms due to their various species and high similarities. In the marketplace, adulteration in slices of edible mushrooms has become a common phenomenon. Li et al. [8] established a standard DNA barcode for edible boletes; samples of common boletes in the markets regarded as four "species" by merchants were collected. The results showed that these 4 "species" in fact represented 12 distinct species. Dentinger et al. [9] used DNA sequencing to identify mushrooms within a commercial packet of dried Chinese boletes purchased in London. Surprisingly, they found three new species of boletes. Especially, some edible mushrooms are easy to confuse with toxic ones; [10] so, an effective method for the discrimination of wild-grown edible mushrooms is quite necessary.
Except the traditional method of morphological observation, many techniques including Fourier transform infrared (FT-IR) spectroscopy, [11] ultraviolet (UV) spectroscopy, [12] near-infrared (NIR) spectroscopy, [13] liquid chromatography-mass spectrometry, [14] high performance liquid chromatography [15] , and inductively coupled plasma-mass spectrometry [16] have been applied widely for qualitative and quantitative analysis of edible mushrooms. Comparatively, FT-IR can give a systematic description of the chemical components of specimens with simple, rapid, and costeffective characteristics, and it is widely used in edible mushrooms analysis. [17] Inductively coupled plasma atomic emission spectrometry (ICP-AES) is a sensitive and rapid technique for measuring multiple elements simultaneously which has been used in researching elements of fungi and food assessment. [18] The analysis of a dataset obtained from a single instrument may have some limitation to provide a comprehensive analysis. For instance, the results of previous researches showed that FT-IR has some limitation, such as it fails to confirm the major constituents and their contents. [19] Data fusion based on complementary techniques is one of the most reliable, effective, and steadiest methods which can provide more accurate information about samples and obtain better results (classifications with less error rate and predictions with less uncertainty) than a single instrument, [20] and it has been widely used in previous study. For example, Wang et al. [21] used an approach based on NIR spectroscopy, ultraviolet-visible spectroscopy, and chemometric algorithms to discriminate five varieties of green tea successfully. Márquez et al. [22] fused the synergistic effect of the information which obtained from Fourier transform Raman and NIR spectroscopy to solve the problem of hazelnut adulteration. Based on the fused data of different techniques with the aid of multivariate statistical analysis, these examples showed a great convenience and capability for the quality control of food. Therefore, the method of data fusion is a powerful tool for effectively improving the classification performance.
There are various strategies for data fusion: low-, mid-, and high-level data fusion. Low-level data fusion and mid-level data fusion were used frequently, in which mid-level data fusion is one of the most popular methods at present. [20] In the previous study, low-level and mid-level data fusion have been used frequently in distinguishing and assessing the quality of food. For example, Silvestri et al. [23] used low-level data fusion strategy for the sake of presenting the geographical variability of wines and obtained a good result. Dankowska [24] gave a synergistic effect for the detection of cocoa butter adulteration with cocoa butter equivalents by using mid-level data fusion of fluorescence and UV spectroscopies. Sun et al. [25] established a method by data fusion of NIR and mid-infrared spectra to distinguish unofficial rhubarbs, and the result indicated that the data fusion strategies could improve the classification accuracy.
In this work, the main goal is to establish an effective classification model for the discrimination of five species of Boletaceae mushrooms which are collected from Yunnan Province in China based on data fusion. FTIR spectroscopy and ICP-AES were used and these two techniques can provide different chemical information of samples which FTIR and ICP-AES were focused on functional groups and mineral element, respectively. Therefore, the data fusion of these two techniques will optimize the obtained information and exploit the synergies of individual information.

Sample collection and initial preparation
A total of 230 fresh fruiting bodies of wild-grown Boletaceae mushrooms (Boletus speciosus, Leccinum rugosiceps, Boletus umbriniporus, Boletus tomentipes, Boletus edulis) were collected from Yunnan Province in southwestern China. The sample details are exhibited in Table 1. All of them have been authenticated by Dr. Honggao Liu from the College of Agronomy and Biotechnology, Yunnan Agricultural University, Kunming in China. Initially, the fresh samples were rinsed by soft brush and dried in a drying oven at 55°C for 24 h to constant weight and crushed with a high-speed blender (FW-100, Shanghai China) in the laboratory. Then, the sample powder was sifted through an 80-mesh stainless steel sieve. Finally, the powder of samples was preserved in Ziploc bags at dry and room temperature conditions for further analysis.

Spectra acquisition and spectral data processing
To get FT-IR spectral data, the powder of each sample (1.0 ± 0.2 mg) was blended uniformly with potassium bromide (KBr) powder (100.0 ± 20.0 mg) and then milled in an agate mortar and pressed into a tablet for testing. After preheated for 30 min, every mushroom spectrum was performed by a FT-IR spectrometer (Perkin Elmer, USA) equipped with a deuterated triglycine sulfate detector. The scanning range was 4000-400 cm −1 , and 16 scans per sample with resolution of 4 cm −1 were performed. Before the spectra data acquisition, the background spectrum was recorded, and CO 2 and H 2 O in the atmosphere were subtracted automatically to eliminate interfering through scanning pure dried KBr tablet. During the whole experiment, the temperature of 25°C and relative humidity of 30% were constant. The samples were determined in triplicate and the results were expressed as mean values for further analysis.
As far as we know, FT-IR not only provide the useful chemical information but also contain a lot of interference information because of external factors, such as light scattering, background disturbance, and so on. [26] Therefore, in order to reduce interference information, different preprocessing methods were used, such as second derivative (2D), Savitzky-Golay (SG) filter (15:1), and standardized normal variate (SNV). 2D is always applied to remove the overlap peaks and baseline shifts. [27] SG filter is used to smooth the acquired data and helpful for a precise peak extraction. [28] SNV can eliminate the noise signal which caused by light scattering effect of particles between different shapes and sizes. [29] Then, the data of FT-IR spectra were used to establish discrimination model.

ICP-AES analysis
The mineral elements in Boletaceae mushroom samples were determined using ICP-AES (ICPE-9000, Shimadzu, Japan). First, 300 mg of each dried sample power was accurately weighed and put into the polytetrafluoroethylene (PTFE) pressure vessels which comprised 6 mL nitric acid solution (65% HNO 3 ) and 2 mL deionized water. And then, the PTFE vessels were closed and the mixture was digested in an automatic microwave digestion system. Ethos One (Milestone, Italy) microwave closed system was used for digestion of samples. Finally, the digestion was filtrated and diluted to 25 mL using deionized water and subjected to instrumental analysis. Standard reference material (GBW07605, Tea leaves were produced by the Institute of Geophysical and Geochemical Exploration in Beijing, China) was used to verify precision and accuracy. During the experiment, ultrapure water was used for the preparation of solutions, and all test tubes and equipment were cleaned by soaking in HNO 3 (10%) and then rinsed with H 2 O 2 to prevent contamination. The content of mineral elements was used for the establishment of the classification model.

Data fusion and statistical analysis
In this paper, the main aim of data fusion is to find out if there are any potentiality in the existing data that could be useful for classification. The low-level and mid-level data fusion were used to build an effective discrimination model and then compare the classifying capacity of the two model. Low-level data fusion from different instruments was simply concatenated into a new matrix that contains a lot of variables. [30] Mid-level data fusion also called feature level data fusion, it selected relevant variables that represented the information of samples from different data sources separately to constitute a new matrix. [31] And then, the new matrix was used for further study.
For the low-level fusion data matrix, the preprocessed FT-IR data and mineral element content were simply concatenated. Before multivariate statistical analysis, the low-level fusion data matrix of FT-IR and ICP-AES was normalized. As far as we know, the problem of low-level data fusion is that the number of variables is much more than that of samples. Thereby, in order to finish the mid-level fusion data, the optimal latent variables which obtained from the information of each instrument were extracted. The most common approach to reduce data and select the most value latent variables is partial least square discriminant analysis (PLS-DA), because of its simplicity to use, speed, and relative good performance. [31] To establish the mathematical model, the fusion data matrices and individual instrument data matrices were integrated and inputted into the appropriate software for multivariate statistical analysis. Support vector machine (SVM) was a supervised pattern recognition method which can identify of unknown samples, and it has the ability to analyze highly collinear and noisy data. [32] Recently, several researchers put emphasis on the classification of food and medicine by SVM. Such as, Bougrini et al. [33] showed that an electronic nose and a voltammetric electronic tongue combined with SVM could be used as an effective method to detect adulteration in argan oil. Devos et al. [34] used SVM method which based on spectral data to discriminate the olive oil from different regions, and the proposed method can be easily used. In this paper, a growing popularity method of SVM combined with data matrix of individual instrument and data fusion was applied for discrimination of five species of Boletaceae mushrooms. Compared the identification results, the best classification strategy was established.
The software were listed as follows: OMNIC (Version 8.2, Thermo Fisher Scientific Inc., USA) was used to analyze FT-IR spectra. SIMCA-P + (Version 13.0, Umetrics, Umeå, Sweden) was used to extract latent variables by PLS-DA and calculated the preprocessing of FT-IR data, while SVM was carried out using MATLAB (version R2014a, MathWorks, USA).

Results and discussion
Chemical assignments of absorption bands in FT-IR spectra of Boletaceae mushrooms samples The raw and preprocessed FT-IR spectra of different species wild-grown Boletaceae mushrooms are presented in Fig. 1. As a rapid and simple analytical technique, infrared spectroscopy has been used widely in mushrooms over the past years [35] , and chemical assignments of characteristic infrared absorption bands were described. [36,37] Remarkably, these spectra are similar in appearance and several spectral characteristics are observed. As we can see from Fig. 1a, the peak at 3390 cm −1 is mainly attributed to the stretching vibration of O-H, which may be caused by water, triterpene, polysaccharide, and sterol. The peak at 2934 cm −1 shows a stretching vibration of the -CH 3 group. Peaks at 1641 and 1554 cm −1 present C=O, C=N, and N-H, respectively, which may be the result of proteins. The O-H bending is presented at 1464 cm −1 , and it belongs to polysaccharides. The absorption around 1082 and 1037 cm −1 are caused by stretches of C-O and C-C groups. Peaks in the region of 900-400 cm −1 mainly belong to polysaccharides, such as beta-D-glucan, the pyranose form of glucose, and so on. For the above reason, the characteristic absorption peaks belong to proteins, polysaccharides, and amino acids.
In this paper, various preprocessing methods were tried to optimize the data of FT-IR. Finally, 2D + SNV + SG (15:1) were used and the raw and preprocessed FT-IR spectra are shown in Figs. 1a and 1b, respectively. After preprocessed, it is obvious that the resolution of spectra have been promoted dramatically and the noise of spectra have been efficiently reduced. Therefore, 2D + SNV + SG (15:1) have attenuated the noise of spectra and increased the classification effect.

Elemental analysis by ICP-AES
As shown in Table 2, the content of 18 elements (Ba, Ca, Cd, Co, Cr, Cu, Fe, K, Li, Mg, Mn, Na, Ni, P, Rb, Sr, V, Zn) was determined by ICP-AES. The results have shown that K, P, and Mg are abundant element in all fruiting bodies. Especially, the average content of K is more than 5000 mg kg −1 and followed by P and Mg. The nutritional values of K, P, and Fe play an important role in human healthy, and their content can provide human daily requirement. However, when excess intake of heavy metals, potential hazards will be caused to human health. [38] For example, high concentrations of the toxic Cd will lead to diseases on renal, pulmonary, hepatic, skeletal, and reproductive. [39] Thereby, the element concentration is the gist of quality assessment for edible mushrooms.

Feature extraction
The feature extraction process is useful to reduce dimensionality and keep the relevant information when there are a lot of data volumes. In order to extract feature information for classification, PLS-DA was used to select the latent variables of FT-IR spectra and element concentration. The parameters R 2 Y and Q 2 were used to evaluate the efficiency. The value of R 2 Y is higher which means the cumulating contributions are more. Meanwhile, it indicated that the cumulating contributions can represent the whole information of the samples. And, it suggested a good performance for predicting class membership when the value of Q 2 reaches a maximum. [40] As shown in Fig. 2, when the value of Q 2 of FT-IR spectra and element concentration reach the maximum, the value of R 2 Y was 0.9482 and 0.4418, respectively. It could represent the best effect for predicting species of mushroom samples. Therefore, the first 17 latent variables of FT-IR spectra and the first 8 latent variables of element concentration were selected to establish the mid-level fusion data matrix.

Classification results by SVM
In this study, data matrices of FT-IR spectra, element concentration, the low-level, and the mid-level data fusion strategies were applied to establish classification models. All of 230 samples were divided into 154 samples (two thirds of samples) as calibration set and 76 samples (one thirds of samples) as test set by Kennard-Stone algorithm. [41] The accuracy of the calibration and test sets were calculated  by SVM. And, the accuracy much closer to 100.00%, the performance of classification model will be more excellent. Finally, the best classification method was selected by comparing the results of individual technique and data fusion.
In order to constructs an optimal separating hyperplane, the original data were mapped into a higher dimensional feature space nonlinearly, which is important to choose SVM parameter. [42] For grid-search method is easy to be understood, it's widely used in optimization problems. In this paper, this method was used to select kernel parameter (g) and penalty parameter (c). The kernel parameter is closely related to the classification accuracy, and penalty parameter is the error term. [43,44] FT-IR spectra, element concentration, low-level fusion data, and mid-level fusion data matrices were calculated for the optimal SVM parameter by gird-search, respectively. As shown in Table 3, the result of the penalty parameter by mid-level data fusion strategy is 0.57435 which is much lower than other methods; it shows that mid-level data fusion strategy is suitable for SVM analysis. Moreover, the sevenfold cross-validation accuracy represents the performances of the established model and the accuracy of the calibration set. The crossvalidation accuracy of four data matrices was 96.75%, 88.96%, 95.45%, and 100.00%, respectively. It shows that the stability of classification models by different techniques was good. Particularly, the mid-level data fusion strategy model is the most robust with the highest accuracy of crossvalidation (100.00%). As can be seen in Fig. 3, it intuitively shows that the cross-validation accuracy is 100.00% and the optimal parameter c and g are 0.57435 and 0.0625, respectively. It seems that the model which was established by mid-level data fusion is more reliable than other strategies.
Furthermore, one of the most important standard to evaluate the classification model is the prediction accuracy of test set. In Fig. 4, the mushroom species A, B, C, D, and E are associated with the y class values 1, 2, 3, 4, and 5. The Fig. 4 shows the SVM classification models; it showed that there were 9, 20, 4, and 1 samples which were erroneously judged by FT-IR spectra (Fig. 4a), element concentration (Fig. 4b), low-level data fusion (Fig. 4c), and mid-level data fusion (Fig. 4d) strategies, respectively. As we can see in Fig. 4b, about 70% samples were  classified correctly. The result showed that the element concentration was relative to the mushroom species, which was same as the research by Yin et al. [45] The model established by midlevel data fusion has the minimum discrimination error that only 1 B. speciosus sample was wrongly classified in B. edulis. For evaluating the classification model by single instrument and data fusion based on SVM, the identification results are listed in Table 4. The classification model established by low-level and mid-level fusion data was better than FT-IR spectra and element concentration; it indicated that data fusion strategy can obtain more chemical information for classification. The accuracy of test set from low-level data fusion and mid-level data fusion was 94.74% and 98.68%, respectively. It is clear that the two data fusion strategies could identify different species of mushroom samples effectively. However, combined with the accuracy of cross validation, the mid-level data fusion could provide more acceptable results for the classification from different species based on SVM. Especially, in this study, the data fusion strategy in combination with low-level and mid-level data fusion has already gotten satisfied results, so we did not make a further study of high-level data fusion. Previous studies have indicated that the elevation would impact the content of chemical substances. [46] Consequently, the geographical conditions and climatic characteristics can affect the results of identification. In present study, all samples are collected from different origins with complex surroundings and different years. The above factors may be the cause that the one sample was misclassified by mid-level data fusion. Thereby, the strategy of data fusions is a promising solution to overcome these difficulties for the quality control of edible mushrooms.

Conclusion
In this study, FT-IR and ICP-AES were used for discrimination of various species of Boletaceae mushrooms by individual and fused strategies. The pattern recognition method of SVM was applied to construct the discriminant models for identification of different species of edible mushrooms. Subsequently, the classification capacities of individual and data fusion strategies were compared. Obviously, the identification effect of data fusion strategy is better than individual instrument. It may be caused that data fusion can obtain more chemical information which is good for classification. In particular, the low-level and mid-level data fusion was were compared; the mid-level data fusion was proved as the better classification strategy with the classification accuracies of 100.00% and 98.68% in the calibration and test set by SVM, respectively. This results showed that low-level data fusion contained effective information and noise signal, but mid-level data fusion not only extracted useful features but also eliminated the unnecessary variables by PLS-DA. As a consequence, the study demonstrated that FT-IR coupled with ICP-AES is an effective and accurately method to discriminate the wild-grown Boletaceae mushrooms by mid-level data fusion based on SVM for quality control. Thus, it is a suitable method to evaluate the analogous food.