Hyperspectral estimation of petroleum hydrocarbon content in soil using ensemble learning method and LASSO feature extraction

ABSTRACT Rapid and accurate estimation of soil petroleum hydrocarbon content is crucial for analyzing the degree of soil pollution and evaluating pollution status. Surface soil samples and hyperspectral measurements in the reservoir area in Lenghu Town, Qinghai Province, China, were viewed as research objects, and the correlation between different spectral forms of original data and petroleum hydrocarbon content in soil was analyzed. To improve the estimation accuracy, we proposed a solution that introduces least absolute shrinkage and selection operator (LASSO) combined with extremely randomized trees (ERT) and gradient boosting decision tree (GBDT) ensemble learning for constructing hyperspectral estimation model. The results show: LASSO algorithm can not only solve the spectral multicollinearity problem effectively but also reduce the number and calculation complexity of soil hyperspectral variables to a great extent. Compared with traditional machine learning, ERT and GBDT perform superior. In particular, the estimation accuracy of the LASSO-GBDT model is the highest.


Introduction
Soil is a key component of living environment [1]. From a production perspective, soil can provide fertility for green plants; from the perspective of environmental protection, it is a receptor of pollutants. With the characteristics of concealment, latency, irreversibility and chronicity, soil pollution can further cause and promote pollution of water, atmosphere, and biology. It is a vital part of environmental pollution affecting human health [2,3]. Petroleum may leak into the soil during mining, transportation, use, and storage [4]. Large quantities of petroleum hydrocarbons lead to changes in physicochemical properties of soil, fertility reduction, salinization, and desertification [5]. In addition, they endanger life and health through plants and the food chain. More importantly, the structure and function of the ecosystem can be damaged by pollution. Rapid and accurate estimation of petroleum hydrocarbons in soil is not only beneficial for analyzing and providing information on soil pollution status but also has application value and practical significance for protecting and managing the ecological environment of soil [6,7].
Traditional detection of petroleum hydrocarbon pollution in soil requires field collection of soil samples and is conducted in the laboratory [8,9]. Rapid and large-scale monitoring is not available due to the high cost of detection and the lack of timeliness and portability [10]. In recent years, hyperspectral remote sensing has been applied to soil component content estimation on account of its rapidity and accuracy [11,12]. Hyperspectral remote sensing records information on soil reflectance spectra through dozens of wavelengths arranged closely [13][14][15]. By utilizing characteristics of reflectance spectra to estimate physicochemical properties quantitatively, hyperspectral remote sensing has incomparable advantages over traditional detection techniques, especially for the efficient acquisition of information on soil pollution over large areas [16][17][18]. For the past few years, the feasibility of hyperspectral remote sensing technology in monitoring petroleum hydrocarbon pollution in soil has been widely demonstrated [7,19]. Hauser et al. established a partial least squares regression (PLSR) model to predict the TPH content of samples using the near-infrared spectroscopy (NIRS) method with spectral data collected from the training set, and the correlation coefficient between validation set data of prediction model and the measured laboratory data was relatively high, indicating that the NIRS method can be applied to predict soil TPH content effectively [20]. Scafutto et al. designed experiments to immerse oil with different concentrations into the soil and established estimation models of petroleum hydrocarbon content in soil using ground hyperspectral data.
The results of the comparison between the predicted results and measured data proved that hyperspectral remote sensing technology enables effective monitoring even at low oil concentrations [21]. Yin et al. demonstrated that the prediction result of the stepwise multiple linear regression (SMLR) model established by hyperspectral data for estimating TPH content was superior to that of the univariate regression model, which depicted that the SMLR model can be a satisfactory solution for rapid quantitative assay of TPH in soil content [22]. Chakraborty et al. established a multiple linear regression (MLR) model with wavelet transform based on hyperspectral data for predicting petroleum hydrocarbons in the case of the complex composition of soil pollutants because the spectral information of various petroleum hydrocarbon components overlaps each other [23].
With the rapid development of machine learning theory and technology, the estimation models between hyperspectral features and component content of soil based on machine learning algorithms are studied widely [24][25][26][27]. The technique can adaptively simulate the relationship between input data and output data and continuously mine soil spectral information, showing an excellent estimation effect [28]. At present, artificial neural network (ANN), classification and regression tree (CART), support vector machine (SVM), and extreme learning machine (ELM) are applied widely for the estimation of soil component content [29][30][31][32]. The process of machine learning solution can be considered the process of searching for a learning model with strong generalization ability and high robustness in hypothesis space [28,33,34]. Since the hypothesis space is artificial, the actual target hypothesis is not in the hypothesis space in most applications of machine learning. The process presents some problems such as strong sensitivity to training samples, high calculation complexity, and over-fitting. Ensemble learning is a hot topic in machine learning [35]. By constructing a learning model that combines multiple learning models with the same or different species, an ensemble learning model usually achieves better performance than a single learning model [36,37]. In terms of learner combination rules, the commonly used ensemble learning methods include bagging parallel and boosting serial ensemble learning [38]. In recent years, various ensemble learning algorithms, such as adaptive boosting (Adaboost), random forest (RF), and Stacking, have been applied widely to study the spectral quantitative estimation of soil component content [39,40]. Compared with the traditional methods, the machine learning constructed by the integrated framework performs better due to its strong leaning ability, high stability, and strong generalization ability.
Hyperspectral remote sensing data are large and informative because the soil spectral information is recorded by continuous and tight bands. What's more, the number of band is so much that lead to the phenomenon of collinearity and superfluous information affecting the modeling prediction result [41]. The LASSO algorithm is widely used because it can extract key variables from high-dimensional variables efficiently, solve the multicollinearity problem of variables, and improve the accuracy of the model [42]. The algorithm is a compression estimation method constructed through reserving subsets. The refined model built by constructing the penalty function and adjusting the coefficients of compression variables by setting the values of penalty parameters. The variable coefficient keeps converging to 0 as the penalty parameter increases. When the penalty parameter is large enough, some of the variable coefficients will be compressed to 0, and the model achieves the goal of reducing the original variable [43][44][45].
The surface soil samples collected from reservoir areas in Lenghu Town, Haixi Prefecture, Qinghai Province, China, were used as experimental data to analyze spectral reflectance and its change characteristics. The LASSO regression algorithm used for selecting the feature band of hyperspectral data and the ensemble learning algorithm, including ERT and GBDT, were combined to build a hyperspectral estimation model of petroleum hydrocarbons in soil. The prediction performance of the models was evaluated by coefficient of determination (R 2 ), root-mean-square error (RMSE) and residual prediction bias (RPD). The main purposes of this study were to: 1) analyzing the correlation between different transformation forms of soil spectrum and petroleum hydrocarbon content and confirming the sensitive band range; 2) constructing estimation model of petroleum hydrocarbon in soil with ensemble leaning method based on LASSO algorithm, which can improve the performance of the ensemble leaning model in estimating. The study can provide an efficient new method for estimating petroleum hydrocarbon content in soil.

Study area and sample collection
The study area is an important petroleum industry base located in Lenghu Town, Mangya City, Haixi Mongolian, and Tibetan Autonomus Prefecture, Qinghai Province, distributed on the north margin of Qaidam Basin. The terrain is high in the northwest and low in the southeast, and the area is an extremely dry area in China with large areas of Yadan landforms. The soil sampling area is located on the west side of S210 provincial road in the southeast of Lenghu Town. First, 45 sampling sites were selected around the oil well, and the distance between sampling and the highway was 150 m at least. Then, sampling grid sized 50 × 50 m was set at every sampling site, and the five-point sampling method was applied to collect five samples with a surface thickness of 5-15 cm in the grid. Finally, the five samples were mixed evenly and put into a sealed plastic bag. After encapsulating the soil sample, the sampling point was numbered, and the GPS positioning technology was used to record the longitude and latitude coordinates of sampling points at the center point. Thus, a total of 45 soil samples were collected ( Figure 1).

Petroleum hydrocarbon content and spectral measurement
After removing straw, gravel, and other impurities, samples with grain sizes less than 0.25 mm were screened through a 0.25 mm mesh sieve after air drying ( Figure 2). The qualified samples were divided into two parts: one for the determination of petroleum hydrocarbons in soil and the other for indoor highspectral measurements. After extraction, purification, concentration, and constant volume, petroleum hydrocarbon (C10-C14) content was measured using the gas chromatographic method. The results of the statistical analysis of 45 petroleum hydrocarbon samples are illustrated in Table 1. The maximum and mean values of petroleum hydrocarbon content were 15.41 and 3.75 g/kg, respectively. According to China National Environmental Quality Standards for Soils, the standard value in soil pollution evaluation of petroleum hydrocarbon is 5 g/kg in industrial land. The pollution level of some soil samples in the study area exceeds the standard value of industrial land. In order to ensure the balance of sample content of training and validation set, 45 samples were divided into 15 sample groups according to their content level. Selecting one sample from each group randomly as the validation samples, the remaining were training samples.
The ASD FiledSpec4 spectroradiometer was used to perform spectral measurements on the collected soil samples in a dark room to obtain continuous spectral curves within 350-2500 nm. Then, 45 samples were put into a black glass container, and a 50-watt halogen lamp was placed vertically over the soil sample as an indoor light source. In order to improve the accuracy of spectral measurement data, the whiteboard was calibrated before each measurement, and the mean of 10 spectral measurements was taken as the reflectance spectral data of the soil sample ( Figure 3). The spectral reflectance data distributed in the bands of 350-399 nm and 2451-2500 nm were eliminated because of the low signal-to-noise ratio. The instrument detection elements of  spectroradiometer are different considering different spectral band range. 1050 nm is the interface band channel of visible and short-wave infrared sensor, 1800 nm is the interface band channel of short-wave infrared and long-wave infrared sensor, and therefore, the reflectance curve of the spectroradiometer appears breakpoint jump phenomenon at 1050 nm and 1800 nm, the breakpoint of spectral data were corrected by instrument software in the study. Since the sample interval of the ASD FiledSpec4 spectroradiometer was 1 nm, the spectral resolution was high, and the number of bands was large, information overlap may exist between adjacent bands. Thus, the spectral data were resampled, and the sampling interval of wavelength was set to 5 nm. Based on denoising and resampling, the original spectral reflectance was transformed by first-order differential transformation, reciprocal logarithmic and continuum removal, etc. Different forms of spectral transformation help to find the peak value accurately and quickly, thus picking out the sensitive bands.

LASSO algorithm
The LASSO algorithm, first proposed by Robert Tibshirani in 1996 [46], is essentially a compression estimation method that preserves subset contraction and is a way to deal with biased estimators of complex collinear data. Based on linear regression, the absolute value of the regression coefficient is constrained to a certain threshold by adding a normal function, and the variables with a correlation less than the threshold value are compressed to 0 and eliminated by optimizing the objective function. Therefore, all remaining variables are the required feature variables [42,47].
For a data set D ¼ X j ; y j À � ; j ¼ 1; 2; L � � � ; n; , X j ¼ ðx j1 ; � � � ; x jm Þ represents the explanatory variable, y j represents the explained variable, and the regression Given a simple linear regression model, the optimized objective function is expressed as:  When the number of independent variables is large or there is a multicollinearity problem between variables, the coefficient matrix is unstable and cumbersome to solve, and the model is more complicated and easy to produce over-fitting. In order to solve the problems, a penalty function item is added to the LASSO algorithm based on the loss function to constrain the complexity of the model and achieve the purpose of selecting models with low empirical risk and model complexity. The optimization objective function after adding the penalty function is: The second term represents the penalty to the coefficient, and ʎ is the adjustment coefficient that controls the compression degree of each variable. The variable is adjusted by the changes in ʎ, which can compress the coefficient of unimportant variables to 0. Smaller ʎ represents smaller punishment and more reserved variables, while the larger ʎ means greater punishment, and fewer reserved variables. Therefore, the algorithm can extract feature variables effectively.

ERT model
The ERT method belongs to the ensemble learning method under the parallel integration framework proposed by Pierre Geurts and other scholars in 2006 after many experimental studies [48]. The method improves the similarity problem of decision trees in random forest [49]. In the extreme random tree, each base learner is trained based on the whole data set, which ensures the utilization of training samples and reduces the final prediction bias to a certain extent [50]. In order to ensure the structural difference between each decision tree, the extreme random tree introduces greater randomness in node division. Additionally, to strengthen randomness, when splitting, the split feature points of each code are randomly selected. Thus, the variance of the decision tree is reduced, and the generalization performance is improved. All base learning prediction results are recorded, and the final result is produced by a voting decision [51].
According to the 'error-ambiguity decomposition' theory, the generalization error of the base learner is assumed to be Ei. Therefore, the weighted generalization error of the learner is: assuming that the bifurcation value of the base learner is Ai, the weighted bifurcation value of the learner is: the generalization error of the extreme random tree is expressed as: improving the diversity of base learners and the prediction accuracy of learners can further promote the prediction accuracy of the extreme random tree model.

GBDT model
The GBDT is an algorithm that combines the decision tree and boosting thought proposed by Professor Friedman of Stanford University in 2001 [52]. It is an additive model of a weighted combination of multiple weak learners into strong learners and can be expressed as [53]: where FðxÞ is the model objective function; T is the number of trees to be constructed in the decision tree for degree improvement; h t ðxÞ is the decision tree of weak learners; α t is the weight of the t th tree. The type of decision tree determines what kind of problems can be solved by the gradient boosting tree. When it is a classification problem, the decision tree is a binary classification tree; when it is a regression problem, the decision tree is a binary regression tree.
The GBDT adopts a forward distribution algorithm [54], and F 0 ðxÞ is the initial value of the model that is always constant. When running at m, the model is expressed as: where F mÀ 1 ðxÞ is the current model, and the new classification regression tree h m ðxÞ is obtained by minimizing the loss function: where N is the number of samples. The gradient boosting decision tree solves optimal model by gradient descent method, and the descent direction is defined as the negative gradient value of F mÀ 1 ðxÞ.
α m is acquired by line search: regularization of decision trees is promoted by gradient, which can be conducted by setting the learning rate: where v represents the learning rate. A lower learning rate means that more decision trees are needed, and the final error will be smaller, but it will also increase the training time.

Analysis of soil sample spectral characteristic
Spectral curves of indoor soil samples after treatment are shown in Figure 4. It can be seen from Figure 4(a) that the original reflectance of the collected soil samples is between 0 and 0.7, and the spectral curve fluctuations of each sample are similar. In the visible band, the reflectance increases sharply with the increase in wavelength. The spectral reflectance of soil in the near-infrared region is generally higher than that in the visible region. Two obvious troughs distributed around 1400 nm and 1900 nm are mainly caused by the absorption of residual water in the soil and water vapor in the air. There is a slight depression at 2200 nm, which is mainly affected by clay minerals in the soil. Figures 4 (b)-(f) show the spectral curves of the original reflectance transformed by the first-order differential transformation (FD-R), reciprocal logarithmic transformation (lg1/R), reciprocal logarithmic firstorder differential transformation (FD-(lg1/R)), continuum removal transformation (CR), and continuum removal first-order differential transformation (FD-(CR)), respectively. According to the change results, the various transformations of spectral can amplify the original spectral change significantly, which is consistent with previous research result [55].

Spectral correlation coefficient analysis
The Pearson correlation coefficient between petroleum hydrocarbon content in soil and soil spectral reflectance was calculated, and a correlation graph was drawn. According to the calculation result of the correlation coefficient of the original spectrum in Figure 5(a), the petroleum hydrocarbon content in soil is negatively correlated with the spectral reflectance. Compared with the original spectral reflectance, the spectral data after characteristic transformation show an improved correlation with petroleum hydrocarbon content (Figure 5 (a)-(f)). Among them, the first-order differential transformation has a significant effect. After the first-order differential transformation, the correlation coefficient between petroleum hydrocarbon content in soil and spectral reflectance fluctuates positively and negatively and the number of peaks was increased. In addition, the peak of correlation coefficient increases significantly, which is conducive to improving the prediction accuracy of the spectral estimation model. A total of 410 spectral bands were obtained after spectral denoising and resampling at 5 nm resolution. However, the bands are still very dense. Therefore, the bands with high correlation were selected as the training data of the ensemble learning model according to the calculation result of correlation coefficients. In this study, the statistical test of significance was used to judge the degree of correlation between petroleum hydrocarbon content and spectral reflectance of samples, with a bilateral significance test level α = 0.01, and the corresponding critical value is 0.38. The statistical result shows that, the correlation between spectral reflectance and petroleum hydrocarbon content is most significant after first order differential transformation, and 112 bands pass the significance test (Table 2). Compared with other transformations, the correlation coefficient of the FD-R curve has the largest peak value at −0.724, the corresponding wavelength range is around 995 nm.

Feature bands extraction based on LASSO algorithm
Compared with 410 bands after denoising and resampling, the number of bands reduces to 112 to a certain extent after the correlation coefficient significance test.
Nevertheless, due to the continuity and high dimension of hyperspectral measurement data, data redundancy may exist among these bands. Therefore, the LASSO algorithm is used to carry out dimensional reduction calculation of spectral band information and extract feature bands as training sample data for the ensemble learning model. Penalty coefficient ʎ, an important parameter of the model, reduces the complexity of the model by controlling the size of the penalty to regulate the overall convergence degree of the model, improving the performance of the model ultimately. In this study, 112 sensitive bands after first-order differential transformation were taken as input samples by the grid search method, the minimum RMSE was set as the fitness value of the optimization objective function, and the optimal value of ʎ was searched by the 10-fold cross-validation method. As shown in the Figure 6, when the value of ʎ is 0.09, the RMSE achieves the minimum value of 1.063, and the number of feature bands reduces from 112 to 14, effectively compressing the number of feature bands.   As shown in Figure 7, 14 feature bands extracted by LASSO algorithm mainly distributed in near-infrared and shortwave-infrared spectral range.

Ensemble learning model of petroleum hydrocarbon content in soil
The training sample contains input and output variables of models, with the 14 feature spectral bands selected by LASSO algorithm as the input variables and the petroleum hydrocarbon content as the output variable. The ERT and GBDT, which belong to the ensemble learning model, are used for the hyperspectral estimation experiment of petroleum hydrocarbon in soil. Ensemble learning is a method to form strong learners by training several weak learners and referring to various combination strategies. From the aspect of the learner combination rule, the method of ensemble learning includes bagging parallel and boosting serial ensemble learning, and ERT and GBDT in this study are typical representatives. The base learner of the two patterns is called the regression tree, which performs outstandingly in solving regression estimation problems. When using the ensemble learning model for regression prediction, the prediction accuracy and generalization performance of the model can be improved effectively by determining the key internal parameters of the model. In the ERT model, both the depth and the number of regression tree are the key to the model, and the learning rate and the number of regression tree are the important parameters of the GBDT model. Grid search-cross validation method is applied for parameter optimization calculation in the two models, and mean absolute percentage error (MAPE) is viewed as the fitness value of optimizing objective function, which can be obtained by iterative loop calculation. The results show that: 1) MAPE is the minimum when the depth of regression tree is 5, and the number of regression tree is 20 in the ERT model; 2) when the learning rate is 0.8, and the number of regression tree is 3 in the GBDT model (Figure 8).
The ERT and GBDT models were used for hyperspectral estimation of petroleum hydrocarbon contents in soil by referring to the parameter optimization calculation results. The prediction results of the ensemble learning model without LASSO algorithm dimension reduction and the traditional machine learning SVM model were calculated (Table 3). The evaluation index of training and validation from various models were also calculated. The indexes include R 2 , RMSE, and RPD. Among them, R 2 presents the stability of the model. The larger the R 2 , the closer it is to 1, and the more stable the model is. A close of R 2 to 0 indicates that the predicted value is not correlated with the measured value.
As shown in the table, the modeling and prediction accuracy of both the SVM machine learning and ensemble learning models, including ERT and GBDT, are improved after feature extraction by the LASSO algorithm. The SVM model shows the greatest improvement, with Rp 2 increasing from 0.557 to 0.649 for the training data set and PRD increasing from 1.051 to 1.450 for the validation data set. According to the result of indexes, the accuracy of ERT and GBDT ensemble learning models is better than that of traditional machine learning models both in training and validation sets, and the evaluation indexes of the LASSO-GBDT model are superior than those of other models. In order to further compare and analyze the fitting and prediction effects of different models, the fitting effect diagram was drawn by using the predicted values and the measured values ( Figure 9). In the figure, the horizontal axis is the measured value of soil sample, the vertical axis is the predicted value of the model, the pink point is the training sample, and the blue point is the verification sample. The closer the sample point in the figure is to line 1:1, the closer the measured value is to the predicted value. Figure 9 shows that the evaluation indexes of ERT and GBDT models were higher than those of the SVM model, and the samples of the training set and validation set were closer to the 1:1 line. That is, the ensemble learning model is superior to the traditional machine learning model in terms of fitting ability and stability, and the hyperspectral estimation model of petroleum hydrocarbons based on the ensemble theory is more outstanding.

Spectral feature summary
Hyperspectral technology can provide rich spectral information and is often used to capture subtle changes in the content of substances in soil. However, studies have shown that when the detected element content is relatively low in the soil composition, the difference of the original spectral reflectance caused by the change of the element content is small [56]. Through mathematical transformation of the spectral data, it is found that the absorption and reflection characteristics of the spectral curve are obviously enhanced, and the correlation coefficient between the spectral reflectance and the petroleum hydrocarbon content in soil is also improved. Among them, the method of first-order differential transformation performed best, which is consistent with the research results of Ou et al. [57]. Therefore, it is necessary to improve the response degree of spectral characteristics through mathematical transformation before constructing the estimation model of soil petroleum hydrocarbon content, so as to determine the best form of spectral transformation. In addition, the spectral band information obtained from soil hyperspectral measurement is redundant to a certain extent. Some studies have shown that the introduction of high-dimensional variables complicates the calculation, reduces the effect of model fitting, and affects the calculation speed [58]. Thus, it is crucial to select appropriate spectral band for constructing estimation model of petroleum    [59]. The feature bands extracted in the paper are consistent with the above findings. However, instead of using successive projections algorithm (SPA), stable competitive adapative reweighted sampling (CARS), kernel principle component analysis (KPCA) and other algorithms for extracting feature spectral directly [60][61][62], Pearson correlation coefficient significance threshold was combined with LASSO regression algorithm to extract feature bands in the study. By adding the constraint condition that the sum of the absolute values of the coefficients of all independent variables is less than a threshold value, the LASSO algorithm makes the coefficient of the independent variables weakly correlated with the estimation target approach zero when solving the multiple regression model, which improves the representativeness of the selected variables and ensures the explanatory and concise of the model. Compared with using merely band extraction algorithm for hyperspectral data, the combination of Pearson correlation coefficient threshold with LASSO algorithm ensured the reliability of extraction results and greatly decreased computation.

Ensemble learning modelling summary
The traditional solution process of machine learning can be viewed as the process of searching for an appropriate learning model with strong generalization ability and high robustness in the hypothesis space. However, the search process is difficult. Many studies have shown that ensemble learning is a combinatorial optimization learning method, which can combine multiple simple models to get better performance combinatorial models. Liu et al. found that the estimation accuracy of extreme gradient boosting (XGBoost) is superior to other methods in predicting soil properties quantitatively [63]. Clingensmith et al. proved that the RF is more accurate than PLSR in prediction Soil organic carbon [64]. This is consistent with the results of this study, which shows that the ensemble learning model is superior to the traditional machine learning model in fitting ability and stability, and the petroleum hydrocarbon hyperspectral estimation model based on ensemble theory is more prominent. According to the prediction results of ERT and GBDT models, the various prediction indexes of the GBDT model built on the boosting serial learning principle are higher than those of the ERT model built on the bagging parallel learning principle. According to the data of soil sample content (Table 1), the large standard deviation of the petroleum hydrocarbon content value indicates that the dispersion degree is high and the data is affected by the noise information. The boosting ensemble learning model represented by the GBDT model can make each round of basic learners pay more attention to the samples of the previous round of learning errors in the training process in some way. Moreover, the boosting ensemble learning model has a relatively good fault tolerance to noise information. Therefore, the model generally performs better than the bagging ensemble learning model in handling small sample data set problems. In addition, it can be seen from the fitting effect diagram that the fitting effect of the training set is better than that of the validation set, and the distribution of sample points of the predicted validation set is relatively scattered, indicating that both the traditional machine learning model and the ensemble machine learning model have high requirements on the reliability and validity of the training sample data.

Conclusions
This study attempts to explore the method and feasibility of estimating the petroleum hydrocarbons in soil from hyperspectral data. Taking the surface soil samples and soil hyperspectral measurements in Lenghu Town, Qinghai Province, China, the correlation between soil spectral and petroleum hydrocarbon content was analyzed, and a method combining LASSO algorithm and ensemble learning was proposed to construct the estimation model. The results show that: 1) Pre-processing of hyperspectral measurement data, such as spectral denoising and resampling, can effectively reduce the influence of measurement environment and noise. Based on spectral data preprocessing, the original spectral reflectance is characterized by first-order differential transformation, which effectively highlight peaks and valleys of correlation curves; 2) LASSO algorithm effectively solves the problem of information redundancy existing hyperspectral band variables and improves the representativeness and integrity of characteristic data. The convergence rate of the model is speed up, and the problem of model overfitting is avoided, thus ensuring the interpretability and simplicity of the model; 3) The established ensemble learning model has higher stability, better learning effect, and more outstanding generalization performance than the traditional machine learning model. For sample data with high dispersion, the GBDT model built by the boosting serial learning principle has higher prediction accuracy and has stronger fault tolerance than the ERT model built by the bagging parallel learning principle.

Disclosure statement
No potential conflict of interest was reported by the author(s).