Comparison of machine learning algorithms for soil salinity predictions in three dryland oases located in Xinjiang Uyghur Autonomous Region (XJUAR) of China

ABSTRACT Many different machine learning approaches have been applied for various purposes. However, there has been limited guidance regarding which, if any, machine learning models and covariate sets might be optimal for predicting soil salinity across different oases in the Xinjiang Uyghur Autonomous Region (XJUAR) of China. This study aimed to compare five machine learning algorithms, Least Absolute Shrinkage and Selection Operator (LASSO), Multiple Adaptive Regression Splines (MARS), Classification and Regression Trees (CART), Random Forest tree ensembles (RF), and Stochastic Gradient Treeboost (SGT), to predict soil salinity in three geographically distinct areas (the Qitai, Kuqa, and Yutian oases). A total of 21 data sets from three oases were used to evaluate the performance of the algorithm and to screen the optimal variables. The results show the following indices are considered to be important indicators for quantitative assessment of soil salinity: EEVI, CSRI, EVI2, GDVI, SAIO, and SIT. Comparison results show that SGT is the most suitable algorithm for predicting soil salinity in arid areas. This study provides a comprehensive comparison of machine learning techniques for soil salinity prediction and may assist in the modeling and variable selection of digital soil mapping in the XJUAR of China.


Introduction
Globally, soil salinization has affected approximately 831 million hectares of land (Butcher, Wick, DeSutter, Chatterjee, & Harmon, 2016;Martinez-Beltran & Manzur, 2005), 23.32% of which (193.8 × 10 6 hm 2 ) is located in Asia (FAO, 2015), and soil salinization is predicted to impact 50% of all arable land by 2050 (Wang, Vinocur, & Altman, 2003). The Xinjiang Uyghur Autonomous Region (XJUAR) in northwestern China, the largest arid region in the country, is also one of the main distribution areas of soil salinization in Asia (Wang, Chen, Luo, & Han, 2015) (Figure 1). In 2014, according to a report by the Xinjiang Institute of Ecology and Geography at the Chinese Academy of Sciences, salinized farmland accounted for 37.72% of the irrigated land in the XJUAR, representing an increase of 6% since 2006 (Tian, Mai, & Zhao, 2016). Moreover, salinized farmland in the southern XJUAR accounted for almost half (49.6%) of the irrigated land, seriously restricting the lives of farmers and herdsmen.
Data mining can be defined as an automated or semiautomated process designed to uncover patterns from large digital datasets using trained models, where the resulting patterns may then be applied to new data for the purpose of prediction (Witten, Frank, & Hall, 2011). In soil science, numerous machine learning algorithms are available in the subfield of pedometrics for the development of predictive or digital soil maps, for instance, random forests (Grimm, Behrens, Märker, & Elsenbeer, 2008), multivariate adaptive regression splines (MARS) (Nawar, Buddenbaum, Hill, & Kozak, 2014); stochastic gradient treeboost(SGT) (Angileri et al., 2016), support vector machine (SVM) (Heung et al., 2016), artificial neural networks(ANN) (Heung et al., 2016), partial least squares regression (PLSR) (Nawar et al., 2014), classification and regression tree (CART) (Youssef, Pourghasemi, Pourtaghi, & Al-Katheeri, 2016),or other learner with less commonly used include least absolute shrinkage and selection operator (LASSO) (Zandler, Brenning, & Samimi, 2015). Review these literatures we found some of them have several advantages include less parameters which need user define, enables the estimation of the importance of the independent variables, own higher computational efficiency which is very important for big data operation, able to handle numerical, ordinal, or discrete predictors, such as LASSO, MARS, CART, RF and SGT. Furthermore, each of five machine learning algorithm give a different strategy for mining helpful information. LASSO is relatively recent approaches that use mathematically similar shrinkage penalties. These penalties push less important coefficients closer to zero in the case of ridge regression, or effectively set them to zero when the lasso technique is used (Tibshirani, 1996). Thus, while the lasso performs variable subset selection and therefore produces sparse models that can be applied more easily in a predictive context. MARS is a relatively new technique which combines the classical linear regression, the mathematical construction of splines, the binary recursive partitioning and brute search intelligent algorithms to produce a model capable of predicting the value of a target variable (categorical or continuous) from a set of independent variables (Friedman, 1991). The MARS algorithm works by partitioning the ranges of the explanatory variables into regions and by generating, for each of these regions, a linear regression equation. CART, is perhaps the most commonly used learners in the digital soil maps literature, which consist of nodes and leaves where each node is a partition of the training dataset that aims to maximize the within-node homogeneity and the between node heterogeneity based on node splitting rules that are generated from a set of predictor variables-a type of if-then statement. The RF learner is conceptually similar to tree-based learners(CART) and shares the same advantages; however, multiple decision trees are trained and the results are based on the predictions from an ensemble of the individual trees (Breiman, 2001). For the RF learner, each tree is trained from a randomized bootstrap sample of the entire training set and a subset of predictors used for the node-splitting rules is also randomly selected. The SGT method combines regression trees and a boosting technique to improve the predictive performance of multiple single models (Friedman, 2002). Boosting is a forward and stage-wise procedure in which a subset of the data is randomly selected to iteratively fit new tree models to minimize the loss function . This process introduces a stochastic gradient boosting procedure that can improve model performance and reduce the risk of overfitting.
Despite these five machine learning algorithms have been developed and applied in various purposes, research on soil salinization still few. Taghizadeh-Mehrjardi et al. (2014) selected a regression tree analysis to infer soil salinity attributes from nonparametric data (i.e., no assumptions regarding variable distribution), which is not sensitive to missing data. In a study by Muller and Van Niekerk (2016), relationships between image features and electrical conductivity measurements of 30 soil samples were studied using a regression analysis and classification and regression tree (CART) modelling. Vermeulen and Van Niekerk (2017) evaluated the extent to which DEM derivatives (only terrain variables were used as input) and machine learning algorithms (k-nearest neighbour, support vector machines, decision trees (DT) and random forests) can be used to predict the location and extent of salt-affected areas (where there are only two classes: salt-affected and unaffected). Among these machine learning algorithms, DT held the greatest potential for monitoring salt accumulation in irrigated areas, particularly for simulating subsurface conditions. However, few studies have compared machine learning algorithms in terms of the prediction of continuous soil salinity for more than one study area in dryland regions (such as in the XJUAR). To address this knowledge gap, our research compared the soil salinity predictions of multiple machine learning algorithms for multiple study areas using soil observations in the XJUAR. Specifically, we compared five algorithms (LASSO, MARS, CART, RF, and SGT) to infer soil salinity values in three geographical areas distributed to the south and north of the Tianshan Mountains in the XJUAR of China (specifically, the Qitai Oasis, Kuqa Oasis, and Yutian Oasis). Each study area represented a certain type of arid landscape with different characteristic salinity-environmental relationships in the soil. Among the algorithms, LASSO, MARS and SGT are not commonly used in soil salinity prediction, and LASSO and SGT, as far as we know, have never been used to predict soil salinity.
This study has two main purposes: (i) to identify sensitive variables suitable for predicting soil salinity in the XJUAR; (ii) to evaluate and compare the efficiency of the five algorithms in predicting soil salinity in these three oases.

Study area
The Qitai Oasis is located in the northern piedmont of the Tianshan Mountains, just south of the Junggar Basin in the XJUAR of China (Figure 1). It is centred at 89.60°N longitude and 44.05°E latitude. The soil types primarily include Haplic Gypsisols, Cumulic Anthrosols, Calcaric Fluvisols, Gypsic Solonchaks, and Gleyic Phaeozems. The vegetation types include temperate semi-shrub and semi-dwarf shrub, temperate salinized dwarf semi-shrub, temperate rosette dwarf grass/ semi-shrub steppe, annual crops, and drought-resistant economic crops. The natural vegetation consists of Achnatherum splendens (Trin.) Nevski, Alhagi sparsifolia Shap., Kalidium foliatum (Pall.) Moq, Halocnemum strobilaceum (Pall.) Bieb, and Salsola brachiata Pall. The elevation ranges from 568 to 978 m. The mean annual precipitation is 184.8 mm, and the majority of precipitation occurs between June and August; the average annual evaporation is 2141 mm, and the mean annual temperature ranges from approximately 5.1 to 6.1°C. The salt-affected land in the Qitai Oasis accounts for 31% of the total agricultural area (Zhang, Tashpolat, Ding, Tian, & Mamat, 2009). The salt type in this area is mainly sulfate, followed by chloride-sulfate, with a small proportion of chloride.
The Kuqa Oasis is located in the northwestern part of the Tarim Basin ( Figure 1). This study area is centred at 82.50°N and 41.38°E and consists of a low-lying alluvial fan plain, a low-elevation alluvial floodplain, and a mid-elevation alluvial fan plain, with low-elevation fixed grass shrub and low-elevation semi-fixed grass shrub. The elevation in this area ranges from 892 to 1100 m, decreasing from northwest to southeast. The soil types primarily include Cumulic Anthrosols, Salic Fluvisols, Gypsic/Calcic Solonchaks, Cambic Arenosols, Calcic Vertisols and Calcaric Phaeozems. The vegetation cover is lower over the salt-affected land and is dominated by desert species, such as Phragmites australis, Tamarix ramosissima, Alhagi sparsifolia, Karelina caspica and Kalidium gracile (Jiang, Ding, Tashpolat, Zhao, & Zhang, 2008). The area has an extremely arid desert climate, with a mean annual precipitation of 51.6 mm, a mean potential annual evapotranspiration of 2356 mm and a mean annual temperature of 11.3ºC. In this region, more than 50% of the cultivated land exhibits salinization, with 30% exhibiting serious salinization . The salt types in this area are mainly chloride and sulfate, accompanied by a small proportion of chloride-sulfate.
The aforementioned three oases as test targets were mainly selected for the following reasons: First, Xinjiang covers three climatic zones: the middle temperate zone, the warm temperate zone, and the plateau climate zone (Shi et al., 2014). Among them, the annual accumulated temperature in the middle temperate zone ranges from 1600 to 3400℃, and the growth period is 100 to 171 days and for the accumulated temperature in the warm temperate zone ranges from 3400 to 4500℃, and the growth period is 171 to 218 days. The Qitai oasis is in the middle temperate zone, while the Kuqa oasis and Yutian Oasis belong to the warm temperate zone. Second, the differences in crop types and farming systems lead to different water resource allocation patterns in the three oases. The Qitai Oasis is mainly planted with grain crops, such as wheat and corn. The Kuqa and Yutian oases are mainly planted with cash crops, such as cotton. Third, the saline soil types are different among the three oases. According to Zhang, Xiong, Tian, & Luan (2011a), the proportion of sulfate in the surface soil of the Qitai Oasis is more than 65%, followed by chloride-sulfate (32%) and chloride (less than 2%). Chloride and sulfate are the main saline soils in the Kuqa Oasis, accompanied by a small proportion of sulfate-chloride soils (Zhang, Tashpolat, & Ding, 2007). Analysis results of saline soil types in the Yutian Oasis showed that the chloride-sulfate and sulfate-chloride types were the main types, followed by the chloride and sulfate-chloride type (Gong, Liu, & Tashpolat, 2015a). Therefore, the study shows that these three target areas basically represent the soil environment in the arid regions of Xinjiang.

Landsat OLI and preparation
The satellite imagery used in this study to establish the model was Landsat OLI images in Qitai Oasis, the Kuqa Oasis and the Yutian Oasis. See Table 1 for detail. All of the satellite images were obtained using the FLAASH (Fast Line-of-sight Atmospheric Analysis of Spectral Hypercubes) model from the Environment for Visualizing Images (ENVI 5.1) for the three study areas to minimize and normalize the additive and multiplicative effects of the atmosphere and the sun illumination geometry on the imagery. The FLAASH reflectance was rescaled to a normal range of 0 to 1.

Soil sampling and analysis
The sampling locations were randomly selected (relatively evenly distributed) based on soil scientists' knowledge of the Qitai Oasis, including soil type, landscape, geomorphology and accessibility (these factors were also considered for the Kuqa Oasis and the Yutian Oasis). See Table 1 for the specific sampling period of each oasis. Variations in field conditions, such as soil salinity levels, vegetation types and land use types, were fully examined by comparisons with laboratory results and existing geographic maps. The sampling point selection in the field assumed that the soil properties and vegetation species were relatively consistent, the environmental factors were similar, and the heterogeneity was relatively minimal. Finally, 101 soil sampling (0-10 cm) locations were available. The distribution of the sample plots is shown in Figure 1.
A total of 189 soil samples (0-10 cm) were collected at the Kuqa Oasis. The sites spanned the full range of local geographic landforms, soil conditions, land use types and vegetation types. The distribution of the sample plots is shown in Figure 1.
Surface soil samples (0-10 cm) were collected from local five land landscape types in the Yutian Oasis. These samples were located in farmland that had been cultivated for 12 years with irrigation water (mainly groundwater), desert-oasis ecotone covered with salinized dwarf grass near the farmland, flood plain near the end of the river characterized by high levels of vegetation coverage and a high groundwater table, site near the Keriya River bank that was adjacent to the farmland, barren land site with very low vegetation cover, and a desert site with low and sparse plant cover. A total of 100 samples points  Figure 1), covering the main local landforms, were obtained in this area. The soil salt content (g/kg) was analysed in a laboratory. The composite soil samples were air dried, ground, and sieved using a 2-mm sieve mesh. The soil salt content was determined by the residue drying method. A 5:1 soil: water mixture was extracted by the residue drying method and was used to determine the major salt ions in the soil. Na + , K + , Ca 2+ , and Mg 2+ cations were determined by atomic absorption spectrometry (AA-6800, Daojin, Japan) and Cl−, CO 3 2− , HCO 3 − , and SO 4 2− ions were determined by ion chromatography (IC-2000, Diane, America). These measurements were successively performed at the State Key Laboratory of Desert and Oasis Ecology in the Xinjiang Institute of Ecology and Geography at the Chinese Academy of Sciences.

Environmental covariates
In this study, the environmental covariates for soil salinity prediction were selected based on the SCORPAN formula (Mulder, De. Bruin, Schaepman, & Mayrc, 2016). These covariates included bands, climate factor (referring to land surface temperature), vegetation indices, salinity and soil-related indices, soil moist indices. See Table 2 for details.
Covariate selection using an inferior eliminated mechanism (IEM) Variable reduction has been previously shown to result in slight error reductions (Svetnik et al., 2003) through the removal of potentially irrelevant predictor variables. This process allows the algorithm to progressively increase the accuracy of the prediction by reducing the chance of obtaining outliers since weak learners also produce weak outliers. In this study, all covariates were divided into 3 groups: Landsat OLI derived covariate sets, DEMderived covariate sets and full covariate sets. Variable reduction was tested in the first three groups to examine whether or not a smaller set of predictors (optimal dataset) would lead to improvements in the five algorithms based on the following procedure, which was mainly adopted (with only small differences) from Svetnik et al. (2003) and Heung, Bulmer, and Schmidt (2014): (1) The machine learning algorithms were initially applied to the first three variable groups; the variable importance, based on the mean decrease in accuracy, was used to rank the predictor variables.
(2) Using the variable rankings, the least important predictors were removed. In the studies of Svetnik et al. (2003) and Heung et al. (2014), the three least important predictors were removed each time. This study considered that the initial stage of variable selection could be manipulated in this way. However, when the number of variables was reduced to a threshold according to changes in the root mean square error (RMSE) and R 2 , even important variables could be deleted; thus, only one variable at a time was deleted in this study. (3) The training data were then partitioned into five cross-validation (CV) segments, and the error rates for each of the 5 CV partitions were aggregated into a mean error rate. A total of 10 replicas of the 5-fold CV were performed. (4) Steps 2 and 3 were repeated until a balance was achieved between the number of predictors and the mean error. (5) Steps 2 to 4 were applied to all five machine learning algorithms for three variable groups at each of the three study sites. (6) Then, the OCG was calculated from all covariate groups for each oasis optimized by Steps 1 through 4. Figure 2 shows iterative processes and precision trajectories ranging from the last 40 variables to the last two variables from full covariate sets using the SGT at three oases. According to the trajectory changes of R 2 and RMSE value, the study concluded that when the number of variables in the optimal data set is seven, seven, two, the prediction accuracy of SGT is relatively highest in a specific period of three oases cross all 40 tests.

Machine learning approaches and parameter initialization
Four machine learning approaches were performed using the following R packages: "glmnet" for LASSO (Friedman, Hastie, & Tibshirani, 2010), "caret" for CART (Kuhn, Leeuw, & Zeileis, 2008), "randomFForest" for RF (Liaw & Wiener, 2002), and "gbm" for SGT (Ridgeway, 2015). MARS was run in Matlab 2014a. Tibshirani (1996) developed LASSO, a penalized likelihood approach, for linear regression. LASSO is a combination of ridge regression and subset selection developed to improve the ordinary least squares (OLS) technique by shrinking the coefficient values and setting several values equal to zero. As a result, LASSO simultaneously achieves variable selection and regression modelling (Yan & Yao, 2015). A great advantage of LASSO is that it produces simpler models than ridge regression. In LASSO, the three parameters refer to the type of loss function in the regression (least squares), points (200) and steps (5000) that need be set; the default values were adopted.
MARS is a relatively new approach and is typically known as a nonparametric method that estimates complex nonlinear relationships among independent and dependent variables (Friedman, 1991). This algorithm has only been applied to the field of visible and near-infrared reflectance spectroscopy, which was employed to predict soil salinity (Nawar et al., 2014). The MARS approach was executed using MATLAB software. During the process, input variables were divided into intervals (subsets), and basic functions were fitted to each interval. The basis function represented information about the independent variables, which was defined over a specific range; its initial and last points were called knots. A knot represents a point where the function behaviour changes. Therefore, parameters that have knot and basic functions have an important role to play in obtaining optimum results in MARS. More detailed information about MARS can be found in Cheng and    (2014). Ultimately, the initial value of the knot was set to 3, and the maximum basis functions were set to 15.
CART models are very attractive due to the interpretability of node splits (i.e., rules), the avoidance of parametric assumptions (i.e., distribution and independent residuals) and their ability to handle noisy data. The common statistics of least squares are used in regression trees for training datasets. Moreover, the maximum depth was set to 5.
The RF learner is conceptually similar to tree-based learners and shares the same advantages; however, multiple decision trees are trained, and the results are based on predictions from an ensemble of individual trees (Breiman, 2001). More detailed information can be found in Heung et al. (2016). To construct the relationships for RF, only two parameters are defined by the user before running the RF algorithm: the minimum size of the terminal nodes and the number of variables randomly selected for each node, which are defined by m (commonly defined as the square root (sqrt) of the number of independent variables). We set this to 2*sqrt after repeated calculations and comparisons. Furthermore, the number of proximal cases and bootstrap sample sizes were changed to AUTO in the options. The default value of ntree (200) was demonstrated to be insufficient for yielding stable results (Grimm et al., 2008) in the three oases after testing. Therefore, we set the ntree value to 1000 for all tests with the RF model.
Overall, SGT (a tree-based method) was an improvement over CART and generally resulted in accurate and robust predictions (with low overfitting effects) of spatial heterogeneity and outliers (Brenning, 2005;Friedman, 2002). This powerful algorithm allows for calculations without a priori assumptions regarding the variables, which might influence the predicted values; this provides more flexibility than traditional generalized linear or additive models (Friedman, 2002). So far, the SGT has been implemented mainly for quantifying the spatial distribution of plants and animals (Mullet, Gage, Morton, & Huettmann, 2016), water erosion susceptibility (Angileri et al., 2016), topsoil carbon stocks (Schillaci et al., 2017) and oil pollution (Fox et al., 2016). However, little information is available concerning soil salinity modelling using the SGT in dryland areas. The SGT fits a simple parameterization function (i.e., base learner) for pseudo-residuals by the least squares method (after comparing multiple loss functions, the least squares was finally chosen) using sequential iterations to construct additive regression models. To avoid overfitting, a subsample fraction was set to 0.75 (Angileri et al., 2016). The number of trees was set to 1000, and the maximum number of nodes per tree was set to 6 (Schillaci et al., 2017).

Model validation
For each algorithm, a 5-fold cross validation (CV) was used to generate optimized parameters (Heung et al., 2014). The advantage of this method is that it exhibits reliable performance and is unbiased for smaller data sets because the process requires much more computational effort than simple trained-andtested (i.e., hold out) procedures (Taghizadeh-Mehrjardi, Nabiollahi, & Kerry, 2016;Zhao, Popescu, & Meng, 2011). The training dataset was randomly partitioned into five subsets, where four of the five subsets comprised 80% of the observations and were used for model training, and the fifth subset with 20% of the observations was used for model validation. Based on a study of machine learning presented by Heung et al. (2016), this process was repeated 10 times, using each round for validation once for all five algorithms (LASSO, MARS, CART, RF and SGT) at each of the three study areas. Three validation measurements were used to quantify the model performance of the simulations: coefficient of determination (R 2 ), root mean square error (RMSE) and relative root mean square error (RRMSE). The predictions were considered increasingly optimal as the RMSE and RRMSE values decreased and as the R 2 value increased.

Descriptive statistics of soil salinity
Summary statistics of soil salinity for the three oases are presented in Table 3. Field samples included all salinity levels (i.e., non-saline soil (<7 g/kg), lowsalinity soil (7-9 g/kg), moderate-salinity soil (9-13 g/kg), high-salinity soil (13-16 g/kg) and saline soil (>16 g/kg)) (Soil Survey Staff of Xinjiang, 1996) in all three oases. A coefficient of variation (CoV) equal to 0.66 indicated moderate variability in the soil salt content in the surface soil of the Qitai Oasis (a CoV lower than 0.1 indicated low variability, whereas a CoV higher than 1.0 indicated great variability). In the Kuqa Oasis, approximately 50% of the samples were from non-salinized land due to the relatively large agricultural area in this oasis, and 37.57% of the samples were from extremely salt-affected land. This sampling scheme was in accordance with the local conditions of the land use/cover. The CoV equalled 1.23, indicating great variability in the soil salt content in the Kuqa Oasis. This result is the same as that in Gao, Ding, Ha, and Zhang (2010) for the Kuqa Oasis. In the Yutian Oasis, 52 of the 100 samples belonged to the category of extreme salinization. This result was similar to Nurmemet et al. (2015), who found a total area of 79,763.8 ha of salinized soil within the study area (41.43%), indicating that soil salinity had already become one of the major threats to local agriculture and local communities.

Comparison of prediction accuracy of machine learning
Tables 4-7 show the predicted results of the five algorithms for the 21 datasets of the three oases. Because there are too many datasets to explain one by one, the study uses average values of R 2 , RMSE and RRMSE to express the comprehensive performance of the algorithm. The The aforementioned results show that the highest prediction accuracy is that of SGT, followed by RF, and the difference between the two is small. CART and MARS have similar prediction accuracy. The prediction accuracy of LASSO is the worst among all of the algorithms.
SGT and RF show stronger information mining capabilities (compared to those of MARS, CART, and LASSO) when the environmental datasets used for modelling are more complex. As can be seen from Tables 4 to 7, the Landsat-based indices (dataset 1) carry more information than the digital elevation model (DEM) derivatives (dataset 2). The former has an average R 2 of 0.36 (0.19-0.58), and the latter has an average R 2 of 0.23 (0.07-0.39). After the aforementioned two datasets are simultaneously input (dataset 3 with whole environmental covariates) into the algorithm, the average prediction accuracy of the soil salinity increases (Tables 5-7). The maximum value of R 2 is 0.63, the minimum value is 0.19, and the average value is 0.41. In addition, the dates of the Landsat images and soil sampling are not consistent, which might also reduce the accuracy.
The study also found that no algorithm in the 21 datasets could be 100% better than the others. According to statistics (R 2 ), the percentages of the prediction accuracy of CART, LASSO, MARS, RF, and SGT that are higher than one algorithm in the 21 datasets are 23.81%, 33.33%, 42.80%, 0% and 9.52%, respectively. The percentages that are higher than two algorithms are 28.57%, 14.28%, 14.28%, 28.57% and 0%, respectively. The percentages that are higher than three algorithms are 0%, 14.28%, 19.04%, 42.85% and 19.04%, respectively. The percentages that are higher than four algorithms are 0%, 0%, 4.76%, 19.04% and 60.90%, respectively.

Stability of machine learning
The accuracy ranks of the five algorithms were used to investigate the stability of their performance in three oases (Figure 3). The research evaluated the stability of the five algorithms in the three oases from the following aspects: R 2 , RMSE and RRMSE. Tables 4-7 show that each oasis has seven datasets (three Landsat-based index datasets, three whole environmental covariates datasets, and one DEMderived dataset). The R 2 , RMSE and RRMSE values of the aforementioned seven datasets were standardized by Z-score one-by-one in each oasis.
where O is validation measurements in this paper, O is mean value, σ is standard deviation. The number of standardized data involved in each dataset was five. Then, each algorithm has a Z R 2 value calculated based on five R 2 values, a Z RMSE value calculated based on five RMSE values and a Z RRMSE value calculated based on five RRMSE values in each dataset. Take the Qitai Oasis as an example; the Z R 2 and Z RMSE formulas of CART are as follows: where S represents the dataset, n represents the nth dataset, and maximum of n value = 7. The remaining algorithms use the same method to calculate their Z R 2 , Z RMSE and Z RRMSE values. The aforementioned processes were cyclized in the Kuqa Oasis and Yutian Oasis, respectively, and the Z R 2 values, Z RMSE values and Z RRMSE of the five algorithms for three oases were obtained. The larger the Z R 2 value and the smaller the Z RMSE and Z RRMSE value, the better the performance of the algorithm. As can be seen from the Figure 3, the Z R 2 , Z RMSE and Z RRMSE values of SGT rank first for the Kuqa and Yutian oases, and second for the Qitai Oasis. This represents the best performance of SGT in predicting accuracy and stability. RF's performance as observed in the three oases was only inferior to that of SGT, followed by MARS and CART. The comprehensive ranking of LASSO was the worst among all of the algorithms.

Mapping of soil salinity in three oases
The prediction results of five algorithms in the three oases are shown in Figures 4-6. The southern part of the Qitai Oasis is mainly farmland, and the soil salinity is low. With the decrease in altitude from north to south, the groundwater level gradually rises, and soil salinization is common (Zhang et al., 2011a). Saline soil in this area mainly occurs in the oasis-desert ecotone and is dot-shaped in the irrigated farmland. In the northern semi-fixed dune area, the groundwater level declines and the soil salinization level is less compared to that of the oasis-desert ecotone, but still higher than that of the farmland. Comparing the aforementioned findings, SGT's prediction results are more consistent with the actual situation, followed by those of RF. The spatial distribution pattern of the soil salinity is not clearly shown by CART's prediction results. Although the results of LASSO can reflect the distribution pattern of soil salinity, the following two aspects are quite different from the actual survey: the range of soil salinity and the salinity content of the saline-alkali soil in the oasis is at the same level as that of the semi-fixed dunes. The results of MARS show that the soil salinity tends to the same level in a desert area and in farmland, which is not consistent with the reality. The soil salinization of the Kuqa Oasis is mainly distributed in the oasis-desert ecotone outside the irrigation area (Ding & Yu, 2014). The results of SGT and RF are more consistent with our understanding of the distribution of saline soil in this area. The prediction results of CART obviously show binarization. At the same time, there is no textural information in the area with a similar range of soil salt content. The severity of the soil salinity in the Kuqa Oasis is not accurate. Extreme outliers are found in the prediction results of LASSO and MARS, but the distribution pattern of soil salinity can be distinguished. In the Yutian Oasis, the saline soil mainly occurs in bare land around the irrigation area (the middle of the study area) and the buffer zone on both sides of the river (the east side of the study area) (Hu, Tashpolat, Yu, & Zhang, 2017). The prediction results of the five algorithms are consistent with the survey only in terms of pattern distribution. From the range and textural information, the prediction results of SGT and RF are closer to the actual situation. However, negativity and outliers were found in the prediction results of CART, LASSO and MARS. In addition, with the passage of time, the range of prediction results of these three algorithms greatly fluctuates. In summary, combined with the results of R 2 , RMSE and RRMSE, it is considered that SGT is the preferred algorithm for soil salinity prediction in arid areas, followed by RF. Figure 7 shows the environmental variables with a frequency of occurrence greater than 1, counting all 21 optimal datasets from the three oases. After five different algorithms iteration, a small number of variables will appear in five optimal data sets at the same time, but the importance of them is different. In the same area, the variables of the optimal dataset will change during different seasons. Tables 4-7 show a large data collection and it is not easy to see the law of change. Therefore, the frequency of occurrence of the variables and their importance (%) in various datasets were used to determine which variables contributed more to the prediction accuracy of the soil salinity in the arid areas. The specific approach was to add the frequency of occurrence to characterize its importance in the prediction of soil salinity in an arid oasis. In addition, the range of relative importance (%) of each variable (frequency > 1) was calculated to better understand its contribution to the modelling process. Although these contributions Figure 3. Accuracy ranks of the five algorithms used to investigate the stability of their performance in three oases. The R 2 and RMSE values of the seven datasets are standardized by Z-score one-by-one for each oasis. The number of standardized data involved in each dataset was five. Then, each algorithm has a Z R 2 value calculated based on five R 2 values and a ZRMSE value calculated based on five RMSE values in each dataset. Finally, seven Z R2 values or seven ZRMSE values were added to represent the performance of each algorithm in each oasis. originated from different algorithms and datasets, this study assumed that the variable has the potential to identify saline soils as long as it appears in the optimal datasets. Figure 7 shows that the top five environmental variables in the Qitai Oasis were ENDVI (nine times), EEVI (five times), CSRI (four times), B1 (four times) and EVI2 (two times). The importance value (%) of ENDVI and CSRI is higher than that of EEVI and B1. In the Kuqa Oasis, EVI2 (six times), GDVI (six times), NDII (six times), SAIO (six times) and EEVI (four times) were the top five environmental variables. The overall contribution of EVI and GDVI is relatively high and the contribution is relatively stable among the various datasets. The results of the Yutian Oasis show that EEVI, CSRI, EVI2, ENDVI and SAIO play an important role in the establishment of soil salinity prediction models. The relative contribution of EEVI and CSRI in multiple optimal datasets is 100%. When all the variables (frequency of occurrence of variable> 1) from the three regions were added together, the order of frequency from high to low was (frequency of occurrence of the variable > 10 times) as follows: EEVI, EVI2, ENDVI, CSRI, SAIO and GDVI.

Performance of machine learning
From the aforementioned results (Table 4-7), we can see that the performances of SGT and RF are obviously higher than those of CART, MARS and LASSO. Referring to the average R 2 , RMSE and RRMSE values, the performances of SGT and RF in the 21 data sets are significantly better than those of CART, MARS, and LASSO. Comparing the prediction results of datasets 3 and 1, we found that the prediction accuracy of RF and SGT (R 2 ) increased by 32.94% and 39.03%, on average. The modelling accuracy of CART, LASSO and MARS increased by −0.19%, 9.08% and −1.38%, respectively. The reason  is that dataset 3 contains relatively higher information dimensions, including vegetation type and vitality, surface reflectance, surface texture (different scales), terrain variables that indirectly represent hydrological changes, parent materials and so on. Furthermore, the results also show that CART, LASSO and MARS are obviously inadequate in mining useful information from complex variable datasets in this study, that is the data utilization rate is lower than that of both RF and SGT. In all 21 data sets, the probability that the predictive accuracy of SGT and RF is better than three or four algorithms is significantly higher than that for the remaining three algorithms. The reasons for the aforementioned performance differences may be as follows. The relationships between soil property variations and the underlying environmental variables can be very complex and an assumption of linearity is often difficult to meet. RF and SGT have more power to model highly nonlinear dimensional relationships compared to that of CART, LASSO and MARS. For RF and SGT, as Breiman (2001) stated, weaker standalone models tend to be more effective when combined. By aggregating multiple models, the instability of a single-tree model is minimized, which leads to an improvement in consistency (Breiman, 1996). For example, although DEM-derived variables have relatively weak explanatory power for the spatial variability of soil salinity (average R 2 value = 0.23), the dataset still has a certain amount of information (Table 4). When ensemble-learning methods were introduced to the model for RF and SGT, consistency drastically increased. Heung et al. (2016) stated that when the relationship between environment variables and soil properties is more complex, the introduction of a random variable selection technique is also an effective means to improve the consistency between the predicted results and the measured data. In contrast, CART only uses a single tree to learn the complex relationship between the spatial variability of soil salinity and a large number of environmental variables. A lower R 2 value and higher RMSE and RRMSE value imply that CART is incapable of addressing such complex relationships. Strobl and Augustin. (2009) stated that CART is known to be very unstable; small changes in the learning sample can produce completely different trees. MARS essentially builds flexible models by fitting piecewise linear regressions. That is, the nonlinearity of the model can be approximated through the use of separate regression slopes over distinct intervals in the predictorvariable space. However, this study found that in most cases, the number of variables in the 21 optimal datasets produced by the MARS iteration is only one. We believe that even using a nonlinear approach to building models, it is difficult to explain the spatial variability of soil salinity by using a lower information dimension or sparse variables. LASSO is ultimately a regression procedure that builds a linear model. It cannot, on its own, discover nonlinearities or interactions. This explains why this algorithm, on the whole, is far inferior to the SGT and RF.
Compared to RF, the modelling method of SGT is more suitable for spatial prediction of soil salinity in arid regions. Of the 21 datasets from three regions (Table 4-7), 19.04% showed RF with a higher accuracy than that of CART, LASSO, MARS, and SGT. However, 61.90% of the datasets showed that the prediction accuracy of SGT was higher than that of the remaining four algorithms, and the performance improved by 225.10%. In addition, although the Z R 2 values of RF and SGT for the Qitai Oasis were similar, the Z R 2 values of SGT for the Kuqa and Yutian oases were much higher than those of RF ( Figure 3). As for the Z RMSE and Z RRMSE ranking, the same situation occurs in the Kuqa and Yutian oases. The aforementioned results prove that SGT is more suitable for spatial prediction of soil salinity in arid areas from the perspective of the accuracy and stability of the model. Until now, only Vermeulen and Van Niekerk (2017) have used different combinations of geomorphometric covariates for predicting soil salinity with the aid of RF. RF achieved a kappa of 0.28 for Vaalharts and a kappa of 0.5 for the Breede River. In our study, RF only used topographically derived variables to predict soil salinity in the three oases with R 2 values of 0.30, 0.11, and 0.30, respectively. The difference between these two studies was that the former was used for classification and the latter for quantitative studies. Other algorithms are not covered in the field of soil salinity prediction research. Therefore, our research results are not easy to compare to the results obtained by other authors. There are two reasons: first, to retrieve the existing research results on soil salinization modelling, we noted that the results were diverse because of differences in sampling depth, selection of variables, number of observations, prediction techniques, prediction accuracy, verification methods (a linear fit with no validation, s training set/verification set with certain proportions, and a spatial leave-one-field-out crossvalidation have been used), and geographical environments (plains, arid lands, coastlands, inland locations and river valleys). Second, comparative studies of RF and SGT in soil salinity prediction were not involved. Therefore, we quote previous research results in other fields to illustrate the reliability of this study. Naghibi and Pourghasemi (2015) used SGT, CART, and RF to study the potential distribution of groundwater fountains in Afghanistan. The results showed that SGT had the highest prediction accuracy, followed by CART and RF. Youssef et al. (2016) assessed landslide hazard in Saudi Arabia based on generalized linear models (GLM), CART, SGT, and RF. The AUC (area under the curve) showed that SGT had the highest value of 0.958, followed by GLM at 0.821, CART at 0.816, and finally RF at 0.783. The greater the value, the higher the precision. Yang et al. (2016) compared the spatial distribution of soil organic matter predicted using SGT and RF in the high vegetation coverage area of the northeastern Qinghai-Tibet Plateau. The results showed that the prediction accuracy of SGT was slightly higher than that of RF.
On the whole, MARS shows better predictive accuracy than that of CART and LASSO. Referring to the R 2 , RMSE and RRMSE values of the 21 datasets (Table 4-7 and Figure 3), the CART and MARS predictions are similar in terms of accuracy. Comparing the distribution characteristics of the R 2 values of the 21 datasets, the probability that the prediction accuracy of CART and MARS is higher than that of three and four algorithms is 0% and 14.28% and 0% and 4.76%, respectively. The results show that MARS performs better than CART. In terms of ranking of the Z R 2 values, the total scores of MARS, CART, and LASSO in the three regions are −3.44, −9.10, and −11.96, respectively. The ranking of the Z RMSE values is 5.34, 8.44, and 10.57, respectively. The ranking of the Z RRMSE values is 6.24, 6.52, and 8.31, respectively. These results imply that the comprehensive ability of MARS is relatively excellent (it has higher prediction accuracy and better stability), followed by CART, and the worst is LASSO. The literature search found that there was no relevant field at the same time to carry out comparative studies between MARS, CART, and LASSO. Here, we tried to quote the results of previous studies in different fields to illustrate the credibility of the aforementioned analysis. Gretchen and Tracey (2002) compared five modelling techniques for predicting forest characteristics in the United States, and obtained better results using MARS than when using GAM, ANN, simple linear models (LM) and CART. Álvaro, Schnabel, and Contador (2009) showed better performance for MARS in predicting gullying with areas under the ROC curve of 0.98 and 0.97 for the validation datasets, while CART presented values of 0.96 and 0.66. The results of Gregory, Jamieson, Bezanson, and Hansena (2013) indicated that the MARS models outperformed LASSO for predicting E.coli particle attachment and virulence marker occurrence.

Excellent indices of soil salinity modelling in arid area
Several important soil salinity-sensitive variables were found by comparison of the frequency of occurrence in all oases (Figure 7). The sensitivity of these variables to soil salinity in each oasis shows the following characteristics. At the same time, the indexes with the recognition function of soil salinization and a frequency greater than 1 were as follows: CSRI, ENDVI, EEVI, B1, EVI2, SAIO, SIT, EVI, NDVI and B6. Among these variables, the frequency of occurrence of CSRI, EEVI, EVI2, GDVI, SAIO and SIT in each oasis is greater than 2. Only the frequency of EEVI appears more than four times in each oasis: five times/four times/eight times. Most of the aforementioned 10 variables have a certain degree of soil salinity environmental bias; i.e. the frequency is higher in one or two oases at the same time. For example, the frequency of occurrence of ENDVI, GDVI, EVI2, CSRI and B1 in the Qitai, Kuqa and Yutian oases were nine times/one time/five times, two times/six times/two times, two times/six times/ eight times, four times/two times/eight times and four times/four times/one time, respectively. We also found that the variables with the highest frequency rank also maintained a relatively high contribution in each preferred dataset, but this situation was not absolute. From the results of the frequency ranking of the three oases, the distribution of the contribution values of each variable shows a certain fluctuation. This also indicates that the relationship between soil salinity and the aforementioned variables changes when the geographical environment changes. Therefore, although the vegetation and salinity spectral indices showed satisfactory results in monitoring salinity throughout the world, notably there is no universal spectral index that can show a satisfactory result under different environmental conditions (Allbed et al., 2014). In summary, it is suggested that variables should be considered in soil salinity mapping and unknown field sampling design in arid areas as follows: EEVI, EVI2, ENDVI, CSRI, GDVI, SAIO and SIT.
The aforementioned indices work well in the identification of saline soils in arid areas, and it is speculated that this may be related to the complexity of its formula, the adjustment factor, and the number of bands involved in constructing an environmental index model. For example, CSRI, EEVI, ENDVI and SAIO. Formula constructs are more complex, such as that of EVI2. This can involve more information than a simple environmental index model, such as the normalized difference vegetation index (NDVI). Adrianv and Gaiusr (2009) indicated that EVI2 has several advantages over NDVI, including the ability to resolve leaf area index differences for vegetation with different background soil reflectance. The spectral reflectance of a surface results from a mixture of green vegetation and soil "background" reflectance. Vegetation cover in arid areas is relatively sparse. The larger the bare soil area, the greater the influence of soil background on vegetation information extraction. Although EVI2 uses the same information as that of NDVI, the additional weight on the red reflectance in the denominator of 2.5*(B5-b4)/ (B5 + 2.4*B4 + 1) allows EVI2 to be less sensitive to soil darkening (Adrianv & Gaiusr, 2009). In areas covered by vegetation, several studies have shown that the vitality of vegetation (which can be indirectly reflected by the vegetation index) can mitigate the extent by which soil is affected by salinization. In areas with high proportions of bare soil, Peng et al. (2018) proved the reflectance of soil in an oasis of Xinjiang increased with an increase of electrical conductivity in the costal to SWIR1 band. This was the basis for constructing a soil salinity index on bare land. Among all soil salinity indices, the performances of SAIO and SIT were outstanding in the study area.

Conclusion
This study compared five machine learning techniques for mapping soil salinity with environmental covariates representing topography, climate, soil and vegetation (derived from Landsat OLI and (Advanced Spaceborne Thermal Emission and Reflection Radiometer Global Digital Elevation Model V2 data) in three oases. The key findings are summarized as follows: (1) After a test of 21 datasets from three oases (based on the analysis of the frequency of variables), the following indices are considered to be important indicators for the identification of saline soils in arid areas and for quantitative assessment of soil salinity: CSRI, EEVI, EVI2, GDVI, SAIO and SIT. (2) We evaluated the performance of five algorithms from the following two aspects: 1) the R 2 , RMSE and RRMSE values between the predicted and measured values and 2) the consistency between the pattern of the soil salinity prediction map and the actual survey results. The results show that SGT is the most suitable algorithm for predicting soil salinity in arid areas, followed by RF with less of a performance gap compared to SGT. Negativity and outliers, binarization, an unreasonable range of values, and instability during multi-period predictions (large fluctuations in the range of soil salinity values) appear to varying degrees in the MARS, CART, and LASSO prediction maps. However, SGT and RF effectively avoid the aforementioned phenomenon, particularly the former. However, there are several limitations to this manuscript that should be point out when building the salinity assessment model(s). Across the multitude of fields that comprise large regions, variations in management, pests, disease, climate and other soil properties can have a far greater influence on soil salinity, thus limiting the utility of remote sensing for salinity assessment. In addition, crop rotation/ fallow practice which may also lead to significant change in spectral reflectance and vegetation indices whereas salinity may not subsequently change. For this reason, we suggest a fusion processing approach, that is, the multiyear maxima, integral or mean-based modelling approach rather than a single date image for salinity mapping to minimize the aforementioned challenges or problematic issues based on the achievements of other authors. Meanwhile, soil type, landform and vegetation type were may also need to be considered as covariates for more accurate soil salinity prediction.
In the future, based on the conclusions of this study, we will use SGT to predict soil salinity over multiple periods, and then summarize the spatial variation in soil salinity during the past decades. Finally, this research is expected to provide practical assistance for the rational use of land resources and ecological environment management.