Consequences of spatial structure in soil–geomorphic data on the results of machine learning models

Abstract In this paper, we examined the degree to which inherent spatial structure in soil properties influences the outcomes of machine learning (ML) approaches to predicting soil spatial variability. We compared the performances of four ML algorithms (support vector machine, artificial neural network, random forest, and random forest for spatial data) against two non-ML algorithms (ordinary least squares regression and spatial filtering regression). None of the ML algorithms produced residuals that had lower mean values or were less autocorrelated over space compared with the non-ML approaches. We recommend the use of random forest when a soil variable of interest is weakly autocorrelated (Moran’s I < 0.1) and spatial filtering regression when it is relatively strongly autocorrelated (Moran’s I > 0.4). Overall, this work opens the door to a more consistent selection of model algorithms through the establishment of threshold criteria for spatial autocorrelation of input variables.


Introduction
Over the past two decades, aided by the advance of geographic information systems (GIS), geospatial scientists have increasingly employed machine learning (ML) to predict and map the spatial distribution of soil properties using spatially explicit environmental data (Brungard et al. 2015;Heung et al. 2016;Ma et al. 2019;Khaledian and Miller 2020;Wadoux et al. 2020;Kaya et al. 2023).Such application to soil-geomorphic data, however, can be problematic given that ML models are generally not effective for analysing geospatial data as they do not account for the spatial autocorrelation (SAC) present in either the raw input variables or the residuals (Schratz et al. 2019;Wadoux et al. 2020;Heuvelink and Webster 2022).This critical issue has generally been underrepresented in the geosciences literature because, fundamentally, ML algorithms are not conditioned to follow any statistical assumptions and, practically, the focus of ML is primarily on enhancing prediction accuracy.It has not been until very recently that efforts began to be made to extend existing ML algorithms to incorporate spatial structure, including spatial ensemble learning (Jiang et al. 2017), geographical random forest (RF; Georganos et al. 2021), RF spatial interpolation (Sekuli� c et al. 2020), RF for spatial data (RFsp; Hengl et al. 2018), convolutional neural networks (Wadoux et al. 2019), and other hybrid approaches (Kanevski et al. 2009;Behrens et al. 2018a;Sergeev et al. 2019;Nikparvar and Thill 2021;van der Westhuizen et al. 2022).
Despite the steady rise in the application of ML to soil-landform data, what is lacking is knowledge of when to use one method over another (whether ML or non-ML).The criteria for using one method over another is often a question of which method yields significant correlations in the dataset of interest giving rise to potential concerns of inadvertent p-hacking (Head et al. 2015) as well as difficulty in comparing conclusions between studies or regions that utilize different statistical approaches.These difficulties are likely exacerbated by differences in the underlying spatial structure of the soil-geomorphic properties that serve as input data for these models.In particular, the outcomes of these different studies are commonly assessed by the magnitude of the residuals (e.g.root mean square error, RMSE) and interpreted using various resulting variable importance (VI) metrics.However, spatial structure in the input variables (a condition ubiquitous in soil-geomorphic data) can result in a corresponding autocorrelation in the residuals leading to modelling inconsistencies, inflations of Type I error rates, erroneous statistical inferences, and bias in both model diagnostics (e.g.coefficient of determination and F-statistic) and partial regression coefficients (Legendre 1993;Anselin 2002;Griffith 2003).
The level of SAC remaining in the residuals continues to be a challenging and important topic in spatial modelling of natural resources (Burrough 2001;McBratney et al. 2003;Lark 2012;Angelini and Heuvelink 2018;Gaspard et al. 2019).Spatial variables such as soil properties and landform attributes derived by GIS-based terrain analysis often possess varying degrees of inherent SAC due to spatially diffusive phenomena, including sediment transport, movement of moisture, and mass flow.Predictions made by ML models can be biased when, because of the SAC in the input data, the residuals are not independent over space (Bel et al. 2005;Ferraciolli et al. 2019;Wadoux 2019;Nikparvar and Thill 2021).For example, a recent study predicting topsoil properties using an artificial neural network (ANN) found that the nonspatial model did not result in accurate prediction estimates due to the presence of SAC in the residuals (Sergeev et al. 2019).Such findings indicate that ML approaches should include a rigorous post hoc examination of the residuals not only in terms of their magnitude, but also of SAC (K€ uhn and Dormann 2012;Behrens et al. 2018a;Nussbaum et al. 2018;Sinha et al. 2019).Of course, this is not limited to ML models as several studies have recently demonstrated; SAC in the input data is strongly and positively correlated to the SAC in residuals produced by ordinary least squares (OLS) (Kim et al. 2016(Kim et al. , 2019;;Miralha and Kim 2018;Kim 2021).However, to date, it is uncertain whether this correlation can be generalized to ML methods used to analyze soil and landform data.
The predictive importance of variables in ML is conceived and measured differently than in non-ML approaches (Gr€ omping 2009).As such, landform variables, which are found to be important in explaining soil spatial variability when using a non-ML model such as OLS, will not necessarily have the same level of importance when applying ML methods and vice versa.Such uncertainty can undermine the ability to infer key geomorphological processes that are relevant to the evolution of soil properties operating in the system under consideration.Unfortunately, few comparative soil-landform studies have been conducted to quantify the influence of SAC in the input data on ML model performance and VI across a range of modelling approaches.
Here, we report the outcomes of various modelling approaches (two non-ML and four ML), each applied to a suite of soil and landform data that were collected from four widely divergent geomorphological systems in different regions of the world (Supplementary Table S1).The non-ML techniques are OLS and spatial filtering (SF) regression, whereas the ML algorithms include support vector machine (SVM), ANN, RF, and RFsp.We quantified the magnitude and SAC of the residuals from the six methods using RMSE and Moran's I, respectively, for a total of 39 soil attributes from the four study sites (i.e.response variables; Supplementary Table S2) and examined whether the ML approaches, on average, resulted in less RMSE and residual SAC than their non-ML counterparts.Potential systematic variation in RMSE and residual Moran's I as a function of the degree of SAC in the response variables was also evaluated.Finally, we compared the relative importance of the predictor variables to soil spatial variability across methods by averaging their rank orders yielded from all soil attributes in each study site.Our goal was ultimately to develop and evaluate criteria that could be used to guide the selection of models in future soil-geomorphic studies based on the level of SAC in the input data.

Study sites and data
The soil and landform data analyzed in this paper had previously been acquired in four disparate geographic regions: (1) a mixed temperate forest composed of beech (Fagus spp.), maple (Acer spp.), and oak (Quercus spp.) in southern Kentucky, USA; (2) a temperate beech-dominated forest in the southernmost part of the Czech Republic; (3) an agricultural field in central Kentucky, USA; and (4) a coastal dune in South Korea.Supplementary Table S1 provides a brief introduction to these sites.Detailed explanations of the study sites and data collection are available in Supplementary Information II of this paper.
At each site, different groups of researchers pursued specific objectives for their study resulting in distinct sampling designs and types of soil properties (Supplementary Table S2).However, the same six terrain attributes were consistently calculated from digital elevation models constructed based on the Shuttle Radar Topography Mission images available at each site (Supplementary Table S3); these included elevation (m), slope aspect ( � ), slope angle ( � ), upslope area (or flow accumulation; m 2 ), plan curvature ( � m À 1 ), and profile curvature ( � m À 1 ).These attributes were treated as predictor variables in the models tested in this study.Except for upslope area, all terrain parameters were estimated based on the grid-based algorithm developed by Zevenbergen and Thorne (1987).This method employed a nine-term polynomial fit to the interior grid point of a moving 3 � 3 square grid network.We extracted upslope area values using the D8 algorithm of O' Callaghan and Mark (1984), embedded in TauDEM Version 5 of the ArcGIS Toolbox.These digital terrain analyses were conducted using ArcMap 10.1 (Environmental Systems Research Institute, Redlands, CA).

Non-ML approaches
Ordinary least squares estimate unbiased parameters of linear regression models by minimizing the error sum of squares.It works under three conditions: normality of errors, homoscedasticity, and independence of errors.OLS multiplies an inverse of the predictor matrix and a vector of the response variable.Predictor matrix in a non-square shape, which is practically common, harnesses the Moore-Penrose inverse (Montgomery et al. 2012).
Spatial filtering regression selects statistically significant spatial eigenvectors (spatial filters) from the eigendecomposition of a spatial weight matrix of the input data (Griffith 1996(Griffith , 2000)).The standardized global Moran's I of the residuals (errors) of each candidate model works as a criterion to achieve spatial randomness in errors, attained by lowering the residual Moran's I below the predefined threshold (Getis and Griffith 2002;Tiefelsdorf and Griffith 2007).We considered spatial filtering as the standard model for maximizing the reduction of SAC in the residuals.

ML approaches
Support vector machines find data points (support vectors) that maximizes a separating hyperplane's margin for classification and regression tasks.Margin is the distance between support vectors and the separating hyperplane, which avoids overfitting.For regression, a m-support vector regression method harnesses a constraint (m), which automatically minimizes tolerated errors, as well as controls the fraction of support vectors (Sch€ olkopf et al. 2000).Kernels are often used in SVM to transform nonlinear input variables to be linearly separable (Drucker et al. 1997;Abu-Mostafa et al. 2012).We considered three types of kernels (linear, polynomial, and radial basis) for hyperparameter tuning.Mathematical details for maximizing the margin can be found elsewhere (Sch€ olkopf and Smola 2001;Abu-Mostafa et al. 2012).
An ANN is a recursive non-linear transformation algorithm.A network consists of input, hidden, and output processing modules called layers.It increases complexity by adding hidden layers, which comprises multiple transforming neurons.Any ANN with a simple structure can be represented as a directed acyclic graph.Each neuron in a hidden layer transforms the weighted sum of values from the previous layer then passes it to the next.The output layer finally makes predictions by computing the weighted sum of the last outputs.Weights are randomly assigned then adjusted using gradient descent to minimize the loss function.We employed a single hidden layer neural network model with sigmoid transformation functions.The hyperparameter, which is the number of neurons in the hidden layer, was tuned using integers between two and sixteen.
Random forest is a decision tree ensemble from random samples and a random selection of predictors.The algorithm often uses three hyperparameters: the number of chosen variables, the maximum number of nodes, and the number of trees.The predefined number of decision trees is called a forest.The final prediction is calculated as the average of the predictions from the forest fitted by the first two hyperparameters (Currit 2002;Grekousis 2019).This procedure is known to give an advantage of reducing out-of-sample errors (Breiman 2001).We tuned two hyperparameters, the number of variables and the maximum number of terminal nodes, by leveraging the total number of variables and the dataset size (Supplementary Table S4).
Random forest can be spatially extended to lower the SAC in cross-validated residuals by incorporating proximity (e.g.Euclidean distance) or geographical connection (e.g.stream network distance or ruggedness; Behrens et al. 2018b) into a RF model.Covariates are preprocessed by a principal component analysis to reduce the collinearity in covariates.RFsp avoids overfitting issues because of the nature of RF that uses only a small subset of variables.By this, RFsp not only improves spatial prediction accuracy but reduces the SAC in errors.

Comparison of model performance
We compared the six algorithms described above (i.e.OLS, SF, SVM, ANN, RF, and RFsp) with respect to the magnitude and autocorrelation of the residuals they produced for a total of 39 soil attributes (i.e.39 response variables) from the four study sites.Here, magnitude and autocorrelation were represented by RMSE and residual Moran's I, respectively.RMSE was estimated as: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where ŷi and y i are the estimated and the observed values, respectively.RMSE was also used as a criterion in the hyperparameter tuning and VI computation described below.A global Moran's I value was calculated as: where x i and x j are the observation at the ith and jth coordinate, x is the mean of the observations, and w ij is the ith and jth element of a matrix of spatial weights.In this study, we measured Moran's I for all soil variables as well as the residuals resulting from non-ML and ML models.
For the ML models, we applied the grid search approach, which compares combinations of multiple candidate hyperparameter values, to find the optimal hyperparameters using cross-validation (Feurer and Hutter 2019).We defined moderate-sized hyperparameter grids (12-540) to select the model with the best prediction accuracy (Table 1).For model evaluation, ten-fold cross-validation was applied.The cross-validation split the original dataset into training, validation, and testing datasets to evaluate model accuracy.After the best set of hyperparameters was found, it was applied to fit the entire dataset and the VI of each input variable was evaluated.
Variable importance is a measure of the contribution of a variable to prediction errors (Breiman 2001).We used permutation VI which relies on permuting each input variable, then evaluating prediction errors with the permuted data.Final VI values were obtained by calculating the ratio or difference between the error from the permuted data and the original data (Kuhn and Johnson 2013;Fisher et al. 2019;Molnar 2022).In this research, we compared the permutation VI rank of each variable to see whether the same input variable contributes differently to the prediction accuracy of the studied models; this was possible because we used the same six predictor variables in all the models.For SF, which furnishes additional spatial filters in the final model, we ranked only the VIs of the common predictor variables.However, RFsp employs the principal components (PCs) of predictor variables and a buffer distance matrix (size of N � N, where N is the total number of observations).Given that these PCs no longer coincide with the original predictor variables, we have decided to exclude the VI measures derived from the RFsp models from our comparative analysis.All procedures were performed with R 4.1.2(R Core Team 2021).The mlr package version 2.19.0 was used to tune hyperparameters and cross-validate models (Bischl et al. 2016).The spdep version 1.2-2 and spatialreg version 1.2-1 packages were utilized for spatial filtering and Moran's I analyses (Bivand et al. 2013), respectively.For each ML model, the R package ranger version 0.13.1 was used for RF (Wright and Ziegler 2017), nnet version 7.3-16 was used for ANN (Venables and Ripley 2002), and e1071 (https:// cran.r-project.org/web/packages/e1071/index.html,accessed March 1, 2022) was used for SVM (Chang and Lin 2011).Codes and datasets used in this study can be found in the online repository: https://github.com/biogeokim/ml_soil_sac_prediction.

Magnitude and SAC of the residuals
Overall, the ML algorithms employed in this research did not always exhibit better model performance than their non-ML counterparts (Figure 1).In terms of the magnitude of the residuals, RF stood out among the ML methods in producing a significantly lower mean RMSE than the non-ML methods (Figure 1(a); Supplementary Table S5).Mean RMSE  values of the other three ML (SVM, ANN, and RFsp) were greater than that of SF.There was no significant difference in the mean RMSE between OLS and SVM (p � 1.0).Furthermore, RFsp showed the greatest range of RMSE values, some of which were larger than the maximum RMSE of OLS.
Spatial filtering regression outperformed the other approaches tested with residual Moran's I values closest to zero (Figure 1(b)).Although the mean Moran's I of the residuals resulting from RFsp was also close to zero, approximately 23% of these I values (nine out of 39) were lower than À 0.1, indicating a potential overcorrection effect for residual autocorrelation by this method (Kim 2021; M. Bini personal communication).There was no significant difference in the mean residual Moran's I between OLS and SVM (p ¼ 0.971; Supplementary Table S5).In general, these findings were consistent even when the magnitude and SAC of the residuals were analyzed separately for each study site (Supplementary Figures S1 and S2).

Variation in the magnitude and SAC of the residuals
The RMSE values produced by all methods, except RF, had a significantly negative correlation with the Moran's I of the response variables (Figure 2).This relationship was strongest for the two spatially explicit approaches-SF and RFsp.With the exception of these two methods, the Moran's I values of the residuals were significantly and positively correlated with the Moran's I of the response variables (Figure 3).These correlations were also  S3). the orange and sky blue colours indicate non-ML and ML methods, respectively.Linear regression models denoted by ��� are significant at the level of p < 0.001.See supplementary Figure S3 for the results obtained by examining variations in the magnitude of the residuals separately for each study site.
found when variations in the magnitude and SAC of the residuals were examined separfor each study site (Supplementary Figure S3).Differences (or shifts) between OLS and SF and between RF and RFsp in both the RMSE and Moran's I values of the residuals were significantly and positively correlated with the degree of SAC in the response variables (Figure 4).These shifts were calculated as the difference between either OLS and SF (OLS-SF) or RF and RFsp (RF-RFsp); larger shifts indicate greater improvement of model performance by spatially explicit models compared to their non-spatial counterparts.

Relative importance of the predictor variables
In general, we observed inconsistencies in the relative importance of the predictor variables to soil spatial variability across non-ML and ML approaches or across study sites (Figure 5).That is, no predictor variable was found to have a pervasive pedogeomorphological control on soil properties regardless of modelling methods or landscapes considered.Surface elevation, for example, seemed to be most influential at the mixed forest I site, but its importance decreased at the mixed forest II site, especially when OLS or SF was used.Plan profile was least important in the coastal dune, whereas it was the second or third important predictor at the mixed forest II site.Despite these substantial variabilities, both non-ML algorithms exhibited highly consistent average rankings of the landform attributes within each study site.In other words, the rank orders produced by OLS  S3). the orange and sky blue colours indicate non-ML and ML methods, respectively.Linear regression models denoted by ��� and � are significant at the level of p < 0.001 and p < 0.05, respectively.See supplementary Figure S3 for the results obtained by examining variations in the SAC of the residuals separately for each study site.at each individual study site, the average ranking of each landform variable was calculated from the multiple soil variables tested (elev, elevation; aspe, slope aspect; slop, slope angle; plan, plan curvature; upar, upslope area; prof, profile curvature).
remained almost always the same as the orders yielded by SF (with exception of plan curvature in the agricultural field); this consistency was not observed among the ML approaches.

Effects of spatial autocorrelation on modelling results
Our results demonstrate that ML algorithms, although having gained much popularity in various disciplines over the last two decades, when used for the spatial modelling of soil properties, are sensitive to SAC in the input variables.Prediction accuracy has been the central focus and justification of using ML approaches (Wadoux et al. 2020), but three of the four ML methods employed in this study (SVM, ANN, and RFsp) showed larger or similar mean RMSEs compared to the non-ML OLS and SF models (Figure 1(a); Supplementary Figure S1).On average, residuals of the ML models retained more of the SAC inherent in the input variables when compared to SF (Figure 1(b); Supplementary Figure S2).The magnitude and SAC of the residuals were comparatively low for RF and RFsp, respectively; however, despite having the lowest mean RMSE, residual autocorrelation was still present in RF, a finding similar to the description of RF by Georganos et al. (2021).This may be the result of RF-overfitting and/or the SAC inherent in the observations of the training dataset, the latter potentially affecting RF internal cross-validation (Makungwe et al. 2021).Although RFsp reduced the level of residual SAC, the usefulness of this method appeared to be substantially reduced due to the unstable variation (i.e.wide range) of the RMSE values it yielded.
This work confirms that both the magnitude and spatial structure of the residuals resulting from several widely-used ML algorithms in the recent geosciences literature are potentially predictable based on knowledge of the SAC present in the input data.The significant and linear relationships shown in Figures 2 and 3 suggest that the original spatial structure of the soil properties can likely serve as a direct predictor of the autocorrelation that will remain in the residuals after the applications of such ML methods as SVM, ANN, and RF.In addition, the Moran's I of the soil data can also be used to predict how much improvement (i.e.decreases in RMSE or residual SAC) a non-spatial model, such as OLS and RF, will undergo if the SAC of the response variable is properly accounted for by SF and RFsp, respectively (Figure 4).These findings are consistent with previous research investigating applications of non-ML techniques to spatial modelling of soil properties (Kim et al. 2016(Kim et al. , 2019)), water quality (Miralha and Kim 2018), species diversity (Kim 2021), and socio-economic phenomena (Kim and Song 2021;Song and Kim in press).
Our findings point to the necessity of taking SAC into account in both non-ML and ML approaches as SAC in the residuals increases linearly with increasing levels of SAC in the response variables.Thus, care should be taken to examine the spatial structure of the response variables when using both non-ML and ML methods.When a soil variable has a low amount of inherent SAC (i.e.Moran's I < 0.1 in this research), we recommend the use of RF as it yielded the lowest RMSE among the approaches tested in this study (Figure 2) and most of the Moran's I values of the residuals were very close to zero (Figure 3).If a soil variable is strongly autocorrelated (especially, when Moran's I > 0.4), however, SF appears to be a better technique because the residual SAC it produces is closer to zero than that of RF with little difference in the RMSE values between the two methods.These findings open the possibility of selecting model types through the establishment of threshold criteria for easily measurable data characteristics, such SAC of input variables.
Our results also demonstrate that efforts to identify important factors of soil spatial variability across landscapes can lead to substantially different (even reversed) findings between non-ML and ML approaches (Figure 5).Accordingly, we urge caution that some GIS-based predictor variables, which are considered important in ML models, may prove to be less important in the context of non-ML approaches.Moreover, the above-mentioned RF-overfitting issue suggests that the relative importance of predictor variables may be influenced by the spatially specific mechanism of predictors on each prediction (Shukla et al. 2018;Gupta et al. 2021).These discussions become all the more important when these models are directly used to understand possible physical and chemical processes that shape the patterns of soil-landform couplings in a system of interest.

Limitations and future opportunities
Each ML approach presents inherent limitations when it comes to both the reduction of residual SAC and the attainment of predictive accuracy.SVM employs regularization to prevent overfitting, which can yield lower predictive accuracy in small datasets.Also, a higher level of residual SAC from SVM results is possibly due to relatively lower degree of nonlinearity of SVM compared to that of other ML techniques.Moreover, the decision-making around kernel selection complicates the SVM's model selection process, making it more challenging than that of the other tested ML models.We demonstrated that ANN reduce residual SAC when compared to SVM and RF, despite its predictive accuracy remaining on par with that of SVM; however, these results may stem from the small datasets used in this study to train the ANN.RF showed lower RMSEs than the other ML methods due to randomization of predictor selection and resampling.Additionally, residual SAC was directly proportional to SAC in the response variables, potentially offsetting the enhanced predictive performance of RF when dealing with highly spatially autocorrelated response variables.Despite the efficacy of RFsp in mitigating residual SAC compared to conventional RF, it introduces increased computational complexity through the addition of a distance matrix (N by N where N is the number of observations) into the input data matrix.The unique nature of this distance matrix within RFsp makes the comparison of predictive accuracy difficult using the conventional train-test split procedures, since the distance matrix is unique to each set of input data.The limitations of each ML approach require careful consideration.The objective of the soil-geomorphic modelling-whether to reduce residual SAC or enhance predictive accuracy of soil properties-should guide the selection of the most appropriate model.
Although some of the differences in residual SAC presented in this paper are attributable to the use of insufficient predictor data (e.g. the use of only six landform variables extracted by GIS-based terrain analysis), high spatial density of the samples (Supplementary Information II), or the complexity and uncertainty of the data associated with non-stationarity (Kim 2021), our findings suggest that further work is needed to understand the comparative usefulness of ML approaches over conventional non-ML methods.Future studies can improve the issue of limited predictor variables using, for example, multi-hierarchical relief attributes especially when scale-specific dependencies exist between the representativeness of the soil samples and the explanatory power of the variables used (Behrens et al. 2018c;Guo et al. 2019;M€ oller et al. 2022).In addition, when multi-collinearity is expected to impair model interpretability and lead to misidentification of relevant predictors (Dormann et al. 2013; cf.Supplementary Table S6), future soil-landform modelers are recommended to test VI employing downstream like the recursive feature elimination approach (Guyon et al. 2002;Svetnik et al. 2004).

Conclusions
Spatial autocorrelation exists nearly ubiquitously in soil-geomorphic data, and can provide useful information that researchers can include into the modelling of natural resources.Our findings demonstrate that this inherent spatial structure in soil-geomorphic data often significantly affects the outcomes of common ML algorithms in terms of the magnitude and autocorrelation of the residuals and the predictive importance of landform variables to soil spatial variability.We conclude that the level of SAC in the input variables can be used to guide the selection of appropriate non-ML or ML approaches making model choices more consistent across divergent pedogeomorphological systems.

Figure 1 .
Figure 1.Box-and-whisker plots of (a) the root-mean-square error (RMSE) and (b) the Moran's I values of the residuals produced by different modelling approaches.The orange and sky blue colours indicate non-ML and ML methods, respectively.The numbers in the parentheses are the mean RMSE or Moran's I values calculated from the total of 39 soil variables (OLS, ordinary least squares; SF, spatial filtering; SVM, support vector machine; ANN, artificial neural network; RF, random forest; RFsp, spatial extension of random forest).See supplementary Figures S1 and S2for the results obtained by analysing the magnitude and spatial autocorrelation of the residuals separately for each study site.
Figure 1.Box-and-whisker plots of (a) the root-mean-square error (RMSE) and (b) the Moran's I values of the residuals produced by different modelling approaches.The orange and sky blue colours indicate non-ML and ML methods, respectively.The numbers in the parentheses are the mean RMSE or Moran's I values calculated from the total of 39 soil variables (OLS, ordinary least squares; SF, spatial filtering; SVM, support vector machine; ANN, artificial neural network; RF, random forest; RFsp, spatial extension of random forest).See supplementary Figures S1 and S2for the results obtained by analysing the magnitude and spatial autocorrelation of the residuals separately for each study site.

Figure 2 .
Figure 2. Relationships between spatial autocorrelation (SAC) of each response (soil) variable and RMSE.All SAC is represented by global Moran's I.Each plot aggregates the findings of all individual study sites (supplementary FigureS3).the orange and sky blue colours indicate non-ML and ML methods, respectively.Linear regression models denoted by ��� are significant at the level of p < 0.001.See supplementary FigureS3for the results obtained by examining variations in the magnitude of the residuals separately for each study site.

Figure 3 .
Figure 3. Relationships between SAC of each response (soil) variable and SAC of the model residuals.All SAC is represented by global Moran's I.Each plot aggregates the findings of all individual study sites (supplementary FigureS3).the orange and sky blue colours indicate non-ML and ML methods, respectively.Linear regression models denoted by ��� and � are significant at the level of p < 0.001 and p < 0.05, respectively.See supplementary FigureS3for the results obtained by examining variations in the SAC of the residuals separately for each study site.

Figure 4 .
Figure 4. (a) Relationships between SAC of each (soil) variable and the reduction in RMSE after incorporating SAC into OLS and RF.(b) Relationships between the SAC of each response variable and the reduction in the SAC of the model residuals after incorporating SAC into OLS and RF.All SAC is represented by global Moran's I.Each plot aggregates the findings of all individual study sites.Linear regression models denoted by ��� are significant at the level of p < 0.001.

Figure 5 .
Figure5.Relative predictive importance of the landform variables in five modelling approaches (RFsp not applicable).at each individual study site, the average ranking of each landform variable was calculated from the multiple soil variables tested (elev, elevation; aspe, slope aspect; slop, slope angle; plan, plan curvature; upar, upslope area; prof, profile curvature).

Table 1 .
List of hyperparameters in the grid search for machine learning algorithms.