Prediction of gully erosion susceptibility mapping using novel ensemble machine learning algorithms

Abstract Spatial modelling of gully erosion at regional level is very relevant for local authorities to establish successful counter-measures and to change land-use planning. This work is exploring and researching the potential of a genetic algorithm-extreme gradient boosting (GE-XGBoost) hybrid computer education solution for spatial mapping of the susceptibility of gully erosion. The new machine learning approach is to combine the extreme gradient boosting machine (XGBoost) and the genetic algorithm (GA). The GA metaheuristic is being used to improve the efficiency of the XGBoost classification approach. A GIS database has been developed that contains recorded instances of gully erosion incidents and 18 conditioning variables. These parameters are used as predictive variables used to assess the condition of non-erosion or erosion in a given region within the Kohpayeh-Sagzi River Watershed research area in Iran. Exploratory results indicate that the proposed GE-XGBoost model is superior to the other benchmark solution with the desired predictive precision (89.56%). Therefore, the newly built model may be a promising method for large-scale mapping of gully erosion susceptibility.


Introduction
Effective and sustainable utilization of natural resources are indeed necessary for proper understanding of environmental processes (Fistikoglu and Harmancioglu 2002;Zhang et al. 2019;Feng et al. 2020;Tian et al. 2020). The sustainability of our environment and societies greatly depends on the optimal uses of various kinds of natural resources (Cheng et al. 2016;Keesstra et al. 2016;Lu, Tian, et al. 2019;Lu, Liao, et al. 2019;He et al. 2020;Yang et al. 2020;Zhang et al. 2020;). Among the several types of natural resources (He, Chen, et al. 2018;He, Shen, et al. 2018), soil is the most vital and one of the key issues for the sustainability as it is support basic human needs and livelihood on the earth surface. In the era of twenty-first century, the sustainability of soil not only depends on management strategies adopted by land planners and farmers but also impacted through marketing, subsidies and several rules taken in political decisions ). In the most effective manner, soil is primarily a land related phenomenon and has significant links to the several functions of United Nations sustainable development goals (SDGs). The United Nations has proposed 17 SDGs that can challenge national governments as well as stakeholders, and SDGs can be categorized into three groups i.e. environment, social and economic accordingly Visser et al. 2019). The proper functioning of soil is greatly emphasis on ecosystem services and links to the achievement of SDGs. In the effort to achieve several SDGs, the most in different environmental conditions including geological formations, topography, soil characteristics and climatic conditions (Poesen et al. 1998). Notably, this type of soil erosion is also governed by human-caused factors including land use alteration as well as exploitation of natural resource such as water, soil and plants (Ayele et al. 2015;Slimane et al. 2016).
By the reason of several complicated interaction among the triggering factors and the particular conditions of local environment, the formation and expansion of the gullies cannot be limited to the simple relationship between the slope and the area of the watershed located in the gully head-cut and physiological characteristics (Gayen et al. 2019;Pal and Chakrabortty 2019). The initiation of ephemeral gullies and its further development is largely dependent on several factors such as slope gradient, drainage area, topographic indices, precipitation, surface runoff, land use characteristics and soil types (Amare et al. 2019). Consequently, in different regions, rate and type of affection of parameters on the formation of the gully may vary (Ayele et al. 2015). As pointed out by previous studies (Pimentel 2006;Bouaziz et al. 2011;Shit et al. 2015), an average amount of 30-40 Ton/Ha/Y is removed due to erosion in Asia, Africa and South America. Particularly in Iran, this amount can be up to 50% with soil losses estimated to be 15 Ton/Ha/Y. Thus, in order to assessment of gully erosion and associated risk, various models have been proposed. These models can be classified into the categories of experimental and mechanical, dynamic and static and quantitative and qualitative models (Gitas et al. 2009). These models have taken into account a wide range of factors affecting soil erosion such as natural and human factors. The main criteria for selecting these models include goal, available data, time and cost (Helming et al. 2005). In recent decades, the analysis of soil erosion hazards by the above models and their integration with satellite data and advanced data modelling tools are employed to identify the physical hazards of soil erosion at regional scales. These models can provide information about erosion-prone locations and their trends, as well as erosion analysis scenarios. In addition, GIS techniques, with its major advantages of greater capacity to process and analyse big data size in digital layers in vector format and raster, can be used to determine the type and value of factors affecting the prediction of prone-erosion through different models, and finally, the results will be presented in the form of a digital layer (Droogers and Kite 2002).
In recent years, the greater advancement in computer technology (Zhu 2020;Cao, Zhao, et al. 2020;Cao, Wang, et al. 2020;Lv and Qiao 2020;Chen and Li 2020) along with gradually development of GIS knowledge (Zhu, Wang, Chen, et al. 2019;Zhu et al. 2020) and several platforms (Han et al. 2019;Yan et al. 2020), a wide variety of models have been proposed for solving scientific problems such as gully erosion susceptibility mapping (GESM). Therefore, based on the methodological approaches, the methods used for natural hazards susceptibility assessment has been classified into quantitative, qualitative and most recently machine learning approaches by several research group of people. Consequently, the aforementioned methods have also been widely used in assessment of gully erosion. These methods can be briefly reviewed as follows: i. Qualitative methods include techniques such as analytical hierarchy process (AHP) employed in ) and TOPSIS which is utilized in (Arabameri, Pradhan, Rezaei, et al. 2019). ii. Quantitative models includes bivariate and multivariate statistical models: information value (Arabameri, Pradhan, and Rezaei 2019a), frequency ratio (Rahmati, Haghizadeh, et al. 2016;Meliho et al. 2018), probability models (Rahmati et al. 2017), weights-of-evidence (Gayen et al. 2019), logistic regression (Arabameri, Pradhan, et al. 2018) and linear discriminant analysis (Arabameri and Pourghasemi 2019). iii. Artificial intelligence models employs advanced data-driven methods including artificial neural network (Sarkar and Mishra 2018), multivariate adaptive regression spline (Conoscenti et al. 2018). decision tree ensembles (Arabameri, Yamani, Pradhan, et al. 2019;Hosseinalizadeh et al. 2019);(Tien Bui, Shirzadi, et al. 2019) and flexible discriminant analysis (Gayen et al. 2019).
It can be seen from the literature that the predictive capabilities of artificial intelligence models are often better than those of conventional qualitative/quantitative methods (Arabameri and Pourghasemi 2019;Gayen et al. 2019). This fact is explained by the superiority in nonlinear and multivariate data modelling of the former approaches. Although various artificial intelligence models have been put forward for gully erosion, other alternative models should be investigated. It is due to the complex nature of this hazard which involves many influencing factors and the unique environmental conditions of each study area. Evidences of this fact are well demonstrated in previous studies on susceptibility mapping of gully erosion (Arabameri, Pradhan, et al. 2018;Conoscenti et al. 2018;Sarkar and Mishra 2018;Tien Bui, Shirzadi, et al. 2019) as well as other phenomena (Siahkamari et al. 2018;Jaafari et al. 2019;Kavzoglu et al. 2019;Tien Bui, Hoang, et al. 2019).
In this research study, the Kohpayeh-Sagzi watershed is chosen for assessment of GESM. The Kohpayeh-Sagzi watershed locates in Esfahan Province (central part of Iran). In this region, human pressure on land and soil resulting from population growth along with climatic condition is arid and semi-arid types, over-grazing and human induced deforestation and mismanagement of land use have led to serious soil erosion and related problems. The present research work is carried out by using 18 gully conditioning factors, all of these variables have been selected based on local geo-environmental conditions i.e. topography, hydrology, climatological, soil and environmental related conditions. The modelling of gully erosion susceptibility (GES) was carried out by using a novel ensemble of extreme gradient boosting machine (XGBoost) and evolutionary genetic optimization machine learning models and its comparison with three standalone models such as support SVM, RF and LR. Extensive literature studies have been shown that there are till now no research work which is based on the ensemble approach of genetic algorithm-extreme gradient boosting (GE-XGBoost), thus GESM using novel ensemble of GE-XGBoost is the novelty of this research study. XGBoost is an efficient algorithm that has the capacity to find out missing values and enhanced the prediction performance result. On the other hand, the major advantage of genetic algorithm (GA) is the multi objective optimization and search from a population of points and not from single points. The stand alone machine learning models used in this study has been previously also widely used by several researchers in different fields and the models give optimal prediction performance result due to their unique advantages. The present research article is prepared into the following subsequent sections: In the first section, we already described about the basic information of gully erosion and associated methods used in GESM. The second section represent about the details of the study area and data used for further progress of our research work. Subsequently, next section explained in details about the back ground of research methods used in this study. The machine learning models used for GESM is presented in the fourth section. The subsequent section deals with investigational result and discussion. The concluding section of the article summarizes the research findings with several closing remarks.

Description of the study area
Kohpayeh-Sagzi watershed is situated between longitudes 51 34 0 02 00 to 53 00 0 56 00 E and latitudes 32 10 0 59 00 to 33 12 0 50 00 N. The study area is within the Esfahan Province (the central part of Iran; Figure 1). The maximum and minimum elevations of Kohpayeh-Sagzi watershed are 1443 and 3322 m a.s.l., respectively. The area is 1736 m above sea level in average (a.s.l.). In that study area, the mean annual precipitation and temperature is 192.8 mm and 12.9 C. The majority of this area has a flat topography whereas north, north-west and north-east parts have covered by undulating rugged mountain.
The highest slope in the study area is 69.56 with a mean of 3.4 . The depth of soil is varied from 6.8 to 165 m. Poor rangeland and agriculture-dry farming are the major land uses practices in the study area. Pebble fan shaped debris from alluvial  deposits of upper maternal materials with total slope of 3 to 8%, coarse limestone fan shaped debris along with argillaceous limestone, limestone shale, limestone marl and old alluvial deposits with slope 5% are among the most important geomorphologic units. Quaternary lithotypes including pediment fan of low level and deposition in valley terrace, salt flat, sand dunes and sand sheet and clay flat are widely observed. In addition, aridisols is the dominant soil type in the region.

Data used
As shown in previous studies, the interaction of several geo-environmental factors like topographical, hydrological, characteristics and properties of soil, present LULC and human interaction causes gulling in prone area (Nyssen et al. 2002;Poesen et al. 2003;Nyssen et al. 2006;El Maaoui et al. 2012). Based on literature review of related works (Conoscenti et al. 2018;Arabameri, Yamani, Pradhan, et al. 2019); (Tien ; physiographic characteristic in study area, multi collinearity test, availability of data, and scale of research, 18 conditioning factors are selected as influencing factors of gully erosion in this research area. These factors include elevation, slope, PC, CI, SPI, TWI, TPI, distance to stream, drainage density, distance to road, rainfall, NDVI, soil type, soil depth, soil EC, geomorphology, lithology and LULC (Figure 2(a-r)).
These conditioning factors were extracted from various resources including: i. A 12.5 m spatial resolution of ALOS PALSAR DEM is collected from Alaska Satellite System and used for extraction and calculation of topographical and hydrological factors consisting of elevation, slope, PC, CI, TWI, SPI, drainage density, TPI and distance to stream. Equation 1 is used for calculation of TPI (Guisan et al. 1999;De Reu et al. 2013): where, E pixel represent the respective cell elevation and E surrounding represents the neighbor pixels mean elevation. ii. The Geological Society of Iran (GSI) produced geological map of Iran with the scale of 1:100,000 (http://www.gsi.ir/) and here we used this map to generate the lithology map. iii. The topographical map with the scale of 1:50,000 were collected from the secondary sources (www.ngo-org.ir). The road network map was digitized from Google Earth satellite images. iv. The NDVI map was prepared by using LANDSAT-8 OLI satellite image, which was collected from USGS earth explorer website (https://earthexplorer.usgs.gov/) with spatial resolution of 30 meters. v. The mean annual rainfall map was prepared by using 30 years (1986 to 2016) rainfall statistics of several synoptic stations. vi. Land-use types have a have a main role in runoff and infiltration . The LULC map for present study area was collected from the secondary source (https://www.scwmri.ac.ir). Various LULC units of present study area are shown in Table 1. vii. The following (http://esfahan.areeo.ac.ir/) website was used for collecting various soil characteristics maps and prepared soil type, soil depth and soil EC maps for this research study area. viii. The geomorphology layer is mapped using Landsat 8 OLI satellite images for 2016, the RGB853 false color, lines of geomorphic surfaces and Google Earth images. To distinguish geomorphic units from geomorphologic knowledge, land area processes, slope map obtained from ALOS-PALSAR DEM and geological map were used. The boundaries of the units were determined based on field surveys, and with the help of GPS. All steps of geomorphology mapping were performed in ArcGIS10.5 environment. The detail descriptions of geomorphology units for this study area are presented in Table 2. The details about the data sources are presented in Table 3.
Additionally, here we used ENVIv4.8, ArcGISV10.5, Google Earth pro 7.8 and Arc HydroV10.4 software to prepare several conditioning factors which are used for GESM. Alongside, several statistical and validation analysis was carried out by using the SPSSv24 and Microsoft Excel 2016 packages. Beside this, enormous readers can follow the previous works of ( Tien Bui, Shirzadi, et al. 2019) for more details of the gully erosion affecting factors.

Extreme gradient boosting machine
The XGBoost was introduced by Chen and Guestrin (2016) is selected to be used in this study because it represents the state-of-the-art within the machine learning community. This algorithm relies on classification trees (Breiman et al. 1984) and the gradient boosting framework (Friedman 2001). XGBoost is a popular machine learning system of scalable used for boosting the performance of classification trees (Li 2010). A classification tree typically establishes a set of rules to categorize each gully erosion instance as a function of a set of predisposing factors in a graph structure. The graph is built as a single tree whose leaves are assigned with a score which expresses how likely is a gully to fall in a given factor class, be it a categorical (e.g. lithotypes) or ordinal (e.g. reclassified slope steepness). Notably, in the framework of XGBoost, the loss function used to train the ensemble model is combined with a procedure known as regularization which penalizes the complexity of the trees. This regularization method can potentially improve the performance of gully erosion model since over fitting phenomenon is alleviated. Over fitting occurs when a new dataset of gully erosion used for prediction may not be explained as well as the one stored in the training set. Thus, regularization contributes to the limitation of over fitting and to the improvement of the gully erosion prediction model flexibility.
XGBoost combines the results of different tree models by taking their weighted averages (Gusain et al. 2018). An iterative process is employed by using weak prediction models to learn the overall prediction model at every step by correcting the misclassification in the preceding iteration. The XGBoost model is constructed by optimizing the following objective function: The final part of the alluvial plain of the Zayandehrud river consists of alluvial deposits with a slope of 0 to 1% C The stabilized sand dunes with a slope of 2 to 15 % D Low elevation hills along with limestone, marl, shale and sandstone with moderate to high rock outcrops with a steep slope of over 35% E The relatively high hills include limestone and limestone shale with marl and sandstone bedding with high rocky outcrops and averaging slope of 18% F High hills include of grey shale, limestone, marl and limestone shale with a steep slope of 15% G Alluvial plain with very deep soils and heavy to very heavy texture with slope of 0-1% H Alluvial plain consisting of alluvial deposits with limestone and marl with slope of 1 to 4% I Hill slope plain consisting of alluvial deposits with limestone and sandstone with slope of 1 to 5% J Flood plain composed of alluvial and flood deposits, along with limestone, salt and sandstone and clay and salt slabs with slope of 0-3% K The flood plain consisting of alluvial and flood deposits along with limestone, salt, limestone marl and sandstone with clay and salt slabs with slope of 0 to 3 % L The flood plain consists of alluvial and flood deposits, along with gypsum, salt, limestone, limestone marl, sandstone and clay M The low lands consist of very fine alluvial deposits, with clay and salt slabs with slope of 0 to 0.5% N old Plateau consists of limestone, salt, grey and red marls, argillaceous limestone, sandstone, conglomerate and limestone marl O High and very high mountains along with limestone, marl and limestone shale, with a steep slope of 33% P Very high mountains with limestone and conglomerate stones along with rocky outcrops with a steep slope of 43 % Q Low-elevation mountains of limestone, shale and limestone marl, along with limestone sandstones with slope of 26% R Relatively high to low mountains consisting of metamorphic and pyrogenic rocks along with limestone, marl and shale with slope of 26% S High mountains of igneous rocks and volcanic lavas along with metamorphic rocks with slope of 45% T Coarse limestone fan shaped debris along with argillaceous limestone, limestone shale, limestone marl and old alluvial deposits with slope 5% U Coarse limestone fan shaped debris along with argillaceous limestone, limestone shale, conglomerate, sandstone, calcareous marl with a slope of 2 to 8 % where, P n i¼1 lðy i , y i Þ represents the loss function of root mean square error used for fitting the model on the training data. P K k¼1 Xðf k Þ refers to the regularization term which helps to avoid overfitting. K denotes to the number of individual trees. f k is a tree within the ensemble. y i and y i are an actual and predicted class output, respectively.
The training process of XGBoost aims at minimizing of the aforementioned objective function. To do so, a weak learner is added to the ensemble iteratively. Notably, while the level of model complexities grows as the learning of trees progresses, the regularization term prevents the over fitting problems by regulating the number of leaf nodes in the tree (Chang et al. 2018;Gusain et al. 2018). The detail about the model construction phases of XGBoost was find out in the previous works of (Chen and Guestrin 2016; Gusain et al. 2018).

GA for global optimization
GA (Goldberg 1989) is a stochastic approach and widely used by several research group of people. The algorithm for this model is based on the idea of organisms' natural selection and genetic processes. In this model for better identification of individuals, several phenomena of reproduction, crossover and mutation in nature are simulating and compete with each other among the group of solutions candidates (Kilinc and Caicedo 2019). Individuals with better fitness combine their genetic features to generate new candidate solutions. Here, the whole process is repeated in several times until the criterion of stopping is satisfied. The operation of the GA is demonstrated in Table 4. More details about GA approach has been found in recent established works (Amjad et al. 2018;Mirjalili et al. 2020). In this study, the GA is applied to optimize the hyper-parameters of the aforementioned XGBoost model used for gully erosion prediction.

Random forest (RF)
The RF, proposed in (Breiman 2001), is a type of decision tree ensemble. This method is used to an overall classifier by combining various individual trees. Its structure can be boiled down to few important steps. From a gully dataset made of gully erosion/no-gully, erosion instances associated to a set of conditioning factors. RF starts by sampling with replacement one subset of the data upon which a classification tree will be applied. Notably, the initial number of predictors of individual trees is kept constant and the tree is allowed to grow to its largest extent. Each leaf of a tree is made of a predictor's binary split which best fits the data. This operation ensures that each tree in the forest can learn or differentiate substantially from the others. This fact helps to reduce the tendency of decision trees to over fit the data. The aggregation among all trees is then computed via a majority vote which is achieved by combining the learning outcomes of all the weak tree based learners (Deng et al. 2019). Successful implementation of the RF model has been demonstrated in previous works on spatial mapping of natural hazards Arabameri, Yamani, Pradhan, et al. 2019;Dang et al. 2019). Vapnik (1998) proposed the algorithm of support vector machine, is a classification model which has its root from the statistical learning theory. The main task of SVM algorithm at hand is to develop a decision surface that separates the data instances into two classes: No-gully and gully erosion. The generalized from of a SVM model in decision surface is to set of training data points fx k , y k g N k¼1 along with the input data of x k 2 R n and a group of class labels represent by y k 2 fÀ1, þ 1g: Notably, nonlinear mapping function and kernel techniques is used to deal with nonlinear classification problems, through increases the data structure within the SVM model (Hamel 2011). Accordingly, a hyper-plane is constructed in the high dimensional space and used for data classification. Several research paper related to this work indicates that SVM is very much appropriate; thus, here we have selected SVM is a benchmark method (Tehrany et al. 2015;Ngoc Thach et al. 2018;Abedini et al. 2019).

Logistic regression (LR)
The LR is a very popular statistical method that employs a logistic function to model a binary dependent variable. This machine learning method establishes a division boundary that classified the data structures into two classes of binary values i.e. 0 and 1, which indicates negative and positive class, respectively (Agresti 2003). The advantages of the LR are straightforward learning phase and transparent model structure. To identify the model structure of a LR model, the loss function of log likelihood and fast iterative methods such as the stochastic gradient descent can be used (Gormley et al. 2016). Although being simple, this classifier has been shown to achieve acceptable prediction accuracy in various recent studies (

Gully erosion database
The understanding of spatial distribution and forms of gullies is most important prerequisite for the preparation of GESM, and it is show by the gully erosion inventory map (GEIM; Conoscenti et al. 2014). To prepare the GEIM various resources were used in this research work: i. Gully erosion points were identified by the published map from the (http://esfahan.areeo.ac.ir/) website. ii. Interpretation and verified them from Google Earth images satellite image. iii. To identify exact location of gullies, extensive field surveys were carried out by using a GPS (Global Positioning System).
A total of 529 gully locations were recognized throughout this study area. Thereafter, this total gully head cut points were randomly classified into two groups of 70/30 ratio for training (370 gullies) and validating (159 gullies) purposes. Also, an equal amount of gully absence locations were created by random point generation tool in ArcGIS platform to support the calibration and validation procedures (Kornejady et al. 2017). Figure 3 demonstrates the field photograph in this field of research.
As mentioned earlier, 18 variables namely elevation, slope, PC, CI, SPI, TWI, TPI, distance to stream, drainage density, distance to road, rainfall, NDVI, soil type, soil depth, soil EC, geomorphology, LULC and lithology are employed as erosion influencing factors. These factors are integrated into a GIS database which is illustrated in Figure 4.

Multicollinearity (MC) diagnosis and feature selection
MC is defined as the linear relationship among the two or more variables (Alin 2010). Basically, in a regression model, MC analysis has been carried out when two or more variables are correlated among each other Chen and Li 2020). The result of traditional statistical method in feature selection procedure has ambiguity and affected by extreme values because of the presence of large dataset Chen, Li, et al. 2021). Thus, in GESM, analysing the collinearity among the gully related conditioning factors is very important because the presence of collinearity decreases the model's predictive performance . Variance inflation factor (VIF) and tolerance (TOL) are commonly used statistical techniques for collinearity analysis among the independent variables by many researchers (Band, Janizadeh, Chandra Pal, Saha, Chakrabortty, Shokri, et al. 2020;Roy et al. 2020;Chen, Zhao, et al. 2020) and are employed in the present study. The threshold value of TOL is < 0.1 and VIF is > 10 can indicates collinearity present among the variables (Oh et al. 2017). Therefore, first we have extracting the values of gully and non-gully points from several conditioning factors used in this study and finally VIF and TOL values were calculated by using SPSS 16 package. The following equations were used to calculate TOL and VIF (Arabameri, Karimi-Sangchini, et al. 2020;Band, Janizadeh, Chandra Pal, Saha, Chakrabortty, Melesse, et al. 2020).
where, R 2 j indicate the regression of coefficient.

The GA optimized XGBoost model
The hyper-parameters to be set appropriately in the construction phase of the XGBoost model. These hyper-parameters include: i. Maximum tree depth (MTD): this parameter strongly affects the model accuracy as well as complexity. Generally, trees with shallow depths may to under fit the data; meanwhile, large values of MTD can lead to over fitted models. ii. Learning rate (g): This parameter used to prevent the over fitting problems through the step size shrinkage phase. iii. Minimum split loss (c): This parameter governs the data partition on a leaf node of the tree. iv. Regularization coefficient (a): The parameter a affects the model regularization.
Therefore, it helps to alleviate over fitting. v. The subsample ratio (r): The parameter r affects the number of data columns when each decision tree is constructed. vi. The number of decision trees (TN): This is the maximum number of individual decision trees in the ensemble.
Thus, the proper selection of the above hyper-parameters is very much necessary for the analysis of training and prediction outcome of the XGBoost model. In addition, due to the presence of infinite number of candidate solutions, it is a challenging task to identifying a suitable set of the hyper-parameters in XGBoost model. Because an optimization problem has been formulated during the hyper-parameter selection task (Sayed et al. 2018;Xiong et al. 2018;Deng et al. 2019;Lu, Tian, et al. 2019;Lu, Liao, et al. 2019), this study employs the GA metaheuristic to optimize the model construction phase of the XGBoost.
The general concept of the GA optimized XGBoost is presented in Figure 5(a). At the first generation (g ¼ 0), all of the model hyper-parameters are initialized randomly. The GA gradually evaluates the fitness of each population member to identify robust candidates during the searching of optimization within the model. After the optimization process terminates, the optimized XGBoost model used for gully erosion prediction can be constructed as illustrated in Figure 5(b). Moreover, an objective function is necessary to described for exhibit the suitability of each solution. The objective function (OF) of the EG-XGBoost algorithm is described as follows: where, Y A indicates actual output, Y P indicates predicted output and N refers to the number of training data samples.

Validation and accuracy assessment
The validation and accuracy assessment in modelling is crucial for its proper evaluation in management studies (Chao et al. 2018;Wang et al. 2020). Without validation, the evaluation of machine learning models output result has less significance in reality. In the current study, several statistical indexes namely classification accuracy (CAR), sensitivity, specificity, receiver operating characteristics (ROC) curve, Kappa coefficient and many were used to validation and accuracy assessment of the model's result. The classification accuracy (CAR) is computed as follows: where, R C represents the sample numbers being classified correctly and R A denote the total number of samples. Beside CAR, another validation performance tools i.e. the ROC curve Nguyen et al. 2019) and Cohen's kappa coefficient (McHugh 2012) are also employed to express the overall performance of a classifier. Furthermore, the specificity, sensitivity and ROC-area under curve (AUC) are also widely employed for model assessment and the respective equations are as follows: where, TP refer to the true positive; TN denotes the true negative; FP represents the false positive and FN is the false negative (van Erkel and Pattynama 1998).

Experimental results
The results of the multicollinearity analysis on the relevance of gully erosion conditioning factors are reported in Table 5. As stated previously, collinearity among the predictors are investigated by VIF and TOL. It is worth reminding that TOL values is < 0.1 and VIF is > 10 indicate collinearity problems among the variables (Oh et al. 2017). As can be seen from the analysis outcomes, TOL and VIF values of the employed variables do not exceed such thresholds. Moreover, the relevancy of conditioning factors is presented by using the Information Gain Ratio. Based on the results, highest Information Gain Ratio occupied by soil depth, followed by soil type TWI, lithology and NDVI. Influencing factors of geomorphology, elevation, plan curvature, slope, TPI and drainage density feature moderate values of Information Gain Ratio. Distance to stream and SPI has comparatively low relevancy. Notably, convergence index has a null value of Information Gain Ratio. Thus, this variable is excluded from the construction phase of the gully erosion prediction model.
Additionally, the performance of the newly constructed EG-XGBoost is benchmarked with those of the RF, SVM and LR. These benchmark models are constructed in Python. The grid search method (Hoang and Bui 2018) is used to identify the hyper-parameters within the SVM model. In addition, the tuning parameters of RF and LR are set based on the authors' experience and trial-and-error processes.
The model performances are presented in Tables 6 and 7 for training and testing phases, respectively. From the output result, it was noticed that the newly developed GE-XGBoost model has attained the most preferred accuracy for training (CAR ¼ 95.69%) and testing (CAR ¼ 89.56%) data predictions, followed by RF (testing CAR ¼ 87.03%), SVM (testing CAR ¼ 83.54%) and LR (testing CAR ¼ 77.22%). Moreover, the ROCs of the prediction models are illustrated in Figure 6. Based on the overall results of training and testing dataset, it is stated that GE-XGBoost is the best fit model for gully erosion prediction analysis.
In the next step of the current research, the proposed GE-XGBoost and other benchmark model are utilized to calculate the GES within this study area. The calculated susceptibility is then transformed to a grid format and opened in ArcGIS 10.4 package. By using the aforementioned MLA models, the GES maps (Figure 7) are obtained and each map classified into five categories for better understanding the variation. These five groups of classes are found from very high to very low. It is noted that the thresholds for categorizing these measured indices into the five classes

Discussion
Among the several types of water induced soil erosion, gully erosion is the most destructive form, in terms of its negative impact on humankind throughout the world (Arabameri, Asadi Nalivan, et al. 2020). The water induced gully erosion is responsible for several natural causes but recently, human interventions on the environment and associated changes of the equilibrium of ecosystem also influences the formation and development of gullies (Chaplot et al. 2005;Band, Janizadeh, Chandra Pal, Saha, Chakrabortty, Shokri, et al. 2020). This type of erosion is responsible for destroying the environment in two ways: first, it causes extensive land degradation through surface and sub-surface soil erosion and associated result i.e. sedimentation into the downstream area, degradation of agricultural land and adverse effect on human's life; second, the gradual reduction of groundwater recharge as the rate of surface runoff increases (Kou et al. 2016). Extensive literature study has shown that gullies exist in most semiarid and arid landscapes around the world, with substantial morphological activities and dynamics (Arabameri, Yamani, Pradhan, et al. 2019;Gayen et al. 2019).  Figure 6. ROCs of the prediction models.
In comparison, semi-arid climate and precipitation regimes promote soil erosion through low vegetation cover and recurring heavy rainfall events. Therefore, the adverse effect of climatic conditions is very much significant for large scale land degradation in the form of gully formation and development. Keeping in view the adverse damages of our environment due to gully erosion, several research works has been conducted to find out the responsible factors of gully erosion and prepared the map of susceptible areas. Hence, it is indeed necessary to manage the adverse effect of gully induced soil erosion through sustainable measurement techniques. The present study area of Kohpayeh-Sagzi watershed, Iran having several problems related to gully erosion; therefore, it is the main obstacle for sustainable land uses practices and associated land management strategies. Thus, it is very much necessary to use suitable models for optimal prediction of GES and apply several suitable measurement techniques for soil and water management. In this regard, here we used machine learning algorithm (MLA) of XGBoost, GA, RF, SVM and LR for modelling GES.
Given the shortcomings and limitations of each of the stand alone models Feng et al. 2020;Qian et al. 2020;Qu et al. 2020;Shi et al. 2020), scientists have proposed and developed integrated methods to overcome their disadvantages and increase their efficiency (Cao, Dong, et al. 2020;Peng et al. 2020;Liu et al. 2021). In this research, A novel ensemble of GE-XGBoost was proposed for better prediction analysis of gully erosion than the stand-alone model. In this study, 18 conditioning factors was used which is responsible for gully erosion based on several geo-environmental conditions. To accurately represent the output result here, we used multicollinearity test of VIF and TOL techniques and information gain ratio (IGR) to proper understand the relationship among the variables. The result of VIF, TOL and IGR indicates that there is no multi-collinearity problem among the variables as all the variable's value is within the threshold value of multicollinearity. In VIF, among the 18 variables, rainfall is the highest value of 3.237 and distance to road is the low value of 1.265. In the case of TOL, distance to road and rainfall consist of the highest and lowest value of 0.911 and 0.309, respectively. On the other side, the result of IGR is shown that soil depth has high priority with the value of 0.1779 and followed by soil type (0.1683), TWI (0.1653) and lithology (0.1549) and zero value has been found in convergence index factor. Thus, in the current study area soil play an important role for the occurrences and further development of gullies, as the soil texture determine the rate of infiltration, associated surface runoff and soil resistance (Deng et al. 2015). The hydrological process of a terrain is quantify through TWI and determine the erosive power of water and greatly influences on gully erosion . Whereas, the characteristics of lithology determine the water percolation rate and significantly influences on gully initiation and extension (Band, Janizadeh, Chandra Pal, Saha, Chakrabortty, Shokri, et al. 2020). The same result i.e. soil characteristics, hydrological and lithological properties has been highlighted by others established research work. Alongside, elevation (0.1103), slope (0.1094), drainage density (0.1074), rainfall (0.0968) and LULC (0.0859) also play significant role for development of gullies and associated land degradation problems.
Several statistical indices have been used for prediction performance of the model's output. In this study, we have carried out goodness-of-fit and prediction performance of several models used for gully erosion modelling. The goodness-of-fit among the four models of GE-XGBoost, RF, SVM and LR, statistical index of true positive, true negative, positive predictive value, negative predictive value, sensitivity, specificity, classification accuracy, MSE and kappa with their highest value of 346, 364, 93.26, 98.11, 98.02, 93.57, 95.69, 0.061 and 0.914, respectively, is shows GE-XGBoost is the best fit model, in the training phase. In the case of false positive and false negative with their highest value of 96 and 63 indicates LR is the goodness-of-fit model. The result of predictive performance in all the models have shown high accuracy, but the ensemble of GE-XGBoost is the top most optimal with the AUC value is 0.969 and followed by RF (AUC ¼ 0.933), SVM (AUC ¼ 0.886) and LR (AUC ¼ 0.870) in the testing phase. The other statistical indices used in this study for validation and accuracy assessment has also shows that GE-XGBoost is the best prediction model for GES flowed by RF, SVM and LR. However, false positive and true positive has shown that LR is the best model with their value of 44 and 28, respectively, followed by SVM, RF and GE-XGBoost. Finally, it is concluded that assessment of gully erosion in proper way always faced uncertainties due to lack of knowledge regarding physical condition, complex function of the model and spatial inconsistency (Arabameri, Pradhan, and Rezaei 2019b). It is also very much difficult to proper validation of the model in an accurate way. It has also been stated that the framework of the model, the input data structure are accurate and proper validation techniques is used than the uncertainty of the prediction result will be reduced (Rojas et al. 2010). Thus, the novel ensemble of GE-XGBoost with their proper validation performance resolved the problems related to gully erosion in various regions of the world.
The worldwide soil erosion phenomena and associated loss of soil fertility has been gradually decreasing the sustainability of environment and national economy (Lal 2003). Therefore, measurement of soil sustainability is an important goal among the farmers and decision makers (Novara et al. 2021). Several kinds of soil erosion control strategies should be implemented in various scale and regions to optimal solution for this type of devastating problems. Various literature studies have been shown that cover crops have positive impact in the control of soil erosion and has been considered for soil erosion management in several regions across the world (L opez-Vicente et al. 2020; Rodrigo-Comino et al. 2020;Novara et al. 2021). In general, cover crops protected the soil surface from direct falling of raindrop and reduced the runoff velocity and associated erosional activities. Alongside, vegetation cover is also improved the soil quality and play a key role to reduced soil losses through plant cover and root system (Van Hall et al. 2017). Hence, in this study area also vegetation coverage, construction of small check dams and proper land use planning should be implemented to mitigate and control the soil erosion problem.

Conclusion
The present research work has been carried out by using a new methodology of GIS technology, remote sensing and MLA for spatial mapping of gully erosion hazard. GIS technology and remote sensing are employed to detect erosion locations and construct a database containing erosion influencing factors. The GIS database established in the present study takes into account the historical cases of gully erosion events and 18 conditioning factors. Elevation, slope, PC, CI, SPI, TWI, TPI, distance to stream, drainage density, distance to road, rainfall, NDVI, soil type, soil depth, soil EC, geomorphology, lithology and LULC are utilized as predicting variables used to determine the status of non-erosion or erosion for a certain area within the study region.
A novel machine learning ensemble method that combined the XGBoost and GA, named as GE-XGBoost, is developed in the study. The division of gully from nongully was carried by construction of decision boundary through XGBoost. In addition, GA is used to optimize the performance of the XGBoost method. Experimental outcomes demonstrate that the hybrid method can deliver predictive performance which is superior to those of RF, SVM and LR. Therefore, the proposed GE-XGBoost can be an effective model to assist local authorities in establishing effective countermeasures against gully erosion and proper planning of land-use.
The novel ensemble model of GE-XGBoost which is used in this study area can also be extended in future for predicting gully erosion in several parts of the world and spatial modelling of other natural hazards such as landslide or flash flood. Hence, every research study has some limitations, thus this study also not included the hydrological modelling for gully system analysis and we need emphasize the same for future studies. Moreover, the proposed method's performance used in this work can be improving enhancing with other advanced metaheuristic optimization algorithms for an appropriate potential research direction.

Data availability statement
The data that support the findings of this study are available on request from the authors.