Assessment of the effects of training data selection on the landslide susceptibility mapping: a comparison between support vector machine (SVM), logistic regression (LR) and artificial neural networks (ANN)

ABSTRACT Landslide is a natural hazard that results in many economic damages and human losses every year. Numerous researchers have studied landslide susceptibility mapping (LSM), each attempting to improve the accuracy of the final outputs. However, few studies have been published on the training data selection effects on the LSM. Thus, this study assesses the training landslides random selection effects on support vector machine (SVM) accuracy, logistic regression (LR) and artificial neural networks (ANN) models for LSM in a catchment at the Dodangeh watershed, Mazandaran province, Iran. A 160 landslide locations inventory was collected by Geological Survey of Iran for this investigation. Different methods were implemented to define the landslide locations, such as inventory reports, satellite images and field survey. Moreover, 14 landslide conditioning factors were considered in the analysis of landslide susceptibility. These factors include curvature, plan curvature, profile curvature, altitude, slope angle, slope aspect, distance to faults, distance to stream, topographic wetness index, stream power index, terrain roughness index, sediment transport index, lithology and land use. The results show that the random landslide training data selection affected the parameter estimations of the SVM, LR and ANN algorithms. The results also show that the training samples selection had an effect on the accuracy of the susceptibility model because landslide conditioning factors vary according to the geographic locations in the study area. The LR model was found to be less sensitive than the SVM and ANN models to the training samples selection. Validation results showed that SVM and LR models outperformed the ANN model for all scenarios. The average overall accuracy of LR, SVM and ANN models are 81.42%, 79.82% and 70.2%, respectively.


Introduction
Landslide results in significant economic damage and human losses annually. This phenomenon is described as a ground range wide movement, which includes rock falls, deep slope failures and CONTACT Biswajeet Pradhan biswajeet24@gmail.com; Biswajeet.Pradhan@uts.edu.au shallow debris flow (Varnes 1978). A quick look at the economic damage and human losses would be enough to encourage managers and planners to operate inclusive projects on landslide. Modelling landslide through appropriate models could provide reliable landslide susceptibility maps, which could offer appropriate information to managers and land-use planners for development plans and land-use management.
The number of landslide locations in the training data-set and sampling strategy are important factors that significantly affect the quality of landslide susceptibility maps. Regarding the number of samples to be used in training a susceptibility model, although this could depend on the availability of landslide inventories in the given area, the training and testing data-sets selection can affect the model prediction capability. Dhakal et al. (2000) denoted that different landslide and non-landslide group samples generated randomly have effects on the critical factors identification, such as geology and model prediction capability. They found that stratified random sampling method achieved better accuracy than random sampling without stratification. In another paper, Remondo et al. (2003) discussed random sampling strategies as being more suitable and logical than study area spatial split, and that spatial validation is more logical than temporal validation. They argued this fact by mentioning that the entire landslide inventory (analysis and validation sets) corresponded to the same period and conditions. Temporal validation compared analysis population with population corresponding to a later period. In addition, Hjort and Marmion (2008) analysed the sample size effects on the accuracy of geomorphological model. They analysed different sample size groups ranging from 20 to 600 samples based on generalized linear model, generalized additive model, generalized boosting method and artificial neural network (ANN) in two data settings, i.e. independent and split-sample approaches. They found that accuracy increases with the increase in sample numbers in all models, and the robust predictions level was reached with 200 observations. Overall, they suggested that a few hundred observations are needed in geomorphological modelling. Heckmann et al. (2014) investigated the sample size effects on logistic regression (LR) model accuracy for predicting debris flow spatial distribution. They discussed two main problems of having inadequate sample size for training landslide susceptibility models. First, a sample size has to be large enough so it can cover the landslide conditioning factors variability within the study area. Second, the sample size must not be too large, because this will likely violate the independent observations assumption because of spatial autocorrelation.
Sampling strategy was also found to have effects on the landslide susceptibility models both in terms of prediction accuracy and generalization capability. Three sampling strategy main types are identified in the literature such as scarp, seed cell, and point. Yilmaz (2010) suggested that the scarp-sampling method could achieve higher accuracy than other sampling methods for ANN-based landslide susceptibility mapping (LSM). Paulin et al. (2010) analysed the pixel size effects on the cartographic representation of the different landslide types. In general, they found that as pixel size increased, the landslide lost cartographic representation. However, they also suggested that different models could achieve results similar to other models but under different pixel sizes. Baeza et al. (2010) studied the effect of sample size and type on the accuracy of landslide susceptibility map. Their results showed that using at least half of the samples for training a discriminant analysis is necessary. In addition, they found that sample type had no significant influence on model performance. More recently, Hussin et al. (2016) developed landslide susceptibility models with different sampling strategies (i.e. scarp centroid, points populating the scarp, entire scarp polygon). They showed that highest success rates were obtained when sampling shallow landslides as 50-m gridpoints, and debris flow scarps as polygons.
The literature review revealed that several studies have employed different models for LSM, but landslide inventory spatial variation has been less considered frequently. With respect to the spatial variations of landslide conditioning factors, the differences between the landslide samples quality and the landslide locations spatial distribution, training data selection is expected to have effects on model performance. Therefore, this study assesses the effects of random training landslide selection on the accuracy of the SVM, LR and ANN models. The models were selected for this study based on previous studies (Pradhan and Lee 2010a;Ballabio and Sterlacchini 2012;Devkota et al. 2013;Pourghasemi et al. 2013a;Nourani et al. 2014;Shahabi et al. 2014;Umar et al. 2014) that have used these models successfully in different areas for LSM. To carry out this task, five training and validation data-set scenarios were generated randomly. Then, SVM, LR and ANN models were trained and built for each scenario to evaluate the training sample effects in spatial distribution terms. The number of samples was kept at 70% of the whole landslide inventory data.

Study area
The study area is situated in Dodangeh watershed, Sari County, Mazandaran Province, Iran. It is approximately at the zone of 36 2' 44.56" N latitude and 53 14' 34.78" E longitude, and covers an approximate area of 1754 km 2 ( Figure 1) with a population of 8469 (2012 census). The highest area temperature ranges between 29 C and 31 C between June and August, and the mean precipitation varies between 35 and 105 (mm/month). According to Emberger climagram, climate in the area is cold and wet (altitude of 800-2000 meters) and cold and semi-arid (altitude of 2000-2800 meters). The study area is covered mostly by vegetation including medium density forest, mixed forest, mixed agriculture, orchard and mixed orchard. The study area is covered with various types of lithological formations including Tertiary, Cretaceous and Quaternary types of formations. The M23msl (Marl, calcareous sandstone and siltstone, silty marl/sandy limestone, mudstone) and PIQcs (Conglomerate, sandstone, silt stone, silty marl) covers about 74.9% of the study area. The Quaternary deposit covers 5.11% of the study area and include Qal (Recent loose alluvium in the river channels), Q2 (Young alluvial fans and terraces, river terraces; mainly cultivated), Qsw (Swamp).

Data used
The landslide inventory data are considered the most vital factor in foreseeing the future incidents (Tien Bui et al. 2012). For this investigation, 160 landslide locations were detected in the Dodangeh watershed. Different methods were implemented to define the landslide locations such as inventory reports, satellite images and field surveys. Most landslides are shallow rotational with a few being translational. Fourteen conditioning factors were considered in the analysis of landslide susceptibility based on the literature review and data availability (Sangchini et al. 2016;Pradhan et al. 2017). These factors include curvature, plan curvature, profile curvature, altitude, slope angle, slope aspect, distance to faults, distance to stream, topographic wetness index (TWI), stream power index (SPI), terrain roughness index (TRI), sediment transport index (STI), lithology and land use. These factors were transformed into a vector spatial database by employing ArcMap 10.3. A 10-m digital elevation model (DEM) was exploited from a 1:25,000-scale topographic map. Employing the mentioned layer, the slope degree, slope aspect, altitude, curvature, SPI, TWI, TRI and STI were generated.

Slope
Slope has considerable influence on slope stability, which can be explained as the form between any section of the surface and a horizontal datum. Many studies have shown that slope is a main conditioning factor in LSM (Dai and Lee 2002;Ayalew et al. 2004;G omeza and Kavzoglu 2005). Slope angle was produced employing the 10-m DEM. This layer was classified into (1)

Aspect
Aspect related factors including exposure to sunlight, winds and precipitation are vital factors in instigating landslides (Dai et al. 2001). In this work, DEM was employed to compute the aspect classes. In the next stage, aspect was categorized into nine groups of eight directions and flat (Figure 2(b)).

Altitude
Altitude is affected by several geomorphologic and geological processes, and thus, it is an influential conditioning factor in landslide analyses (Gritzner et al. 2001;Dai and Lee 2002;Ayalew and Yamagishi 2005). In this investigation, altitude was prepared using the Dodangeh watershed 10-m DEM, where the watershed altitude varied between 210 and 1240 m (Figure 2(c)).

Curvature (plan and profile)
Plan curvature is clarified as the counter line curvature made by a horizontal plane crossing with the surface. Water convergence and divergence during the flow, which affect slope erosion processes, can be determined using plan curvature (Ercanoglu and Gokceoglu 2002;Oh and Pradhan 2011). This layer was generated by the ArcGIS spatial analysis tool (Figure 2(e)). Plan curvature represents the flow acceleration and erosion/deposition rate. Profile curvature affects the flow velocity variation  down the slope (Talebi et al. 2007). These layers were generated by the ArcGIS spatial analysis tool (Figure 2(f)).

Sediment transport index (STI)
STI represents the slope failure and deposition procedures (Moore and Wilson 1992). There is a direct relationship between the STI and the water accumulation at the bottom of the catchment as well as the erosion amount (Pourghasemi et al. 2013a). STI can be calculated as below STI ¼ A S 22:1 0:6 sinb 0:0896 1:3 ; (1) where b denotes the slope at a certain pixel and A S refers to the upstream area. In STI (0-550) five classes were constructed using quantile algorithm for analysis as shown in Figure 2(g).

Stream power index (SPI)
SPI represents flowing water erosion power according to the presumption which relates discharge to definite catchment area. This factor was determined by the formula introduced by Moore and Wilson (1992) where A is the specific catchment area and b is the local slope gradient computed in degrees. In this study, SPI was reclassified into five categories namely, (1) 0-219, (2) 220-985, (3) 986-2520, (4) 2530-6020 and (5) 6030-14 000 (Figure 2(j)) using quantile algorithm.

Buffer zones (distance to fault and stream)
The landslides occurred mainly along the fault and decreased sharply with distance from it. A buffer map related to the seismogenic fault with a 5-km buffer zone interval is shown in Figure 2(k). Moreover, the distance to the stream was derived in two steps. First, the streams were delineated using flow accumulation and converted into a vector format. In the second step, a Euclidean distance analysis was conducted to produce the distance to the stream factor. In addition, the distance to fault was derived using the same distance analysis and existing topographic maps of the study area ( Figure 2(k,l)).

Land use
Land use is affected by human activities and alterations in the environment. Its importance in LSM has been referred to by several publications (e.g. Restrepo et al. 2003). In this research, the Landsat Enhanced Thematic Mapper (2006) image was used to produce land-use layer by employing a supervised classification algorithm with an accuracy of 86%. These classes include agriculture, dry farming, high density forest, medium density forest, mixed forest/orchard, mixed agriculture/ orchard, orchard, mixed orchard/agriculture, sandy/dune and urban (Figure 2(m), Table 1).

Lithology
Lithology can be regarded as one of the most vital conditioning factors in LSM because strength and the rocks and soils permeability are influenced directly by lithological characteristics (Kavzoglu et al. 2014). Thus, several studies have considered this factor in their analyses (Dai and Lee 2002;Ayalew and Yamagishi 2005). The lithology map (Figure 2(n)) was generated by Geological Survey of Iran with 1:100,000 scale (Table 2).

Methods for landslide susceptibility zonation
In the current work, LSMs were produced employing GIS-based SVM, LR and ANN models. The procedure applied was completed in three steps. First, five landslide inventory random scenarios were generated (Figure 3). Then, three approaches (i.e. SVM, LR and ANN) were used to model the correlation between landslide conditioning factors and landslide incident in each scenario using Weka 3.8 software. Finally, the accuracy of all approaches in different scenarios was evaluated by the ROC curve and confusion matrix to assess the sample selection effect on the model accuracy. Figure 4 shows the proposed method general methodology.

Support vector machine (SVM)
SVM has been employed widely in different classification and regression problems because of its effectivity in working with linearly non-separable and high dimensional data-sets (Kavzoglu et al. 2014;Mountrakis et al. 2011). Statistical learning theory can be considered as the basic theory of SVM (Cortes and Vapnik 1995). Ballabio and Sterlacchini (2012) mentioned two kinds of errors in landslide modelling. The first is data linearity, which is an assumption in classic models. The second error is model overfitting (Ballabio and Sterlacchini 2012), which reduces model generality and its applicability in larger regions. These two errors can reduce model performance and subsequently  find an optimal separation hyper plane that could differentiate the two classes (Hong et al. 2016b).
In case of linear separable, the hyper plane could be determined by where w refers to the coefficient vector that determines the hyper plane orientation in the feature space, b represents the offset of the hyper plane from the origin and d i is the positive slack variables (Cortes and Vapnik 1995). The optimal hyper plane definition resolves an optimization problem as Minimize where a i is the Lagrange multipliers and C refers to the penalty. In this work, the SVM model RBF kernels were implemented as follows (Pourghasemi et al. 2013b): where K x i :y i ð Þdenotes the kernel function; y is the gamma term in the kernel function for RBF, sigmoid and polynomial kernels; d denotes the degree term; r shows the bias term; y, d and r are the parameters that should be optimized in the modelling process.

Logistic regression (LR)
LR is one of the most widely employed statistical models in LSR. It clarifies the relationship between a categorical variable and some dependent factors, which could be categorical, continuous or binary variables (Hong et al. 2016a). The benefit of the algorithm is that variables do not require normal distribution (Pradhan and Lee 2010c). Independent variables in the LR could be designated as 0 and 1, denoting the landslide absence and presence. Model output varies between 0 and 1 and represents landslide susceptibility. The LR is based on the logistic function P i , determined as where P denotes the probability related to a certain observation, and z could be defined as where b 0 denotes the intercept of the algorithm, b i is the coefficient representing the independent variables contribution X i , and n denotes the number of conditioning factors (Atkinson and Massari 1998).

Artificial neural network (ANN)
The ANN is a computational procedure with capability to acquire, present, and calculate mapping from data multivariate space to another (Garrett 1994). The goal of the ANN model is to build a method for predicting outputs from input factors that have not been used in the modelling process (Lee et al. 2003). The most prominently employed neural network method is the back-propagation learning algorithm (Lee et al. 2003). The standard ANN model comprises three layers, namely, input layer (i.e. landslide conditioning factors), hidden layers and output layer (i.e. landslide susceptibility). The ANN determines a certain weight for each input factor and after multiplying, sums the product, and employs a non-linear transfer function to build results (Lee et al. 2003). This study employed two neural networks architectures namely, the multilayer perception (MLP) and radial basis function (RBF), where both are vigour classifiers (Yilmaz and Kaynar 2011). The main difference between RBF and MLP is that MLP is more general, while RBF is considered as more localist learning and is associated with input data (Yilmaz and Kaynar 2011). The MLP can separate non-linear data (Lee et al. 2003), exploit helpful information from imprecise data, and produce acceptable results even when the inputs are not perfect (Ermini et al. 2005;Kanungo et al. 2006).

Validation of the susceptibility map
The quality of a landslide susceptibility model is usually verified by using landslide inventory data that is not used during building the model. Of the 160 landslides identified, 112 (%70%) locations were used for the landslide susceptibility maps, while the remaining 48 (%30%) cases were used for the model validation. The landslide susceptibility analysis was performed using both confusion matrix and receiver operating characteristics (ROC).

Confusion matrix
The confusion matrix includes information on model classification accuracy (Provost and Kohavi 1998), and provides useful information on model performance. The matrix columns display the instances in an estimated class while the rows show the instances in an actual class (Sammut and Webb 2011). This matrix helps experts to determine whether the system is misidentifying the two classes (i.e. mislabelling one as the other) or correctly classifying them (Stehman 1997). This matrix was also employed to evaluate the algorithm performance in other investigations .

Receiver operating characteristic (ROC)
The ROC provides a graphical plot in which the binary classifier method performance is determined. This curve has been used in many studies to evaluate the algorithm performance in LSM and groundwater potential mapping (Tien Bui et al. 2012;Ozdemir and Altural 2013;Hong et al. 2016a;Naghibi et al. 2017), where the true positive rate is plotted against the false positive rate at several threshold settings (Huang et al. 2002;Naghibi and Moradi Dashtpagerdi 2016;Mousavi et al. 2017). The area under this curve determines the employed model accuracy in LSM. The area under curve changes from 0.5 to 1, and a 1 value indicates perfect performance, while a 0.5 value denotes weak model performance. In this study, to apply the ROC curve in this research, 47 nonlandslides were generated by a random algorithm and then employed to calculate area under the curve using the ArcGIS software.

Results and discussion
In this study, five training data selection scenarios were generated and analysed randomly to evaluate their effects on the accuracy of SVM, LR and ANN models for LSM. Subsequently, the landslide susceptibility maps (LSMs) were classified into five classes (i.e. very low, low, moderate, high and very high) using the quantile classification scheme method (Hong et al. 2016a(Hong et al. , 2016b. Figure 5 shows the maps produced by the SVM model for each scenario. In general, by looking at the spatial distribution of the landslide susceptibility zone, significant training data selection effects can be observed. The SVM model fits the best hyperplane that can separate landslides from non-landslides effectively, although some difficulty could be encountered in fitting the hyperplane when the predictors (landslide factors) are not separable, indicating that the training sample selection can have effects on the SVM accuracy because landslide conditioning factors vary according to the geographic locations in the study area. SVM also works well with high dimensional data, and in the current study, 14 landslide conditioning factors were used. Categorical factors such as land use and lithology can have lesser effects on SVM accuracy. The reason is that when selecting different landslide training data subsets, the number of selected landslides may not vary from one class to another. However, in continuous factors such as slope and altitude, more effect can be observed because the values vary continuously.
In contrast, the LR model was found to be less sensitive to training data selection. Figure 6 shows that results of the LR model are similar in terms of spatial distribution, but are different from those generated by the SVM model. The distribution of the susceptibility zones produced by LR varies in the study area and those produced by the SVM. Both methods attempted to fit a linear equation to the landslide inventory data, and could generally place the landslide locations to the high and very highly susceptible classes. Similarly, the susceptibility maps produced by the ANN model had susceptibility zone spatial distributions nearly similar to those produced by the LR model (Figure 7).
The random landslide training data selection affected the parameter estimations of the SVM, LR and ANN algorithms. Table 3 shows the landslide conditioning factors estimated parameters in each scenario. In general, it can be observed that both coefficient value and positive/negative correlation of the factors with the dependent variable (landslide presence/absent) were affected. For example, the TWI factor coefficient in the first scenario was ¡0.083 and 0.028 in the second scenario. Similarly, the TRI coefficient estimated by the LR varied from one scenario to another. The factor slope and altitude coefficients had consistent negative and positive signs. The results suggested that to estimate the importance of a factor, further analysis (such as optimization, parameter fine-tuning, new kernel function, etc.) is required, and cannot only rely on the estimated coefficients by the SVM, LR or ANN models. For this purpose, researchers often use the frequency ratio and other factor optimization methods, such as random forest and chi-square test. Different coefficient values may produce similar accuracy, but the final outputs can be affected and the importance of the factors may not be reliable.
The three-model validation using the ROC curves and confusion matrix are shown in Tables 4  and 5. For all scenarios, the LR outperformed the SVM and ANN models. The average overall accuracy of LR, SVM and ANN models are 81.42%, 79.82% and 70.20%, respectively. The ROC curves showed that the LR algorithm average success rate is higher than that of SVM algorithm by »6%, and the ANN model by 8%. In addition, the LR model prediction rate is higher than that of SVM and ANN models at an average of 10%. This quantitative assessment indicates that the LR and ANN models are less sensitive to training data selection than the SVM model in the study area. The advantages of the LR model include the final model output interpreted as a probability and not an exact prediction value, as in the case of other models. The LR has less parameters to fine tune, while the SVM model requires optimizing the kernel function, the penalty and gamma parameters. The ANN model requires finding the best network architecture that can predict the output classes accurately. In addition, the LR model training is more efficient than that of the SVM and ANN, especially in large-scale data.
Further model evaluation was made by producing landslide density graphs, as shown in Figure 8. The number of landslides in each susceptibility class was calculated and plotted in a two-dimensional chart, where its horizontal axis showed the susceptibility class and the number of landslides was presented in the vertical axis. The results showed that the LR model predicted the highest landslide percentage (42%) in the very high class. In addition, the SVM and ANN models predicted 13% and 38% of landslides in the very high susceptibility class.
Previous LSM works using the LR, SVM and ANN methods found several important findings related to the current work. Xu et al. (2012) showed that the training data and kernel function selections had significant effects on SVM modelling, and optimization is required for accurate LSM. The   Pham et al. (2017) found that the SVM model outperformed the LR model. Even then, Wang et al. (2016) found that the LR model outperformed the ANN model, and selecting appropriate landslide samples for training the models can improve model accuracy. The SVM, LR and ANN are widely used for LSM, and are concluded to have good predictive capabilities with certain limitations. Previous works also suggested optimizing the training data-set as well as the model parameters for better LSM. In addition, the integration of these models into a hybrid model was also found to improve the base model accuracy (Pham et al. 2017).
One SVM model drawback is the considerable time required for application ), while LR requires lesser time. To improve the SVM model performance, different researchers have suggested creating ensemble models by using its results and other statistical methods. For instance, SVM was combined with frequency ratio in Tehrany et al. (2015) and has been reported to have high efficacy. The high LR method efficacy could be related to its nonparametric nature (Nandi and Shakoor 2010), which allows the model to work with skewed distribution data, a phenomenon that can also be observed in the natural environment and landslide conditioning factors. The current work suggests the careful input data pre-processing before the actual modelling. The outliers, highly correlated factors, and noisy data should be pre-processed to reduce final training data selection product effects. However, the LR model suggested a comparison of several models may be a better solution for accurate modelling because knowledge domain is different from one area to another.

Conclusions
Landslide susceptibility assessment methods have progressed significantly over the past decade; however, random training data selection effect is not well-investigated. This study contributes to literature by investigating the accuracy effect and parameter estimates of the SVM, LR and ANN models under five training data-set scenarios for LSM. In these scenarios, the data-sets were created using random sampling algorithms. Overall, the results showed that LR and ANN models had better performances than the SVM model in terms of sensitivity to training samples. The findings indicated that both the coefficient value and positive/negative factors correlation with the dependent variable (landslide presence/absent) were affected.
The main challenge in landslide susceptibility assessment is having large-scale and accurate landslide inventory data. With the limited number of landslide locations, models such as ANN can be difficult to build and optimize. Training models with a small number of training data often lead to having models with less generalization capacity. Hence, producing relatively acceptable landslide susceptibility maps with limited data and appropriate landslide location selection as well as optimizing model hyperparameters and their architecture are highly suggested. Future works should focus on improving the model generalization so that models built with large data-sets can be used in datascarce environments.

Disclosure statement
No potential conflict of interest was reported by the authors.