Evaluation of landslide susceptibility mapping by evidential belief function, logistic regression and support vector machine models

Abstract The main purpose of this study was to produce landslide susceptibility maps using evidential belief function (EBF), logistic regression (LR) and support vector machine (SVM) models and to compare their results for the region surrounding Yongin, South Korea. We compiled a landslide inventory map of 82 landslides based on reports and aerial photographs and confirmed these data through extensive field surveys. All landslides were randomly separated into two data sets of 41 landslide data points each; half were selected to establish the model, and the remaining half were used for validation. We divided 18 landslide conditioning factors into the following four categories: topography factors, hydrology factors, soil map and forest map; these were considered for landslide susceptibility mapping. The relationships between landslide occurrence and landslide conditioning factors were analyzed using the EBF, LR and SVM models. The three models were then validated using the area under the curve (AUC) method. According to the validation results, the prediction accuracy of the LR model (AUC = 94.59%) was higher than those of the EBF model (AUC = 92.25%) and the SVM model (AUC = 81.78%); the LR model also had the highest training accuracy.


Introduction
Landslides are the most common type of natural geological disaster, causing great damage in terms of material costs and loss of human life. Landslides are frequently responsible for enormous property damage and incur both direct and indirect costs, including actual damage and future damage (Fang 2012). In general, landslides occur in mountainous areas where precipitation, bedrock, forest, soil conditions and steep slopes can promote failure. South Korea has a large mountainous area, covering approximately 70% of the total land area of the country and consisting of granite or gneiss lithology; the area is also characterized by inclines vulnerable to landslides (Kim and Song 2015). Additionally, low soil strength due to weathering, and unstable slopes and high rainfall during the rainy season, greatly influence slope stability and landslide susceptibility. Landslide risk increases during the rainy season and with climate change-induced increases in rainfall frequency (Jeong et al. 2015;Kim and Song 2015).
Several methods are used to directly or indirectly minimize the impacts and losses attributable to landslides. The first step must be to identify zones vulnerable to landslides within a given area. The most common method is a geographic information system (GIS)-based approach to assess the environmental conditions where landslides have occurred to determine conditions that indicate risk. GIS-based approaches can incorporate a variety of methods; heuristic, statistical, probability and deterministic approaches were well documented by van Westen et al. (2006).
Each model has strengths and weaknesses; in general, their behavior depends on the characteristics of the study area where they are applied. Therefore, comparisons of various models for landslide susceptibility assessment are highly desired. This study analyzed landslide susceptibility based on EBF, LR and SVM models of landslide susceptibility. Various factors affecting landslides, including topography, hydrology, soil and forest cover, were examined using the System for Automated Geoscientific Analyses (SAGA) GIS Module. Landslide events were accurately detected from digital aerial photographs to determine the locations of past landslides. The relative contributions of these factors and landslide occurrences used as training data were then evaluated using the EBF, LR and SVM models, and the results of the models were validated using landslide occurrences.

Study area and materials
South Korea is located in East Asia, in the temperate monsoon zone. The climate is generally hot and humid, with abundant rainfall during the summer. The annual average precipitation ranges between 1,100 and 1,500 mm, with 70% of the annual average rainfall occurring during the summer (June-September). Debris flows and shallow landslides also commonly occur during the summer (Hong et al. 2017).
Yongin is a major city located in Gyeonggi Province, South Korea ( Figure 1). The city comprises rural, urban and corporate communities. It is covered by hills, which account for 60% of the surface and mainly consist of Precambrian porphyroblastic and quartzofeldspathic gneisses. The city is expanding into the hills due to the creation of new towns and housing developments.
The study area suffered extensive landslide damage following heavy rain on 27 July 2011. The landslide began at around 13:00; precipitation intensity was 78 and 68 mm/ h at 10:00 and 12:00, respectively, indicating high-intensity precipitation during that month. Within the study area, the cumulative rainfall until 13:00 on July 27 was 385 mm. Heavy rainfall caused shallow landslides, and subsequent debris flows collapsed houses and parts of buildings, resulting in the loss of life and property (Hong et al. 2017).
In this study, we interpreted landslides in the Yongin region using web-based aerial photographs and survey data and then confirmed these data in the field. To detect current locations of past landslides, we used digital aerial photography with a ground resolution of 0.5 m. From visual interpretations of photographs taken before and after the landslide, we detected 82 landslide locations in the study area. These were plotted on the aerial photographs, and DEM data were added for spatial processing (Figure 1). Topographic and hydrologic factors were constructed from DEMs using terrain analysis from the SAGA-GIS module ( Table 1). The soil and forest factors were extracted from soil and forest maps.

Landslide inventory
Landslide inventory maps are necessary to assess the relationship between landslide distribution and predisposing factors. We constructed a landslide inventory map with 82 landslide locations that were identified by interpretation of aerial photographs without ground control points (GCPs). The aerial photographs were freely obtained at portal sites such as http://map.daum.net/ and http://www.skymaps.co.kr/ in Korea . Photographs taken before and after landslide events (Figure 2a-c) were selected from each region of landslide occurrence, and five GCPs were applied to each photograph from digital topographic features using ArcMap 10.2. Figure 2d shows a landslide photographed during the field survey. The spatial distribution of landslides was determined using remote sensing (RS) and GIS spatial analysis methods; the results show that landslides were mainly distributed at elevations between 100 and 300 m.

Landslide-related environmental factors
Environmental factors are important in determining the locations of landslide susceptibility areas. We considered 18 conditioning factors related to landslide occurrence. The spatial distribution of these factors is typically difficult to determine. We were able to obtain digital topographic, soil, forest and geological maps. The scale of all maps was 1:5,000, except for one geological map, which had a scale of 1:25,000. Landslide-related topographic, hydrologic, soil, forest and geologic factors used in this study are listed in Table 1. Topographic and hydrologic factors were prepared using a DEM derived from digital topographic maps with 5 m contours using ArcMap 10.2. The locations of landslides ( Figure 1) and environmental factors (supplement A) were denoted by 5 m Â 5 m pixels; the study area had a total of 1,918,400 cells, with 1,760 columns and 1,090 rows.

Topographic and hydrologic factors
Topographic and hydrologic factors are related to the cause of landslides. In this study, we considered slope, plan curvature, aspect, convexity, topographic position index (TPI), terrain ruggedness index (TRI), mid-slope position (MSP) and landforms. The extracted hydrologic factors were slope length (SL), stream power index (SPI) and topographic wetness index (TWI). Slope indicates the hill's degree of steepness, and aspect is the steepest downhill direction of the slope. Plan curvature is perpendicular to the slope and affects the divergence and convergence of flow on the surface. Terrain surface convexity is the positive surface curvature and represents the percentage of convex-upward cells (Iwahashi and Pike 2007). TPI is the difference between the mean elevation for a neighborhood of cells and the elevation of each cell (Guisan et al. 1999). Positive values typically represent features higher than surrounding features, values near zero are flat areas and negative values represent lower features. TRI is converted to absolute values by squaring the difference between the value of a cell and those of its neighbor cells. MSP sets the position to 0 and the maximum vertical distance to the mid-slope in the peak or valley direction to 1. Landform classification was derived from ranges of TPI values (Weiss 2001). SL is based on a specific catchment and slope area and was originally a replacement for SL. SPI represents the erosive forces of water flow (Moore et al. 1991). TWI exhibits a topographic effect at the site of the saturated area size of runoff generation (Moore et al. 1991). In general, lower TPI values and higher SL and SPI values represent higher landslide susceptibility.

Soil factors
The soil factors that we examined from soil maps for landslide susceptibility included soil thickness, land use, soil material and topography. Each factor was divided into several classes. Soil thickness was divided into the following four classes: very shallow (<20 cm), shallow (20-50 cm), moderate (50-10 cm) and deep (>100 cm). The maximum thickness and average unit weight (Kim et al., 2005) of the soil in the study area were 300 cm and 1.347 g/cm 3 , respectively. Land-use factors were grouped into grasses, forest, paddy, farm and orchard. Soil materials were classified into gneiss, river, acidic residuum and granite residuum. Topographic factors were classified based on the existing landscape type as mountainous, valley, hill, alluvial fan, piedmont slope, diluvium and fluvial plains' areas.

Forest factors
Slope stability can be affected by vegetation and plant distribution, and it influences the hydrological properties of hills such as water flow, which, in turn, is associated with triggering landslides (Schwarz et al. 2010). Plant root strength also greatly influences the occurrence of slope failure (Schmidt et al. 2001). Therefore, forest factors such as tree diameter, density and age are closely related to the strength of soil-root bonds. Thus, forests with high tree density have a high capacity to maintain water pressure under heavy rainfall. In general, older trees with larger diameters tend to have stronger roots, contributing to increased slope stability. Tree diameters were divided into the following four classes: very small (<6 cm), small (7-18 cm), medium (19-30 cm) and large (30 cm). More than 50% of the trees in the study area were of medium diameter. Forest density was classified into the following three classes: sparse (<50% cover), moderate (51-70%) and dense (>71%). Tree ages were grouped into the following six classes: <10, 11-20, 21-30, 31-40, 41-50 and 51-60 years. Trees 10 years of age had very small diameters, and regions where this age class was dominant had low forest density. Trees 11-30 years of age were also small in diameter, and dominated areas with mixed cover and moderate forest density.

Application of EBF, LR and SVM for landslide susceptibility mapping
Using the detected landslide locations and the factors influencing landslide occurrence, we applied data-driven probability, statistical and data-mining models. The landslide locations were divided into a training set and a validation set, each containing 50% of the total number of landslides, and the spatial relationships between an event site and each landslide-related factor were determined.

EBF model
The EBF model used in this study was mainly based on the Dempster-Shafer theory of evidence algorithms (Shafer 1976;Dempster 2008). The Dempster-Shafer theory is  a generalization of the Bayesian subjective probability theory, which relates to the effect of confidence in a problem on the probability of related problems. The flexibility of the EBF model, which is due to the acceptance of uncertainty and the incorporation of many sources of belief, is its greatest advantage. For this reason, it has already been applied in the construction of landslide susceptibility maps (Lee and Park 2013). The EBF model consists of the following four basic mathematical functions: Bel (degree of belief), Dis (degree of disbelief), Unc (degree of uncertainty) and Pls (degree of plausibility); values of these functions range from 0 to 1 (Althuwaynee where Bel n is the lower degree of belief for each factor type or range, Dis n is the degree of disbelief for each factor type or range, and Unc n is the degree of uncertainty for each factor type or range. Table 2 shows the spatial relationships of landslide occurrence with landslide conditioning factors obtained using the EBF model. An integrated belief function map was produced from the sum of the EBF values calculated for each landslide conditioning factor. The resulting EBF landslide susceptibility index (LSI) map (Figure 3) was classified into four zones using the equal area classification method for ease of visual interpretation: low, medium, high and very high landslide susceptibility zones, covering 70, 15, 10 and 5% of the study area, respectively (Figure 3).

LR model
In LR, one of the most widely used statistical analyses, the probability of a result is related to a set of potential predictive variables (Tu 1996). LR detects a quantitative relationship of dependent variables with one or more independent variable; the value of the dependent variable is transformed to the value of its corresponding probability ratio logarithm and fit to the logistic curve. In landslide prediction, the goal of LR is to find the best algorithm for analyzing the spatial relationship between a set of factors that influence landslides and the presence or absence of landslides; the predicted result landslide or non-landslide (Ayalew and Yamagishi 2005). The dependent variable is therefore a binary variable indicating the presence or absence of a landslide, and the relationship between landslide events and causal factors can be expressed as: where p is the estimated occurrence probability of a landslide and varies from 0 to 1 on an S-shaped curve, z is the weighted linear combination of the independent variables and varies from À1 to þ1, and z can be expressed as a summation of some constant value (Tien Bui et al. 2016): where p/1Àp is the likelihood ratio, a is the intercept of the model, x i represents the independent variable and b i denotes the respective coefficients. The intercept of the model in this study was found to be -30.867. The coefficient for each factor is presented in Table 2.
The model coefficients and probability of landslide occurrence (Equation 7) were calculated using the statistical analysis software SPSS. Through the LR model, the spatial relationships between each landslide-conditioning factor and landslide occurrence were assessed (Table 2). In this analysis, slope, convexity, TPI, TRI, MSP, SL, TWI, SPI and soil thickness were treated as continuous data, whereas plan curvature, slope aspect, landforms, land use, soil material, topography, tree diameter, forest density and tree age were treated as categorical data. The LR LSI map (Figure 4) was prepared in the same manner as the EBF LSI map.

SVM model
A SVM is a supervised machine learning classification technique based on statistical learning theory and the structural risk minimization principle (Vapnik 1998). Using the landslide training dataset, the SVM maps the input data onto a high-dimensional feature space and an optimal hyperplane with a maximum margin is determined to separate the two classes: landslide and non-landslide. Kernel functions convert the original data into linearly separable data in high-dimensional feature space (Kavzoglu and Colkesen 2009).
We used the Environment for Visualizing Images software (ENVI 5.0; Exelis Visual Information Solutions, Boulder, CO, USA) for SVM model application. The default kernel of the ENVI software is the radial basis function (RBF) kernel, which is known to provide more accurate prediction results in most classifications, especially in nonlinear environments. In landslide susceptibility mapping, the radial basis function (RBF) kernel is a popular choice for SVM classification (Tien Bui et al. 2012); as applied in this study, this kernel was expressed as where K ¼ ðX i ; X j ) is the kernel function and c is a kernel parameter. The c value for landslide-related environmental factors was set to 1.000 in the ENVI software. The probability values from the SVM model range from 0 to 1 in ENVI, representing low and high LSIs, respectively. The SVM LSI map ( Figure 5) was prepared in the same manner as the EBF LSI map.
A sensitivity analysis was conducted for each factor, by excluding each factor from the SVM LSI map of all factors. The normalized weight for each factor (Table 2) was  (Table 3). The normalized weight has a range of 0 (low landslide susceptibility) to 1 (high susceptibility). The weight for each factor is presented in Table 2.

Landslide susceptibility map validation and comparison
A landslide susceptibility map should be able to effectively predict future landslide areas and to be validated by combining existing landslide location data with new locations as landslides occur. Therefore, we confirmed the results of the landslide susceptibility analysis using validation data. Landslide susceptibility maps produced by the three models were validated by comparing the susceptibility map with the training data and also by comparing the susceptibility map with the validation data. To achieve this, 82 landslides were randomly separated into the following two data sets: 50% for modeling and 50% for validation.
The AUC was used to quantitatively compare the results of the three models ( Figure 6); the model with the highest AUC was considered the best model. The AUC was obtained by comparing the validation data with the landslide susceptibility maps. Based on Figure 6, the LR model had the highest AUC value (0.9459), followed by EBF (0.9225) and SVM (0.8178). The LR model also had the highest training accuracy, of 94.59%, followed by the EBF (92.25%) and SVM (81.78%) models.

Results and discussion
The AUC of the LR model was higher than that of the EBF and SVM models by about 2.3 and 12.8%, respectively. The AUC of the EBF model was higher than that of the SVM model by about 10.4%. Therefore, the LR and EBF models were less sensitive to the training data than the SVM for the study area.
The LR is designed to find the best conditional probability for the data (Kavzoglu et al. 2014). Therefore, the coefficient for each factor can be calculated to optimize probabilistic interpretation. Factors with positive relationships with landslide susceptibility were slope, convexity and SPI; negative factors included TPI, TRI, MSP, SL and TWI.
In the EBF model, belief (Bel) was considered the function most representative of the correlation between landslide occurrence and each landslide conditioning factor. At a slope of >27.23 , the Bel value was 0.546, indicating a very high probability of landslide occurrence. Plan curvature was characterized by a high Bel value for concave shapes (0.477). High Bel values for aspect were related to southeast-and eastfacing slopes (0.288 and 0.238, respectively), indicating that these categories have a positive spatial association with landslide occurrence. The probability of landslide occurrence was highest at convexity values between 34.87 and 36.23 (7th class), TPI values between 3.58 and 7.62 (9th class), TRI values between 1.74 and 7.74 (6th class), MSP values between 0.14 and 0.27 (2nd class), SL values between 5.54 and 552.80 (6th and 7th classes), TWI values between 4.88 and 5.34 (2nd class) and SPI values between 1.10 and 2.31 (5th and 6th classes). The remaining factors had highest probabilities as follows: landforms, 4th class; soil thickness, 3rd class (50-100 cm); land use, grasses; soil material, gneiss residuum; topography, hilly area; tree diameter, medium; forest density, dense; and tree age, 4th class (31-40 years).
The sensitivity analysis of each factor required calculation of the AUC ratio of the SVM LSI maps, which excluded each factor in turn from the SVM LSI map of all factors. The AUC ratio was lowest at 81.58% when soil thickness was excluded and highest at 84.63% when plan curvature was excluded (Table 3). Thus, soil thickness had the greatest effect (normalized weight: 1) in the SVM model. In contrast, plan curvature had the smallest effect (normalized weight: 0).
In landslide susceptibility modeling, the main advantage of the LR model is that its linear classification allows the use of binary dependent variables for a range of independent variable types (i.e., scale, nominal and ordinal). The EBF model allows susceptibility zone mapping and uncertainty modeling. Furthermore, the quantitative relationships between landslide occurrence and the classes of each factor are clearly presented as a series of mass functions: Bel, disbelief (Dis) and uncertainty (Unc). Although the SVM model confers the benefit of modeling nonlinear decision  boundaries and kernels, the application of the RBF kernel and a gamma value of 1 produced lower performance in this study than the LR and EBF models, which have fewer parameters for fine-tuning. However, the SVM model requires that some kernel functions and gamma parameters are optimized. The training of the LR and EBF models is typically more efficient than that of the SVM model, especially when large amounts of data are used (Kalantar et al. 2018). However, all models exhibited reasonably good accuracy in landslide susceptibility mapping in the current study. Therefore, any of these models may be applied to spatial prediction of landslide risk within the study area.

Conclusions
Landslide inventory maps were constructed using high-resolution digital aerial photography data due to the difficulty of distinguishing landslides of similar shapes in the study area using satellite images or panchromatic photography. By contrast, landslides were easily visually interpreted using high-resolution aerial photographs taken during a high-vegetation season. The use of aerial photography could also save the time and cost associated with field surveys to identify damage caused by natural disasters.
During the rainy season, especially in Korea, debris flows occur in many high-slope areas due to intensive daily rainfall. Therefore, it is necessary to analyze landslide susceptibility using the relationships between factors related to landslide events and the locations of previous landslides. In this study, 18 factors were selected to characterize topography, hydrology and soil and forest effects to analyze landslide risk using EBF, LR and SVM models. With these three models, landslide susceptibility maps of the Yongin region were produced and areas were classified into low, medium, high and very high landslide susceptibility zones. Validation of the landslide susceptibility results showed that the LR and EBF models had higher success rates and training accuracy than the SVM model. The landslide susceptibility map produced by the LR model performed best in landslide susceptibility mapping in this study.
The factors affecting landslide and their ranges (or classes) were used to produce a landslide susceptibility map from the three models; such maps provide useful information for decision makers and land-use planners in urban regions. The results obtained in this study will make a significant contribution to the geohazard mapping and land-use planning literature.

Disclosure statement
No potential conflict of interest was reported by the authors.