Landslide susceptibility assessment in the Nantian area of China: a comparison of frequency ratio model and support vector machine

Abstract It is significant to do landslide susceptibility assessment (LSA) accurately and efficiently using an appropriate model for landslide prediction and prevention. This article aims to compare the frequency ratio (FR) model with the support vector machine (SVM), for mapping the landslide susceptibility of Nantian area in southeastern hilly area, China. To begin, 70 recorded landslides are identified through field investigation and the land and recourse department, 50% of the landslide grid cells are used to train the models and the remaining 50% of the landslide grid cells are used to test the models. Ten environmental factors are used in the modeling of LSA, including the elevation, slope, aspect, plan curvature, profile curvature, relief amplitude, lithology factor, distance to river, Normalized Difference Build-up Index (NDBI) and Normalized Difference Vegetation Index (NDVI). Then the landslide susceptibility maps of Nantian area are produced by the FR and SVM models, respectively. Finally, the accuracies and efficiencies of both two models are evaluated and compared. The results show that the landslide susceptibility distribution characteristics of Nantian area are explored well by the two models, and the FR model has higher prediction rate and is considerably more efficient than SVM for LSA.


Introduction
Landslides are destructive geological disasters that frequently result in serious problems around the world. It is significant to predict the regions where landslides are most likely to occur. Landslide susceptibility assessment (LSA) is considered as an important and useful tool to predict the spatial distribution probability of the landslide occurrence under certain geo-environmental conditions. Therefore, it is valuable to do LSA for landslide disaster risk reduction (Srivastava et al. 2010;Mahalingam et al. 2016;Persichillo et al. 2016).
The LSA is generally carried out based on the Geographical Information System (GIS), which has functions of collection, storage, manipulation, analysis, display of spatial big data (Ayalew and Yamagishi 2005;Akgun et al. 2012;Miller and Degg 2012;Pourghasemi et al. 2013). The processes of LSA include data sources preparation, landslide-related environmental factors analysis, and modeling of landslide susceptibility indexes (LSI) calculation. An accurate and efficient LSI calculation model is very important for producing the landslide susceptibility map (LSM). In the past decades, many models have been proposed to do LSA, and these models can be mainly divided into qualitative models and quantitative models (Devkota et al. 2013;Wu et al. 2016).
The qualitative models depend on the experience and knowledge of the researchers to deal with LSA. They mainly include landslide inventories model and heuristic model (Ruff and Czurda 2008;Pourghasemi et al. 2012;Zhu et al. 2014;Althuwaynee et al. 2014a). The main limitations of these models are their subjectivity and relatively low prediction accuracy (Bui et al. 2012). On the contrary, the quantitative models are built based on the geographic mathematical modeling of the correlations between recorded landslides and related environment factors. There are mainly two types of quantitative models: deterministic models and data-driven models (Marjanovi c et al. 2011;Bui et al. 2012). The deterministic models are not appropriate for a large area, because they have difficulties in obtaining sufficient and detailed hydrological and soil mechanical parameters in a large area. Meanwhile, the data-driven models are more effective and popular than the other models for the LSA in a large area (Godt et al. 2008;Sorbino et al. 2010). Many data-driven models have been reported in past decades, for example, the frequency ratio (FR) model (Vijith and Madhu 2008;Wang et al. 2016), weight-of-evidence model (Lee and Choi 2004;Pradhan et al. 2010), information value model (Sharma et al. 2015), logistic regression model (Ayalew and Yamagishi 2005;Nourani et al. 2014;Althuwaynee et al. 2014b), artificial neural network (ANN) (Choi et al. 2012;Dou et al. 2015;Bui et al. 2016;Huang et al. 2017b), support vector machine (SVM) (Marjanovi c et al. 2011;Kavzoglu et al. 2014;Huang et al. 2016aHuang et al. , 2017a, fuzzy logic (Pourghasemi et al. 2016) and decision tree (Park and Lee 2014;Althuwaynee et al. 2014b).
These data-driven models have their inherent advantages and disadvantages. Hence, it is meaningful to select an accurate and efficient data-driven model to do LSA for a large area. Two typical data-driven models: SVM and FR models, are selected as examples in this study. Literature review shows that SVM model has many advantages, including nonlinear modeling, excellent generalization performance and global optimum. The SVM has gained special attention in the LSA (Yao et al. 2008;Kavzoglu et al. 2014). However, SVM model has difficulties of model parameters selecting, reasonable non-landslides determining and low modeling efficiency (Huang et al. 2016b(Huang et al. , 2017c. On the other hand, FR model can be readily understood and has been widely used for LSA (Vijith and Madhu 2008;Li et al. 2016;Wang et al. 2016). It is not necessary for FR model to select model parameters and to determine reasonable non-landslides. FR model has been compared with many other models (Pradhan and Lee 2010a;Reis et al. 2012;Park et al. 2013;Ramesh and Anbazhagan 2015;Vakhshoori and Zare 2016;Wu et al. 2016;Chen et al. 2016b).
Unfortunately, limit attentions have been received in the comparisons of accuracies and efficiencies between the FR and SVM models. Hence, this study is aimed to compare the two typical data-driven models for LSA.
Nantian area in the southeastern hilly areas of China, frequently suffers many landslides due to rain storms and unreasonable anthropogenic activities. Landslides have resulted in a lot of economic losses and casualties in Nantian area in the past decades. The LSA in this area has not been carried out, hence, it is significant to do LSA in this area. Meanwhile, the accuracies and efficiencies of the LSMs in Nantian area produced by the FR and SVM models are assessed and compared.

Methods
The main object of this study is to do LSA respectively using the FR and SVM models for the Nantian area. First, the landslide inventory and other data sources are obtained, ten landslide-related environmental factors are extracted from these data sources. Then the FR values and correlation coefficients of these environmental factors are calculated. Third, the FR and SVM models are used to calculate the LSI on the basis of landslide inventories, selected environmental factors and other related data. Meanwhile, the LSMs of Nantian area are produced in GIS software. Finally, the prediction efficiencies and accuracies of the two models are assessed and compared.

Frequency ratio model
The FR is the ratio of the area where landslides have occurred to the total area, and is also the ratio of landslide occurrence probability to the non-occurrence probability for a given attribute (Vijith and Madhu 2008;Reis et al. 2012;Chen et al. 2016b). Suppose that the L and F , respectively represent recorded landslide grid cells and landslide-related environmental factors, and the F is sub-classified into i class of environmental factor. The FR of the i class of environmental factor F i is expressed as Equation (1). For the frequency ratio FR i , a value of 1 is an average value of FR i , it means a higher correlation between landslides and environmental factor when the value is greater than 1, and it means a lower correlation when the value is smaller than 1 (Pradhan and Lee 2010a;Park et al. 2013;Vakhshoori and Zare 2016). where PL i is the percentage of landslides in F i area, PF i is the percentage of F i area. Suppose a landslide-related environmental factor F ðjÞ ðj ¼ 1; 2; 3; ::::::; mÞ, its frequency ratios named FR ðjÞ i ði ¼ 1; 2; 3; ::::::; n; j ¼ 1; 2; 3; ::::::; mÞ of different classes are calculated based on Equation (1). If the class of F ðjÞ at a certain location is F ðjÞ i , we can determine that the frequency ratio value of this environmental factor at the location FR ðjÞ can be calculated as FR ðjÞ i . Hence, the LSI at this location can be considered as the summation of the frequency ratio values of all the landslide-related environmental factors at this location as shown in Equation (2) (Pradhan and Lee 2010b;Wang et al. 2016). The LSI represents the relative probability to landslide occurrence. The greater the LSI value, the higher the probability of the landslide occurrence. The smaller the value, the lower the probability of the landslide occurrence.

Support vector machine
SVM (Cortes and Vapnik 1995) is a typical model for multi-variable classification and assessment. SVM addresses the problem of estimating a function based on a given data set fX i ; FðXÞg n i , where X i is set as input vector, FðXÞ is set as the desired value, and n denotes the total number of data patterns. For the SVM, we can approximate the regression function as: where b is a scalar threshold, x is the weight vector, /ðXÞ is high-dimensional feature space, /ðXÞ is from input space X by nonlinear mapping. Meanwhile, in order to prevent over-fitting and improve forecast accuracy, following regularized functional involving summation of the empirical risk and a complexity term kxk 2 2 , is minimized. Then we can estimate the coefficients x and b by minimizing the regularized risk function: Min kxk 2 2 s:t: Then the regression problem can be transformed into the constrained formation: where the constant C stands for the penalty degree of the sample with error exceeding e. Two positive slack variables f i and f Ã i represent the distance from actual values to the corresponding boundary values of e-tube. A dual problem can then be derived by using the optimization method to maximize the function as: where a i and a Ã i are the Lagrange multipliers. SVM model for function fitting obtained by the above mentioned maximization function is then given as Equation (8). The radial basis kernel function Kðx i ; xÞ ¼ /ðX i ; X j Þ is used as kernel function of SVM in this study (Feizizadeh et al. 2017).

Introduction of Nantian area
Nantian area locates in the southeastern hilly areas, Zhejiang province of China (Su et al. 2015) as shown in Figure 1. Its area covers about 912 km 2 with the elevation ranging from 136 m to 1421 m. The main geomorphic types of Nantian area are hills and mountainous regions. The study area is affected by the frequent geological structure activities and intensive rock weathering. Its climate type belongs to the subtropical maritime monsoon climate zone, hence, there are plenty of sunshine and abundant rainfall in this area. Nantian area is heavily affected by the typhoon climate in the spring and summer every year, which triggers frequent heavy rainfall events. Many landslides occurred in this area due to various topographic types, complex geological structure, and complicated climate.

Landslide inventory of Nantian area
An accurate and reliable landslide inventory is very important for LSA (Zhu et al. 2014). According to field investigation and local land and recourse departments, a total of 70 landslides were identified in the study area by 2015s ( Figure 1). These landslides are mainly shallow soil landslides. Meanwhile, small-scale, high frequency, group-occurring and wide distribution are considered as the main features of these landslides. These shallow soil landslides are sensitive to heavy rainfall, non-reasonable human engineering activities, river network and other environmental factors. Furthermore, the total area of landslides in the study area is about 16.8 Â 10 4 m 2 . The average area of landslide is about 2,400 m 2 ranging from 600 m 2 to 2 Â 10 4 m 2 , and the depth of sliding mass ranges from 1 m to 6 m.

Landslide-related environmental factors
The landslide inventory map, Landsat TM 8 image (July 03, 2013, path/row 119/41), Digital Elevation Model (DEM) and geologic maps are used as data sources of environmental factors. Ten landslide-related environmental factors are extracted from these data sources for LSA. These environmental factors can be classified into four types (Huang et al. 2017c): the topographic and geomorphologic factors (elevation, slope, aspect, profile curvature, plan curvature, relief amplitude); the land cover factors (Normalized Difference Build-up Index (NDBI), Normalized Difference Vegetation Index (NDVI)); the hydrological environmental factors (distance to river); the lithology factor (lithology). The landslide inventory map and these environmental factors are processed, analyzed, and stored in the GIS software. Meanwhile, they are

Frequency ratio and correlation analysis of these environmental factors
The FR model is a simple tool to calculate the probabilistic relationships between recorded landslides and environmental factors. In this study, each continuous environmental factor (input variable) is firstly divided into several subclasses. Then the frequency ratios of the subclasses of the input variable are calculated as shown in Table 1.
It is necessary to calculate the correlation coefficients between the ten environmental factors after the frequency ratio analysis. The results show that these correlation 3.4.1.2. Slope. The slope is a very important factor for assessing slope stability and landslide susceptibility (Jebur et al. 2014). The slope in Nantian area ranges from 0 to 72.53 as shown in Figure 2(b). The slope is classified into eight subclasses as (0 -7. 68 , 7.68 -15.36 , 15.36 -22.19 , 22.19 -28.45 , 28.45 -34.70 , 34.70 -40.96 , 40.96 -48.36 , 48.36 -72.54 ) (Nandi and Shakoor 2010). Table 1 shows that about 72.33% of the study area has a slope ranging from 15.36 to 48.36 , suggesting that the Nantian area is a mountainous area. Meanwhile, the FR values in Table 1 show that these landslides are more likely to occur on the slopes distributing between 7.68 and 34.70 .
3.4.1.3. Aspect. Aspect is also considered as one of the very important factors for LSA, because slope aspect-related factors such as exposure to sunlight, degree of saturation, and discontinuities can also affect landslide occurrences (Yalcin et al. 2011). The aspect map is divided into nine subclasses as shown in Figure 2(c) (Tsangaratos and Benardos 2014). The frequency ratios of landslide grid cells with aspects of southeast, south, southwest and west are >1, indicating that grid cells with these aspects have higher probabilities of landslide occurrences than other aspects (Table 1).

Profile curvature.
Profile curvature can be calculated as the slope of the slope in GIS software, describing the change rate of slope (He et al. 2012). The surface can be determined as upwardly concave when the profile curvature value is relatively large, and can be determined as upwardly convex when the profile curvature value is relatively low. The profile curvature values in the Nantian area shown in Figure 2(d) fall in the range from 0 to 47.14. For the range of (4.07-10.91), the corresponding frequency ratios shown in Table 1 are greater than 1. Hence, landslides are more likely to occur in the grid cells with profile curvature values ranging from 4.07 to 10.91.
3.4.1.5. Plan curvature. The plan curvature describes a change of slope in the aspect and has effect on the landslide occurrence (Atkinson and Massari 2011). The surface can be regarded as upwardly convex when the plan curvature value is relatively large, and can be regarded as upwardly concave when the plan curvature value is relatively small. The plan curvature values in the Nantian area range from 0 and 81.47 as shown in Figure 2(e). Table 1 shows that about 80.30% of the landslide grid cells occur on the area with plan curvature ranging from 0 to 50.48.
3.4.1.6. Relief amplitude. Relief amplitude mainly describes the topographic characteristics of the study area, it is obtained through calculating the difference of elevation in a given area (Bui et al. 2012). The larger the value of relief amplitude, the steeper the slope. Figure 2(f) shows that the relief amplitude of Nantian area ranges from 0 to 608.78 m. Table 1 shows that the frequency ratios of relief amplitude ranging from 93.11 m to 250.67 m are greater than 1.

Land cover factors
Normalized difference building index (NDBI) and normalized difference vegetation index (NDVI) are adopted to reflect the land cover characteristics of Nantian area.
NDBI can quantitatively reflect the building distribution characteristics, and NDVI gives a quantitative estimation of the vegetation growth and biomass in the study area . The hydrological and soil mechanism characteristics of slope materials can be affected by the land cover factors. Hence, the probability of landslide occurrence is affected by the land cover factors (Marjanovi c et al. 2011). The NDBI and NDVI are derived from the image of Landsat TM 8 as shown in Figure 3. Their resolutions are both 30 m and their values range between 0 and 225. It can be seen from Table 1 that the frequency ratios of landslide occurrences are greater than 1 when the NDBI values are greater than 113, and when the NDBI ranges from 70 to 192.
3.4.3. Hydrological environmental factor and Lithology factor 3.4.3.1. Distance to river. Some rivers and their branches widely distribute in the Nantian area. River is considered as a basic factor for landslide occurrence since it abrades the slope base and saturates the underwater section of the material forming the slope . The grid cells near the river network are likely to have higher water content than the grid cells far from the river (He et al. 2012). We can obtain the distance to river through the buffer analysis of GIS software. Hence, the distances to river are classified into four subclasses as shown in Figure 4. Table 1  shows that the frequency ratios of the landslide grid cells occurring in the areas within 300 m to the river networks are >1.
3.4.3.2. Lithology factor. The lithology affects the probability of the landslide occurrence through affecting the shear strength and permeability of rocks and soils (Van Westen et al. 2008). Three main types of lithological units distribute in the Nantian area: volcanic sedimentary rock formation; pyroclastic rock and volcanic lava rock formation, intrusive rock and sub-volcanic rock formation (Figure 4). Different lithological units have different frequency ratios as shown in Table 1, a total of 62.63% of the landslide grid cells occur in the area with volcanic sedimentary rock formation, the frequency ratios of these landslide grid cells are 1.53.

LSA of Nantian area using FR model
The relationships between landslides and environmental factors are determined clearly as shown in Table 1. Then the LSI values of Nantian area can be calculated according to Equation (2). In this study, the LSI values of Nantian area ranges from 3.611 to 16.832, a higher LSI suggests a higher susceptibility to landslide occurrence. Third, this study adopts the natural breaks method to classify the calculated LSI values into five levels: very high (9.618%), high (20.183%), moderate (29.333%), low (27.472%), and very low landslide susceptibility (13.393%).
The final produced LSM using FR model is shown in Figure 5(a). It can be seen from Table 2 that the high and very high susceptible areas cover 29.801% of the whole study area, while 76.137% of the landslide grid cells distribute in the high and very high susceptible areas. The frequency ratios of the very low, low, moderate, high and very high landslide susceptibility levels increase significantly from 0.085 to 4.844. And the sum of frequency ratios of high and very high susceptible levels accounts for about 87.880% of the total frequency ratios. Hence, it can be seen from Table 2 and Figure 5 that the LSM produced by FR model is accurate and reliable.

LSA of Nantian area using SVM model
The SVM model is also used in this study to do LSA for comparison. First, all the landslide-related environmental factors are normalized into [0, 1] as input data of SVM, and the LSI values (1 proved to be landslide grid cells, 0 proved to be nonlandslide grid cells) are taken as output data of SVM. Second, the selected 792 non-landslide grid cells and 792 recorded landslide grid cells are divided into training data sets (50% of the nonlandslide grid cells and landslide grid cells) and test data sets (the rest 50% grid cells) (Bui et al. 2012). The SVM and FR models are both verified based on the test data sets. Based on the cross-validation stage, the optimum values of parameters C 0 , e and c of SVM model are determined as 4, 0.09, and 0.16, respectively. Finally, the LSI values are calculated by the fully trained SVM model and then are classified into five levels: very high (10.634%), high (23.149%), moderate (28.943%), low (24.918%), and very low (12.357%) using natural breaks method.
The produced LSM is shown in Figure 5(b) and the frequency ratios are shown in Table 2. It can be seen from Table 2 that the frequency ratios of the very high, high, moderate, low and very low susceptibility levels are 3.681, 1.653, 0.554, 0.238 and 0.051, respectively. About 39.141% of the landslide grid cells fall into areas with very high LSI values, which cover about 10.634% of the whole Nantian area. The sum of frequency ratios of high and very high susceptible levels accounts for about 86.353% of the total frequency ratios. Hence, the results also show that the LSM produced by SVM model is accurate and reasonable.

Distribution characteristics of Landslide susceptibility in Nantian area
Two accurate LSMs produced by FR and SVM models are shown in Figure 5(a, b), respectively. It can be seen from Figure 5 and Table 2 that, the distribution characteristics of LSMs produced by the two models are similar on the whole.
The high and very high susceptibility indexes mainly distribute in the areas with elevations ranging from 276.951 m to 695.363 m, slope ranging from 7.68 to 34.7034 , aspects of southeast, south, southwest and west, complex terrain characteristics, accumulation of volcanic sedimentary rocks, distance close to rivers and with heavy human engineering activities. There are several reasons for the distribution characteristics of the high and very high susceptibility indexes. First, the quaternary accumulation layer is relatively easy to form in these areas with elevations ranging from 276.951 m to 695.363 m and with slope ranging from 7.68 to 34.7034 . Second, complex terrain is a type of important factor inducing slope instability. Third, the mechanical strength of the volcanic sedimentary rocks is relatively low, which is not conducive to slope stability. Fourth, the slopes close to the river networks are more vulnerable to be runoff washed and eroded. Meanwhile, the NDBI map of Nantian area shows that the recorded landslides are more likely to occur in the areas with more human engineering activities.
The low and very low susceptible indexes mainly distribute in the areas with elevation greater than 700 m or smaller than 276.951 m, a very small slope or slope >40 , aspects of north, northeast and east, the lithological units of pyroclastic rock, volcanic lava rock, intrusive rock, sub-volcanic, far distance from river and with few human engineering activities. One reason is that there are few thick colluvium in the regions with very high elevation or slope >40 . In addition, the lithological units in these areas have a relatively higher shear strength than the volcanic sedimentary rock. And slopes far from rivers are not easy to be runoff washed and eroded. What is more, the characteristics of natural landscape are protected well with few human engineering activities, which contributes to slope stability.

Prediction accuracies of FR and SVM models
The prediction rate curves of the FR and SVM models are adopted to assess the fitting degree between the calculated LSI values and the testing datasets (Chung and Fabbri 2008). First, the LSI values of all the grid cells are sorted in descending order and then are divided into 20 equally sized intervals in the GIS software. Second, the percentage of landslide grid cells of the testing dataset in each equally sized interval is calculated to evaluate the prediction rates of FR and SVM models as shown in Figure  6 (Huang et al. 2017c). It can be seen from Figure 6

Prediction efficiencies of the two models
The modeling processes of FR model and SVM model are shown in Section 4. It can be seen that the FR model can deal with LSA more simply and efficiently than the

modeling comparisons of the FR and SVM models
The modeling comparisons show that the FR model is more accurate and considerably efficient than the SVM model for LSA in the Nantian area. The FR model has also been used commonly in many other study areas, related literature review shows that FR model is also more accurate than many other statistical models and machine learning models for LSA (Lee and Sambath 2006;Pradhan and Lee 2010a;Ramesh and Anbazhagan 2015;Ding et al. 2016;Vakhshoori and Zare 2016;Chen et al. 2016a). Meanwhile, there are also some other studies show that machine learning models are more accurate than the FR model for LSA (Yilmaz 2009;Akgun et al. 2012;Park et al. 2013;Youssef et al. 2015;Chen et al. 2016b). However, the modeling comparisons in this study suggest that the FR model is a good choice to do LSA in other areas for researchers and practitioners. One reason is that the frequency ratio values can effectively reflect the relationships between the landslides and the environmental factors. Hence, the connections between the environmental factors and landslide susceptibility map can be established well using FR model (Pradhan and Lee 2010a;Park et al. 2013). Another reason is that, the inputoutput variables and calculation processes of FR model are very easy and readily understood. And the FR model can calculate the LSI values just summing over the frequency ratios of the subclasses of related environmental factors. In addition, the FR model is friendly to end users and is an acceptable simple tool of LSA on GIS environmental, it avoids the modeling difficulties of some other statistical models and machine learning models (Yilmaz 2009).
On the contrary, although SVM model can calculate the nonlinear relationships between the recorded landslides and related environmental factors, it is hard to process the large amount of input-output variables in the statistical package to produce reliable LSI values. What is more, there are many difficulties in the modeling processes of SVM, for example, reasonable nonlandslide grid cells should be selected, optimal model parameters of SVM are needed, the input-output variables of SVM model should be prepared, and the algorithm complexity is very high (Yao et al. 2008;Kavzoglu et al. 2014;Huang et al. 2017c). It is these difficulties that reduce the prediction rate of SVM model for LSA in Nantian area.

Conclusions
LSA is very valuable for strict land-use planning and disaster risk reduction in landslide-prone areas. In this study, the FR and SVM models are adopted to produce LSMs for Nantian area, southeastern hilly area of China. The validation results show that the landslide susceptibility distribution characteristics of Nantian area are determined accurately by the two models. The results and findings of this study can help the decision-makers, planners, and engineers do slope management and landuse planning.
Meanwhile, the comparison results show that the FR model has higher prediction rate and considerably higher modeling efficiency than the SVM model in this study area. The FR model has many advantages of simplicity, high computational efficiency, and comprehensible modeling. Therefore, it is significant to extend the FR model to many other study areas for LSA.

Disclosure statement
No potential conflict of interest was reported by the authors.