A data-driven approach for landslide susceptibility mapping: a case study of Shennongjia Forestry District, China

Abstract The main purpose of this study was to establish a data-driven approach and assess its potential for shallow landslide susceptibility mapping of the Shennongjia Forestry District, China. For the data processing, Fisher segmenting was used for the classification of topographical factors (gradient, relief amplitude, plan curvature, and normalized difference vegetation index) and the improved Otsu method was used to determine the buffer length thresholds of the locational factors (proximity to roads, rivers and faults). The computations show that these two objective methods help to avoid the uncertainty caused by the subjective judgments and the obtained thresholds are more consistent with actual conditions. Considering the different data types, the locational and topographical factors were evaluated by fuzzy comprehensive evaluation while the geological factors were evaluated by frequency index, and then the three separate evaluations were combined using radar chart analysis. The validation result shows that the susceptibility map based on a data-driven approach has a high accuracy of prediction and this approach will be salutary for risk mitigation and land-use management in mountainous regions.


Introduction
Due to the complex geographical conditions and expanding human construction, landslides often occur during the rainy season in the mountainous regions of central and southern China. Among these landslides, shallow landslides have a wide distribution and occur frequently, which poses serious threats to residents and causes huge economic losses. (Li et al. 2012;Pastor et al. 2014;Xu et al. 2015).
Landslide susceptibility mapping plays an essential role in risk mitigation and land-use management in mountainous areas. A landslide susceptibility map emphasizes static landslide-prone conditions, making it easier to predict the spatial distribution and the probability of landslides (Lee et al. 2008;Eeckhaut et al. 2009). Recently, the Geographic Information System (GIS) has made it possible to apply theoretical methods to the regional landslide susceptibility mapping because of its spatial analysis and data manipulation (Pourghasemi et al. 2013;Crozier 2017;Hong et al. 2018), such as fuzzy (Jamaludin et al. 2006;Bui et al. 2012), artificial neural network (Ayalew & Yamagishi 2005;Rossi et al. 2010), logistic regression (Pradhan 2010;Nejad et al. 2016), and linear weighted combination (Ayalew et al. 2004;Jamaludin et al. 2006;Soltani et al. 2013;Shahabi & Hashim 2015). More recently, comparisons between different methods were conducted. For instance, Yilmaz (2009) assessed landslide susceptibility with four methods (conditional probability, logistic regression, artificial neural network, and support vector machine) and concluded that the difference in the prediction accuracy of the four different methods was within 0.2%. Eker et al. (2014) compared five methods for landslide susceptibility evaluation (linear discriminant analysis, quadratic discriminant analysis, artificial neural network, logistic regression, and spatial regression) and drew similar conclusions that the five different methods have a similar prediction accuracy with a largest difference of less than 0.5%. As the comparison analysis revealed, with the same data processing, there is little difference in the prediction accuracy between the above methods (Melchiorre et al. 2008;Zhang et al. 2016).
Classification and the determination of buffer length threshold are vital components of data processing. However, because there are no universal standards or guidelines, most scholars used subjective decisions commonly, which even led to different results within a same project. For instance, for the buffer length thresholds of roads, Xu et al. (2015) took 250 m as the maximum threshold, while Hong et al. (2016) used 2000 m. A smaller threshold cannot ensure that all the regions influenced by a certain factor are considered, whereas with a greater threshold, there would be an increase in the cost of the risk mitigation and land-use management (Li et al. 2010;Wang et al. 2014). Moreover, different data sources lead to different data types: most topographical factors derived from the digital elevation model (DEM) are continuous; lithological characters, such as types of aquifer are generally discrete, are commonly digitized from the thematic map directly. In most cases, a single method cannot deal sufficiently with the situation of different data types.
For these reasons, in the current paper, considering the different data types of selected influence factors, a data-driven approach is established. Fisher segmenting and the improved Otsu method were employed for data processing and the selected factors were evaluated by fuzzy comprehensive evaluation and frequency index (FI) separately, and then combined using radar chart analysis. Subsequently, based on this data-driven approach, a landslide susceptibility map of the Shennongjia Forestry District, China, where shallow landslides frequently occur, was produced.

Study area
The study area, Shennongjia Forestry District, is located within latitude 31 15 0 and 31 75 0 N and longitude 109 56 0 and 110 58 0 E, covering an area of about 3250 km 2 ( Figure 1). The district is bounded by the Ta-pa Mountains with altitudes ranging from 420 to 3100 m and decreasing from the southwest to the northeast. The average annual temperature is 12 C, and the rainy season is mainly between May and October. The types of aquifer are mainly carbonate and clastic, where channels are well-formed for the migration of groundwater due to the development of karst pipelines. There are more than 300 mountain rivers in the district. The annual amount of groundwater is about 840 million m 3 . The length of the road network is 1500 km. The main north-south travel route is the China National Highway 209 (G209), and the Hubei Provincial Route 307 (S307) runs through the northeastern part. Buildings and tourist attractions are widely distributed along the rivers and roads.
With the acceleration of the town construction, the destruction of the natural environment has led to frequent landslides. It is observed from the landslide inventory that there were 285 shallow landslides over the last decade in the district (Table 1). The existing landslides are mainly small and medium scale (88.4% of the total), the minimum scale is 150 m 3 , whereas the largest landslide reaches about 2500 £ 10 4 m 3 (Figure 2). The frequent landslides have posed serious threats to residents and causes huge economic losses. For instance, Landslide Yaolan occurred twice due to heavy rainfall on 5 and 22 August 2012. Referring to the field survey, this landslide was about 120 m long and 63 m wide with an average thickness of 6.5 m. Landslide Yaolan led to traffic interruption of G209 for days and economic losses of about US$0.75 million, threatening the safety of 10 households as well as the passing vehicles. Therefore, it is necessary for decision-makers to take into account susceptibility to shallow landslides during risk mitigation and land-use management.

Data and methodology
3.1. Data used 3.1.1. Landslide inventory map A landslide inventory is an important component of the spatial database, which records the details of the existing landslides and helps in the analysis of the connections between landslide occurrences and influence factors (Ercanoglu et al. 2004;Reis et al. 2009;Mohammady et al. 2012;Martha et al. 2013). By means of extensive field surveys and spatial analysis, the landslide inventory, which contains existing 285 shallow landslides, was digitized by the School of Urban Design of Wuhan University. In the current paper, 199 landslides (70% of the total) were randomly selected for modelling, and the remaining 86 landslides (30% of the total) were used for validation ( Figure 1).

Topographical factors
The topographical factors of the gradient, relief amplitude (RA), and plan curvature (PC) were derived from the DEM that was provided by the School of Urban Design, Wuhan University. RA is a measure of the surface cutting degree, which is defined as the maximum elevation difference within a certain range. PC describes the curvature of a contour line in the horizontal direction and is an essential measurement of the slope erosion Bui et al. 2012). Normalized difference vegetation index (NDVI) is a measure of surface reflectance and gives a quantitative estimate of the vegetation growth and biomass (Hall et al. 1995;Deng et al. 2007;Zhu et al. 2013;Dou et al. 2015). Using the satellite image of Landsat 8 from February 2017, NDVI was calculated: where NIR and RED represent the spectral reflectance of the near-infrared and red bands of the electromagnetic spectrum, respectively.

Locational factors
The locational factors refer to the proximity to locational elements (roads, rivers, and faults). The influence of the locational factors to landslides can be characterized by the distance between mapping cells and the locational elements. Roads affect landslides in the form of excavation, additional load, etc. The region near rivers is likely to have a higher water content, which may erode the slopes or saturate the lower part of slopes, resulting in a decline in the support capacity. The soil close to a fault is considered to be relatively weak because of the altered surface material structures and the increased permeability (Kanungo et al. 2006;Yalcin & Bulut 2006). The road layer was digitized from the traffic map of the Shennongjia Forestry District with a 1:50,000 scale. The river layer was digitized from the water resource map of the Shennongjia Forestry District with a 1:50,000 scale. The fault layer was digitized from the geological map of the Hubei province with a 1:100,000 scale.

Geological factors
The types of aquifer determine the migration of groundwater, which is one of the important indicators of lithological character (Stegmann et al. 2011;Suzuki & Higashi 2012). The layer of aquifers was directly digitized from the published hydrological map of the Hubei province with a 1:100,000 scale. The aquifers have been divided into five types (metamorphic-carbonate, clastic, clastic-carbonate, carbonate, and carbonate-clastic) according to the aquifer medium and hydrodynamic features.

General approach
The data-driven approach can roughly be divided into a three-step procedure (Figure 3). The first step is the data processing, which includes the classification of topographic factors by Fisher segmenting and the determination of buffer length thresholds for locational factors by the improved Otsu method. The second is the separate evaluations using fuzzy comprehensive evaluation and frequency index. The last step is the combination of the three separate evaluations using radar chart analysis to produce the landslide susceptibility map.

Fisher segmenting
The Fisher segmenting method is an effective clustering analysis of ordered samples based on the minimum sum of variance of all classes. For a group of ordered samples X = {x i , x i+1 , …, x j } (i<j), P = {i, i+1, …, j}, the average Àx p , the diameter, D(i, j), and the objective function, B(n, k), can be determined via the following equations: (1) where n represents the number of samples and k represents the number of classes. For a specific sample and a fixed class number, the smaller the B(n,k) is, the more reasonable the classification.

Improved Otsu method
In the GIS analysis, buffers are built to consider the influence of locational factors on landslides. Before building the buffers, there is a need to determine the buffer length threshold. The Otsu algorithm is an effective algorithm for image binarization in computer vision and image processing (Otsu 1979). The algorithm searches for the optimum threshold separating the image into the foreground and background so that intra-class variance is minimal and inter-class variance is maximal: where s 2 w t ð Þ is the sum of variances of the foreground and background, w 0 (t) and w 1 (t) are the proportions of the two classes separated by a threshold t, and s 2 0 t ð Þ and s 2 1 t ð Þ are the variances of these two classes, respectively. When the sum of variances s 2 w t ð Þ is at its maximum, the threshold t is the optimal threshold.
The traditional Otsu method only classifies the sample into two categories, so it cannot meet the requirements of multiple classes. To build the buffers, L i ð Þ (where i is the number of classifications) is assumed as the distance threshold that divides the existing landslides C into influenced class c where L 0 ð Þ is the maximum distance between the locational factors and the landslides (L max ), and c 0 ð Þ 1 is equal to C. When i is 0, Equations (4) and (5) are the same, which indicates that the first buffer is in the range greater than the threshold L 1 ð Þ , and similarly, the second is L 2 ð Þ to L 1 ð Þ , and the i-th buffer is L i ð Þ to L iÀ1 ð Þ (Figure 4).

Fuzzy comprehensive evaluation
Fuzzy comprehensive evaluation refers to the overall evaluation of a phenomenon affected by multivariate factors and gives a nonnegative evaluation value (Feng et al. 2014). By performing the fuzzy composite operation of the weights and memberships, a fuzzy comprehensive evaluation is established: where W represents the weight vectors of factors, F represents the fuzzy membership matrix, m is the factor number, n is the class number, and f mn is the fuzzy membership for factor m of class n. The final evaluation is determined by the maximum membership principle (Liang et al. 2003;Lv et al. 2013). Fuzzy membership r A (x) is used to characterize the extent of element x belonging to fuzzy set A, whose values lie between 0 and 1. The higher the value is, the greater the degree of belonging to the fuzzy set and vice versa (Devkota et al. 2012). The membership functions that are commonly used include linear, exponential, hyperbolic, and inverse hyperbolic. In the current paper, the fuzzy membership was assumed to follow a Gaussian distribution ( Figure 5) and can be calculated using the following equations (Su et al. 1993): where x i+1 and x i represent the upper and lower bounds of the class interval. Information entropy is a frequently-used objective weighting method based on the characteristics of the original data, which avoids the deviations that arise from subjective decisions and can better characterize the importance of the influence factors (Devkota et al. 2012;).

Radar chart analysis
A radar chart is a graphical tool for displaying multivariate data in the form of a two-dimensional chart and it is useful for displaying multivariate observations with an arbitrary number of variables (Chambers et al. 1983;Friendly 2009). Radar chart analysis involves simple calculations and is easy to understand, and it has been widely used in the fields of product comparison and control of quality improvement (Chang et al. 2012;Kalonia et al. 2013).
As shown in Figure 6, a radar chart consists of equal-angular axes, each representing the susceptibility class of the separate evaluation; lines are drawn connecting data points on adjacent axes, forming a characteristic polygon for the comprehensive evaluation of a distinct mapping cell. According to the area of the polygon, the score function of the comprehensive evaluation is established as follows: ðc 1 Ác 2 þ c 1 Ác 3 þ c 2 Ác 3 Þ; where c i represents the susceptibility class of the separate evaluations.

Data processing
The factor data were processed via the Fisher and improved Otsu method, and the FIs in each class interval were investigated. The FI refers to the ratio of landslide scale frequency (u LS ) to the grid cell proportion (u cell ) in the same class. The FI can objectively reveal the density of landslide occurrences in a certain region, and the regions with relatively higher FI are considered to be more susceptible to landslides. As shown in Table 2, the FIs are the largest in the gradient interval of 13.5 to 21 , in the RA interval of 15 to 25 m, in the PC interval of ¡0.1 to 0.1, and in the NDVI interval of more than 0.36. Using improved Otsu method, buffer length thresholds were determined (Table 3). The maximum road buffer is slightly greater than the maximum river buffer and much smaller than the maximum fault buffer. Moreover, in the regions within 250 m of the locational elements, the FIs of roads are generally higher.

Mapping
The results of weights determined by information entropy included the weights of the topographical factors and the weights of locational factors (Table 4). Among the selected topographical factors, both gradient and RA have a high percentage of the weights (almost three times the weight of PC and twice the weight of NDVI). As for the locational factors, the roads have the greatest weight, the rivers come second, and the faults the last. According to the class intervals shown in Tables 2 and 3, the topographical and locational factors were classified into five (I-V) categories where V represents the interval with the highest FI. To be compatible with the spatial resolution, the layers of each factor were translated into raster ones via the map-making software Arcgis 10.2 with a size of 25 m £ 25 m (Figure 7).
Using the classified layers, the fuzzy memberships of the mapping cells were calculated via Equations (7)-(9). As shown in Figure 8(a), the memberships of mapping cells follow a Gaussian distribution. As the gradient is about 17.25 , the mapping cells have the greatest extent of belonging to class V, and with the increasing gradient, the extent of belonging to class V gradually decreases while the extent of belonging to class III increases. Similarly, as shown in Figure 8 (f), with increasing of the distance to roads, the memberships of mapping cells trend to belong to a lower class.
Based on the obtained fuzzy memberships and weights, the topographical and locational factors were evaluated separately (Figure 9(a,b)). Meanwhile, the geological factor, namely types of aquifer, was also evaluated into five (I-V) categories via the FI (Figure 9(c)). As shown in Table 5, the FIs of each separate evaluation are similar: the FIs increase with the increasing susceptibility class. In particular, the FIs of the locational evaluation change evenly in class I to class III, and increase sharply as the class increases from III to IV, then become even in class IV and class V. The FIs of the geological evaluation stay fairly low in classes I to IV, but increase sharply as the class increases from IV to V. On the basis of radar chart analysis, the above three evaluations were combined. The comprehensive evaluation was classified into the categories I-V with the segmenting points of 3, 12, 27, and 48, and the susceptibility map was generated ( Figure 10). As the map shows, 46% of the district's total area was in susceptibility classes IV and V, which is where more than 70% of existing landslides are distributed.

Validation
The radar charts of the mapping cells in typical regions (Figure 10(a-d)) were analysed comparatively. For the region a, the three lower separate evaluation classes were combined into a low comprehensive evaluation class, and vice versa for the region b. Moreover, a single low or high class could not make an extreme evaluation class, such as the regions c and d ( Table 6).
The validation results indicate that the FIs of both the data for the modelling and the data for the validation increase with the increasing susceptibility class similar to the FIs of the separate evaluations (Table 7). Of the existing landslides, 71.4% in the data for modelling and 72.1% in the data for validation are located in regions with susceptibility classes IV and V (46.8% of district's total area). Meanwhile, in the same susceptibility class, the FIs of both are similar.

Conclusion
A data-driven approach was established for regional landslide susceptibility mapping. This approach could be roughly divided into a three-step procedure: (i) data processing, which includes the classification of topographic factors by Fisher segmenting and the determination of buffer length thresholds for locational factors by the improved Otsu method; (ii) separate evaluations using fuzzy comprehensive evaluation and frequency index; (iii) combination of the separate evaluations by radar chart analysis.
The inaccurate thresholds of the locational factors determined by the subjective decisions could lead to an incomplete consideration or a higher cost and, therefore, the improved Otsu method was established. The buffer length thresholds of roads are 55, 115, 245, and 710 m, the buffer length thresholds of rivers are 50, 105, 240, and 590 m, and the buffer length thresholds of faults are 205, 365, 1470, and 3415 m.
The combination of the separate evaluations by radar chart analysis solves the problem of different data types caused by the different data sources. The FIs of the data for the modelling and the  data for validation have common distribution features and similar values, which indicates that the obtained susceptibility map has a high accuracy of prediction.
Since this is a prototype study performed in landslide-prone region in Shennongjia Forestry District, China, the findings may potentially be utilized for regions with similar topographical and geological conditions.

Discussion
For selection of suitable influence factors, the quality of the landslide inventory map, the difficulties in data acquisition, the credibility of the obtained data, and the connections between landslides and influence factors should be fully considered.   A high-quality landslide inventory map should cover the occurrence time and shape feature of landslides, the physical and mechanical properties of the main soil and rock, and the changes in the main triggering factors, etc. However, the obtained landslide inventory only records the locations and scales. To be compatible with the frequency index, in this paper, the circular buffers were built according to the recorded scales, where the average values were extracted from the factor data layers as the attributes of the existing landslides.
Precipitation is one of the main trigger factors for landslides (Tarolli et al. 2011). The obtained precipitation data is the average annual precipitation (AAP) of the last 10 years from eight meteorological observation stations. Using the ordinary Krigingmethod, a spatial interpolation method, the AAP layer was produced. However, the AAP layer shows that the AAP appears to be uniform throughout the district, which makes the weight of the AAP calculated via information entropy too small to characterize the importance of AAP and, therefore, the AAP was not considered in the current paper. The same can be said for the land-use.
It seems not that the greater the gradient is, the higher the probability of landslide is, which may due to the locations of constructions and the fact that steep slopes stay relatively stable under natural conditions. As the flat terrain is scarce in the district, constructions are mainly concentrated on slopes, leading to a change in the slope structures. Under natural conditions, the stability of the slope is mainly determined by the geological conditions and mechanical properties. When the surrounding environment changes drastically or slopes are subject to constructions, the slopes are relatively unstable.
The diversity of the data sources leads to different data types. For data processing, there is a need to classify the DEM-derived data and build buffers for locational elements. The Fisher segmenting and the improved Otsu method solve these issues. Meanwhile, other objective methods, such as K-means, Jenks, etc., are alternative solutions. The performances of these methods should be investigated in further studies.
A landslide susceptibility map is usually estimated using area under the receiver operating characteristic curve (AUC) (Lee et al. 2008;Chen et al. 2014;Hong et al. 2017). However, since the susceptibility indexes calculated by radar chart analysis are still discrete, the AUC is not suitable for map validation. As shown in Table 7, the FIs of the data for the modelling are very similar with the FIs of the data for validation in the same susceptibility class. This is a fairly good result, which indicates that the data-driven approach has a good ability to predict landslides.
In addition to the grid, the mapping cells come in various forms, such as unique-condition units, sub-basin, slope units, etc. Among these, a slope unit is the basic terrain unit for landslide occurrence and can reveal shape features and keep the differences between the units (Guzzettia et al. 1999;Abella & Westen 2008). However, the flow threshold, which is the key to generating the slope units, is sensitive to hydrological characteristics and difficult to determine. Recently, a generation method of slope units established by our team presents a high consistency between the generated slope units and the actual slopes. If this is reasonable, the generation method of slope units could be applied in landslide susceptibility mapping to better predict the shape features of the landslides.
Based on the discussion above, the future work of our team is mainly focused on the filtering of influence factors and the application of slope units. Meanwhile, some other evaluation method, including the mechanics method, will be considered and introduced, and the detailed geological investigations will be conducted for obtaining the geotechnical parameters at the site-specific scale, where the local geological heterogeneities may predominate.

Disclosure
No competing financial interests exist.