Predicting susceptibility to landslides under climate change impacts in metropolitan areas of South Korea using machine learning

Abstract Landslides cause considerable damage to life and property worldwide. In order to prevent and respond to landslides, it is necessary to identify vulnerable areas. This study identified areas that are likely to be damaged by landslides and aimed to predict future landslides. We compared and analyzed areas using machine learning (ML) algorithms, and conducted susceptibility mapping and landslide prediction using an algorithm that produced excellent results. For landslide predictions, the probability distribution of precipitation in the representative concentration pathway scenario 8.5 was used. We accounted for future uncertainties by using several regional climate model scenarios. Comparing the performances of different ML algorithms, the overall prediction accuracy of the random forest (0.932) was excellent. Susceptibility to landslides in the future determined using the random forest and five other regional climate models exhibited minor differences, but the average susceptibility increased over time. In addition, many urban areas are distributed around forest areas that have high landslide vulnerabilities, which provide important perspectives for urban and environmental planning.


Introduction
Along with flooding, landslides are a major cause of serious damage to life and property worldwide (Malamud et al. 2004;Gomez and Kavzoglu 2005;Lee and Pradhan 2007;Garcia-Rodriguez et al. 2008;Yilmaz 2010;Pham et al. 2020). In the future, when the impacts of climate change become more severe, sudden heavy rains could cause more damage due to landslides and flooding (Yılmaz 2009). Therefore, studies related to landslide-susceptibility assessments and responses are necessary to guide disaster reduction and management measures, including land use planning and decision-making processes (Akgun 2012;Nsengiyumva et al. 2018;Dou et al. 2019bDou et al. , 2020b. Studies of the vulnerability or susceptibility to landslides have been conducted worldwide. Previous research has mostly focused on identifying the factors that cause landslides based on the conditions of the target site, collecting data on those factors, and analyzing vulnerability or sensitivity through statistical models, as shown in Table 1. A review of previous studies indicates that, while the target sites and methodologies differed, the contents of the studies were similar (Tien Bui et al. 2012;Zare et al. 2013;Kumar et al. 2016;Pham et al. 2016;Kornejady et al. 2017;Chen et al. 2017a;Hong et al. 2017;Nsengiyumva et al. 2018;Pham and Prakash 2019;Polykretis et al. 2019;Wang et al. 2020;Dou et al. 2020a).
There have been many great previous studies, but most of them focused on evaluating current susceptibility or sensitivity. Few studies have focused on predicting the future risk of landslides. In the field of disaster research, predictions or future prospects are important in terms of disaster management (UNISDR 2015). The probabilistic statistical techniques used in previous studies (Table 1) to predict and forecast landslides in a specific vulnerable area also have the potential for future studies (Pourghasemi et al. 2020). Therefore, this study aimed to predict future landslides in areas that have not received adequate attention in previous studies by using multiple climate change scenarios. A probabilistic statistical model was constructed to estimate landslide-susceptibility because it can consider the uncertainties in the calculations used for prediction (Barr ıa et al. 2019).
Recently, machine learning (ML) algorithms have become popular and are being used extensively for the spatial prediction of diverse types of hazards (Chen et al. 2017a). Also, data-driven models, such as machine learning models, performed better and were considered more efficient than other approaches, such as expert opinionbased methods (Goetz et al. 2015;Pham et al. 2020). In this study, landslide susceptibility was assessed using five ML algorithms widely used in previous studies: Naïve bayes classifier (NB), k-Nearest Neighbor (kNN), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM). Predictions of future landslides were then made by considering the probability distribution of precipitation data obtained from representative concentration pathway (RCP) climate change scenarios provided by the Intergovernmental Panel on Climate Change (IPCC) and regional climate models  Zare et al. (2013) Vaz Watershed, Iran Multilayer perceptron and radial basic function Pham et al. (2016) Uttarakhand state, India. Naïve Bayes Trees, Support Vector Machines Kumar et al. (2016) Indian Himalayas Fuzzy-frequency ratio Pham et al. (2017) Indian Himalayas Multiple Perceptron Neural Networks Kornejady et al. (2017) Golestan Province, Iran Maximum Entropy Chen et al. (2017a) Langao County, China. Rotation forest ensembles, Naive Bayes Tree Hong et al. (2017) Chongren area, China Frequency ratio, Certainty factor, Index of entropy Nsengiyumva et al. (2018) Eastern Province, Rwanda Spatially different criteria evaluation methods Polykretis et al. (2019) Mediterranean catchment, Greece Adaptive neuro-fuzzy modeling Pham and Prakash (2019) MuCang Chai, northern Vietnam Bagging-based Naïve Bayes Trees Dou et al. (2020aDou et al. ( , 2020b Northern parts of Kyushu, Japan Support vector machine hybrid ensembles Wang et al. (2020) Sichuan Province, China Deep belief network (DBN) (RCMs) provided by the Korea Meteorological Administration (KMA). In addition, climate models have uncertainties because of uncertainty in the scenario values (Knutti and Sedl a cek 2013). This is the reason many studies have used the ensemble approach to consider uncertainty by using diverse scenarios (Parker 2013). The findings of this study can serve as a data source for formulating long-term policies for response and disaster management related to landslides.

Data
As shown in Figure  . Topographic (elevation, slope, aspect, curvature), geologic (lithology), environmental (road area ratio, forest area ratio), and meteorological (daily maximum precipitation) data pertaining to these variables were collected for analysis from National Geographic Information Institute, Korea Institute of Geoscience and Mineral Resources, Korean Meteorological Administration, and National Disaster Management Research Institute (Table 2). By collecting data on these eight factors, as well as the landslide inventory, four topographic factors based on the digital elevation model (DEM) were created using a 10 m grid (Dou et al. 2015), and others including environmental factors were created using a 250 m grid. All data was resampled to a 250 m grid in consideration of the area of the target site. Because the dataset was created by matching the rainfall data with the landslide occurrence dates from the inventory, the size of the dataset used for analysis was the number of grids multiplied by the number of occurrences. Table 2 shows the source, type, and period of each factor. In addition, Figure 3 shows the mapping of variables used in the study, with July 27, 2011 mapped as an example for Daily Maximum Precipitation (DMP).

Landslide factors analysis
To control unnecessary factors used in the analysis prior to a landslide susceptibility assessment (LSA), multi-collinearity analysis was used to find the relationship among the factors (Bui et al. 2019, Dou et al. 2019a, and the use of information gain ratio (IGR) to determine the degree of factor influence on the results was analyzed. This is because multi-collinearity and influencing factor analysis using IGR affect the results and accuracy of the model (Zhou et al. 2018). Variance Inflation Factor (VIF) and tolerances were used to calculate the accuracy of multi-collinearity. Then, the influence of each factor was determined using the IGR technique (Zhou et al. 2018).

Machine learning algorithms and validation
Each of the five ML algorithms used in this study had their own characteristics. NB is a stochastic-statistical method based on Bayes' rule, where prior probability is used to estimate the posterior probability. Bayes' theorem is stated mathematically as follows: P(AjB) ¼ P(A)P(BjA)/P(B). In that equation, P(A) and P(B) is the prior probability, P(BjA) is the likelihood, and P(AjB) is the posterior probability. kNN was developed by Cover and Hart (1967) , is easy to run, and is as simple as NB (Jadhav and Channe 2016). The designation of the number k, which is the proximity of the data points to one another, is important because it affects the result of the algorithm (Bhavsar and Ganatra 2012; Kim et al. 2012). DT is a popular ML algorithm resembling a tree and based on decision tree theory. It is useful for decision-making because it provides a simple representation of the results (DeFries and Chan 2000). RF is an algorithm that is often used in studies that use ML, combined with algorithms such as SVM and neural networks (e.g., Breiman 2001). SVM, devised by Cortes and Vapnik (1995), is a multi-purpose algorithm that can classify unlabeled datasets. It identifies and analyzes the characteristics of data clusters. Using the receiver operating characteristic (ROC) curve score, the results of the LSA derived from the five ML algorithms were compared. ROC analysis was mainly used to assess model performance (Pham et al. 2016), the relationship between the false positive rate (1 À Specificity), and the true positive rate determining the model's performance. The closer the average AUC (area under the ROC curve) is to 1, the higher the accuracy of the model (Chen et al. 2018).

LSA using different algorithms
The landslide inventory was divided into inventories for landslide occurrence and non-occurrence. Because there was a considerable difference in frequency between the occurrence and non-occurrence areas, under-sampling was performed based on the occurrence area (He and Garcia 2009). The analysis was then conducted by dividing the data into a training set and a test set with a ratio of 70:30 (Dou et al. 2015(Dou et al. , 2019b(Dou et al. , 2020a(Dou et al. , 2020bZhou et al. 2018) . Tens of thousands of iterations were required to analyze all of the grids because under-sampling was performed prior to grid creation. Landslide susceptibility was assessed using five ML algorithms that have been used widely in recent years: NB, kNN, DT, RF, and SVM. Additionally, these five algorithms were used to account for the uncertainty of the model (Hao et al. 2019). Results were obtained through an analysis involving approximately 50,000 iterations.

Predicting landslide susceptibility
Susceptibility was predicted using the highest performing algorithm. Precipitation data from RCP climate change scenario 8.5, obtained from five different RCMs (GRIMs, HadGEM3-RA, RegCM4, SNURCM, and WRF), were used as the main variables for prediction. RCP climate change scenarios are scenarios using radiative forcing to measure the amount of carbon emissions that are the main cause of climate change, and there were four scenarios (2.6, 4.5. 6.0, 8.5). The higher the number, the higher the carbon emission scenario, with 8.5 indicating the scenario with carbon emission continuously carried out at the current level (IPCC 2014). Precipitation data for these five RCMs were obtained from the Regional Climate Detailing Project in East Asia (CORDEX-EA: Coordinated Regional Downscaling Experiment-East Asia, http://cordexea.climate.go.kr/cordex). The climate information portal of the KMA provided the RCM data (Table 2). We used the daily maximum rainfall amounts for the different RCP scenarios of the RCMs, and their probability distributions were used as inputs to determine the future susceptibility to landslides. The temporal targets for forecasting are the 2030s, 2050s, and 2080s, and daily maximum rainfall amounts for 10 years before and after each year were used (Figure 2). Table 3 shows the result of the multi-collinearity analysis. If the tolerances were less than 0.2 or the VIF was greater than 5, it was interpreted that there was multi-collinearity (O'Brien 2007). The VIF and tolerances of the factor 'slope' were 5.013 and 0.199, respectively. Since the VIF was greater than 5, and the tolerance was less than 0.2, the 'slope' was removed for LSA. As a result of performing the multi-collinearity analysis again without the 'slope', it was found that there was no multi-collinearity among the factors. Figure 4 shows the result of influencing factor analysis and the average AUC value of the result of LSA by removing the influencing factor analysis result for low influence in order. As can be seen from the results (Figure 4a), the most influential variable was 0.45, which was shown as daily maximum precipitation (DMP). The variable with the least influence was aspect (0.05), which was much higher than the altitude (0.023) that most affected the colluvial landslide calculated by Zhou et al. (2018). The factor's influence is difficult to compare accurately because the number of variables used is different, but even after taking this into account, it was a high value. As a result of performing LSA ignoring factors having low influence, it can be seen that the AUC value is not significantly different (Figure 4b). Then, LSA was implemented by using all factors without factor 'slope,' which was removed at the multi-collinearity analysis. In addition, DMP appeared to be the most influential factor, similar to those studies that analyzed flooding in coastal areas (Park and Lee 2020). These results reiterate the importance of responding to heavy rain as part of disaster management.

Comparison of machine learning algorithms
As shown in Figure 5a, the accuracies of the average ROC curves produced from the five algorithms were: 0.822 (NB), 0.896 (kNN), 0.869 (DT), 0.932 (RF), and 0.866 (SVM). The accuracy of the model using RF was the highest of the 5 models; however, the results of the other models were also high. In addition, the graph depicting the density of RF versus ROC accuracies shows that the kurtosis of the graph is slightly higher than those of other algorithms. This indicates that the results obtained with RF were precise. Therefore, we used RF for the landslide susceptibility mapping and prediction. Figure 5b shows the ROC curves obtained using RF.

Predicting landslide susceptibility
The predictions were conducted using monthly average rainfall amounts. In this process, rainfall predicted to occur in the future was used as a density function to consider the uncertainty of future rainfall. The rainfall was estimated by substituting the kernel density in one month, and the model was used to calculate the monthly predicted susceptibility. By comparing the probability of monthly susceptibilities, we observed that the probability increased during the months of June, July, and August for most of the RCP 8.5 scenarios. Based on these data, the average probabilities of susceptibility for each scenario (8.5, 2030s, 2050s, and 2080s) during June, July, and August were calculated to create a probability map of the maximum susceptibility for the given scenario.
The results of the predictions are shown in Figure 6. These results were obtained for the RCP climate change scenario 8.5 using five different RCMs with the 2030s, 2050s, and 2080s as the target periods. The susceptibility was high, on average, when HadGEM3-RA was used. Moreover, when GRIMs were used, the susceptibility increased over time, while the remaining three RCMs had tendencies for reduced susceptibility. This is because the peak precipitation value appeared differently depending on the scenario considered by the RCMs (Figure 3i).

Analysis of results from different ML algorithms
This study accounted for the model uncertainty by performing LSA using five different ML algorithms. The resulting performance was good for most models, with RF having the best performance, similar to the findings of previous studies such as those using supervised learning classification with spatially geographical data (Cracknell and Reading 2014;Naghibi et al. 2017;Chen et al. 2017b;Pourghasemi et al. 2020).

Difference in susceptibilities based on land cover type
We determined how susceptibility changed according to the type of land cover when the same land cover is maintained in the future using five different RCMs under climate change scenario 8.5 (Figure 7a). The risk in forest areas is largely due to the occurrence of landslides. The uncertainty of the results increased over time. This can be attributed to the uncertainty of climate models and the distribution characteristics of the precipitation values in each RCM.  Moreover, it is important to know which areas among other land cover types near the forest are more sensitive in the future. Figure 7b shows how three land-cover types near the forest are relatively susceptible during the target periods under the RCP 8.5 scenario. The uncertainty increased with time; however, the urban area remained more susceptible than other areas. This is because many urban areas are distributed around forest areas that have high susceptibilities to landslides. More indirectly, it could be explained that due to the high urbanization around the forest area, economic damage is high when landslides occur due to heavy rains. These results highlight the need for future efforts in land-use planning to reduce landslide susceptibility in urban areas located near forest areas.

Conclusions
This study evaluated landslide susceptibility in the metropolitan area that includes Seoul, South Korea. Prior to LSA, multi-collinearity was analyzed among eight factors. As a result of the multi-collinearity analysis, the factor 'slope' had high multi-collinearity, and it was removed for LSA. Then, to improve the ML model's performance, the influencing factors were addressed by using the IGR technique. The most influential factor was the daily maximum precipitation (0.45). In addition, the results of performing LSA while ignoring factors having low influence in turn showed that the AUC value is not significantly different from each result. Based on evaluations using five different ML algorithms with seven factors, the average AUC using RF exhibited the best performance (0.932). To predict future landslide susceptibility, projected future precipitation values from the RCP climate change scenario were used, and the target years were set to the 2030s, 2050s, and 2080s. Additionally, various future possibilities were predicted based on five different RCMs. This allowed us to consider future uncertainties. The landslide susceptibility generally increased over time, with the exception of some results where RegCM4, SNURCM, and WRF were used.
From this study, we arrived at the following additional inferences. First, among the five ML algorithms used to assess landslide susceptibility, RF had the best performance. Similar studies using supervised learning classification have shown comparable results. Second, we found that it is necessary to reduce the landslide susceptibility in forest areas, based on an analysis of changes in landslide susceptibility over the target period (2030s, 2050s, and 2080s) and land cover types (urban, agricultural, forest, and grassland). We found that urban areas were more susceptible than other land-cover types because they were distributed around the forest areas that were estimated as more susceptible to landslides. It might indicate that many areas were urbanized around the forest area, and it also highlighted the need for future efforts in land-use planning to reduce landslide susceptibility.
Additionally, the result and process of this study reveal some limitations. First, since the purpose of this study was to predict future landslide susceptibility, the effect of data resolution was not considered, although the resolution of the area of the target site was larger than the resolution of the two other studies (Dou et al. 2015, Chang et al. 2019; thus, the resolution of data should be considered in future studies because it is important for determining landslide susceptibility. Second, we predicted future landslides by assuming that the socioeconomic factors (as indicated by land cover type) did not change. In the future, socioeconomic factors may become more important for determining susceptibility; therefore, these factors should be considered in future studies.

Disclosure statement
No potential conflict of interest was reported by the authors.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.