Exploring novel hybrid soft computing models for landslide susceptibility mapping in Son La hydropower reservoir basin

Abstract In this study, two novel hybrid models namely Bagging-based Rough Set (BRS) and AdaBoost-based Rough Set (ABRS) were used to generate landslide susceptibility maps of Son La hydropower reservoir basin, Vietnam. In total, 186 past landslide events and twelve landslides affecting factors (slope degree, slope aspect, elevation, curvature, focal flow, river density, rainfall, aquifer, weathering crust, lithology, fault density and road density) were considered in the modeling study. The landslide data was split into training (70%) and testing (30%) for the model’s development and validation. One R feature selection method was used to select and prioritize the landslide affecting factors based on their importance in model prediction. Performance of the hybrid developed models was evaluated and also compared with single rough set (RS) and support vector machine (SVM) models using various standard statistical measures including area under the curve (AUC)-receiver operating characteristics (ROC) curve. The results show that the developed hybrid model BRS (AUC = 0.845) is the most accurate model in comparison to other models (ABRS, SVM and RS) in predicting landslide susceptibility. Therefore, the BRS model can be used as an effective tool in the development of an accurate landslide susceptibility map of the hilly area.


Introduction
Landslide Incidents (LI) are common in hilly and mountainous areas, which cause huge losses of life and properties affecting socio-economic conditions of a region (Rossi et al. 2019;Chen et al. 2019b). Landslides are gravitational downward movement of rock/soil and ground mass (Tacconi Stefanelli et al. 2020). Early spatial prediction of landslides is desirable to prevent the landslides, and thus, the damages (Lombardo et al. 2020). Natural and anthropogenic activities usually affect the landslides occurrence . Researchers have come up with various methods to identify landslide prone areas and solutions to reduce the adverse consequences of landslides. Landslide susceptibility mapping (LSM) is one of the most appropriate methods, which helps in forecasting the landslide prone places in specific areas. Generally, it is assumed that future landslides in an area will occur under similar conditions of the past. Therefore, spatial relationship between the variables that affect the occurrence of landslides is appropriate to identify and predict the places of future landslides (Chen et al. 2019a;Nhu et al. 2020d).
Recently, hybrid ML methods have yielded promising results in many fields (Pham et al. 2018b;Nhu et al. 2020c) including LSM. Some of these methods are MultiBoost with Logistic Regression , Bagging with RAF , Random Subspace RAF  and Rotation Forest with Logistic Model Tree (Fang et al. 2021). Improvement in the capability of prediction with different combination of ML methods is continuous process. Therefore, in this study, we have explored novel hybrid models namely Bagging-RS (BRS), AdaBoost-RS (ABRS) in the LSM of the Son La hydropower reservoir basin, Vietnam to select the best model. Performance of the models was evaluated using standard statistical methods including AUC (area under the curve)-ROC (receiver operating characteristics) curve. Results were also compared with other two good single ML methods namely rough set (RS), and SVM. The main difference of this study compared with previous works is that it is the first time the ensemble optimization techniques of Bagging and AdaBoost were used to integrate with RS for the development of hybrid models (BRS and ABRS) for exploring the possibility of better accurate LSM. In this study, we have used Weka software and GIS applications to prepare the data and generate the models.

Description of hydropower reservoir basin
The Son La hydropower reservoir area (21 15'15 '' to 22 17'10 '' N; 102 50'10 '' to 104 35 '15 0 E) is located in the Da river basin in the north-western part of Vietnam, covering an area of 5381 km 2 , falling in part of Lai Chau, Dien Bien and Yen Bai provinces ( Figure 1). This area is prone to landslides as several incidences of landslides have been recorded in the catchment area of hydropower dam. Report of the Natural Disaster Prevention and Control (CCNDPC) indicated that 10 people were killed and 4 injured by landslides and flash floods only in the Muong La district of Son La Province, besides 258 houses were damaged, out of which, 179 completely washed away (Vietnam's Central Steering Committee for Natural Disaster Prevention and Control 2017 report).
Topography of the area is hilly with narrow river valleys and wide plateaus. The hills are aligned in Northwest to Southeast direction. Geologically, the area is complex associated with folded and faulted rocks indicating tectonic activities in this region. Rock types present in the study area include magmatic metamorphic rocks (biotite gneiss, biotite-granite-cordierite-sillimanite, quartz-biotite-plagioclase-hypersthene schist, two-pyroxene gneiss, biotite-sillimanite-cordierite-granite gneiss, gneiss biotitegranite, quartz-biotite-sillimanite) and sedimentary rocks (shale, sandstone, siltstone). Major part of the area (75%) is covered by forest land.
Son La hydropower catchment area has varied climate with temperature ranging from À4.7 C to 41.8 C (Average temp 21.4 C). The hottest season spreads from June to August, while the coldest one is placed between November-December. The annual average relative humidity fluctuates between 78% and 93%. There are two distinct seasons in the year: the rainy season which lasts from April to September, and the dry season from October to March of the following year. Rainfall is unevenly distributed throughout the year. Total average rainy days in the year, fluctuates from 114 to 118 days. Up to 80% of the annual rainfall falls during rainy season (May-September). June-August period has the highest rainfall (>300 mm/month). Most of the landslides occur during the months of July and August.

Landslide inventory
Landslide inventory is one of the prerequisites for the LSM. Landslide inventory map provides an important information regarding the following elements: landslide sites, kind of landslide, landslide modulation, causes and triggers of landslides (i.e. earthquakes, heavy rainfall and rapid snowmelt) (Tien . Researchers have found that Incidence of past landslides in an area has a strong relationship with the future occurrences of landslides (Galli et al. 2008). Therefore, landslide inventory mapping is a significant and necessary part of landslide studies. In this study, in total 186 landslide past events were mapped in the study area with the help of Google Earth images and field studies (Figures 1 and 2).
In the study area, rock, debris and mixed types of landslides have been reported. About 80% landslide in this area of small (<200 m 3 ) and medium (200-1000 m 3 ) size, whereas 20% landslides are of large sliding blocks (>1000 m 3 ). For the model study, equal numbers of non-landslide points were randomly placed across the study area where no past landslide events were recorded so far. This procedure of selection the non-landslides points was made according to the literature survey (Tien Bui et al. 2016). These non-landslide points assigned the value '0' and landslide points '1' in the landslide susceptibility modeling. Landslide data was split into 70:30 ratio for training (70%) and testing/validation (30%) of the models. This split ratio was selected based on the experience and literature survey for the better accurate performance of the models (Sahin et al. 2020).

Landslide conditioning factors
In the present study, 12 landslide affecting factors, based on the topography (slope degree, slope aspect, elevation, curvature), hydrology (river density, rainfall, aquifer), geology (weathering crust, lithology, focal flow and fault density) and man-made relevant variables (road density), were considered in the development of models. A digital elevation model (DEM) (https://search.asf.alaska.edu/) with a raster resolution of 12.5 m was used to generate the topographical factors maps. Geology and other factors thematic maps were generated from the available data of the government organization, literature and Google Earth Images. ArcGIS software was used for the development of thematic maps (Figure 3).

Elevation
Elevation is one of the important factors because it affects the humidity and temperature regimes, soil moisture, rain fall patterns and weathering (He et al. 2012). Most of the hills having high elevations are the result of geologic tectonic activities causing deformation of the rocks, and thus, more vulnerable to landslides (Ercanoglu and Gokceoglu 2004). Elevation map was prepared from the DEM in GIS application (Figure 3(a)).

Slope degree
Slope is the angle that represent vertical change versus a horizontal change between two points on the surface of the earth or body (Moosavi and Niazi 2016). It is a measure of steepness of hill ore ground. Slope degree is an acting as a cause variable for landslide occurrence rate (Van Den Eeckhaut et al. 2006) and mostly used in the landslide studies (Reichenbach et al. 2018). It is one of the important factors as landslides generally occur on moderate to higher slopes due to downward gravitational effect on the sliding mass. The slope degree map was generated from DEM and divided into different classes using 'Slope' tool in GIS application (Figure 3(b)).

Focal flow
Focal flow, used to identify potential streams such as waterways and rivers (Park et al. 2016). The flow represents the ways water moves downward. The flow is determined by evaluating eight neighboring cells in raster image (DEM). It is one of the important factors in the identification and delineation of streams and direction of flows in the landslide study. Focal flow map of the study area was created from DEM using focal flow tools of the ArcGIS (Figure 3(c)).

Aspect
The aspect that is direction of the slope is one of the important factor in landslide occurrences which is related with the solar radiation and moisture condition (topographic wetness) (Chen and Chen 2021;Van Phong et al. 2020). Slope aspect of the study area was extracted from DEM and divided into nine classes: (1) flat, (2) north, (3) northeast, (4) east, (5) southeast, (6) south, (7) southwest, (8) west and (9) northwest using 'Aspect' tool of GIS application (Figure 3(d)).

Curvature
Curvature has been mostly employed in landslide research related to water flow (Mersha and Meten 2020;Ghasemian et al. 2020;Thanh et al. 2020). Curvature affects landslides because curvature can control the flow of water on the ground. Runoff is more on convex surface in comparison to concave surface where water accumulates more (Firth and Whitlow 1991). The curvature map was derived from the DEM map with different classes using 'Curvature' tool of GIS application (Figure 3(e)).

Fault density
Fault density determines the degree of rock weakness (Nampak et al. 2014), thus, it is important factor in the dislodging of the rock blocks from the surface causing landslides. The fault density variable parameter was created from the geological map using the 'line density' tool of ArcGIS (Figure 3(f)).

Rainfall
Rainfall is the most important factor in causing and triggering landslides (Yang and Adler 2008). Determination of amount and duration of the rains is essential in landslide studies. Intense rainfall can cause sudden landslides depending on the topographical and geological conditions of the ground/rock mass. Rainfall map was prepared into four classes from meteorological data of the catchment area of the dam (Figure 3(g)). Average annual cumulative rainfall data for the period 2016 À 2020 was used in the generation of rainfall map in this study. This data was obtained from 6 measuring stations located in and around the study area namely Nam Giang, Pa Tan stations (Lai Chau province), Nam He station (Dien Bien province), Quynh Nhai, Thac Vai; and Bo Sinh stations (Son La province). The mean sum of multi-annual rainfall is usually used in the landslide susceptibility studies (Chen et al. 2018a, Zhao et al. 2019).

River density
River density is related with fractures in tectonically disturbed areas and plays important role in the runoff and infiltration of water, and thus, in evaluating sliding stability of the area (Abedini et al. 2019a). River density map of the area was prepared after delineating river network and processing in GIS application using 'density' tool of Arc Hydrology extension (Figure 3(h)).

Weathering crust
Weathering crust is important in the landslide study as its nature and thickness affects the shallow landslides (Van Tu et al. 2016). Weathered rocks are easily eroded and vulnerable to sliding especially during rains. In this study, weathering crust map was extracted from the weathering crust map (1:100,000) of West-North map of Vietnam (Figure 3(i)).

Lithology
Lithology of rocks plays important role in landslide study (Segoni et al. 2020). Lithology represent general physical characteristics (colour, texture, grain size and mineral composition) of rocks. Weathering of rocks, porosity and permeability depends on the lithology of rock strata. Strength of rocks depends on the type of lithology besides other structural conditions. Therefore, lithology map of the study area was generated from the geology map (1:100,000) of the West North map of Vietnam for the model study (Figure 3(j)).

Aquifer
Water bearing permeable strata of rocks and sediments where pores are saturated with water is known as aquifer. It is important to identify aquifers in the study area. If they are located at shallow depth, they may cause heavy seepage and leakage through ground surface causing erosion of ground mass and generate pressure on slope faces, thus, can cause slope failure/landslides. In this study, aquifer map (1:100,000) of the study area was extracted from the geohydrological map of the study area (Figure 3(k)).

Road density
Road density is the ratio of length of road network to the area (study area). It is one of the important factors in landslide study (Nhu et al. 2020b). The area having high density of roads is more vulnerable to landslides as during the road construction surrounding groundmass get disturbed and creates instability of slopes. Road density map was generated from the available road maps of the area and Google Earth images (Figure 3(l)).

Methods used
4.1. Rough set RS method was first introduced by Pawlak in the year 1982 (Pawlak 1982). It is a mathematical tool which deals with imprecise, inconsistent, incomplete information and knowledge. The RS method is now used in soft computing for data mining in many fields. RS theory helps in solving the uncertain problems. In this theory, two precise lines are established to describe the imprecise concepts. It has great potential for analogical logic modeling (Skowron and Dutta 2018). Regarding the subject of access of meanings over a world U 1 : we suppose that the concepts are perceived only through some subsets of U 1 , named samples. This is a generic estate in the ML, pattern recognition, or data mining ways (Zhang et al. 2016). It is thought to be an information system IS¼ (U, R) which is specified (where U ⸦ U 1 ) and that for several C ⸦ U 1 it is a complex

AdaBoost
Adaptive Boosting is an example of a group booster. In this system, various, classifiers are composed to achieve the highest exactitude for predicting a model. Based on the purpose of this method, it can be claimed that it is a duplication method, which means that in many duplications, it combines several weak classifications to create an intense and potent classifiers. AdaBoost's task is to regulate the weight of each class and the training data of the samples in each iteration. It should be noted that any ML algorithm can be used as a basic classification (Xiao et al. 2019). In other terms, to expand and improve ML learners and also to prevent the weakness of models in training examples, weak classes are repeated to combine and become a strong class so that errors are eliminated in each repetition and generate more emphasis strong class (Uyanı k et al. 2020). For example, the forecast provided by AdaBoost is as follows (Eq. (1)) (Wan et al. 2013): where, 'F(x)¼' is the respective prognosis, 'M¼' is the numeral of feeble classifications, the indication subordinate here is an affirmative or minus prognosis, f m (x) is a faint classifier that generates either positive or negative foresight, C m is a ratio computed via learning weights.

Bagging
Bootstrap aggregating (Ba) is employed to enlarge inconsistency/instability extents as well as classification plots. Bagging has demonstrated by evidence or argument to be true or existing to be very sensitive to highlight the variations in training data that is contributory to boost the categorization precision of incipient intention tree classifier by decreasing variance of categorization wrong (Weber et al. 2020). The points to be made about the Ba algorithm note that the punctuality of the single ML algorithm is not high, so the principal ML algorithm is repeated several times to enhancement the prediction precision, as well as the final precision of the model using the results. The output of several models becomes final. Another advantage of this method is that it creates a strong classification (Yang et al. 2021).

Support vector machine
SVM is a most popular supervised ML algorithm since 1990 (Pisner and Schnyer 2020). SVM performs classification, regression and also outlier detection by drawing a straight line between two classes. SVM uses the kernel to magnify the flair spacing and quantify the homogeny of the two views (James et al. 2014). It detects the largest border amongst two classes in attribute, zone. A generic SVM model can be a 2 Ã classes or poly Ã classes model (a syntax of a chain of 2 Ã classes SVM m ). The 2 Ã classes SVM is the most usable used ML model (Tien Bui et al. 2012). In fact, it uses the origin of underestimating construct transgression. SVM algorithms are segregated into two main categories: support vector machine classification model (SVM CM ) and support vector machine regression model (SVM RM ). Each of these two groups has unique characteristics. For example, SVM CM are handled to divide the data categories that are placed into diverse stages and SVM RM are operated to resolve forecasting problems, if the data are detached lineally. The equation is explained below (Eq. (2)): Specimens can be preprocessed by moving specimens to a nonlinear space with high dimensions. This is the case when the data are not linearly severable. If a symmetric core of the Mercer equations is used in the low-dimensional input space, it can be considered the result of high-dimensional (relating to measurements) proliferation and considerably reduces the calculations (Burges 1998). In this case, the previous is changed to the next Eq. (3): The k (x i , x j ) subsidiary which is a single-core function, is used to create machines with various kinds of nonlinear decision lines in the internal Ã multiplication Ã data space (Table 1).

OneR feature selection
OneR (One Rule) is a classification algorithm which is simple and accurate. Rules are created for each predictor by constructing a frequency table against the target. One rule for each predictor is created and then rule with the smallest total error is selected as its 'one rule'. OneR is an efficient selection method that specifies the statistical connection between an output variable and a set of selected input variables. OneR is generated separately for each component in the training data set. One-R classifies all variables according to their, significance in solving landslide prediction problems (Jain and Singh 2018). In the present model study, a OneR feature selection technique is seen to pick out adequate affecting variables. Average merit (AM) generated from OneR is a criterion in variable selection that numeric distinguishes the emphasis and rating of variables. Thereafter, variables are precedence according to AM in descending arrangement.

Validation indicators
In this study, various statistical indicators including Positive Predictive Value (PPV), Negative Predictive Value (NPV), Susceptibility (SST), Specificity (SPF), Accuracy (ACC), Kappa, Area under the ROC Curve (AUC) and Root Mean Square Error (RMSE) were used to validate and compare the models. The mentioned criteria are counted on according to a stupefaction matrix that is used to classify the model output into attendance classes(!cutoff) and non-existence classes (!cutoff) according to a particular phenomenon, to a still (e.g. cutoff amount) (Naderpour and Mirrashid 2020). Out of these, the PPV mentions to the ratio of faultlessly classified landslide pixels out of all pixels categorized as landslide (LS) class whiles the NPV ¼ is the ratio of pixels that were fittingly classified as non-landslide (NLS) class. The SST is the ratio of perfectly classified landslide pixels out all pixels that were fittingly classified as LS plus those wrongly classified as NLS. The SPF indicates the ratio of perfectly classified non-landslide pixels out of all pixels that were fittingly classified as NLS plus those wrongly classified as LS. The ACC ¼ illustrates the overall proficiency of a predictive model and is computed as the ratio of landslide and non-landslide pixels that are rightly classified (Pham et al. 2018a). The RMSE metric was employed to appraise the discrepancy between apperceived/observed and guesstimated/estimated data (Costache et al. 2020;Nhu et al. 2020d). Kappa index (k) is a statistic/actuarial dial that extents the compromise between two assessors who every grade the whole numeral of landslide and non-landslide pixels into two monopolized/pre-emptive classes, LS and NLS, respectively (Costache et al. 2020). The ROC curve, is a graph according to the correct affirmative rate (sensitivity) and the wrong affirmative rate (1 À specificity), can be selected as the actuarial indicator of the totally efficiency of the algorithms (Cabrera and Lee 2019). In this article, the ROC curve will explain the ability of the algorithms to exactly forecast the incidence of the landslide history. The graphical delegation will be feasible via plotting the sensitivity on Y-axis versus the 1specificity on X-axis. The nearby the amount of are under this curve (AUC) close to 1, indicates higher accuracy (Pallant 2013), the values of 0.5-0.6 show a negligible forecast algorithm, 0.6-0.7 fair, 0.7-0.8 good, 0.8-0.9 very good and >0.9 an excellent forecast algorithm (Hanley and McNeil 1982). The formula for the evaluation indicators described above is given below (Eqs. (4)-(11)) (Costache et al. 2020;Wu et al. 2020): where TP: True Positive, TN: True Negative, FP: False Positive and FN: False Negative. TP and TN are the numeral of landslide pixels that are accurately classified as, respectively, LS and NLS . FP and FN are the numeral of non-landslide pixels that are wrongly classified as LS and NLS classes; P a is the comparative perceived compromise amidst assessors and P est is the assumptive possibility of luck compromise, X MODEL and X ACT signify the modeled and factual worthiness, respectively, P is the total numeral of landslide pixels and N is the total numeral of non-landslide pixels.

Methodology
The main points of methodology of development of landslide susceptibility maps and evaluation of the models studied is presented below (Figure 4): 1. Selection of variables affecting the occurrence of landslides in the study area; 2. Identification and delineation of landslides (represented by points) in the study area; 3. Prioritization of landslide affecting factors (variables) using OneR feature selection method; 4. Out of the total landslide occurrence points, 70% were used for spatial modeling (ABRS, BRS, RS, SVM) and the remaining 30%, which were randomly selected, were used to evaluate and predict the accuracy of the models; 5. To compare the models and select the best performing model, various evaluation criteria: PPV, NPV, SPF, SST, ACC, Kappa, AUC and RMSE were used. 6. Computation of the landslide susceptibility maps using the hybrid and standalone machine learning models 6. Results and discussion

Variable selection using OneR method
In the spatial forecasting modeling of landslide susceptibility, most important variables were selected and minor influencing variables were eliminated during the modeling process (Chen and Chen 2021). Ranking wise variables for the prediction of landslide susceptibility zones were presented in Figure 5 using OneR method. The results of the OneR selection show that the variables namely focal flow, elevation and lithology are three most important variables in the occurrence of landslide. The power to choose the most substantial variables is determined using the Average Merit (AM) . The AM is supported on a 10-fold cross-validation (Nhu et al. 2020d). Thus, the focal flow variable with a value of 71.71 is the most important in the occurrence of landslides of this area, followed by elevation and lithology variables with values of 66.14 and 63.35, respectively. In the study of Dai and Xu (2013), elevation factor is also the most important factor affecting landslide occurrence.
Depending on the geology, soil, topography, climate and land uses, landslide affecting factors may vary from region to region (Nasiri et al. 2019). It is important to note that most of the landslide affecting factors ranking in landslide occurrence depends on the local geo-environmental conditions. In this study, focal flow is the most Figure 5. Ranking of the conditioning factors (variables) using OneR method.
important factor as this factor directly relate with the flow of the water and infiltration on the susceptible ground causing landslides where rainwater acted as triggering factor. This might be due to the fact that most of the landslides occur in this area during rainy season.

Training and validation of the models
In this study, we have used two hybrid ML models (ABRS and BRS) and two single ML models (RS and SVM) for the development of landslide susceptibility maps. These models were trained on 70% data and validated on 30% data. In the hybrid models, the Bagging and AdaBoost were used to optimize the training dataset to train  Confidence level for interval difference 0.9 ---5 Minimal frequency 6 ---6 Minimal number of intervals 3 ---7 Number of intervals 5 ---8 Discretization Maximal discernibility heurisitc local ---9 Indiscernibility for missing Discern from value 10 Number of decimal places 2 2 2 2 11 Reducts All local 12 Number of interations -13 10 13 Seed Number of execution slots Tolerance of the terminal criterion - the base classifier of RS. It can be noticed that the performance of the models depends on the selection of the hyper-parameters used to train the models. In this study, we have selected the optimal values of the hyper-parameters of the models based on the trial error tests ( Table 2). Validation of the models was done using various statistical indicators on both training and testing datasets (Table 3 and Figures 6 and 7). In term of training  (Table 3 and Figure 6).
Validation of ML algorithms using the ROC curve was also done as shown in Figure 7. It can be observed that the ABRS has the highest value of the AUC (0.929) compared with other models namely BRS (0.882), RS (0.881), SVM (0.877) in term of training dataset while the BRS has the highest value of AUC (AUC¼ 0.845) compared with other models including ABRS (AUC¼ 0.836), RS (AUC¼ 0.826), SVM (AUC¼ 0.807) in term of testing dataset. In addition, the statistical significance of the differences between the models results was investigated through the Pair-wise comparison algorithm proposed by Hanley and McNeil (1982). In this regard, the evaluation of statistical significance will assume the use of AUC values, the number of landslide and the number of non-landslide points. The main parameter which indicate the presence or absence of statiscal significant difference is p-value significance level. The statistical difference between 2 values will be indicated by a p-valuelower than .05 (Hanley and McNeil 1982). As can be observed in Table 4, the only statistical significants are the differences between AUC (success rate) of ABRS and the other three models.  From the results of this study, we can state that the use of optimization techniques like Bagging and AdaBoost can help in improving the performance of the base classifier like RS as the performance of both hybrid models ABRS and BRS is better than other two single models (RS and SVM) (Table 3 and Figures 6 and 7). It is reasonable as these optimization techniques are able to produce the optimal training dataset for training the hybrid models using the RS as a base classifer. In this study, the BRS outperforms ABRS; therefore, the Bagging indicates as a better optimization technique than AdaBoost as this Bagging technique has several advantages such as (1) it can effectively void the instability of a classification algorithm by combining multiple base classifier; (2) It uses random sampling with replacement, which can generate multiple training datasets with the most useful information of the original training data; and (3) Using the Bagging, the multiple base classifiers are trained by the same classification algorithm, therefore, its process is not so complex (Sun et al. 2018). In addition, the results of this study is comparative with the published works which showed that the Bagging is more effective then AdaBoost in landslide susceptibility mapping (Bui et al. 2014.  In common, the use of statistical models during input, output and spatial analysis processes is time consuming, but ML algorithms have the advantage of spontaneously/automatically detecting relationships between afillated and autonomous variables (Yilmaz 2010). Many researchers around the world have used ML algorithms to landslide susceptibility map (Goetz et al. 2015;Pourghasemi and Rahmati 2018;Nam and Wang 2020). Howbeit, the predictive accuracy of some of these techniques is still questionable. Choosing the best model among different ML models plays a obligatory role in assessing landslide susceptibility (Javidan et al. 2019). Combined ML algorithms have the following advantages: (1) They automate the robotize procedures, (2) The researcher can browse several large databases and gather noteworthy awareness, (3) Instead of being confined to one classification, they duplicate a large number of classifications while randomly changing the input data, (4) Individual models can be composed into one, creating a more consistent rule or prediction scheme, compatibility with different parameters, for example different environmental conditions, is usually desired to support planning intentions, particularly in large areas with assembled lithology, topography and climate (Camilo et al. 2017).

Preparation of landslide susceptibility maps
Landslide susceptibility maps were prepared using ABRS, BRS, RS, SVM models. In this study, we have natural break way (NB w ) method to classify landslide susceptibility indices of all pixels of the study area generated during the training of the models (Zhou et al. 2018;Nhu et al. 2020d;Wang et al. 2020;Youssef and Pourghasemi 2020). The output maps obtained from the ML algorithms were classified into five categories: very low, low, moderate, high and very high susceptibility classes ( Figure 8). Figure 9 shows the results of frequency distribution analysis of landslide susceptibility classes on the maps of ABRS, BRS, RS, SVM models. It can be seen that the highest percentage of landslide susceptibility area in high and very high classes using ABRS algorithms are 22.72%, 6.177%, respectively, using BRS (28.71% and 24.56%), RS (27.47% and 16.13%) and SVM (19.6% and 21.68%). In addition, the outcomes of the third graph from Figure 9 show that the Frequency Ratio (FR) of landslides in the study area according to the four algorithms are ABRS¼ 2.024, BRS¼ 2.109, RS¼ 1.992, SVM¼ 1.647, the highest value is for the BRS model, which indicates that the map generated by the BRS model is better than the maps of other models (ABRS, RS and SVM).

Concluding remarks
In this study, the main objective is to develop novel hybrid ML models (BRS and ABRS) and evaluate performance of these models for accurate landslide susceptibility mapping of the Son La hydropower reservoir basin, Vietnam. OneR method was used to select and rank the considered landslide affecting factors (slope degree, slope aspect, elevation, curvature, focal flow, river density, rainfall, aquifer, weathering crust, lithology, fault density and road density) for the development of landslide susceptibility maps. Results show that focal flow factor had the greatest impact on the occurrence of landslides followed by elevation and lithology. Accuracy of maps was evaluated using several statistical measures: PPV, NPV, SPF, SST, ACC, Kappa, AUC-ROC curve method. Statistical analysis of the model performance shows that performance of the developed novel hybrid model BRC is the best (AUC ¼ 0.845) in accurate mapping of landslide susceptibility in the study area. Therefore, this developed BRS model can be used in other areas also for the generation of accurate landslide susceptibility maps using local geo-environment conditions in other landslide prone areas.