Advanced data mining techniques for landslide susceptibility mapping

Abstract This paper describes the development and validation of landslides susceptibility models for mountainous regions using advanced data mining techniques. The investigation was carried out to ascertain the effectiveness of Naïve Bayes Multinomial (NBM) and Random Trees (RT) in landslide susceptibility mapping. The NBM is an advancement of the frequently used Naïve Bayes classifiers, while the RT was built to overcome the limitations of the traditional forest classifiers. A geospatial database for this investigation comprises 148 landslide locations influenced by ten (10) landslide conditioning factors. The factors (Slope Angle, Slopes Elevation, Slope Aspect, Plan curvature, Profile Curvature, Lithology, Soil type, Stream power index (SPI), Sediment transport index (STI), and Rainfall precipitation) were drawn using a Multi Collinearity Decision Making (MCDM) technique. A Frequency Ratio (FR) analysis was used to obtain the relative significance of the factors in the slides. Predictive models were also developed by quantifying these models using data mining techniques. A section of the entire geospatial data (70%) was used as training datasets, while the remaining part of the data (30%) was used to validate the trained datasets. SVM, RT, and NBM algorithms were used to produce predicted datasets from the training datasets. These predicted datasets were used to develop the Landslides Susceptibility Models. A comparative assessment between the two classifiers against the famous traditional learning algorithm, the Support vector machines (SVM), was conducted. Model performance evaluators such as the AUROC, RSME, F-measure, MAE, and ACC were employed to check the predictive capabilities and accuracies of the models. The indices indicated that the SVM model performed better than the other two algorithms in both training and validation datasets. Further analysis and comparison of the models reveal that the new data mining techniques are reliable for landslide susceptibility. Simultaneously, the traditional algorithm is also useful and remains relevant, especially with similar site conditions. This study has provided insights on better planning and development and provision of mitigation strategies and further analysis on landslides in the study area, particularly in cases of limited data availability. Graphical Abstract

planning and development and provision of mitigation strategies and further analysis on landslides in the study area, particularly in cases of limited data availability.

Introduction
Landslides are generally a form of geo-hazards because it affects human lives either directly or indirectly (Collins and Znidarcic 2004;Gue and Tan 2006;Ibrahim MB et al. 2020). It occurs when the soil or rock mass is displaced under the influence of gravity. This phenomenon causes great destruction of lives and properties ( Figure  1a-f), and it is a cause of concern to many governments. It also posed fear to the people living in places that are susceptible to landslides. Generally, there are at least five primary forms of landslides based on the nature of material displacement. These include rock falls, topples, slides, spreads, and flows (USGS 2004). Landslides are believed to have multiple complex causative factors and triggering factors that are dependent on the nature of the environment. When loss of life is directly involved, then the phenomenon is labelled as hazardous. Landslides affect the environment, such as vegetation and farmlands, in areas that humans are not inhabited. It also affects the infrastructures built across rough terrains such as pipelines, roads, highways, earth dams, retaining walls, housing, small cities. These and many more are endangered by landslides (Bacha et al. 2020;Chang Z et al. 2020;D ıaz et al. 2020;Li L et al. 2020;Prakash et al. 2020).
Landslides are now a global problem; according to world bank data, approximately 3.7 million km 2 of the earth's landmass is under serious threat from landslide activities (Mandal et al. 2021). The report also identified some 300 million people living within those high landslide susceptible areas, putting their lives in danger, and destructing massive economic activities within the areas. A recent study conducted by Balogun et al. (2021) reported a total death of 1370 and 784 injuries in 27 European countries alone from 1995 to 2014. In addition, a compensatory cost of about 4.7 billion euros to property loss. Another 200 million dollars was expended to restore damages from two landslide events (1974and 1998) in Peru. From 1995to 2004, the world has recorded some 163,658 deaths with 11,689 injuries to landslides hazards alone. While from 2004 to the year 2016, China alone has spent over one billion dollars on non-seismic landslides. Overall, the figures relating to landslide losses have been rising despite governments' and individuals' efforts to curtail the phenomenon (Remondo et al. 2003;Petley 2012;Pourghasemi and Rahmati 2018;Althuwaynee et al. 2021). Figure 1. (a) Indicating the event of a landslide within a community in the study area; (b) A road linking two towns is completely cut out; (c) showing the devastating effects of landslides along a road section linking some communities in the state of Sarawak; (d) A gas pipeline in the study area that got ruptured and lit up the whole area, the event was caused by landslides activities triggered by nonstop rain that happened in the area; (e) Part of a road leading from Song bazaar to SMK Song, deep in the remote Kapit Division in Sarawak, has been cut off due to a landslide; (f) The condition of a house after a landslide which occurred near the water treatment plant in Paitan; Bernama pic, January 15, 2021 (internet sources).
Due to the dynamic nature of landslides and their analysis, there have been improvements in studies that analyze landslides and provide early warnings of their occurrence in susceptible regions. Geotechnical and geological data obtained mainly from laboratory experimentations were used to produce physical models to determine susceptibility in early landslides studies (Pourghasemi and Rahmati 2018). Limitations and challenges encountered while analyzing landslides using the physical model (e.g. time wastage in actual data collection and analysis, huge experimental cost, especially in the analysis of larger areas) led to statistical analysis. The statistical models best correlate the dynamism in landslides between many predisposing factors and landslides (Balogun et al. 2021). To this effect, many statistical models such as frequency ratio (Guzzetti et al. 1999;Gorsevski et al. 2006), fuzzy logic (Akgun et al. 2012;Balal and Cheu 2018;Chen W et al. 2017a;Hauser-Davis et al. 2012;Irvin et al. 1997;Tien Bui et al., 2017a), the weight of evidence models (Chang M et al. 2020;Lee JH et al. 2018;Pamela et al. 2018;Polykretis and Chalkias 2018), logistic regressions (Bai et al. 2010;Chen W et al. 2017b;Chen W et al. 2019;Pradhan 2010), analytical hierarchy process (Althuwaynee et al. 2014;Saadatkhah et al. 2014;Mardani et al. 2015;Asadabadi et al. 2017) and many more have all been utilized to produce susceptibility models. More recently, soft computing procedures (Data mining) are increasingly being adopted for landslide susceptibility analysis due to their impressive performance. This approach combines GIS data and machine learning algorithms to analyze the landslide, predict and produce susceptibility maps statistically (Li DQ et al. 2021;Saravanan et al. 2021).
The technique uses approximate values to produce very accurate and valuable solutions (Tsangaratos and Ilia 2017b;Moayedi et al. 2019;Goel 2020). Overall, the accuracy of the models developed using this technique has shown high prediction performance and high success rates (Ayodele 2010;Oladipupo 2012;Goetz et al. 2015;Dickson and Perry 2016;Shirzadi et al. 2018;Ghorbanzadeh et al. 2019;Hegde and Rokseth 2020). The dynamic nature of landslides with their conditioning and triggering factors across different locations made researchers explore different algorithms to harness the maximum prediction rate from the soft computing techniques (Chen X and Chen W 2021;Diana et al. 2021;Saha et al. 2021;Youssef and Pourghasemi 2021). To date, many researchers are using machine learning algorithms to mine data and make valuable predictions of landslide occurrence effectively. For instance, ) used kernel logistic regression, naïve Bayes, and radial basis function network to produce landslide susceptibility maps. While (Oh and Lee 2017) utilized artificial neural networks and boosted trees to produce landslide susceptibility maps. Others like Dou et al. 2019;Fallah-Zazuli et al. 2019;Chen 2019;Hong et al. 2016;Lay et al. 2019;Lee S et al. 2017;Nhu et al. 2020a;Song et al. 2012;Tien Bui et al. 2016;Vafakhah et al. 2020;Vakhshoori et al. 2019;Liu Z et al. 2021;Oliva-Gonz alez et al. 2019;Wang et al. 2021) have used different machine learning algorithms to mine data and produce susceptibility maps.
Studies to assess the comparative performance of various algorithms have revealed that the study environment and conditioning factors under consideration affect the outcome of the analysis (Chen X and Chen W 2021;Hong et al. 2017;Mohammady et al. 2019;Liang et al. 2020;Sun et al. 2021;Tsangaratos and Ilia 2017a;Xu et al. 2016;Yeshwanth et al. 2019 ). Different predictive results were recorded for the same machine learning algorithms used in different locations. For instance, (Merghadi et al. 2020) reported that the RF algorithm outperformed the conventional SVM. Even though both algorithms performed well by surpassing the 0.7 success rate benchmark. Tien et al. (2020) suggested that Deep learning neural networks perform better than the conventional learning algorithms followed by the SVM and then the RF models. Contrary to the previous investigation (Achour and Pourghasemi 2020), the RF model has performed better than both SVM and Boosted Regression Tree (BRT). However, most researchers have concluded that data mining techniques can be improved by exploring more study areas with different and enhanced newer machine learning algorithms (Balogun et al. 2021).
Thus, this paper explores relatively new machine learning algorithms, the Naïve Baise Multinomials (NBM) and Random Trees (RT). These algorithms are expected to perform better in developing landslides susceptibility maps in a mountainous region. Mountainous regions have often been characterized as data-scarce environments for soft computing analysis (Buijs et al. 2009;Lee JH et al. 2018;Marin et al. 2021). Data scarcity is when the needed data or information for a successful analysis is not readily available. Some forms of qualitative judgments, such as the weight of evidence WoE and the analytical hierarchy process AHP, are employed to augment the missing data (Ibrahim Sameen et al. 2019;Medwedeff et al. 2020;Marin et al. 2021). Although, the overall quality of the soft computing techniques depends on the quality and quantity of the data. Research has however, discovered that some of these algorithms could perform wonderfully well in such environments. For instance, the SVM algorithm is identified to evade overfitting by handling fewer training data over other algorithms (Rahmati et al. 2017;Ibrahim Sameen et al. 2019;Achour and Pourghasemi 2020). The NBM is an advancement of the frequently used Naïve Bayes classifiers, while the RT was built to overcome the limitations of the traditional forest classifiers with the potential of enhancing the accuracy of results. The final models were compared with the conventional data mining algorithm, the Support Vector Machines (SVM), to assess the performance of the new algorithms. The choice of SVM for validation is due to its impressive performance in previous studies, particularly with fewer learning data on many occasions (Ghorbanzadeh et al. 2019;Goetz et al. 2015;Ibrahim Sameen et al. 2019;Lee JH et al. 2018;Marin et al. 2021;Nhu et al. 2020b;Vakhshoori et al. 2019).
The random trees (RT) were obtained from the combination of the RF and DT, an additional step to computing the splits (Merghadi et al. 2020). Furthermore, the RT classifies the dataset decision by providing other subsets to tackle overlearning and overfitting, especially with insufficient data. Another advantage of considering this algorithm is that it can reduce the dataset variance more effectively because a total learning sample is used to establish the trees, replacing bootstrap. The Naïve Bayes Multinomial was the second ensemble used in this research. The algorithm is built based on the Bayesian theorems with improvements to the conventional Naïve Bayes (Chen W et al. 2017c). The idea is to see how well it will perform in terms of landslides predictions within its class compared with the other two algorithms.
The landslide conditioning factors in this work were not randomly selected. A technique that involves probability future selection technique was used to choose the factors influencing landslides in the study area (Cinelli et al. 2014;Yeshwanth et al. 2019;Jena et al. 2020). The factors were selected from a procedure called the weight of evidence, which is a part of the MCDM technique. The technique pairs the factors against each other, thereby placing the most influencing factor above the other based on supplied evidence (Pourghasemi et al. 2013b;Chen et al. 2016;Polykretis and Chalkias 2018). The selected factors were again scrutinized using frequency ratio FR. The FR further reduced the factors into a precise number of factors needed to influence landslides within the study area. The factors selection procedure will cause an improvement to the old ways of analyzing landslides using GIS data and machine learning. The discrepancies that are likely in this procedure, especially when allocating values, can be avoided through careful inputs of the weights and proper allocation of the discounts as the weight of evidence (Ghorbanzadeh et al. 2019;Wang Y et al. 2019;Mohan et al. 2020).
The rest of this paper is structured as follows: Section one (Introduction) talks about the importance of landslides analysis. Highlights of the various methods like the physical methods, statistical methods, and data mining methods were discussed. Limitations of existing techniques, the significance of data mining techniques, and the identification of the research gap are also discussed. Section two presents details of the study area. It describes the geology and the geomorphology of the study area. A detailed discussion on the methodology adopted follows in the third section. Results were discussed in section four. Finally, conclusions and recommendations are provided in the last section of the paper.

The study area
The study area for this research is a region of mountainous terrain with a gas pipeline that transports natural gas from Sabah to Sarawak in Malaysia ( Figure 2). The site lies within latitudes 0 8 02 0 45 00 N and 01 8 32 0 45 00 N and longitudes 105 8 24 0 05 00 E and 106 8 10 0 45 00 E and it is covering some 3,811.9km 2 : The area has a population of about 35,300 from the year 2000 census report. Being a transit district, major roads that link the capitals of the two states Sabah and Sarawak, remain busy almost throughout the day. Economic activities in the area comprise some apple cultivation as well as palm plantations in the high lands. The region is well known for its rough mountainous terrains measuring over 1800 m above mean sea level at some points. The climate in Lawas is that of a typical rain forest or simply an equatorial climate that is common to areas situated between 10 to 15 latitude to the equator. The temperatures are apparently high and very humid at some points in the year. An annual average temperature of about 30 C remains the peak temperature, while an annual low temperature of 24.4 C is also attainable within the study area. In contrast, the rainfall in this area is a typical northern Borneo monsoon rain that falls intensely from September to March. High annual rainfall of up to 4,178mm can be recorded within the mountainous regions of Lawas. These rains are sometimes characterized as continuous downpours because they usually fall continuously for several hours or even days.
Lawas geologically is classified as stable, which means there are no seismic activities recorded yet in the area. The geological composition consists of thick and sequence layers of Eocene-Oligocene grey bluish fine to medium-grained sandstones. In addition, a formation of red/grey shales forms the soil beds traced to the 'Crocker Formation'. The Crocker formation has strata that extend towards the northeast forming steeper terrains as it moves to the east or west. Lithological units in the area comprise some thick to very thick-bedded rock units of sandstones and interbedded shales. Rock stratification in this area can be categorized into two categories: the 'sandy sequence' and the 'shale sequence'. An extension of low-lying flat areas that lies towards the coastal regions and extends in wetlands and swamps. At the same time, the hilly regions provide for most of the inhabited lands of the region. The geology and geomorphology of the study area have classified it as an area where landslides can quickly occur with a bit of trigger.

Data and methods
The methodology adopted for this research is as shown in (Figure 3). The work starts with data collection after an extensive literature review on landslides and their analysis. The identification of the landslide points leads to the development of an inventory map. This inventory is developed using landslides' history records, interpretation of satellite images, and site visitation reports for the past ten years. A total number of 148 landslide locations were identified from this study area. These landslide locations served as a preference for training and validation datasets. In other to avoid the bias associated with probabilities, the same number of non-landslides locations were identified. Digital Terrain Model (DTM) of the study area with a high resolution of 50 Â 50 cm was used to derive all the spatial factors contributing to landslides in the area for the past ten (10) years. The datasets were randomly divided into a ratio (70%-30%) (Hegde and Rokseth 2020; Mohan et al. 2020). As stated earlier, 70% of the total data is used to train the algorithms. The remaining 30% is used as validation data. Splitting the datasets into training and validating or testing is necessary for data mining analysis. The idea is to use most of the data for training sets while fewer parts validate the trained data sets. Selection of landslides and non-landslides locations is conducted to ensure that there is a similarity in the data when it is split in tow. Validation datasets are being used to make predictions against the training datasets. The validation datasets have values already known in the attributes, making it easy to identify the correctness of the predictions.
A clustering method was applied to sample the landslides data from the non-landslides data. The K-means algorithm can place and group data to the specified cluster or centroid (Chang Z et al. 2020;Keyport et al. 2018). The Datasets were partitioned into the specified number of clusters (landslides and non-landslides). Simultaneously, the clustering is continued by grouping the datasets into the predefined clusters from the centroid. The landslides conditioning factors were selected using a modified future selection methods (Goetz et al. 2015;Huang and Zhao 2018;Ibrahim et al. 2020 MB). Ten landslide conditioning factors, including Slope Angle, Slope Elevation, Slope Aspect, Profile Curvature, Plan Curvature, Lithology, Soil Type, STI, SPI, and Rainfall, were used for the analysis. (Table 1) explains the data sources classification and the type of data or model obtained, while ( Table 2) indicates details of soil and geological formations of the study area. Factors associated with the ground's surface are usually developed out of terrain models such as Digital Terrain Models DTM or Digital Elevation Models DEM. Similarly, the remaining landslides' spatial models were created from the topographical maps and charts. Landslides' spatial models that Meander belts Calcareous alluvium, calceric regosol/humic gleysol 10 Tidal Swamps Sulphidic alluvium, thionic fluvisol/dysteric histosol have to do with the weather and climate are developed using detailed and up-to-date weather records of the study area.

Factor selection process
Various kinds of literature are yet to identify the number of factors to be used in a landslides analysis or how these factors can be drawn. The reason is that problems associated with landslides are always complex, depending on the nature of the environment Truong et al. 2018). Therefore, the model quality of any landslides analysis as observed depends on the quality of these factors. However, statistical interpolations in recent years were employed to help select relevant factors for the analysis, and results of such statistical selection have been overwhelmed, for example in (Gigovi c et al. 2019;Pham et al. 2020;Zhao and Chen 2020). In this research, the weights of evidence (WofE) method was used to trim the number of conditioning factors identified based on the site visitation/investigation reports. The principles of this method are like that of Bayesian probability models (Chen X and Chen 2021;Ghorbanzadeh et al. 2019;). Many researchers have used this principle to develop landslides susceptibility models for many study scenarios. The WofE technique calculates the weights of every landslide conditioning factor (B) in areas or locations of landslides or no landslides within the selected study area. Thus, Where P represents the probability, ln is the natural log, B is the potential landslides predictive factor, B potential non-landslides predictive factor, L is the locations of the landslides and L represents the non-landslides locations or points. W þ i Indicates the presence of a predictive variable within a landslide location with a magnitude that explains a positive correlation between the landslides and the predictive variable presence (Equation (1)). While W À i indicates the absence of a predictive variable with a negative correlation (Equation (2)).
A difference between the two weights W þ i , W À i is defined by a factor called weight contrast W f (Equation (3)) thus, This expression represents the entire spatial relationship between the predictor variable and landslides. The second phase was to use the Frequency Ratio (FR) method to quantify the level of involvement of the factors in the slides. The method of FR in landslides susceptibility analysis has been in use for quite some time now . Could be the first researchers to have reported the use of the technique for landslides analysis. The method solely provides the relationship between landslides in the area, the conditioning factors, and the interrelationship between the factors' variables. The FR is classified as a quantitative statistical approach that can relate the spatial distributions of the factors leading to landslides within their interdependencies and landslides. The FR as computed for this research (Table 3) indicates the probability of the ten (10) landslides conditioning factors as dependent variables and the inter-dependency within the component's pixel counts. The pixel counts signify the specific area of coverage by each factor occupied in the landslide and non-landslide locations. The percentage of the pixel counts is computed as the percentage of landslide pixel size to the variable corresponding to the percentage pixel count (Acharya and Lee 2019). The FR is interpreted as the values that specify the probability of involvement in the landslide occurrence shown by a particular factor. Those factors with higher probabilities indicate higher participation than those with lower probability values in the landslide occurrence. The FR respective probabilities can easily be compared to find factors contributing to the landslides more (higher probabilities) and those that contribute less (lower probabilities).

Preparation of landslides spatial models
As stated above, we selected ten landslides predisposing factors for this analysis. The number of factors is decided after series of factor selection procedures are conducted. Again, the nature of the study area's terrain could also be responsible for the number of factors selected. Afterward, the landslides spatial models were prepared using the relevant data and methods as explained below.

Slope angle
The slope angle provides details of the surface steepness or inclination with the horizontal plane. Sloppy terrains with higher angles of inclination are more susceptible to landslides. The slope angle has been used to date by many researchers for landslide prediction because of its relationship with gravitational forces that act on the detaching materials (Nath et al. 2020). Landslide occurs at specific critical slopes, usually termed unstable slopes. It is hard to single out and label a slope as safe or unsafe to landslides despite the size of its angle of inclination without considering other factors. From our study area, the slopes have ranged from 0 8 to about 82 8 (Figure 4b) which is a value too high for safe slopes under normal conditions.

Plan curvature
The curvatures specify the slopes' surface's nature; it is sometimes referred to as the 'slope of the slopes'. The curvatures originate from the intersections of planes with  (Figure 4d) has to do with the convergence of water or its divergence when there is a water flow down the slope. This process can quickly erode the sloped surfaces at some sections causing them to fail. Subsequently, landslides can quickly occur in the eroded sections. The nature of these uncertainties has made the plan curvature a critical factor for this analysis. The organic matter distribution in an area is greatly affected by this factor because it reflects the terrain's morphology (Ramakrishnan et al. 2013).

Slope aspect
The slope aspect is concerned with the orientation of the slopes in the study area. The slopes' exposure is critical because some slopes' faces orient to heavier rainfall directions than others. This action might subject such faces to weathering and degradation that will eventually trigger a landslide. Other parameters related to the nature of the slopes' orientation include exposure to direct sunlight, dry/wet heavy winds, saturation degrees, and other forms of discontinuities (Pradhan 2010;Gigovi c et al. 2019). From the study area (Figure 4c), the values translating the aspects range from À1 8 that represents the flat land areas to 360 8 :

Slope elevation
The slopes' elevation defines the slopes' height above the mean sea level (Figure 4a). Researchers have considered slope elevation an essential factor because it relates the detaching mass with slope stability conditions and variables. Unfortunately, the study area for this research has some unprecedented heights of over 800 m above the main sea level, making it more susceptible to landslides (Pourghasemi et al. 2013a).

Rainfall
Rainfall is a vital conditioning factor and a triggering mechanism. Many researchers emphasize the influence of rainfall above other factors on landslides occurrence, especially in areas with no seismic activities (Li WY et al. 2017). Therefore, the kriging method was used to develop the rainfall model ( Figure 4j) using rainfall data for the past ten (10) years collected from 16 weather stations across the study area.

Lithology
This factor is vital in deciding landslides in the area because it reveals the type of rock formation. Furthermore, it is an essential factor because it relates to rocks' degradation, making it necessary to know the area's underlying rocks' properties. The model of the lithology of the study area was obtained by digitizing the details of the rock formation obtained from the geological department. So far, seven (7) classes of the formation were identified (Figure 4f), which brought about the formation of more hardened layers in the study area (Pham et al. 2017).

Soil type
The morphological changes made by the soil type when trying to establish the landslides' susceptibility are significant. Therefore, landslide intensity is mainly a function of the nature of the soil in that area. Ten (10) categories of soil classes ( Figure 4g) were identified. The soil model was digitized from a detailed soil topography map of the study area obtained from the relevant authorities (Ramakrishnan et al. 2013).
3.2.9. Sediment transport index/stream power index (STI/SPI) Sediment transport index STI (Figure 4i) characterizes the erosion rate within the study location and the rate flow of the erosion materials, and it was developed using the DTM of the study area. Another important factor determining water in an erosion flow scenario is the SPI (Figure 4h). The SPI for this research was also carved out from the high-resolution DTM of the study area.

The support vector machine
This algorithm is widely used to establish the landslide susceptibility maps of many places. It separates the linear case from the non-separable linear case, using a low-dimensional input space to map a nonlinear situation (Nath et al. 2020). An optimum hyperplane provides for the best separations in the classes, thus the expression; here w determines the position of the hyperplane in the future space, which is termed as the coefficient vector, b defines offsets that exist between the hyperplane and the origin and n i is the positive slack variables.
To determine the optimum hyperplane, we have the expression below by solving Equation (1), a i a j y i y j x i x j ð Þ , subject to X n i¼1 a i y j ¼ 0, 0 a i C (5) a i is the multiplier of the Lagrange, C is the constant called the penalty. The equation can be rewritten to give a classification decision function as, This classification decision function can be written to determine the separating hyperplane using the linear kernel function. Thus, The function Kðx i , x j Þ represent the kernel function. The SVM algorithms under this expression provide four (4) different types of input or kernel functions. These include the radial basis function (RBF), Polynomial (PL), Sigmoid (SIG), and Linear (LN) functions.

Random trees classifiers
Random forest is a machine-learning algorithm that has been used in many landslide situations to make predictions and implement the predictions into maps with GIS software. In this research, the random forest classifiers ( Figure 5) address the classifiers' limitations. The random forest classifiers usually suffer from many high variances, which affects their accuracy compared to other classifiers (Ghorbanzadeh et al. 2019). Furthermore, random tree classifiers are introduced to overcome conventional forest algorithms' limitations when dealing with many variances. Landslide indeed has many variances when trying to analyze the phenomenon using soft computing procedures. In this situation, the random trees' capability to handle the variances was checked using the algorithm to train the datasets and monitor the outcomes' training processes.
Overall, unlike the support vector machines, this classifier can deal with mixed categorical and numerical variables. The classifier also has a lesser sensitivity when scaling the data, unlike the SVM that must normalize the data before the training process began. As reported by many scholars, the advantage of SVM over the random forest is that it performs with even a small data size or with an unbalanced data type (Ibrahim MB et al. 2020).

Naï ve Bayes multinomial
The NBM algorithm belongs to the class of algorithms with the Bayesian theorems principles. This research considers using this algorithm to provide for the advancement of the frequently used Naïve Bayes classifiers. The improvement in the classification process can be viewed as a form of optimization to the Naïve Bayes performance. Multinomial Naïve Bayes classifiers compute a random variable's likelihood counts differently from Naïve Bayes .

The landslide susceptibility modelling
In this paper, ten landslide-conditioning factors were used as the landslide predictors in the study area (Figure 4a-j). The relationship between landslide points as identified on the inventory map and the conditioning factors were extracted. The extracted data were then divided randomly into the mentioned ratio of 70% as training datasets and the remaining 30% as testing or validation datasets ( Figure 6a). Next, the three algorithms discussed earlier were applied to the training datasets for classification. After the training procedures, the results now predicted values were used to develop the landslides susceptibility maps. Then, the susceptibility map was reclassified into five zones of landslides in the area based on the severity of the slides. Subsequently, In developing the landslides susceptibility maps, landslides susceptibility indices (LSIs) were established from the training and testing datasets (Balogun et al. 2021). These values constitute the landslides susceptibility map of the area developed using the ArcMap software. In addition, generated maps were reclassified, meaning the maps were categorized according to the severity of the landslides' susceptibility. As a result, five categories were identified: regions of very high landslides susceptibility, high landslides susceptibility, moderate landslides susceptibility, low landslides susceptibility, and very low landslides susceptibility were classified from the original map. The landslides susceptibility maps are shown in (Figure 6b) for the SVM model, the Random trees (Figure 6c), and the Naïve Bayes Multinomial (Figure 6d) (Li WY et al. 2017).

Evaluation of the model's performance
The ROC and AUC measure and visualizes the performance characteristics of our models in the multiclass classification. (Figure 7) shows the classes in a confusion matrix where the algorithms' performance on the datasets is put into classes. The classes include True Positive (TP), False Positive (FP), True Negative (TN), and False  (8)). In contrast, the specificity (True Negative Rate) tells the proportion of negative class (non-landslides locations) that are correctly classified as non-landslides (Equation (9)). Between sensitivity and specificity lies False Negative Rate (FNR), which signifies the proportion of landslide points wrongly classified as landslides (Equation (10)). The False Positive Rate (FPR) tells the proportion of non-landslides incorrectly classified as non-landslides (Equation (11)).
Other statistical analyses reveal the model performance when it functions separately in the presence or absence of the datasets. These include the Root Mean Square Error (RSME), the Mean Absolute Error (MAE), Accuracy (ACC), and the F-Measure. For example, the RSME (Equation (12)) takes the square root of the difference between each observed data and predicted data per the total number of non-missing data points.

RSME
N ¼ number of non À missing data points x i actual obseved time series and x i ¼ estimated time series The percentage of correctly predicted values to instance summation defines the (ACC) of the algorithm (Equation (13)).
The MAE measures the acquired errors between the paired observations in the same class expression (Equation (14)).
MAE ¼ p 1 À a 1 j jþ jp 2 À a 2 j þ ::: þ jp n À a n j n where p i is the predicted value and a i is the actual value. The F-Measure is another statistical technique that measures the model's performance. It combines the precise values and sensitivity values to form a single measure that captures both properties with their exact weighting (Equation (15)).

Kappa index
The kappa index is denoted by the following relationship (Equation (16))  Table 4 expresses the results obtained from validating the three data mining algorithms' performances. The performance evaluation is conducted on both the training and validation datasets. As stated earlier, four (4) performance evaluators were used to check the prediction rate and the success rate of the models developed and the data used in developing them. Performance validation from the training datasets shows that the traditional data mining algorithm (SVM) is still significant in creating landslides susceptibility models from this study area (Figure 7). However, the two new models have also performed above the benchmark of 0.75 and could be used in landslides susceptibility analysis (Table 4).

Significance of the statistical evaluation
Landslides models obtained through mathematical simulations are evaluated for the model performance using statistical evaluation methods such as the AUC (Figure 8), RSME, MAE, F-Measure, Kapper, to mention a few. When two or more models are involved in an assessment, a statistical significance test is usually conducted to establish the best model and reduce the subjectivity level in the final report (Table 5). The P-value and Z-value test for the models were computed and explained using Wilcoxon signedrank test technique (Tsangaratos and Ilia 2017a;Khosravi et al. 2018;Hong et al. 2020).

Results and discussion
The predicted models were built using WEKA software and data from ten (10) conditioning factors mentioned earlier (Figure 4a-j). The factors were selected using factor selection procedures and re-screened and quantified using frequency ratio FR. We compute the values of 8 performance indicators using various methods. These  performance indicators include the Sensitivity, Specificity, ACC, RSME, F-measure, MAE, AUC, and Kappa. As captured in (Table 4), the performance analysis was conducted to validate further the landslides predictions obtained from the SVM, RT, and NBM algorithms. Furthermore, these indices were established to further explains the models' maximum likelihood, for example, in the work of (Gholami et al. 2020;Balogun et al. 2021). Statistical significance test (Table 5) is conducted to check the level of significance among the models, which helps in reducing subjectivity (Ritter and Muñoz-Carpena 2013). The P-Value and the Z-values test were part of the statistical significance investigation on the models as reported similarly by (Chen W and Zhang S 2021;Mohammadifar et al. 2021). The landslides susceptibility models developed from the SVM, RT, and NBM algorithms (Figure 5c,d) were subjected to the performance evaluation (Hong et al. 2020;Li L et al. 2020;Shin 2020). Conducting a performance analysis on results obtained through data mining techniques is crucial and cannot be over-emphasized (Remondo et al. 2003;Brock et al. 2020;Mohammadifar et al. 2021). In addition, the evaluations help verify and define the level of accuracy and performance of the landslides susceptibility models (Althuwaynee et al. 2021). Results from the performance evaluation show that the AUROC value for SVM on the training datasets is 0.833 against 0.814 and 0.792 for both RT and NBM. This means that the SVM models have higher accuracy over the remaining two algorithms. Thus, an area with similar environmental conditions with this study location can opt for the SVM algorithms even though the remaining two algorithms have performed wonderfully well. The SVM algorithm was observed to have higher strength in determining the probability of landslides pixels correctly classified as landslides (sensitivity). A 0.807 sensitivity value was observed for the SVM, while the remaining were computed to be 0.776 and 0.732 for both RT and NBM, respectively. The SVM recorded a specificity value of 0.782, and RT recorded 0.778, while NBM recorded 0.741 for the training datasets. Specificity values indicate the non-landslides regions or zones that are correctly classified or identified as non-landslides.
Another performance evaluator computed for this study is the Kappa index. This index is necessary to find a substantial agreement or disagreement between the prediction and observation outputs. For this study, kappa values obtained were 0.589, 0.553, and 0.579 for the SVM, RT, and NBM algorithms with the training data. Similarly, other statistical performance indicators computed were the RSME which recorded 0.224, 0.241, and 0.274 for the three algorithms (SVM, RT, and NBM). In addition, SVM recorded an accuracy (ACC) value of 0.795, 0.777 for the RT algorithm, and 0.736 was observed for the NBM. Similarly, these indicators were also computed on the validation or testing datasets (Table 4). In other words, the statistical evaluation of the testing datasets a way of validating the training process and the datasets used for the training Beguer ıa 2006;Pradhan 2013;Truong et al. 2018).
The performance assessment on the three models (Table 4) also revealed the differences in classification accuracy for both the training and validation datasets. Although the models were observed to perform better with validation datasets, for instance, with an ACC value of 0.791, the SVM has recorded an AUC value of 0.841 higher than RT and NBM that recorded 0.822 0.814 respectively. This indicates a better prediction accuracy as well ahead of the RT and NBM for this study area, as confirmed by a similar study . Furthermore, the observation made on the validation data for the three models indicated that the SVM outperforms the remaining two algorithms in prediction capabilities (Table  4). The trend in the results and the slight difference between the training datasets and the validation datasets may be attributed to the conditional independence assumptions . These assumptions were specific to violations made in the training datasets, resulting in the variance and even the lower performance in some of the indicators recorded for the training datasets. Therefore, despite the NBMs' low classification rates in many of the indices or indicators, it displayed a better ability to make adjustments to the weights of some of the variables affected by the assumptions, this was also observed in similar studies (Chen X and Chen 2021;Liu X and Wang Y 2021).
In the case of the statistical significance, the Wilcoxon signed-rank test was computed for the p and z values. A comparison between the SVM, RT, and NBM conducted indicated the significant difference of the models. With the significance level in p values less than 0.05 and z values not exceeding the critical z (-1.96 to þ1.96), the models are considered significantly different. The significant test results obtained in this research align with many findings using data mining techniques (Chen W et al. 2019). Thus, the results obtained from the Wilcoxon test shows that the susceptibility models developed from the three algorithms in this study are significantly different. Hence, based on this evaluation, the three models comprising SVM, RT, and NBM are acceptable statistically for landslides susceptibility analysis and mapping in this study area. Furthermore, the reliability of the models (Figure 6b-d), when compared, has been enhanced. The obtained differences are attributed to how well the training process was carried out, plus the sufficiency of the training datasets; these were also observed in many works of literature (Chen W and Zhang Y 2021;Mohammadifar et al. 2021). Lack of a considerable factor difference from model comparisons entails the absence of significant data overfitting in the training process . With the results analyzed so far, SVM models have outperformed the remaining algorithms by a small margin. However, the margin is significant enough to conclude that the SVM is the better algorithm for this study area among the three. This is in line with many findings, e.g. (Chang Z et al. 2020;Hong et al. 2017;Mohammadifar et al. 2021;Pradhan 2013) that reported SVM outperformed other traditional algorithms.
Landslides analysis using data mining techniques to produce regions of landslides susceptibility from GIS data has been an essential tool in regional planning and management ). In addition, literature has proven that the data mining technique produces landslides susceptibility maps of high predictive accuracies that tackled real-life landslides scenarios (Ma and Xu 2019;Nhu et al. 2020c;Saha et al. 2021). Although, it is still challenging to produce high accuracy landslide models from the technique in various places due to the dynamism of landslides and the factors involved (Tien Bui et al., 2017b;Tien et al. 2020;Balogun et al. 2021). So far, no machine learning algorithm used in the data mining technique was observed to fit all regions under all landslides conditioning factors perfectly. For instance, the SVM used in this analysis was discovered to perform better in many landslide incidences . With this in mind, we compare the SVM models with models from NBM and ensembles of DT and RF to find a higher-accuracy landslide model. The model will help manage landslides for this study area with substantial economic relevance that is often disrupted due to landslide activities.

Conclusion
This study has assessed the effectiveness of advanced data mining techniques to evaluate landslides in Lawas, an economic giant town in Sarawak, Malaysia. In achieving the said objectives of the study, three machine learning algorithms, namely the SVM, RT, and NBM, were used to train geospatial data extracted from various GIS sources (Bacha et al. 2020;Althuwaynee et al. 2021). The training was made to develop classification between identified landslides location in the area to non-landslides locations by examining the pixels of two classes. An equal number of non-landslides (148) locations was also identified to tackle the problems associated with imbalances in the probability distribution. The results obtained from this analysis are guaranteed comparable with results from the literature.
The ten (10) landslides conditioning factors drawn using the WoE technique were quantified using the FR method to establish the most influencing. The landslides spatial models were developed using respective data layers. The selection of the landslide location was evenly distributed across the study area to enhance the likeness in the split data and the training process. It was also observed from the performance evaluation test conducted (Table 4), the whole analysis was rightly executed and successfully analyzed. Although all three models displayed positive predictive capabilities, SVM turns out to work fine with the geological/geomorphological conditions of the study area. The geological factors were observed to have the highest contributions to landslides events in the area. Soil type, Slope angle, Elevation, and Curvatures were observed to higher FR values than the raining factors. According to reports, the latest landslide events happened after a continuous downpour event that lasted for several hours. With this information compared to our results, it can conclude that the rain in the first place serves as a triggering factor. The slopes could have survived the continuous rain that becomes the suspected most influencing factor.
This study can advise that despite the considerable rainfall intensity in parts of the study area, infrastructures like the pipeline can be protected using the SVM model to plan maintenance accordingly. Overall, the data mining technique looks very promising in managing the landslide mysteries from this study area. However, the study has now revealed more insight into the landslides' causative factors than just rain. When planning, those identified geomorphological factors such as the nature of the soil and the slope and height of the terrain should be given proper attention.
like to extend their acknowledgments to the Editor in Chief, the associate editor, and two other anonymous reviewers for the time taken to review this paper to standard.