A novel ensemble classifier of rotation forest and Naïve Bayer for landslide susceptibility assessment at the Luc Yen district, Yen Bai Province (Viet Nam) using GIS

ABSTRACT The objective of this study is to attempt a new soft computing approach for assessment of landslide susceptibility in the Luc Yen district, Yen Bai province (Viet Nam) using a novel classifier ensemble model of Naïve Bayes and Rotation Forest. First, history of 95 landslide locations was identified byfield investigations and interpretation of aerial photos. Also, the total ten landslide causal factors were selected (slope, aspect, elevation, curvature, lithology, land use, distance to roads, distance to rivers, distance to faults, and rainfall) to evaluate the spatial relationship with landslide occurrences. Information Gain technique is carried out to quantify the predictive capability of these factors. Second, landslide susceptibility assessment was carried out utilizing the novel classifier ensemble model. Finally, the performance of landslide model was validated using receiver operating characteristic curve technique, and statistical index-based evaluations. The novel classifier ensemble model indicates high prediction capability (AUC = 0.846) and relatively high accuracy (ACC = 78.77%). The study reveals that this model performs well in comparison to the other landslide models such as AdaBoost, Bagging, MultiBoost, and Random Forest. Overall, the novel classifier ensemble model is a promising method that could be used for landslide susceptibility assessment.


Introduction
Landslide is known as one of the most serious natural hazards having devastating effects on human life and infrastructures (Tsangaratos et al. 2013;Alimohammadlou et al. 2014). All over the world, there were 2620 deadly landslide events occurred within 6 years from 2004 to 2010, killing a total of 32,322 people (Petley 2012). In Asia, approximately 18,000 people died and about 5.5 million people have been affected due to landslides during the period of , and the number of landslides in this region is relatively high in comparison to other regions of the world (EM-DAT 2010).
Viet Nam is one of the top six countries frequently affected by natural disasters including landslides (Guha- Sapir et al. 2011). Over the years, landslides have occurred frequently in the country especially in north-western mountainous and hilly regions (Tien Bui 2012). However, only limited studies of landslides have been carried out in this region (Tien Bui et al. 2013;Tien Bui et al. 2015).
Landslide susceptibility assessment is considered as an appropriate solution for reducing landslide damages through proper land use planning (Fell et al. 2008). On regional scales, the assessment is based on the statistical assumption that landslide events in the future will occur under the same conditions that happened in the past (Guzzetti et al. 2005). Many methods and techniques have been developed for the landslide susceptibility assessment during last decade. These methods can be broadly grouped into two categories (1) qualitative methods and (2) quantitative methods (Guzzetti 2006). Qualitative methods are relatively subjective approaches which are based on expert's perspective for defining the parameters and giving weights (Castellanos Abella 2008). Quantitative methods are more objective which are based on criteria for selecting and assigning the weight for variables (Castellanos Abella 2008). Therefore, quantitative methods are preferable for landslide susceptibility assessment.
Even though these models have been applied successfully and efficiently in landslide susceptibility assessment, no model is totally perfect. Therefore, the improvement in these models is needed to achieve desire results. The performance of landslide models can be enhanced by using feature selection and ensemble frameworks (Tien Bui et al. 2014). The feature selection could quantify the predictive ability of landslide causal factors. Thereafter, the factors with non-predictive ability would have to be removed to improve the performance of landslide models (Mart ınez- Alvarez et al. 2013). Whereas, the ensemble frameworks that combine multiple classifiers to improve the performance of individual classifiers based on characteristics of the diversity (Kuncheva 2014).
Ensemble frameworks started in 1990s but received significant attentions of researchers in recent years. Ensemble techniques such as Bagging (Breiman 1996); AdaBoost (Freund & Schapire 1997); Random Subspace (Ho 1998); MultiBoost (Webb 2000); Random Forest (Breiman 2001); and Rotation Forest (Rodriguez et al. 2006) have been applied efficiently in improvement of the performance of individual classifiers for different problems. Out of these, Rotation Forest technique has resulted better outcomes (Rodriguez et al. 2006). Despite its merit, application of these ensemble frameworks for landslide models is still rare. Therefore, the main objective of present study is to attempt a novel classifier ensemble data mining approach for landslide susceptibility assessment at the Luc Yen district in Yen Bai province (Viet Nam). This method is a combination of Na€ ıve Bayes classifier and Rotation Forest ensemble. These two methods are current state-of-the-art techniques but they have so far been seldom used for landslide models. In addition, the performance of the novel classifier ensemble model was also compared with other ensemble models such as Bagging, AdaBoost, Multi-Boost, and Random Forest.

Study area
The study area of Luc Yen district (latitudes 21 o 55 0 30 00 N to 22 o 17 0 30 00 N and longitudes 104 o 30 0 00 0 E to 105 o 53 0 33 00 E), which is located in the northeast of the Yen Bai province in Viet Nam, is affected by numerous landslides every year (Figure 1). It covers an area of about 810 km 2 that is 1.2% area of the Yen Bai province. As per record, the population of Luc Yen district in the year 2010 was 10,3587 and average populated density was 120 people per km 2 .
Luc Yen district is a mountainous region occupied by hills, small valleys, mounts, cliffs, and plains. The district is dissected by two dominant mountain ranges running in northwest-southeast direction namely Nui Voi and Large Rock mountains. Elevation in the area ranges from 43 to 1325 m above standard sea level, with an average elevation of 262m. Slope angles in the region vary from nearly flat to 81 o . Approximately 29.71% of the study area has very gentle slopes under 8 , and around 12.93% falls into slopes from 8 o to 15 o . Slopes in the range of 15 -25 occupy about 26.58% of the study area whereas 20.93% of the study area belongs to slopes of 25 -35 . Around 7.96% of the study area has slopes between 35 and 45 . Only 1.89% area is having slopes greater than 45 ( Figure 6).
Geologically, there are eight main geological formations (Nui Voi, Ngoi Chi, Thac Ba, Phan Luong, An Phu, Tu Le, Ha Giang, and Nui Chua) in the study area. Different types of rocks (sedimentary, igneous and metamorphic) exist in the study area. Predominant rocks in the area are metamorphic (48%), whereas igneous rocks are occupying only 5.4% area. Alluvium and recent deposits are also present at places ( Figure 6). Different types of land use patterns have been observed in this area namely forest; barren; cultivation; grass; scrub, and residential area. Forest land occupies the largest area (68.07%), followed by barren and cultivation lands (15.09%), grass and scrub lands (7.36%), and residential area (4.5%). Water bodies occupy only 4.98% of the total area ( Figure 6).
Luc Yen district is situated in the tropical monsoon region, thus regularly experiencing heavy rainfall during the months of June, July, and August. The annual average rainfall varies from 2500 to 3550 mm. Rainfall usually occurs with high intensity and over a short period of time, often triggering landslides, flooding, and causing erosion in the study area. The average daily temperature is 22 C. The temperature in the area varies from 2 C to 40 C. The average daily humidity ranges between 60% and 72%.

Materials and methodology
Landslide susceptibility analysis has been carried out in five main steps: (i) data collection from various sources, (ii) preparation of dataset, (iii) evaluation of prediction capability of landslide causal factors, (iv) assessment of landslide susceptibility using the novel classifier ensemble model, and (v) validation and comparison of landslide models.

Landslide inventory map
Preparation of landslide inventory map is considered as a primary and important step for landslide susceptibility assessment (Fell et al. 2008). The map indicates the location of landslide events that occurred in the past as well as in present. To construct a landslide inventory map, consultation of literature and interpretation of high-resolution satellite images/air photos are being done in conjunction with field investigation (Xu et al. 2012;Pradhan 2013).
Landslide inventory map in this study was constructed with the help of air photos (1:33.000) of the year in 2013 obtained from the Aerial Photo-Topography Company (Vietnam). Interpretations were carried out under a current national project in Viet Nam at the Vietnam Institute of Geosciences and Mineral Resources, namely 'Survey, assessment and zoning of landslide warning in the mountainous region of Vietnam'. Field investigations were also carried out to check the interpretation results. Figure 2 shows photos of landslides in the study area that were taken during the field work phase. These landslides were classified into three types namely translational, rotational, and debris slides. In the study area, the number of translational landslides is 65 that are 68.4% of total landslides. The number of debris slides is 18 that equals to 19% of total landslides. Remaining 12 locations fall into rotational type of landslides that is approximately 12.6% of total landslide occurrences. Landslide locations were divided randomly into two parts, and then converted into raster data with the pixel size of 20£20 m for analysis. One part of 75% landslide locations (29,038 pixels) used for training process and another of 25% landslide locations (4979 pixels) utilized for validation process.

Landslide causal factors
Based on the analysis of the natural mechanism of landslides and the geo-environmental characteristics of the study area, a total of ten landslide causal factors (slope, aspect, elevation, curvature, lithology, land use, distance to roads, distance to rivers, distance to faults, and rainfall) were selected for landslide analysis in the present study. Moreover, these factors were reclassified into different classes for landslide spatial prediction which is based on the frequency analysis of landslides in this study and landslide studies (Table 1). Similar approaches have been adopted by other researchers in the identification of causal factors (Dai & Lee 2002;Pourghasemi et al. 2013;Tien Bui et al. 2016a).

Geomorphologic factors
It is well known that landslides are largely influenced by terrain types, therefore geomorphologic factors should be taken into account for landslide susceptibility assessment (Dou et al. 2014). In the present study, geomorphologic factors, i.e. slope; aspect; elevation; and curvature were obtained from a Digital Elevation Model (DEM) with a spatial resolution of 20 m. The DEM was generated from national topographic maps available on a scale of 1:50000 obtained from Vietnam Institute of Geosciences and Mineral Resources (Tien . Slope is considered as one of the most important factors for slope instability analysis (Sadr et al. 2014), where slope is steeper there is high probability of slope failure (Dai et al. 2001). However, variations of soil thickness and strength should also be taken into account. The slope map was constructed with five categories (Tien Bui et al. 2014) namely 0-8 , 8-15 , 15-35 , 35-45 , and > 45 (Figure 3a). The distribution of landslide pixels on slope map is shown in Figure 6a  Elevation ( Aspect is also important factor influencing the slope instability as it controls topographic moisture due to impaction of solar radiation and rainfall (Sadr et al. 2014). The aspect map was constructed with nine classes (Tien Bui et al. 2014) such as flat (-1), north (0-22.5 and 337.5-360), northeast (22.5-67.5), east (67.5-112.5), southeast (112.5-157.5), south (157.5-202.5), southwest (202.5-247.5), west (247.5-292.5), northwest (292.5-337.5) ( Figure 3b). The distribution of landslide pixels on the aspect map is shown in Figure 6b. There are no landslide pixels in flat class. The highest percentage of landslide pixels belongs to east class (27%), followed by southeast (22.2%), south (18.51%), northeast (15.65%), southwest (6.62%), north (6%), northwest (2.29%), and west (1.73%), respectively.
Curvature is a factor that reflects the morphology of terrain surface representing changes in slope angles along a very small arc of the curve (Tien Bui et al. 2014), and thus be susceptible to slope instability. The curvature map (Figure 3d) was generated with three classes (Tien Bui 2012) such as concave (< -0.05), flat (-0.05-0.05), and convex (> 0.05). The distribution of landslide pixels on the curvature map is shown in Figure 6d. The landslide pixels only appear in concave (51.71%) and convex (48.29%) classes and no landslide pixels are shown in flat class.

Lithology
Lithology is one of the most important factors that influence the type and mechanism of the landslides because different types of rocks and soils are having different internal structures, mineral compositions, and thus susceptibility to landslide occurrences (Ercanoglu 2005).
In this study, the lithology map ( Figure 4a) was constructed based on the Geological and Mineral Resources Map of the Luc Yen district on a scale of 1:50,000. Lithology was classified into seven groups (Table 2) based on mineral composition, degree of weathering, and estimated strength and density Tien Bui 2012). The distribution of landslide pixels on the lithological map is shown in Figure 6e. The highest percentage of landslide pixels falls in group 2 (78.38%) whereas the smallest percentage of landslide pixels (0.44%) is observed in group 7.

Land use
Land use pattern affects to landslide occurrences due to human intervention (Glade 2003). For instance, landslide occurs more frequently in barren area, and less frequently in forest and residential regions (Lallianthanga & Lalbiakmawia 2013). The land use map was generated from air photos on a scale of 1:33.000 using Envi 5.0 software with the maximum likelihood classification. A total of five land use classes were identified and grouped, i.e. forests, grass & scrub lands, barren & cultivated lands, residential area, and water bodies ( Figure 4b). The distribution of landslide pixels on the land  use map is shown in Figure 6f. The highest percentage of landslide pixels in forests is 64%, following by grass and scrub lands (17.75%), barren and cultivation lands (16.99%), residential area (0.75%), respectively. There are no landslide pixels in water bodies.

Distance to features
Features such as faults, rivers, and roads should be taken into account for landslide susceptibility assessment (Tien Bui 2012). Faults are products of tectonic activities that break the continuity of soil or rock masses and are considered weak planes influencing slope stability. The fault lines were extracted from the Geological and Mineral Resources Map of the Luc Yen district at the scale of 1:50,000. The distance to faults map was then constructed with six classes by buffering these fault lines into study area (Tien Bui 2012) namely 0-100 m, 100-200 m, 200-400 m, 400-700 m, 700-1000 m, and > 1000 m (Figure 5a). The distribution of landslide pixels on distance to faults map is shown in Figure 6g. The percentage of landslide pixels at a distance of 200-400 m is 22.57% and at 400-700 m is 24.91%. Lower percentage of landslide pixels has been observed at distances 0-100 m (9.96%) and 700-1000 m (9.77%). The erosion of soil and rock masses caused by the activities of rivers has also influenced significantly landslide occurrences in the study area. The density of drainage affects moisture of terrain as more dense drainage pattern helps in accumulation of water, and thus making area more susceptible to landslide occurrence (Stevens & Wolfe 2012). In this study, river sections that undercut slopes larger than 15 were also extracted from national topographic maps on a scale of 1:50,000 (Tien Bui 2012). Then the distance to rivers map was constructed with four categories: 0-40 m, 40-80 m, 80-120 m, and > 120 m (Figure 5b). The distribution of landslide pixels on the distance to rivers map is shown in Figure 6h. The percentage of landslide pixels is the highest at 0-40 m (91.58%), and very less in the rest categories, i.e. 40-80 m (6.4%), 80-120 m (8.39%), and > 120 m (9.23%). Road sections in the mountainous and hilly regions that undercut slopes larger than 15 , breaking the continuity of soil or rock masses are considered to be susceptible to instability of slopes. The road networks were extracted from national topographic maps on a scale of 1:50,000. After that, the distance to roads map was constructed with four intervals (Tien Bui et al. 2015) such as 0-40 m, 40-80 m, 80-120 m, and > 120 m (Figure 5c). The distribution of landslide pixels on the distance to roads map is shown in Figure 6i. The percentage of landslide pixels is the highest at a distance between 0 and 40 m (75.98%), and very small at the distances: 40-80 m (2.41%), 80-120 m (2.89%), and > 120 m (3.12%).

Rainfall
Rainfall is considered to be a triggering factor that influences significantly to landslide occurrences (Shahabi et al. 2014). This is because rainfall affects the soil properties such as decreasing of soil shear strength. Rains also causes liquefaction of soil material and even flow of soil/ debris mass enhancing the susceptibility of soil masses to landslides (Highland & Bobrowsky 2008). In fact, landslide usually occurs during long-term intensive rainfall in the study area. The rainfall data during 30 years from 1984 to 2014 was extracted from the database of Climate Forecast System Reanalysis (CFSR) in Global Weather data for SWAT (NCEP 2014). The rainfall map was then generated with five classes namely rainfall less than 2800 mm, 2800-2950 mm, 2950-3100 mm, 3100-3300 mm, and greater than 3300 mm (Figure 5d). The distribution of landslide pixels on the rainfall map is shown in Figure 6j. The two highest percentages of landslide pixels are in the two highest rainfall classes, i.e. 3100-3300 mm (55%), > 3300 mm (41.73%). Lower percentage of landslide pixels has been observed in smaller rainfall, i.e. < 2800 mm (6.51%), 2800-2950 mm (12.11%), and 2950-3100 mm (19%).  (Azhagusundari & Thanamani 2013). Consequently, the accuracy of results can be improved and the process of learning could be implemented more quickly (Doshi & Chaturvedi 2014).

Methodology
Let z i , i D ! 1; n (z i is the landslide causal factors); L j , j D ! 1; m (L j is the out classes including landslide, non-landslide). The information gain value of each landslide causal factor is quantified based on the reduction measurement of the entropy (information) using the following equation: InfoGainðL; z i Þ D IFðLÞ ¡ IF z ðLÞ; (1) where IFðLÞis the entropy value of L that is the expected information needed to classify a landslide causal factor for L and is given by PðL j Þ log 2 PðL j Þ: IF z ðLÞ is the information of L after integrating values of landslide causal factors z i and is calculated by where j L i j = j L j is the weight of the ith landslide causal factor and IFðL i Þis the entropy of L corresponding to the ith landslide causal factor. As a remark, the factors with higher Information Gain value would have more important to landslide models. Also, the factors with zero Information Gain value are having no contribution to landslide models, thus it must be removed during dataset preparation.

Na€ ıve Bayes classifier
Na€ ıve Bayes classifier is one of the simplest soft computing methods which is based on the Bayesian theory and the maximum posteriori hypothesis (Rish et al. 2001). Na€ ıve Bayes classifier uses a statistical hypothesis that all values of numeric attributes are independent and normally distributed in each class (Zhang & Su 2004). Na€ ıve Bayes classifier has been applied effectively in many fields such as medical diagnosis (Domingos & Pazzani 1997), and management (Hellerstein et al. 2000). However, its application is still limited in landslide problems. Let t D t i , i D 1, 2, …, 10 are the attributes of the 10 landslide causal factors, G D G j , j D landslide, non-landslide that represent classified variables and outputs. The prediction using Na€ ıve Bayes classifier is presented as follows: where PðG i Þ is termed as the prior probability of G j which can be estimated using the proportion of the observed cases with output class G j in the training dataset. Pðt i j G i Þ is defined as the conditional probability which can be calculated as follows: where n is mean and b is standard deviation of t i Na€ ıve Bayes classifier is easy to construct, and has surprisingly good performance in classification. On the other hand, it is also shown as a method of poor probability estimation due to the conditional independence assumption (Zhang & Su 2004). Therefore, some researches have tried to improve its probability estimates (Friedman et al. 1997;Zadrozny & Elkan 2001). Additionally, the performance of Na€ ıve Bayes classifier might be improved by using ensemble classifier framework (Pham et al. 2016c).

Rotation forest ensemble
Rotation Forest is a relatively new framework for creating classifier ensembles. It was first proposed by Rodriguez et al (2006). The basis of Rotation forest is that principal component analysis (PCA) is used to extract the features to create training datasets for learning base classifiers (Zhang & Zhang 2009;Koyuncu & Ceylan 2013). Rotation Forest ensemble has been utilized to solve several classification problems (Koyuncu & Ceylan 2013). The principal aim of Rotation Forest ensemble technique is to encourage same time individual accuracy and diversity (Rodriguez 2007 ). The success of Rotation Forest is relied on the rotation matrix created by the transformation methods and the base classifiers (Xia et al. 2014).
Suppose that x D (x 1 , x 2 , …,x 10 ) is the vector of the 10 landslide causal factors and y D (y 1 ,y 2 ) is the vector of landslide and non-landslide classes, X represents the training set. C 1, C 2 , …, C L are classifiers in the ensemble, and by T which is landslide causal factor set. The steps for training classifier C i are implemented as follows (Rodriguez et al. 2006;Rodriguez 2007;Zhang & Zhang 2009;Xia et al. 2014): First, generating the rotation matrix R i a by rearranging the matrix of R i is as shown as follows: ; a i;1 ð2Þ ; ::::; a i;1 ; a i;1 ð2Þ ; ::::; a i;1 To make the matrix ofR i , (i) T is split into K subsets with the number of the landslide causal factors for each subset is Q D 10/K. (ii) For classifier C i , let T i,j be the jth, j D 1, 2, …, K subset of the landslide causal factors. X i,j is landslide causal factors in T i,j from X. X i,j ' is randomly selected from X i,j with 75% size using bootstrap algorithm. After that, X i,j ' would be transformed to obtain the coefficients a i;1 ð1Þ ; a i;1 ð2Þ ; ::::; a i;1 ðQ i Þ , the size of a i;1 0 is Q x 1. (iii) Arrange a sparse rotation matrix R i with the obtained coefficients Then, the confidence is calculated for each class by the average combination method in the given test sample x, k D 1; 2;::::::; c; where g i;k ðhR i a Þis the probability generated by the classifier C i to the hypothesis that h belongs to class k.
Lastly, the h will be assigned to the class with the largest confidence.

The novel classifier ensemble model
In this study, the novel ensemble classifier model is generated by the combination of Na€ ıve Bayes classifier and Rotation Forest ensemble. Rotation Forest ensemble was first applied to create the subsets of training. Thereafter, Na€ ıve Bayes classifier was used to construct base classifiers from these subsets for classification. Methodological flow chart of the novel classifier ensemble model is shown in Figure 7. The advantage of the novel classifier ensemble model is that the training subsets are being optimized using Rotation Forest ensemble, and then these training subsets are utilized for training a base classifier of Na€ ıve Bayes. Therefore, the novel classifier ensemble model could improve predictive capability of a base classifier of Na€ ıve Bayes.

Statistical index-based evaluations
The five statistical indexes namely Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, specificity, and accuracy  were chosen to evaluate the performance of landslide models. Here, PPV indicates the probability of pixels that is classified correctly as 'landslide' class. NPV indicates the probability of pixels that is classified correctly as 'non-landslide' class. Sensitivity is the probability of landslide pixels that is classified correctly as 'landslide' class. Specificity is the probability of non-landslide pixels that is classified correctly as 'non-landslide' class. Accuracy is the proportion of landslide and non-landslide pixels that are correctly classified. These values were calculated using the values of confusion matrix (Dou et al. 2015) including true positive (TP), false positive (FP), true negative (TN), and false negative (FN) that was obtained through training and validating process in Weka software 3.6.11 version.
The overall performance of the landslide models is evaluated by using Receiver Operating Characteristic (ROC) curve technique. ROC curve is a graph with each point on it representing a pair of sensitivity and 100-specificity corresponding to a particular decision threshold (Fawcett 2006;Dou et al. 2014). The area under the ROC curve (AUC) indicates the goodness-of-fit of landslide models on the training data and prediction capability of landslide models using the validation data (Jones and Athanasiou 2005). The AUC value equals to '1' representing a perfect model whereas the AUC value equals to '0' indicating a non-accurate model. When the AUC value is closer to '1' the performance of landslide model is better (Walter 2002;Pourghasemi et al. 2012). According to Kantardzic (2011) the AUC values can be classified into different intervals with respective performance such as 0-0.7 (poor), 0.7-0.8 (fair), 0.8-0.9 (good), and 0.9-1.0 (very good).

Feature selection using information Gain method
Utilizing Information Gain method with ten folds cross validation standard, the evaluation of predictive capability of ten landslide causal factors was carried out using training data. The Average Information Gain (AIG) value and its standard deviation for each factor were calculated and ranked (Table 3). In general, the total of ten landslide causal factors has contribution to landslide models (AIG > 0). Aspect shows the highest contribution to landslide models in the study area with AIG value is 0.189, followed by slope (AIG D 0.166), rainfall (AIG D 0.154), curvature (AIG D 0.15), lithology (AIG D 0.138), elevation (AIG D 0.13), land use (IG D 0.074), distance to rivers (AIG D 0.004), respectively. Distance to faults and distance to roads factors have the least contribution to landslide models with AIG of 0.002.

Model performance and validation
The performance of the novel classifier ensemble model for landslide susceptibility assessment is shown in Table 4 and Figure 8, 9. The results show that the novel model has a very high degree of the goodness-of-fit in the case of training data with 87.37% of predictive accuracy and 0.94 of area under ROC curve (AUC D 0.94). More specifically, the probability of pixels that are classified correctly as 'landslide' class is 95.13% (PPV D 95.13%) whereas the probability of pixels which are classified correctly as 'non-landslide' class is 79.62% (NPV D 79.62%). The probability of the landslide pixels are classified correctly to 'landslide' class is 82.36% (sensitivity D 82.36%). The probability of non-landslide pixels which are classified correctly as 'non-landslide' class is 93.20% (specificity D 82.36%).
The novel classifier ensemble model was validated using the validation dataset which has not been used during training process. The results indicate that the novel model has a good performance in landslide susceptibility assessment with 78.77% of predictive accuracy and 0.846 of the AUC value. Moreover, the probability of pixels is classified correctly as 'landslide' class is 78.79% (PPV D 78.79%). The probability of pixels which are classified correctly as 'non-landslide' class is 78.75% (NPV D 78.75%). The probability of landslide pixels are classified correctly to 'landslide' class is 78.76% (sensitivity D 78.76%). The probability of non-landslide pixels are classified correctly into 'non-landslide' class is 78.78% (specificity D 78.78%).

Reclassification of landslide susceptibility map
Landslide susceptibility map is the final result of landslide susceptibility assessment using the novel classifier ensemble model. In order to construct this map, landslide susceptibility indexes was extracted after the successful model training phase. By using ArcGIS software 10.2 each pixel inside the study area was assigned an unique susceptible index, and then the reclassification of landslide susceptibility map was carried out by ranking and grouping the landslide susceptibility indexes.
According to Pradhan and Lee (2010), the classification of landslide susceptibility classes can be implemented based on percentage of area of the region. At first, the susceptible indexes of all cells were sorted in descending order. And then, these indexes were grouped into several groups according to area percentage of the region. Moreover, Althuwaynee et al. (2014) proposed that the landslide susceptibility classes might be classified into five categories (very high, high, moderate, low, and not susceptible). In this study, landslide susceptibility map (Figure 10) was constructed with  into five classes on the base of area percentage of the region, namely: Not susceptible (50%), Low (20%), Moderate (15%), High (10%), and Very high (5%).

Model comparison
The performance of the novel classifier ensemble model was compared to other ensemble techniques using Na€ ıve Bayes as a base classifier such as Bagging, AdaBoost, MultiBoost. These models are well known as boosting techniques that is one of the most important recent methodological developments in classification (Friedman & Tibshirani 2000). Additionally, an individual classifier ensemble of Random Forest was also taken into account for comparison.
Bagging is one of the earliest ensemble learning algorithms proposed by Breiman (1996). It is known as a bootstrap aggregation using the training dataset to generate multiple random subsets. After that, the Na€ ıve Bayes classifier-based model is constructed on the base of each subset. The final classifier ensemble model is formed by integrating these classifiers.
AdaBoost is one of the most popular boosting algorithms for classification (Mease & Wyner 2008). AdaBoost was introduced by Freund and Schapire (1997), and it is known as an extremely effective adaptive boosting (Dietterich 2002). It creates the training subsets and assigns the weights for each subset through sampling process using base training set, and then the Na€ ıve Bayes classifier uses these weighted subsets for classification.
MultiBoost is a combination of boosting and wagging techniques for reducing both variance and bias and avoiding the over-fitting (Geoffrey 2000). Using the training set, the subsets of training are built through random selection. These subsets are then assigned the weights through the boosting technique. Thereafter, the Na€ ıve Bayes classifier model uses these subsets to produce the outcomes. However, the training process is to be continuous by resetting the weights of subsets according to the overall accuracy performance of the Na€ ıve Bayes classifier model. Training process is finished if the optimal weights are assigned in training subsets to get the highest overall accuracy performance.
Random Forest is a combination of multiple decision tree classifiers that utilizes both bagging and random variable selection, it was proposed by Breiman (2001). In the beginning, the subsets of training are generated randomly from original training dataset using bootstrap aggregation approach, . Random Forest is an effective ensemble technique that could obtain good results with both low bias and low variance (Gislason et al. 2006).
Using training dataset, the performance of four landslide susceptibility models namely Bagging, AdaBoost, MultiBoost, Random Forest is shown in Table 5 and Figure 11. It can be clearly seen that these four models have high degree of the goodness-of-fit in landslide susceptibility assessment. Out of these models, the Random Forest model is the highest (Accuracy D 96.26%, AUC D 0.994), followed by the AdaBoost model (Accuracy D 83.40%, AUC D 0.906), the MultiBoost model (Accuracy D 83.24%, AUC D 0.903), and the Bagging model (Accuracy D 83.17%, AUC D 0.901), respectively. Overall, the novel classifier ensemble model has higher degree of the goodness-of-fit compared to Bagging, AdaBoost, MultiBoost. However it is less than the Random Forest model.
The validation of the four landslide models has been carried out using the validation dataset. The results are shown in Table 6 and Figure 12. The predictive accuracy of the MultiBoost model is highest (Accuracy is 79.3%), followed by the Bagging model (Accuracy is 79.03%), the AdaBoost model (Accuracy is 77.44%), and the Random Forest model (Accuracy is 67.53), respectively. The Multi-Boost model and the Bagging model have higher accuracy comparing with the novel classifier ensemble model while the AdaBoost model and the Random Forest model have lower accuracy. Regarding to the area under ROC curves of these models, the Bagging model indicates the highest  The performance capability of the novel classifier ensemble model has been further compared with four other landslide models using McNemar's test. It was proposed by Everitt (1992) as a statistical test based on the chi-square test value (x 2 ) (Kuncheva 2004). This test compares the significance of differences between the landslide models. In case x 2 value is greater than the critical value of 3.841459 and the level of significance (p) is less than 0.05, then the hypothesis of two significantly different models is correct. Thus the null hypothesis of two non-different models might be rejected (Dietterich 1998).
The results of the statistical test of prediction ability of the novel classifier ensemble model compared with other landslide models (AdaBoost, Bagging, MultiBoost, and Random Forest) are shown in Table 7. It could be observed that the statistical test of the novel classifier ensemble model vs. the AdaBoost model has the smallest chi-square value (38.823). It is dramatically higher than critical value of 3.841459. Furthermore, the p-value of all tests (p < 0.0001) is extremely lower than 0.05. Therefore, the novel classifier ensemble model has a difference with four other landslide models. This difference is statistically significant. It means that the performance of the novel classifier ensemble model is comparable to other landslide models.  1  True positive  3742  4351  4347  2226  2  True negative  3969  3519  3550  4499  3  False positive  1237  628  632  2753  4  False negative  1010  1460  1429  480  5 PPV (

Discussions
Landslide susceptibility assessment has been done at Luc Yen district, Yen Bai province (Viet Nam) using the novel ensemble classifier model which is a combination of Na€ ıve Bayes classifier and Rotation Forest ensemble. Na€ ıve Bayes is an effective classifier. However, in the landslide problems, its performance is affected by independent assumption (Pham et al. 2016e). In contrast, Rotation Forest is a promising ensemble technique which could be used to improve the performance of individual classifiers (Pham et al. 2016e). Therefore, the ensemble classifier framework encompassing these two techniques could result better performance of landslide susceptibility assessment. Landslide causal factors are usually used to prepare input data for running landslide models. Selection of these factors plays crucial role in getting qualitative output from the used model (Tien Bui 2012). Feature selection is an effective method in selection of variables in input data for modelling (Pham et al. 2015b) which can be used to realize the irrelevant or unimportant variables in the set of variables. Then these variables are removed to optimize the inputs for improving prediction accuracy of modelling (Dash & Liu 1997). In this study, the feature selection of Information Gain Method was selected to pick up the best landslide causal factors for the novel classifier ensemble model in landslide susceptibility assessment in the study area. Results show that all of the ten landslide causal factors (slope, aspect, elevation, curvature, rainfall, land use, lithology, distance to rivers, distance to faults, and distance to roads) are capable of prediction to landslide modelling. However, aspect and slope have the highest contribution to landslide models which is in agreement with other studies carried out by Sadr et al. (2014), and Van Den Eeckhaut et al. (2006).
Analysis results show the novel classifier ensemble model has the best degree of fit to landslide susceptibility assessment compared to other models on the base of the area under ROC curve. Moreover, its performance is dramatically higher than the AdaBoost model (1.33%), and the Random Forest (11.24%) model regarding to the predictive accuracy. However, it is slightly lower than the Bagging model (0.26%) and the MultiBoost (0.53%) model. Results of the present study are comparable with Rodriguez et al (2006) and  which showed that the Rotation Forest ensemble performs significantly better than other models such as AdaBoost and Random Forest; however, its performance is less than the MultiBoost ensemble, and quite similar to the Bagging ensemble. In comparison to other methods, the novel classifier ensemble model uses Na€ ıve Bayes classifier which has abilities to deal with uncertainty and Rotation Forest ensemble which is more effective in dealing with small sample sizes, high-dimensional and complex data structures (Pham et al. 2016d).
Moreover, the present study proposed to use the McNemar's statistical test  for evaluation of the different significance of the novel classifier ensemble model and the other landslide models. McNemar's statistical test is known as one of the most powerful statistical tests for comparison (Roggo et al. 2003) which should be used to evaluate the performance of landslide models. Results (Table 7) show that the performance of the novel classifier ensemble model is different statistically with other models (AdaBoost, Bagging, MultiBoost, and Random Forest).

Conclusions
New methodological approach which combines the Rotation Forest ensemble and the Na€ ıve Bayes classifier has been proposed for landslide susceptibility assessment at Luc Yen district of Yen Bai province (Viet Nam). This combined approach has not been carried out so far in other landslide studies. Performance of the novel landslide model was compared with other landslide models using current state-of-the art ensemble frameworks (AdaBoost, Bagging, MultiBoost, and Random Forest). In addition, feature selection method using the Information Gain Technique has been adopted to select the best landslide causal factors for running landslide models. Results analysis proved that the novel classifier ensemble method is a promising technique that could be considered as an alternative for assessment of landslide susceptibility. Analysis also reveals that the performance of the novel model is comparable with other landslide models such as Ada-Boost, Bagging, MultiBoost, and Random Forest. Moreover, while using this model, the Information Gain Technique should be used as a feature selection method to evaluate the importance of landslide causal factors for landslide susceptibility assessment. Additionally, this novel classifier ensemble method can be used for the evaluation of different types of landslides under varying geo-environmental conditions. Results of the present study could be helpful for the natural hazard management, planning and decision makings of the area affected by landslides.