COLAFOS: a hybrid machine learning model to forecast potential coseismic landslides severity

ABSTRACT Timely and rational prediction of coseismic landslides is crucial for the design and development of key infrastructure capable to protect human lives in seismically active regions. This research introduces the novel Hybrid Coseismic Landslide Forecasting model (COLAFOS) that takes into consideration three parameters namely: The Average Slope of the Active Areas, the Slope Aspect and the types of Geological forms. The developed model was tested on two datasets from the island of Lefkada Greece, for years 2003 and 2015. COLAFOS is a hybrid model, employing the Fuzzy c-Means clustering, the Ensemble Adaptive Boosting (ENS_AdaBoost) and the Ensemble Subspace k-Nearest Neighbour (ENSUB k-NN) algorithms. The introduced model managed to correctly classify the coseismic landslides according to their severity, with a success rate of 70.07% and 72.88% for 2003 and 2015, respectively. The algorithm has shown very good performance for the classes of major severity, reaching an accuracy up to 92%. Accuracy, Sensitivity, Specificity, Precision and F-1 Score, were used to evaluate the performance of the model. Given the fact that this is a multi-class classification problem, ‘One Versus All’ Strategy was used in the evaluation process. Although the datasets were relatively unbalanced, the evaluation indices sealed the efficiency of the model.


Introduction
Landslides occur in a variety of environments, characterized by either steep or gentle slope gradients, from mountain ranges to coastal cliffs or even underwater. Gravity is the primary driving force for a landslide to occur. However, there are also other factors affecting slope stability in a way that makes a slope prone to failure. In many cases, the landslide is triggered by a specific event, such as a heavy rainfall, an earthquake, a slope cut to build a road and many others (Huang et al., 2020). Earthquake-induced landslides are one of the most catastrophic results of earthquakes, as evidenced by many historic events over the past decades, especially in countries with high seismicity (Dunn, 1973).
The 1994 Northridge earthquake triggered more than 11,000 landslides (Harp & Jibson, 1996). During the 2008 Wenchuan earthquake in China, the earthquake-induced Tangjiashan landslide caused an over 20.37 million m 3 mass movement and blocked the main river channel forming a landslide dam, which posed a deadly threat to the lives of millions of people downstream (Yin et al., 2009). According to Jibson et al. (Robinson et al., 2017), the consequences of landslides triggered by earthquakes have had a massive impact on human lives and facilities many times. The correlation of the coseismic landslides' pattern with geological and topographical variables (i.e. lithology, slope angle, slope aspect and volume) has been investigated several times by many researchers. According to field observations, the density of the mapped landslide concentration is normally associated with the seismic shaking magnitude (Zhao & Crosta, 2018). In particular, it has been shown that landslide frequencies are higher in areas of highest Peak Ground Acceleration (PGA) and that landslide density decays with the epicentral or fault distance (Ayalew et al., 2011;Bezdek, 2013;Papathanassiou et al., 2005).
The pinpoints of the most vulnerable areas to coseismic landslides are vital in order to take timely actions and to reduce the risk level. Realistic prediction of coseismic landslides is crucial for the design of key infrastructure, capable to protect human lives in seismically active regions.
The Newmark sliding mass model has been extensively utilized to estimate earthquake-induced displacements in slopes, earth dams and landfills since the 1960s (Du et al., 2018a(Du et al., , 2018bKokusho, 2019). As technology develops, new methods and techniques have been proposed for assessing the degree of risk within an area of interest. Such technologies are satellite imagery and Geographic Information Systems (GIS) using statistical analyses of geo-environmental and seismologic factors (Louvari et al., 1999). In particular, the characteristics of a landsliding area are statistically related to topographic, geologic and seismic parameters, e.g. slope angle, slope aspect, curvature, lithology, Peak Ground Acceleration (PGA), seismic intensity distribution and distance from the seismic fault or epicentre Sachpazi et al., 2000). These correlations can provide crucial information that can be used for seismic landslide hazard analysis and mitigation measures' planning, for prone earthquake-induced landslides regions (Cover & Hart, 1967;Meunier et al., 2007;Robinson et al., 2017;Wei & Lu, 2017). Moreover, coupling effect between topography and soil-amplification can lead to complex wave propagation patterns due to scattering and diffracting of waves within the low-velocity nearsurface layers (Asimaki & Mohammadi, 2018;Assimaki & Jeong, 2013;Wang et al., 2018). These ground-motion effects have a significant impact on coseismic landslides' assessment. However, only limited efforts have been devoted to the assessment of these effects by empirical models (Song & Rodriguez-Marek, 2015;Tsai & Lin, 2018). There is a clear need to develop innovative numerical schemes to address the above challenges.
Being able to predict the severity of an upcoming landslide after an earthquake could be extremely beneficial. It can lead to specific pre-emptive actions that can result in the reduction of damages and in the avoidance of injuries or deaths. Overall, it can lead to the effective reduction of disastrous consequences. Short-term forecasting approaches are usually considering features which derive from the use of expensive equipment (Cao et al., 2016;Lee et al., 2015;Shiels et al., 2008). The aim of this research is the development of an equally efficient long-term model, without requiring significant resources. It is a fact that in existing literature, there is a lack of models that can predict the severity of coseismic landslides by assessing only three features, namely: the slope angle, the slope aspect and the geological forms of a specific area. To the best of our knowledge, COLAFOS is the first flexible, intelligent model in the literature, that not only forecasts potential Seismic Landslides but it also applies Machine Learning and Fuzzy Algebra in order to assign the examined areas severity labels.
The research described herein is an extension of a previous effort of Psathas et al. (2020). The development of the COLAFOS (Coseismic Landslides Forecasting Severity) model could assist risk management organizations, public agencies and stakeholders, or even governments, to apply a better distribution of the staff and financial resources to each area. This can result in the confrontation of potential corollaries, and to the development of appropriate mitigation plans, capable of increasing the resilience of the communities. Mainly, the statistical analysis of geo-environmental and seismologic factors is performed by bivariate and multivariate approaches.
This research paper introduces the novel COLAFOS hybrid algorithm, which can find the influence of three parameters related to coseismic landslides namely Slope, Aspect and Geological forms, with the actual landslide's severity. In particular, it discusses the development of a hybrid model that combines the employment of the Unsupervised Fuzzy C-Means Clustering algorithm (FCM) (Calvo et al., 2012;Samworth, 2012) with the Ensemble Adaptive Boosting (Freund & Schapire, 1997;Hastie et al., 2009;Hsu et al., 2003;Ying et al., 2013) and the Ensemble Subspace k-Nearest Neighbour classification algorithms (Grabowski, 2002;Khalilia & Popescu, 2012). The introduced COLAFOS model comprises two Computational Intelligence (CI) submodels. The first sub-model follows the Unsupervised Fuzzy c-Means Clustering (FCMC) algorithm. It assigns labels to the available data records. The second sub-model performs classification by using a robust Hybrid Machine Learning algorithm. The aforementioned hybrid approach simplifies and at the same time extends previous research efforts (Psathas et al., 2020) of our research team. More specifically, it considers three independent parameters (which can be derived at low cost) instead of five that were used so far. Finally, this manuscript does not use nominal classes for the average aspect, but it considers actual degrees (numerical values) instead.
It is a fact that existing research has been trying to model and estimate the values of the independent parameters Planar Area and Area before an earthquake incident. In this novel research, the aforementioned variables were measured after the landslide, and they were taken into consideration for the determination of the clusters.
There are several models in the literature, aiming to model landslides (Collins et al., 2012;He et al., 2019;Marjanović et al., 2011;Papathanassiou et al., 2017a;Shirzadi et al., 2018;Thai Pham et al., 2019). An important disadvantage of the existing approaches is the fact that they are too slow to effectively inform emergency responders. Thus, there is a need for a faster and more flexible model, with more affordable factors. Furthermore, most of the current approaches are using crisp values for the determination of slope angle or slope aspect classes. As a result, some coseismic landslides characterized by borderline values could be easily misclassified. To the best of our knowledge, it is the first time that such a hybrid algorithm is introduced for Landslide Severity forecasting in the literature.

Literature review
The data set used is not publicly available. It was developed from the lab of Technical Geology of the Department of Civil Engineering, Democritus University of Thrace. Thus, to the best of our knowledge, there are no other similar research efforts, on the specific dataset following this modelling approach published in the literature. However, the paragraphs below attempt to make a short reference to already published research on landslide severity prediction. Dehnavi et al. (2015) proposed a novel hybrid model, based on a Step-Wise weight Assessment Ratio Analysis (SWARA) method and on an Adaptive Neuro-Fuzzy inference system (ANFIS) for the evaluation of landslide severity in Iran. They achieved an accuracy equal to 0.84 and 0.80 in testing and training, respectively, considering 12 landslide predisposing factors, such as: lithology, slope angle, slope aspect, plan curvature, profile curvature, altitude, distance to streams, distance to faults, distance to roads, land use, seismicity and rainfall. Wang et al. (2022) presented a hybrid Particle Swarm Optimization, Extreme Learning Machine (PSO-ELM), AdaBoost hybrid model, for future debris-flow volume estimation, using a dataset of measured debris-flow volumes from 60 catchments in China. Their model achieved MAPE close to 0.1, by considering the following five features: topographic relief, channel length, distance from seismic fault, average channel gradient and total volume of co-seismic landslide debris. Di Napoli et al. (2020) studied the Cinque Terre area by considering thirteen independent variables, namely: slope angle, aspect angle, planform curvature, profile curvature, distance to roads, distance to streams, topographic wetness index, topographic position index, stream power index, agricultural terraces state of activity, land use, geo-lithology and soil thickness. They achieved ROC value equal to 0.9 which is a clear indication of good convergence. Tsangaratos et al. (2018) developed an ML model with an accuracy index of 0.812, for the prediction of moving landslide rate curves, on the island of Lefkada Greece, using lithological units, slope angle, slope orientation, distance from tectonic features, distance from hydrographic network and distance from road network. Chen, Chen, Tsangaratos, Ilia, and Wang (2020) used Genetic algorithms (GA) as a feature selection method, whereas the PSO algorithm was used to optimize the structural parameters of two ML models namely: Support Vector Machines (SVM) and Artificial Neural Network (ANN). Overall, 12 independent variables were considered namely: elevation, slope angle, slope aspect, curvature, plan curvature, profile curvature, topographic wetness index, and stream power index, distance to faults, distance to river, lithology and hydrological cover. The Achaia Regional Unit located in Northern Peloponnese, Greece, was the application area. The ANN model has shown the highest prediction accuracy with an Area Under Curve (AUC) equal to 0.800, followed by the SVM for which the AUC was equal to 0.750.
There are several other research efforts performed for the prediction of landslide severity that are published in the literature. However, to the best of our knowledge, the research presented herein is the only one that uses only three independent variables offering a simpler and more flexible model. The accuracy is as high as 92% for most of the severe landslides.

Area of research
'Lefkada' is a Greek island in the Ionian sea with an area of 302 km 2 (Vött et al., 2009). It is connected to the mainland by a floating bridge. Lefkada is the capital town, and it is located in the northern part of the island, which is 35 km long and 15 km wide. Its basic geological characteristics can be described as follows: (a) A carbonate sequence of the Ionian zone. (b) Limestone of Paxos (Apulia) zone restricted in the South West (SW) peninsula of the island. (c) Few outcrops of Ionian flysch (turbidites) and Miocene marls-sandstones mainly in the northern part of the island (Dudani, 1978;Papathanassiou et al., 2021).
The geological Zones of Ionion and Paxos are separated by a boundary, which is located in the North West (NW) -South East (SE) direction of the region. It projects onshore Southcentral Lefkada near 'Hortata' village, in the form of a buried thrust fault, by scree (a collection of rock fragments at the base of rocks, mountain cliffs, volcanoes) and late Quaternary deposits (Keefer, 1984). 'Pleistocene' and especially 'Holocene' coastal deposits are sprawling in the Nothern edge of the island, where its capital is located, in the valley of 'Vassiliki' and in the Coast of 'Nydri'.
Due to its location in the 'Ionian' sea and to its complex crustal deformation resulting from the subduction of the African Plate towards North East (NE) and 'Apulian' platform continental collision further to the NW, Lefkada is one of the most tectonically active areas in the European continent (Khalilia et al., 2014;Rifkin & Klautau, 2004;Papathanassiou et al., 2013). Its principal active tectonic structure is 140 km long, dextral strike-slip 'Cephalonia-Lefkada' Transform Fault (Map 1) (CTF; Cushing, 1985;Parker et al., 2011), which has a Global Positioning System (GPS) slip-rate bracketed, between 10 and 25 mm/year. Most of the slope failure cases have been reported on the western part of the island, which owes its deep morphology to this offshore CTF and its onshore sub-parallel fault; the Athani-Dragano fault (Dudani, 1978;Parise & Jibson, 2000). The latter one is a North North East-South South West striking fault, forming a narrow elongated continental basin, precisely depicted in the region's morphology and indicated on satellite images and aerial photos.
There is a detailed record about at least 23 events with crucial impact on the ground of the Lefkada island (Hatzfeld et al., 1995). A first conclusion drawn by the events is that earthquakes occur in pairs (twin or cluster events) with time period of occurrence ranging between 2 months and 5 years, e.g. 1612-1613 (16 months); 1625-1630 (5 years); 1722-1723 (10 months); 1767-1769 (2 years); 1783-1783 (2 months, possible aftershock); 1867-1869 (2 years); 1914-1915 (2 months); and 1948-1948 (2 months). Therefore, it is of great importance to pinpoint the location of coseismic landslides since it will be useful in order to reduce the hazards and increase the resilience of the island.

Coseismic landslides at the island of Lefkada
The most recent and well-examined earthquakes are those of 2003 and 2015. The penultimate earthquake caused massive slope failures at the western part of the island. The amount of the debris material that arose was remarkably larger than the one of 2015. Numerous landslides occurred on the whole island and especially in the NW and central area, on both natural and cut slopes, as well as, on downstream road embankment ones. Among the most indicative rock falls with diameters up to 4 m were observed along the 6 km long road of 'Tsoukalades-Agios Nikitas', which is within the epicentral area, and are accompanied by gravel, small rock and soil slides (Hatzfeld et al., 1995). The frequent occurrence of these failures led to the closure of the road network which lasted for more than two years. The reported rock falls followed the trace of a 300 m high morphological scarp, and especially a 10-40 m high artificial slope (Hatzfeld et al., 1995).
Regarding the 2015 earthquake, the dominant geological effects were related to slope failures, i.e. rock falls and slides, shallow and deep-seated landslides on both natural and cut slopes (Keefer, 1984). These failures were documented on the western part of the island, while the most densely concentration of these phenomena was reported on the coastal zone from 'Porto Katsiki' to 'Egremnoi-Gialos' beaches and along the 6 km long coastal road of 'Tsoukalades -Agios Nikitas' (Keefer, 1984). Shallow landslides and rock slides were mainly generated in areas where the clastic material covered the bedrock, and particularly in places where the rock mass was heavily jointed. Deep-seated landslides were mainly documented at the area of 'Egremnoi' (Louvari et al., 1999). At this area, deep-seated landslides were reported, and large amount of debris material moved downslope inducing severe damage to the road network and to residential houses. The debris consists of coarse-grained size material with significant amount of large-diameter gravels and few boulders.
The earthquake-induced landslide density has been investigated. Event-based inventories have been developed by taking into account aerial and satellite imagery, obtained by Google Earth. This process aims to enrich and update existing landslide datasets previously compiled for two earthquakes . According to the multi-date satellite imagery and to the knowledge gained by local residents, landslide activities between the aforementioned major earthquakes along the western part of Lefkada is considered insignificant. Each satellite imagery (corresponding to a period of 2-3 years) comprises only Coseismic Landslides (COLA), whereas it does not include the ones triggered by other factors. In total, 301 and 596 COLA were mapped as caused by earthquakes that took place in 2003 and 2015, respectively. For the extraction of morphological and terrain parameters of the compiled landslide datasets, a detailed Digital Elevation Model (DEM) with a spatial resolution of 5 m was used. The 5 m DEM was obtained from the Hellenic Cadastre and it was extracted from aerial imagery stereo-pairs, having a vertical accuracy of 4 m (Louvari et al., 1999).
Having completed the polygon-based inventories, a statistical analysis of COLA distribution took place. In total, 596 and 301 landslides were identified covering 1.29 and 1.6 km 2 for the 2015 and 2003 events, respectively (planar area). It is pointed out that these planar-oriented areas are obtained as projected measurements. The minimum and maximum COLA areas were evaluated as 40.4and 42,940 m 2 for the 2015 earthquake, whereas for the penultimate event the relevant values were 129.8 and 98,300 m 2 , respectively (Louvari et al., 1999). The relevant values of minimum and maximum COLA for the 2003 and 2015 events were estimated by considering the digital elevation model for the delineation of the Landslide area. More specifically for 2015, the minimum value was equal to 51.30 m 2 and the maximum was 58,330 m 2 in a total area of 1.78 km 2 . On the other hand, for the 2003 earthquake, the total landslide area covered 2.28 km 2 where the minimum COLA as equal to 140.9 m 2 and the maximum was 148,469 m 2 (Louvari et al., 1999).

Dataset pre-processing
The coseismic landslides' features were acquired through two *.xlsx files, one for each year. They comprise six columns, four of which are numeric (Planar Area, Average Slope, Area, Average Aspect, Id), and one is nominal (Geological Form). The 'id' column was neither used for the clustering nor for the classification processes. However, the 'id' column was used for distinguishing the unique landslides incidents. Each row of the aforementioned files deputizes the features of a landslide, planar area, average slope, area, average aspect and geological form, as well as the number of the landslide that this row represents. In the area of research section, it has already been mentioned that overall 301 and 596 landslides are registered for the years 2003 and 2015, respectively. Given the fact that there are several landslides with two or more geological forms, the coseismic landslides were reassigned. Therefore, the total number of instances for 2003 and for 2015 are 423 and 767, respectively. For both years, the *.xlsx files comprise the same features. Thus, the same data pre-processing has been applied for both datasets. Nonetheless, however, in the 2003 coseismic landslides, two additional geological forms were observed, compared to the ones of 2015. Due to this fact, the data-pre-processing for these two years was done separately. In this way, there will be an overall evaluation of the proposed algorithm, in order to conclude whether consistency and efficiency can be achieved regardless the year.

Processing average slope
The unit used for inclination measurements is degrees. During the experiments, it was observed that there are some variations between landslides that have the same severity, but they appear to have quite different slopes. For this reason, the natural logarithm function ln(x) was applied to the values of the slopes in order to smooth out any spikes (Othman & Gloaguen, 2013).

Processing average aspect
Regarding the average aspect, the initial elaboration was to transform it from nominal to numeric in a scale from one to eight, according to Figure 1.
Nonetheless, it was considered more efficient to separate the landslides with aspect equal to 10°from the ones that have aspect equal to 350° (Figure 1). For this purpose, for each landslide the actual aspect in degrees was used.

Labelling geological forms
The authors translated the variable (Geological Forms) from nominal to numeric, in order to develop proper labels in a scale from 1 to 20 for 2003 and from 1 to 18 for 2015. Tables 1 and 2 present the labels corresponding to each Geological Form.

Fuzzy C-means clustering of landslides' data
Among the innovations of the proposed hybrid algorithm is that the model is flexible. After a statistical analysis of the datasets, it was pinpointed that some values of the area and the planar area, caused high standard deviation and a non-representative mean value. Thus, landslides' classification is prohibitive using the classical statistical methods. This motivated our research team to develop a new labelling approach.
The Fuzzy C-Means clustering (FCM) algorithm has been employed in order to assign proper labels to all data records. FCM allows a point to be partially classified into more than one cluster, with different degrees of membership which sum to 1. The whole process of using FCM is to deploy the labels required for the development of the hybrid Machine Learning (ML) model. In order to achieve this goal, the authors used the features area and planar area, in order to perform the clustering.
Fuzzy C-means clustering was developed by Dunn in 1973(Samworth, 2012 and was improved by Bezdek (Calvo et al., 2012). It is the fuzzy equivalent of the nearest mean 'hard' clustering algorithm (Duda et al., 1973), which minimizes the following objective function: where m is the fuzzifier (the fuzzifier m determines the level of clusters' overlapping), u ij is the degree of membership of x i in the cluster j, x i is the ith of d-dimensional measured data, c j is the d-dimension centre of the cluster and * is any norm expressing the similarity between any measured data and the centre. The process of fuzzy partitioning is performed through a repetitious optimization of the objective function as shown in equation1, with the update of the membership value u ij to the cluster centres c j .
The final condition for stopping the iterations is whenmax  termination criterion taking values between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point of J m . The feature vectors used as input are a subset of the source dataset, as they include only the Planar Area and the Area. Table 3 presents the name and the default values of the FCM hyperparameters. The fuzzification parameter m (also known as fuzzifier) determines the level of cluster fuzziness and controls the degree of fuzzy overlapping between clusters. The corresponding membership degrees u ij for large m values are small, which implies clusters with smaller boundaries. Thus, if the m fuzzier equals 1, then the membership degrees are equal to 0 or 1 and the result is a crisp partitioning. The fuzzification parameter m was set to 1.2, which is the default value.
The hyperparameter maxIterations is the maximum number of repeating the optimization of the objective function, as shown in Equation (1). The minImprovement is the minimum improvement in the objective function, between successive iterations.
When the objective function improves by a value below this threshold, the optimization stops. A smaller value produces more accurate clustering results, but the clustering can take longer to converge. As it has already been mentioned, the FCM's hyperparameters were assigned the default values, based on existing literature (Barlow et al., 2015;Chang et al., 2006). The*.xlsx datafiles were initially transferred to Matlab Tables for further processing. Then, the FCM.m script (MATLAB code) was executed to perform the clustering. The goal of the FCM.m script is to create the clusters for the coseismic landslides according to their severity and therefore to assign the data records, proper labels to be used for the classification. The initial number of clusters was chosen to be six. Τhe name of the first five clusters derives from the combination of the three potential states (Low, Medium, High) of the parameters Planar Area and Area. The 6th cluster is related to the extreme values for both Planar Area (PA) and Area (AR). Table 5 presents the number of cluster and the corresponding name. The first part of the name corresponds to Planar Area and the Second to Area.
The FCM.m Script is presented in the form of natural language, in algorithm 1. Algorithm 1. The FCM.m Matlab Script. Step 1: Read and convert each *.xlsx file to a Matlab table.
Step 2: For each Step 1: FCM algorithm was applied with options described in Table 3 and clusters described in Table 4. Centres of clusters and membership degrees of each observation were calculated.
Step 2: Each instance was classified in the cluster on which the membership degree was the highest. Part 3: Step 1: Clusters pass to the original data Step 2: Original data with their clusters are plotted in Figure 2 Total number of landslides assigned to each cluster for 2003 and 2015 is presented in Table 5.

A novel post-label assignment FCM clustering algorithm
One of the main assets of this research paper is the development of a novel variation of the FCM algorithm (introduced by our research team) capable of overcoming specific inherent limitations of the classical approach.
As it is understood from Figure 2(a, b) in the previous section, the clusters that were derived from the application of the FCM algorithm had a specific limitation. Some instances could be characterized as 'Weak' members of a specific cluster (e.g. point  (1981, 4.2·10 4 ) presented in Figure 2(a) or point (2744, 3.3678·10 4 ) presented in Figure 2 (b)). More specifically, some instances were assigned almost the same degree of membership for two clusters, or their degree of membership to their first cluster was low. These instances could be characterized as weak members of their closest cluster. Taking into consideration the above observation, the need to create new clusters proved to be imperative. The new clusters were created between older ones, aiming to reassign labels to some instances which do not seem to be typical cases of any specific cluster in the first place. This algorithm was based on the following rule: . Landslides that were weak members of a specific cluster, with a degree of membership less than a certain threshold, were reassigned via a new algorithmic approach. In order to perform this, weights of importance were assigned to each one of the independent factors (Fan et al., 2002;Tian et al., 2017).
Equation (3) implements a Fuzzy Conjunction operation between fuzzy sets, using the function which assigns the weight w i to the membership degree μ i (Iliadis & Papaleonidas, 2016).
where i = 1, 2, … , k and k is the number of instances examined and n is the number of the employed features (Joutsijoki & Juhola, 2011). Specifically, in this research, the function f (x) is defined as follows (Iliadis & Papaleonidas, 2016): where a is the membership degree and w is the corresponding weight obtained empirically (following a trial-and-error process). The Hamacher T-Norm was used to perform the Fuzzy Conjunction (Iliadis & Papaleonidas, 2016;Scordilis et al., 1985).
The Matlab script T-Norm_Clustering.m was developed from scratch in order to apply the FCM Algorithm with the Hammacher fuzzy Conjunction function. The script is presented in the form of natural language, in Algorithm 2. Algorithm 2. The T-Norm_Clustering.m Matlab Script. Step 1: Threshold is defined at 0.7  2015  417  52  145  27  56  25  18  7  15  2  3  767  2003  222  27  84  9  37  5  22  4  6  0  5  421 Step 2: Weights are defined. The highest membership's degree weight is 4, the second highest is 2, and for the rest of the clusters it is 0.
Step 3: If a membership degree for a cluster is higher than 0.7 then the incident belongs to this cluster.
Step 4: Else if the membership degree is under 0.7 then the incident belongs to a new cluster, with membership degree calculated by Equations (3)-(5).
The T-Norm_Clustering.m script led to the development of a new cluster between the already existing ones, for both years 2003 and 2015. Totally, four new clusters were developed for 2003 and five for 2015. Label of each new cluster was based on the ones located in between. For example, if a cluster was between clusters 4 and 5, then it was named 4.5. The results obtained by the application of both FCM and Hammacher Conjunction functions are presented in Figure 3(a, b). It is clear that the T-Norm approach optimized the classification of each instance according to its severity. Table 6 presents the exact number of instances that correspond to each cluster.

Classification methodology
Following the clustering of coseismic landslides using the parameters Planar Area and Area, a classification attempt was made in order to predict the proper class with only three factors. The three independent variables used for the classification are Average Slope, Average Aspect and Geological Form of landslides. Geological Forms are labelled as indicated in Tables 1 and 2. The cluster label that was assigned by the Fuzzy C-Means Algorithm with Hammacher Aggregation is the target variable.
A total of 25 classification algorithms have been employed namely:  The one with the highest performance was the Ensemble Adaptive Boosting Algorithm (Ensemble AdaBoost). The authors observed that the AdaBoost classification algorithm converged well for clusters 1, 1.5 and 2, and extremely well for all other classes. Thus, if an observation was classified in the first three clusters, another algorithm was applied in order to classify it more accurately. The algorithm that offered the best performance was the Ensemble Subspace k-Nearest-Neighbors (Ensemble Subspace k-NN). The hybrid classification model is presented in Figure 4. The next sections will discuss the Ensemble AdaBoost and the Ensemble Subspace k-NN.

Ensemble adaptive boosting (ensemble AdaBoost)
Ensemble learning, in general, is a model that makes predictions based on a number of different models. By combining individual models, the ensemble model tends to be more flexible (less bias) and less data-sensitive (less variance) (Hastie et al., 2009). Ensemble Adaptive Boosting (ENAB) performs especially well with the decision tree. It is the most popular boosting technique, developed for classification (and later extended to regression) formulated by Freund and Schapire (1997). The model's key is learning from the previous mistakes, e.g. misclassification data points. AdaBoost (ADBO) learns from the mistakes by increasing the weight of misclassified data points (Ying et al., 2013). Its algorithmic steps are described below: 1. Initialize the weights of data points. if the training set has n data points, then each point's initial weight should be w(i) = 1/n, where i = 1, 2, 3, … ., n. 2. Train a weak learner, h t (x), on a random sample weighted by an initially uniform distribution, w(i). 3. Calculate the weighted error rate (e) of weak learner. The weighted error rate (e) is just how many wrong predictions out of total and you treat the wrong predictions differently based on its data point's weight. The higher the weight, the more the corresponding error will be weighted during the calculation of the (e).
4. Calculate this learner's weight in the ensemble: where TW: the weight of this learner, lr: learning rate and e: the weighted error rate.
The higher weighted error rate of a learner, the less decision power the learner will be given during the later voting. On the other hand, the lower weighted error rate of atlearner, the higher decision power the learner will be given during the later voting.

Update weights of wrongly classified points:
If the model got this data point correct, the weight stays the same.
If the model got this data point wrong, the new weight of this point is: where w*: the new weight of each data point, w: the old weight of each data point and TW: the weight of this learner. The higher the weight of the learner (more accurate this learner performs) the more boost (importance) the misclassified data will get. The weights of the data points are normalized after all the misclassified points are updated.

Repeat
Step 2 (until the number of learner that was set to train is reached) 2. Make the final prediction The AdaBoost makes a new prediction by adding up the weight (of each learner) and by multiplying the prediction (of each learner). Obviously, the learner with higher weight has more significant influence for the final decision (Freund & Schapire, 1997).

Ensemble subspace k-nearest neighbours (ensemble subspace k-NN)
The k-nearest neighbours (k-NN) is a lazy and non-parametric learning algorithm (Hechenbichler & Schliep, 2004). The k-NN classifier is a traditional classification rule, which assigns a test sample the label of the majority of its k-NN from the training set (Marjanović et al., 2011). Given a set X of n points and a distance function, k-NN search, finds the k closest points to a query point (Bora et al., 2014). Domeniconi and Yan (2004) introduced a weighted voting method, called the distanceweighted (DW) k-nearest neighbour rule (Wk-NN). In this approach, the closer one neighbour is, the greater is the weight that corresponds to it, using the DW function. The weight w i for the i-th nearest neighbour of the query x ′ is defined by the following function 11: Finally, the classification result of the query is determined by the majority weighted voting as in function 12: Based on Equation (7), the closer a neighbour is to an observation, the more it weighs compared to someone's farther. This means that the neighbour who is located farthest from all others corresponds to a weight of zero. On the other hand, the one closest to the observation corresponds to a weight of one. All neighbours in the middle area, get corresponding values between 0 and 1, depending on their distance. Despite the fact that k-NN algorithm is used as a benchmark algorithm, there are several times that it outperforms more complex algorithms. It remains an effective and flexible algorithm. Nevertheless, k-NN and its variations like weighted k-NN that have been mentioned before suffer from inability to process high dimensionality data. In order to achieve better performance with the k-NN (regardless the data) ensemble approaches were introduced. The most common and most consistent ensemble method for k-NN is the Ensemble Subspace k-NN. Related modelling efforts using this algorithm can be found in (Bora et al., 2014;Khalilia & Popescu, 2012).
It has already been mentioned that Ensemble models make predictions based on a number of different models and by combining individual models. The ensemble model tends to be more flexible (less bias) and less data-sensitive (less variance) (Hastie et al., 2009).
The basic random subspace algorithm uses the following hyperparameters.
. m is the number of dimensions (variables) to sample in each learner. . d is the number of dimensions in the data, which is the number of columns (predictors) in the data matrix X. . n is the number of learners in the ensemble. Set n using the NLearn input.
The basic random subspace algorithm performs the following steps: (1) Choose without replacement a random set of m predictors from the d possible values.
(2) Train a weak learner using just the m chosen predictors.
(3) Repeat steps 1 and 2 until there are n weak learners.
(4) Predict by taking an average of the score prediction of the weak learners and classify the category with the highest average score.

Learning using 10-fold cross validation and grid search
The hyperparameters' tuning has been performed by employing the combination of 10fold Cross Validation and Grid Search. As it is indicated in existing literature (Hsu et al., 2003), this combination is one of the most widely used strategies in machine learning (Bergstra & Bengio, 2012). The procedure is as follows: (a) The range and the step (unit of transition to the next value) of the hyperparameters' values are initialized. (b) The best combination of hyperparameters' values is found, after considering all of them. The classification accuracy for each one (after performing 10-fold cross validation) is calculated, and the one with the greatest accuracy is considered the optimal. (c) After finding the best combination of hyperparameters' values, the first step (a), is repeated, with different range and step values. New range and new step are defined around the optimal hyperparameters. Then, step b is performed, in order to reach in a higher value of classification accuracy.
The process is repeated, by shrinking the range and the hyperparameters' step even further, until the classification accuracy value is optimal.

Evaluation of the classification model
Validation of the developed and applied model was performed by considering the Accuracy index. In various cases, Accuracy could mislead to wrong conclusions. Due to this fact, for this research, additional classification indices were used in order to estimate the efficiency of the model. Furthermore, the problem that we are dealing is a multiclass classification problem. Thus, for this research, the 'One Versus All' Strategy (Kritikos et al., 2015;Papathanassiou et al., 2005) was used. Table 7 presents the name, abbreviation and the calculation of the five performance indices used.
The performance indices for the 'One Versus All' approach are defined as follows: Precision answers the following question: What proportion of predicted positives for each class is truly positive? Moreover, Sensitivity (also known as Recall) answers a different question: what proportion of actual positives for each class is correctly classified? It is important when the cost of false negatives is high. Specificity (SPC) for each class is the ratio between the number of cases that were correctly classified as negatives, to how many were actually negative. Accuracy (ACC) is the most intuitive performance measure and calculates all correctly identified from the predicted cases.
Accuracy is a great measure, but only when there are symmetric datasets where values of false positive and false negatives are almost the same, which means that the class distribution is balanced. F1 score for each class can be interpreted as the harmonic mean (weighted average) of the Precision and Recall for this class. F1 Score is needed when we want to seek a balance between Precision and Recall. According to the existing literature, F1 score is a better metric when there are imbalanced classes as in the above case. Using it as a metric, we are sure that if its value is high, both precision and recall of the classifier indicate good results. In our case, the F1 score is the final overall criterion of good performance evaluation (Sammut & Webb, 2017;Tharwat, 2020).

Experimental results
The experiments were performed with the use of Matlab R2019a software. As it has already been mentioned, the tuning of the hyperparameters was applied with the combination of 10-fold cross validation and the grid search. The initial range for the AdaBoost hyperparameters is presented in Table 8.
The optimal values of the AdaBoost hyperparameters are presented in Table 9. The Ensemble AdaBoost achieved an accuracy equal to 64% and 68% for 2003 and 2015, respectively. The Confusion Matrix for each year is presented in Tables 10 and 11, and the corresponding ROC Curves in Figures 5 and 6.
Tables 12 and 13 present the values of the performance indices for the above optimal algorithm. SNS stands for Sensitivity, SPC for Specificity, ACC for Accuracy, PREC for Precision and F1 for F1 Score.
Tables 12 and 13 clearly show that the algorithm accurately classifies all the landslides that belong to Cl 2.5 and above. It is extremely significant that the algorithm can indicate with high accuracy the coseismic landslides that have the potential to be the most disastrous. Thus, this model can be used by public agencies and stakeholders, in order to develop appropriate mitigation plans increasing the resilience of the community. In this way, the existing resources can be used more efficiently.
Regarding the prediction of the first three clusters, there is clearly a need for the employment of a more robust Machine Learning classification algorithm. The Ensemble k-NN was used in order to classify all of the landslides that were predicted in the first three clusters. Once again, the tuning of hyperparameters was applied, whereas Learning followed a 10-fold cross validation approach and grid search. The initial range of the hyperparameters' values for the Ensemble k-NN algorithm is presented in Table 14.
The optimal hyperparameters' values for the Ensemble Subspace k-NN were determined via trial and error and they are presented in Table 15.
The Ensemble Subspace k-NN algorithm achieved an overall accuracy equal to 70.07% and 72.88% for 2003 and 2015, respectively.
Regardless of the overall accuracy, it is really important that the accuracy is extremely high (very close to 100%) for all areas that are offering frequent and important seismic landslides.
The overall performance drops due to the limited number of available instances for the three clusters Cl1, Cl1.5 and Cl2 (minority ones) that offer rare and insignificant landslides.
In practical terms, the COLAFOS manages to achieve the target which is the determination of all the risky clusters along with their respective severity index. The Confusion Matrix for each year is presented in Tables 16 and 17 and the corresponding ROC Curves in Figures 7 and 8.
Tables 18 and 19 present the values of the performance indices for the above optimal algorithm. SNS stands for Sensitivity, SPC for Specificity, ACC for Accuracy, PREC for Precision and F1 for F1 Score.
After applying the Ensemble k-NN algorithm, a significant increase in the value of overall accuracy and in the values of all respective indices was observed. From the above tables, we can clearly see that the COLAFOS model offers extremely high accuracy for all risky or extremely risky areas that offer frequent and major landslides, which was the motivation for this research.
On the other hand, it performs very well for cluster1, whereas it has a moderate performance for clusters 1.5 and 2. This is due to the fact that these clusters  True class are minority ones, as they are characterized by a small number of seismic landslides, with a limited number of respective cases. Consequently, the need to work on them is not imperative.
Overall, this research has achieved to forecast the coseismic landslide severity level of really dangerous areas, by considering only three independent variables, which is a real           True class achievement. Moreover, we should emphasize the fact that the COLAFOS model is the first ever in the literature that forecasts the severity of Landslide risk for certain geographical areas. All indices, especially the accuracy and F-1 score, provide the groundtruth for a very flexible model that can predict the most severe landslides and can also handle quite satisfactory the ones that are less dangerous. It is worth mentioning that the results related to both 2003 and 2015 are similar, which means that the algorithm is consistent. Although   the results of 2015 are better, due to the use of contemporary equipment for feature extraction.

Discussion and conclusions
As it has already been described in Section 2.1, there are several attempts that try to predict the severity of landslides, both for the island of Lefkada and for other parts of the world. These efforts have achieved an overall accuracy as high as 90% (in rare cases). Although it seems that this effort does not seem to achieve such high overall accuracy, it is very important to emphasize that it achieves specific success rates of 85% and 92% for 2003 and 2015, respectively, for classes 2.5 and above. However, the goal is the proper allocation of financial and governmental resources during an earthquake and landslides. The introduced COLAFOS model can predict with much higher accuracy than the models existing in the literature, the areas where the landslide severity is more likely to be extreme. In addition, the fact that only three independent variables are used, in contrast to the existing literature (see Section 1.1) where a significantly larger number of variables are considered, makes the introduced approach, less complicated, much faster, less biased and more trustworthy. This research addresses a highly significant topic, which is of huge contemporary importance. It is one of the most crucial hazards in prone to earthquake countries. The area of concern is of essential importance to the urban and lifeline planning and consequently to the functioning of societies.
In particular, the classification of coseismic landslides according to their severity is a really interesting, and challenging task. This paper introduces an innovative version of the FCM clustering algorithm and additionally it proposes the novel COLAFOS (Clustering and Classification hybrid model). COLAFOS is based on a new version of the Fuzzy C-Means in order to achieve initial labelling of the available records, and two well-known Machine Learning algorithms namely: Ensemble AdaBoost plus Ensemble Subspace k-NN.
The application of Fuzzy C-Means employing the Hamacher Fuzzy Conjunction operator (T-Norm) managed to label the areas under study to 10 clusters for year 2003 and to 11 clusters for 2015. Thereafter, Ensemble AdaBoost and Ensemble Subspace k-NN, Machine Learning models were developed, using only three independent parameters namely: Average Aspect, Average Slope and Geological Form.
At this point, it should be clarified that the novelty of the COLAFOS model is not based on the analysis of the phenomenon itself, but on the fact that the developed hybrid model considers only the Slope, the Average Aspect and the Geological Forms which are low coast to obtain, thus affordable to acquire and to feed the model with timely data through the years. It should be emphasized that there is no respective model in the literature with comparable accuracy values, for the most risky areas.
The overall accuracy for 2003 and 2015 is more than satisfactory (especially for the risky instances) and it proves that the model can be used in real-life applications. It is very important that the COLAFOS performs perfectly for all the cases that have frequent and severe seismic landslides (based on historical data). There is some level of limitation for areas that offer rare and minor incidents and thus they are considered by the model as minority classes.
More specifically, the efficiency of the model is also perceivable from Tables 18 and 19, where indices Accuracy, Sensitivity, Specificity, Perception and F1-score range at highest levels, especially for clusters Cl 2.5 and above. However, as it has already been mentioned there are some limitations for clusters Cl 1 , Cl 1.5 and Cl 2 which represent areas that have limited interest as they offer rare and insignificant seismic landslide phenomena. Overall, COLAFOS is very effective and it has the potential for a very good practical application.
Concluding, it can be supported that the hybrid model described herein, managed to correctly classify coseismic landslides according to their severity for the island of Lefkada, Greece (an area with high seismic risk). Being able to predict the severity of an upcoming landslide after an earthquake could be extremely beneficial for the effective treatment of disastrous consequences. The development of such a model could assist risk managers, public agencies and stakeholders, or even governments, to design pre-emptive actions and policies in order to confront potential corollaries. It can also elaborate the development of appropriate mitigation plans increasing the resilience of the respective communities.
Future work will focus on predicting the exact time frame and the area that a landslide will probably occur after an earthquake, or after a respective natural disaster.

Disclosure statement
No potential conflict of interest was reported by the authors.

Notes on contributors
Anastasios Panagiotis Psathas holds a BSc in Mathematics from Aristotle University of Thessaloniki, Greece, a M.B.A. from the Department of Economics Sciences, Democritus University of Thrace, Greece and a M.Sc. in Applied Mathematics from Department of Civil Engineering, Democritus University of Thrace, Greece. He is currently a PhD Candidate in Department of Civil Engineering-Lab of Mathematics and Informatics (ISCE), Democritus University of Thrace, Greece and a researcher in Risk & Resilience Assessment Center "RiskAC". His academic interest focuses on Security of Critical Infrastructures and Cybersecurity using AI Techniques. Articles regarding his research activity were published in international journals and conference proceedings. He has been a member of organizing committees and organizing, publicity and publication co-chair in conferences, as well as an experienced reviewer in both international Conferences and Journals.
Dr. Papaleonidas Antonios is Computer Science engineer specialized in the scientific fields of Artificial Intelligence, Real Time monitoring systems and Data Analysis. He holds a PhD in Hybrid Computational Intelligence methodologies, and a M.Sc. in Sustainable Management of the Environment and Natural Resources by using Hybrid Intelligent Systems both from Democritus University of Thrace. He also holds a Certificate of Pedagogical and Teaching Competence from the Greek School of Pedagogical and Technological Education. He participates in several funded research projects, he is a member of the organizing, publicity and publication chair in a vast number of conferences while at the same time he is an experienced reviewer and member of the Program Committee in dozens of conferences and Journals. He has more than 20 publications in international conferences and Journals while at the same time he has been a Doctoral track Chair in AIAI 2022 international conference. Since 2019 he is a permanent Computer Science Teaching Personel at the Civil Engineer Department of Democritus University of Thrace.
Professor Lazaros Iliadis is Head of the Civil Engineering Department, in the School of Engineering, Lab of Mathematics and Informatics of the Democritus University of Thrace, Greece. He holds a BSc in Mathematics from the Aristotle University of Thessaloniki Greece, MSc in Computer Science from the University of Wales Swansea, United Kingdom and PhD in Expert -Knowledge Based Systems, from the Aristotle University of Thessaloniki Greece. Professor Lazaros Iliadis is author of two Academic Books, he has published 88 research papers in International Scientific Journals and 114 research papers in Proceedings of International Scientific Conferences. Moreover, he has run numerous Competitive Research Projects.
Dr. George Papathanassiou is a geologist with an expertise in engineering geology and geohazards. In particular, his research work dealt with soil liquefaction, landslide hazard assessment, engineering and earthquake geology. Articles regarding his research activity were published in international journals and proceedings. He has participated in multiple post-earthquake field surveys and the rapid mapping of geohazard events in Greece and Italy. From July 2018 to September 2021, he worked as an Assistant Professor at the Department of Civil Engineering at the Democritus University of Thrace. From September 2021 till today, he is working as an Assistant Professor at the Department of Geology at the Aristotle University of Thessaloniki.
Dr Sotiris Valkaniotis is a geologist with an expertise in active tectonics, landslides and remote sensing application in geohazards. He is currently a postdoctoral researcher in the Department of Civil Engineering, Democritus University of Thrace, Greece (Risk & Resilience Assessment Center "RiskAC") and also an External Research collaborator with Geodynamic Institute, National Observatory of Athens, Greece and a Senior Research Analyst of University of Philippines Resilience Institute, Quezon City, Philippines. His research work is focused on seismotectonics, study of earthquake and surface deformation, mapping and evaluating landslides and their hazard, high resolution topography using photogrammetry, satellite imagery and UAS, satellite mapping of ground deformation using SAR and optical satellite imagery, Quaternary geology and paleoseismology/archaeoseismology. Has participated in multiple post-earthquake field surveys and the rapid mapping of geohazard events globally. His published research work includes 41 publications in peer-reviewed journals and more than 120 short papers, abstracts and posters.