Detection of areas prone to flood risk using state-of-the-art machine learning models

Abstract The present study aims to evaluate the susceptibility to floods in the river basin of Buzau in Romania through the following 6 machine learning models: Support Vector Machine (SVM), J48 decision tree, Adaptive Neuro-Fuzzy Inference System (ANFIS), Random Forest (RF), Artificial Neural Network (ANN) and Alternating Decision Tree (ADT). In the first stage of the study, an inventory of the areas affected by floods was made in the study area, and a number of 205 flood points were identified. Further, 12 flood predictors were selected to be used for final susceptibility mapping. The six models' training was performed by using 70% of the total flood points that have been associated with the values of flood predictors. The highest accuracy (0.973) was obtained by the RF model, while J48 had the lowest performance (0.825). Besides, by classifying flood predictors' values in flood and non-flood pixels, the six flood susceptibility maps were made. High and very high flood susceptibility values cover between 17.71% (MLP) and 27.93% (ANFIS) of the study area. The validation of the results, performed using the ROC Curve, shows that the most accurate flood susceptibility values are also assigned to the RF model (AUC = 0.996).


Introduction
Natural hazards cause extensive human and financial losses to humans worldwide each year Alam et al. 2020Alam et al. , 2021. In many parts of the globe, the acceleration of processes such as climate change, excessive surface waterproofing, and uncontrolled deforestation has led to an exponential increase in the severity and the number of natural hazards like drought and floods (Didovets et al. 2019;Feng et al. 2020;Kocaman et al. 2020;Peptenatu et al. 2020). Among the mentioned phenomena, the global climate changes are responsible for the amplification of the intensity and frequency of the torrential rains which are the main factor that generates the flood phenomenon (Shahid et al. 2016;Markus et al. 2018;Sepehri et al. 2020). Climate change accelerates hydrological cycles, alters the magnitude and timing of streamflow, and threatens the water resources and environmental sustainability of basins (Chen et al. 2017;Lu et al. 2019aLu et al. , 2019bTian et al. 2020); as a result of the changes in the processes mentioned above, there is a significant increase in flash-flood phenomena that cause the propagation of a very high water quantity from the upper area of river basins towards the lowlands where massive floods are caused by water accumulation . According to Giang et al. (2020) the economic damages caused by global floods by 2050 will reach approximately US$300 billion. Due to their concise genesis, the floods from flash-floods cannot be predicted with a lead time higher than few hours (Douinot et al. 2016;Ma et al. 2018). Therefore, it is necessary to identify in a short time and as accurately as possible the areas susceptible to flood. Flood risk assessment is an essential topic for the community of European Union countries, of which Romania is a part (Haer et al. 2020). Thus, flood protection measures, which also includethe flood risk assessment activity, are provided in Flood Directive 2007/60/EC adopted by the European Parliament (Costache et al. 2020c).
Unlike hydraulic modelling, performed by programs such as Mike Flood (L€ owe et al. 2017;Li et al. 2018;Dat et al. 2019) or HEC-RAS (Leon and Goodell 2016;Thakur et al. 2017;Khalfallah and Saidi 2018), which is time-consuming, applies only to certain sectors of the monitored river and for which there is also need for detailed information regarding the Digital Elevation Model. Today, with the development of GIS knowledge and several platforms the (Zuo et al. 2015;Yang et al. 2020;Zhao et al. 2020;Hu et al, 2020bHu et al, , 2020c, determination of the flood susceptibility through GIS techniques combined with machine learning and artificial intelligence, is much faster, can cover an entire river basin to the smallest streams, and the data needed for modelling are often freely available. For this reason, in the last time, there is a significant increase in the concerns of researchers around the world for the application of machine learning (ML) based models to calculate the susceptibility to floods (Azareh et al. 2019;Arabameri et al. 2019;Costache 2019aCostache , 2019b. The application of these modern calculation models ensures the obtaining of final results with high precision. ML-based models also contain specific procedures for validating the results and evaluating the performance of the models Li et al. 2019, Costache 2019cXue et al. 2020;Hu et al. 2020a;Ma et al. 2021). Along with its high support for flood and flash-flood early warning systems, the high-accurate estimation of flood exposed areas can also help to draw up the river basins flood defence plans (Albano et al. 2017;Johann and Leismann 2017;Brillinger et al. 2020). The following models are among the most popular ML algorithms used to estimate the flood susceptibility: Artificial Neural Network (ANN) (Costache et al. 2020b), support vector machine (SVM) (Nguyen et al. 2017;Tehrany et al. 2019;Sahana et al. 2020), decision trees-based models Khosravi et al. 2018;Costache 2019c), naïve Bayes (Ali et al. 2020;Tang et al. 2020), deep learning neural network (Costache et al. 2020a), extreme learning machine . All the models applied in the above works achieved an accuracy higher than 80%. This fact demonstrates ML models' high efficiency (Lee et al. 2017;Pham et al. 2020a). It is worth to mention also some differences which exist between the characteristics of the existing methods. Thus, for example, an important difference is given by the fact that the Artificial Neural Network is a parametric classifier based on the hyperparameters used in the training stages. At the same time, the Support Vector Machine is a nonparametric classifier that is based on a hyper-plan through which the classes are separated (Ren 2012). If we bring into discussion the decision tree models, we should mention that the decision trees have more clear rules than Artificial Neural Network and can be trained faster. Also, the training procedure of the decision trees does not require the optimization of geometry and internal network (Gharaei-Manesh et al. 2016).
In the context of the aspects as mentioned above, the present study aims to identify the areas exposed to floods through a comparative analysis of the results provided by the following 6 machine learning-based models: SVM, J48 decision tree, Adaptive Neuro-Fuzzy Inference System (ANFIS), Random Forest (RF), ANN and Alternating Decision Tree (ADT). The study will be focused on Buzau river basin from Romania. Checking of validation and uncertainty is the most critical step in statistical analysis (Yang et al. 2015;Shi et al. 2017;Chao et al. 2018;Chen et al. 2020;Zhang et al. 2020). ROC Curve method and the following statistical metrics will be employed to validate the results: accuracy, sensitivity, specificity, F-Score, Kappa index.

Study area
The study area is located in the south-eastern part of Romania and covers an area of 5350 km 2 . The altitudes of the study area, which are in a relatively high range, between 1 m and 1925 m, highlight necessary relief energy that favours the propagation of floods from the upper areas to those at low altitudes. The high slopes, higher than 25 , in the upper part of the basin corroborated with the flat surfaces in the lowlands part, is another element that certifies the potential for flood propagation in the upper area and the potential for flow accumulation in the lower part of the basin. From the geological point of view, in the mountain region of the study area, deposits of the internal Cretaceous flysch can be encountereddeposits of the internal Cretaceous flysch can be encountered while in the hilly area dominates the Miocene Sarmato-Pliocene deposits. The plain region is characterized by sedimentary rocks like clays, gravels, and sands. The geological structure corroborated with exogenous factors determines the occurrence of several geomorphological phenomena like gully erosion and landslides which are closely related to the flash-flood phenomena. At the study area level, the mean annual precipitation around 750 mm/year, while the maximum rainfall in 24 hours equal to 115.4 mm occurred on 12 July 1969 at L ac auīi meteorological station (Minea 2013). From the hydrological point of view, the highest the most important event occurred in 1975 when, after a rainfall, the Buz au river's discharge, the main collector in the hydrographic basin, reached to 2100 m 3 /s at M agura hydrometric station. Other important flood events on the main tributary rivers occurred in 2005 when the maximum discharge values came 56.2 m 3 /s on Câln au river and 54 m 3 /s on Sl anic river (Negru 2010). The land use, including forests (40.7%), arable land (30.9%), and built-up areas (4.6%), are other environmental variables that highly influence the flooding potential across the study area.

Flood inventory
It is unanimously accepted that the accurate prediction of the areas that may be affected by a phenomenon in the future must be based on the characteristics of the factors that have favored the production of that phenomenon in the past (Ali et al. 2020;Yariyan et al). Thus, in the present study, the inventory of the locations affected by the floods in the period 1990 À 2020 was carried out, being taken into account only those events that caused damages within the socio-economic elements. A number of 205 flood locations were quantified over the research zone ( Figure 1). The General Inspectorate archives for Emergency Situations and National Administration of Romanian Waters were consulted to identify the flood locations. To increase the performance of the applied models, another data set representing 205 non-flood locations was generated. The 2 data sets were divided into training data set (70%) and validating data set (30%). This ratio between the training data set and validating data set is recommended by most researchers focused on natural hazards susceptibility assessment (Lee et al. 2017;Kalantar et al. 2018;Ali et al. 2020;Pham et al. 2020a). Moreover, in a study related to the natural hazards susceptibility computation, Sahin et al. (2020) demonstrated that the results achieved with the ratio of 70:30 between the training and validation data sets were characterized by a lower error than the results achieved by applying other ratios.

Flood conditioning factors
While the flood locations will be the dependent variable in estimating the flood susceptibility, a number of 12 flood predictors will be used as explanatory variables, and their spatial distribution will be based on the flood exposure values. It is worth stating that the following predictors were selected following a meticulous analysis of the literature (Nguyen et al. 2017;Tang et al. 2020;Costache et al. 2020c): slope, altitude, aspect, TPI, TWI, convergence index, plan curvature, hydrological soil groups, land use, distance from rivers, lithology and rainfall. The first 7 mentioned flood predictors, which are also morphometric indices, were derived from the Digital Elevation Model (DEM) extract from the Shuttle Radar Topographic Mission (SRTM) 30 m. It is worth noting that, at the present moment, for perimeter covered the study area, another DEM with a high resolution of 30 m is not available. Also, the use of the DEM extracted from the SRTM 30 m database has been proved a high-quality solution for the previous studies focused on the same research topic (Dahri and Abida 2017;Sahana et al. 2020). Therefore, all the 7 morphometric factors will have the exact spatial resolution of 30 Â 30 m. In order to involve the other 5 flood predictors in the analysis, their associated GIS layers were directly derived or converted into raster layers with the exact spatial resolution of 30 Â 30 m. The most accurate data sources available at this moment at the study area level were used for all flood predictors. A short description of each flood predictor is given below. The slope gradient is an essential characteristic of the ground surface, which highly contributes to surface runoff and flow accumulation. The slope factor was derived from DEM, and his values range between 0 and 55.96 (Figure 2a). The elevation is another critical factor in defining the exposure to floods because it highlights the differences between high areas where surface runoff is formed and the low areas where it accumulates (Costache 2019a). Within the study area, the altitudes range from 1 m to 1925 m ( Figure 2b). The aspect is another predictor with an important influence on flood occurrence ( Figure 2c). According to Costache et al. (2020c), the Eastern and South-Eastern slopes cover 30% of the study area. Convergence index is a morphometric indicator that highlights the valleys perimeters with negative values and the interfluvial areas with positive ones (Figure 2d). Natural environmental response to human activities Zhang et al. 2021). Land use is one of the most important flood predictors since it influences the velocity of surface runoff due to Manning roughness coefficients' different values. The arable lands and forests account for around 75% of the study area ( Figure  2e). The maximum distance from rivers within Buzau river catchment reaches 10648 m ( Figure 2f). The closer they are to the rivers, the more prone the areas are to floods . Hydrological Soil Group (HSG) influences the velocity of water infiltration and, finally, the accumulation potential at the ground surface (Norbiato et al. 2008;. All the four HSGs can be found within the study area, and HSG B (55%) covers the most extensive surfaces (Figure 3a). Along with the influence on water infiltration, lithology also has contributed to the shape of river valley morphology. Buz au river basin has 12 lithological categories (Figure 3b) on which the highest percentage is covered by flysch (25%). Plan curvature is used to subdivide the hillslopes into concave, convex and planar regions with the plan curvature equal to 0. In our study area, the plan curvature values range from À4.032 to 4.48 (Figure 3c). Rainfall is the environmental factor that determines the flood genesis. Within the study area, the multiannual average rainfall range from 469 mm to 716 mm (Figure 3d). Topographic Position Index (TPI) is a morphometric variable that highlights the difference of the elevation between a specific cell and the neighbouring cells ). In the case of the present research area, the maximum TPI value is equal to 153.8 while the minimum one is À122.8 (Figure 3e). TWI values indicate the areas from the ground surface where the flow accumulation is favored from a morphometric perspective ). In the present study, the TWI values range from 0 to 19.89 (Figure 3f).

Support vector machine (SVM)
In terms of flood susceptibility modeling, the SVM algorithm uses the flood predictors' nonlinear transformations in higher dimensional feature space (Yilmaz 2010;Ghorbanzadeh et al. 2019;Nguyen et al. 2019). Based on the statistical learning theory, in the first stage, SVM is trained with a training dataset. The first advantage of the SVM model is that this model can reduce the error test and the model complexity. In this regard, the SVM method tries to find an optimal hyper-plane that could separate the support vectors (flood and non-flood locations) (Kalantar et al. 2018). In most situations, the hyper-plane will be defined by a non-linearly surface. In this case, the following mathematical expression will be employed to classify the data set (Costache et al. 2020c): where a i and a i Ã represent the Lagrange multipliers (a i ! 0, a i Ã C), K(x i , x j ) represents the kernel function, and b represents the offset of the hyper-plane from the origin.

J48 decision trees
J48 Decision Trees algorithm, which includes a root node, internal nodes, and leaf nodes, will be used in the present article to create a binary classification of the flood predictors into the flood and non-flood pixels (Tien Bui et al. 2012). The root node will contain all the input data, the internal nodes will ensure the application of the decision function, while the leaf nodes are associated with the output data (Pham et al. 2017b). J48 training process is carried out in 2 main phases as follows: i) construction of classification tree; ii) pruning the classification tree. In the first stage, at the root node will be determined the input data having the gain ratio with the highest value. Based on the root node values, the splitting procedure will be implemented in the case of the training dataset to create sub-nodes. The second stage will consist of the generation of the individual gain ratio for each sub-node. The classification of flood predictors into flood or non-flood will be carried out, taking into account the gain ratio associated with each sub-node (Pham et al. 2017b).

Adaptive neuro-fuzzy inference system (ANFIS)
ANFIS is a hybrid algorithm proposed by Jang (1993) and is generated through Artificial Neural Network and Fuzzy Logic (FL) method. ANFIS represents a datadriven algorithm characterized by a self-learning function that can be run without input conditions (Gao et al. 2019;Costache 2019a;Ghorbanzadeh et al. 2020). The FL is the fuzzy inference system method (FIS) to transform a specific set of inputs into an output (Polykretis et al. 2019). The fuzzy if-then rule represents a way to describe a particular problem in linguistic terms. The if-then rules have the following form and belong to Takagi and Sugeno's type (Celikyilmaz and Turksen 2009): Rule1 : ifxisA 1 andyisB 1 , thenf 1 ¼ p 1 x þ q 1 y þ r 1 Rule2 : ifxisA 2 andyisB 2 , thenf 2 ¼ p 2 x þ q 2 y þ r 2 where A 1 , A 2 , B 1 and B 2 are membership functions (MFs) for the inputs x and y, and p i x, q i y and r ij (i, j ¼ 1, 2) are consequent parameters (Costache 2019a; Moayedi et al. 2019aMoayedi et al. , 2019b). The ANFIS model structure contained six layers and was described in a detailed manner by Costache (2019a).

Random forest
Random Forest (RF) is a prevalent decision trees-based method in natural hazards researches whose main aim is the classification and prediction topics ( ; ). This method was introduced by Breiman (2001) and dincluded integration of the random subspace model with bagging ensemble learning . The following three-stage characterize the RF procedure: i) defining and resampling the training data several times; ii) selection of random features set associated to each re-sample; iii) assigning a decision tree for each to resample and random features set; iv) creation of a single decision tree through the aggregation of decision tree assigned to each resample.

Multilayer perceptron
Multilayer Perceptron (MLP) is a neural network with a structure composed of three layers, among which one is an intermediate layer that is not directly connected to input and output data. The input data is distributed towards the neural network's middle or hidden layer using the input layer units (Pham et al. 2017a). Further, the output layer will provide many output signals to respond to the information received from the intermediate layer (Dodangeh et al. 2020). In the present study, the number of input neurons included in the input layer will be the same as the number of flood predictors. In contrast, the output layer will contain two neurons associated with the flood and non-flood points. The number of hidden neurons will be calculated according to the following procedure (Sheela and Deepa 2013; Costache and Bui 2019): 2 Ã Ni þ 1, where N is equal to the number flood predictors.

Alternating decision tree
In the Alternating Decision Tree (ADT) model, the Boosting algorithm improves the initial decision tree method . ADT contains a first layer associated with the decision nodes, which impose a predicate condition. The second layer of ADT contains multiple prediction nodes with a single number (Hong et al. 2015). In the present case study, the ADT will be involved in calculating flood predictors values as flood and non-flood pixels. Thus, the information associated with the spatial distribution of flood and non-flood points and flood predictors values will be used as input in the ADT model, whose functionality is described by the following relations: where W indicates the sum of the values provided by the prediction nodes, c 1 is a precondition, c 2 is the base condition, a and b are 2 real numbers. The methodological steps followed in the present research are highlighted in Figure 4.

Flood predictors selection using multicollinearity assessment
The multicollinearity among the flood predictors was estimated in order to avoid the unnecessary information to be used in the modeling. Thus, the values of Tolerance (TOL) range between 0.248, for altitude, and 0.935, for aspect predictor. The other multicollinearity indicator, Variance Inflation (VIF), has a minimum of 1.070, in the case of aspect, and a maximum of 4.032, in Altitude (Table 1). Therefore, it can be assumed that serious multicollinearity is not present among the flood predictors, and all 12 variables will be taken into account in the modeling process.

Flood susceptibility models performance
Once established the flood predictors, their values were extracted to the flood and non-flood locations using GIS software. Further, the records of flood and non-flood pixels from the training sample and flood predictors' values were used to run the six machine learning models. The model's performances were evaluated using several statistical metrics presented in Table 2. Also, it should be noted that the statistical metrics were calculated for both training and validation samples. In the case of the The accuracy values that exceed the threshold of 0.8 confirm that all the data mining models obtained outstanding performances.

Flood susceptibility mapping and validation
The classification of each flood predictor pixels into flood and non-flood categories allowed the computation of flood susceptibility maps for each machine learning model. The results provided by the SVM model highlights a surface of 39.42% of the study area with a very low flood susceptibility, 17.8% of the study area with a low flood susceptibility, 14.5% with a medium flood susceptibility, and 28.28% with a high and very high flood exposure degree (Figure 5a). With the help of the J48 algorithm, it was found out that 45.83% of the study area has a very low flood susceptibility, 15.65% has a low flood susceptibility, 12.87% has a medium flood susceptibility, while over 26% of the research area has a high and very high flood susceptibility ( Figure 5b). The application of the ANFIS model revealed that a very low flood susceptibility covers 44.66% of Buz au catchment, 15.57% is covered by a low flood susceptibility, 11.84% has a medium flood susceptibility. In comparison, the high and very high flood susceptibility is spread on 27.93% of the research area ( Figure 5c). RF model highlights the following results: 73.44% for very low flood susceptibility, 5.1% for low flood susceptibility, 3.55% for medium flood susceptibility, and 17.91% for very high flood susceptibility (Figure 5d). According to the MLP model, 73.36% of Buz au catchment falls in very low flood susceptibility, 5.29% has a low flood susceptibility, 3.64% has a medium flood susceptibility 17.71% is characterized by a high and very high flood susceptibility (Figure 5e). The use of ADT shows that 61.81% of the research area has a very low flood exposure, 12.16% has a low flood susceptibility, 7.73% has a medium flood susceptibility. In comparison, 18.3% has high and very high exposure to floods (Figure 5f). The use of ROC for results validation highlights that the highest AUC in terms of Success Rate was achieved by the RF model (0.996), followed by MLP (0.995), ADT (0.992), J48 (0.983), ANFIS (0.98), and SVM (0.958) (Figure 6a). The highest  (Figure 6b).

Discussions
Given the events of severe floods in recent years in Romania, assessing the susceptibility to these hazards in the most vulnerable regions of this country is an activity that cannot be omitted and postponed. The machine learning algorithms provide a very accurate assessment of flood susceptibility and, therefore, the use of state-of-theart models became indispensable (Azareh et al. 2019). Thus, on the Buzau river basin, a very accurate estimation of flood susceptibility was made through 6 machine learning models that provided also results with high accuracy in other previous studies focused on natural hazards susceptibility assessment (Zare et al. 2013;Hong et al. 2015;Kalantar et al. 2018;Polykretis et al. 2019;Chen et al. 2020). It should be noted that the maximum accuracy of the models applied in this study, obtained by RF (AUC ¼ 0.996), exceeded the ultimate accuracy of the models used by Costache et al. (2020c), which also estimated the flood susceptibility within the Buzau river basin. The maximum accuracy achieved by Costache et al. (2020c) was equivalent to an AUC equal to 0.979 and was associated with the application of Support Vector Machine -Index of Entropy ensemble. Moreover, all the models applied in the present research obtained, in terms of Prediction Rate, higher accuracy than the models used by Popa et al. (2019) in a study focused on the same study area. In the present study, there is a more pronounced highlighting of regions with a very low potential for flooding that covers between 39% and 74% of the Buz au river basin, compared to areas with very low susceptibility obtained by Costache et al. (2020c), which cover an area between 19.35% and 28.62%, and by Popa et al. (2019), which spreads 6%. Also, in the case of the present study, the areas with a medium susceptibility are limited to a maximum of 14.5% compared to the previous studies in which they reached a  Training data  TN  125  115  132  135  125  132  FP  15  25  8  5  15  8  FN  12  24  9  3  13  10  TP  128  116  131  137  127  maximum of 43% of the total. This shows that the extreme values of flood susceptibility are better highlighted in the present case study than in the studies done by Popa et al. (2019) and Costache et al. (2020c). These differences between the 2 studies conducted on the same river basin can be explained by the fact that Popa et al. (2019) and Costache et al. (2020c) classified the continuous values flood predictors into several classes before introducing them into machine learning models, while in the case of the present study unclassified factors were used. In both studies, it can be remarked the presence of high flood exposure along to the major river valleys, in the depression's areas, or over the surfaces with a low slope angle. As was demonstrated in the previous works related to the natural hazards susceptibility assessment, the parameters of the present paper's models have a significant influence on the final results obtained. For example, in the case of Multilayer Perceptron Neural Network, a strong impact in the results accuracy is held by the back-propagation function, whose main scope is to reduce as much as possible the classification error between during the training phase (Saha et al. 2021). Moreover, for the SVM algorithm, the model complexity and precision is an influence in a high measure by the cost parameter (Oh et al. 2018), while the Random Forest model accuracy is highly determined by the hyperparameters represented by a specific number of decision trees (Lee et al. 2017).

Conclusions
Land use planning and flood risk management require a good knowledge of the areas prone to significant flood events. Buz au river basin represents a region from Romania territory which should frequently face flood hazards. In this situation, an accurate flood susceptibility map will represent a valuable tool for the authorities in charge of flood mitigation actions. In this study, six machine-learning models were trained to achieve the highest accurate flood susceptibility maps across Buz au river basin. Twelve flood predictors, selected after multicollinearity assessment, and 205 flood and 205 non-flood locations were involved in the model training and validating the resulting maps. The most accurate model was Random Forest, which achieved an AUC of 0.996 while the lowest accuracy was associated with the J48 model with an AUC of 0.978. These very high AUC values can be considered proof that the DEM at 30 m, removed from SRTM database, represents a reliable source for the derivation of morphometrical indicators used as flood predictors. Therefore, we can conclude that the present results can be reliable tools for governmental decision factors. The algorithms' very high accuracy can also represent a guarantee for other researchers that can use them to estimate the flood susceptibility in different research areas.