Performance of multi-level association rule mining for the relationship between causal factor patterns and flash flood magnitudes in a humid area

Abstract Integrated with K-means clustering and Apriori algorithm, the multi-level association rule mining is proposed to investigate the causal factor patterns of flash floods, which consists of the following three steps: first, the association between causal factors and flash flood occurrence is being analysed; second, to identify the contribution of soil moisture (SM) to flash flood hazards, the association between risk indicators and SM, and the linkage between SM and risk magnitude are being discussed; finally, with the consideration of total 24-h rainfall and SM pattern, the association rules for risk magnitude are extracted. The method has been tested in a humid area of southern China, results show: (1) flash flood hazards are especially active after the prolonged and periodic intense rainfalls, and because of the saturated SM, flash floods are easily triggered even by slight rainfall; (2) severe flash floods are easily triggered by extreme rainfall, and SM is the critical indicator of 5-year floods and 20-year floods; and (3) owing to the differences in steady infiltration rate and instability in soil type, conservation of water and soil is an indispensable and co-ordinate part of flood control. Results are expected to be applicable for decision-making in flood control and flood prediction.


Introduction
With the rapid production of surface runoff, flash floods is considered one of the most dangerous natural hazards whose occurrence is associated with various causes, including hydrological factors, geological factors, topography characteristics and climatic influences (Gan et al. 2018). However, intense rainfall is often regarded as the cause of flash flood events over a small-scale watershed (Modrick and Georgakakos 2015;Youssef et al. 2016;Mahmood et al. 2017). Several studies have shown that early warning based on rainfall threshold is essential to predict the flash flood occurrence, and has been demonstrated as helpful guidance to reduce flood damages (Gourley et al. 2014;Li et al. 2018). Flash floods are also dependent on the degree of soil saturation, while changes in precipitation and evapotranspiration will modify the soil moisture (SM) content (Grillakis et al. 2016;Massari et al. 2018). Meanwhile, the spatial distribution of SM provides a crucial link with the hydrological and ecological processes (Manfreda 2008;Santi et al. 2013;Renzullo et al. 2014). Manfreda and Fiorentino (2008) developed a model to defined the relative saturation, in which the rainfall force was interpreted as an additive noise in the soil water balance, and concluded that the flood probability is mathematically derived as a function of daily rainfall and antecedent SM, some regional applications was successfully conducted based on the proposed model (Gioia et al. 2014). Based on the hydrological model, the initial SM conditions can explain the differences in various risk magnitudes of flash floods (Meng et al. 2017;Zhai et al. 2018). Thus, identifying the causal factors of flash flood events is of primary importance to explain causal mechanisms of flash floods and enhance its forecasting (Saharia et al. 2017).
With the widely application of data-driven methods, it opens another opportunity to revisit hydrological challenges based on a massive amount of data, especially for the complex and nonlinear relationships (Shen et al. 2018). For instance, satellite SM observations have provided global or regional scale information, such as Soil Moisture Active Passive (Entekhabi et al. 2010), Global Land Data Assimilation System (Rodell et al. 2004), and the North American Land Data Assimilation System . They have shown promising performance with moderate correlation between satellite SM observation and ground data (Abhishek et al. 2012;Matgen et al. 2012;Alvarez-Garreton et al. 2015) and are useful in the analysis of relationship between hydrological and environmental causes and flash flood events. Transforming vast data into useful information and knowledge intelligently is critical to the causal analysis of flash floods.
Association rule mining (ARM) is one of the most instrumental technologies in data mining, which aims to extract correlations, frequent patterns, or associations among item sets in database (Pears et al. 2013;Son et al. 2018). Considering hazard risk as a complex system, the transfer mechanism between causal factors and risk magnitudes can be highlighted by 'support' and 'confidence', which are two parameters in the procedure of association rules. Through data mining, the cause-and-effect relationships between hydrological parameters and landslide movement are determined (Ma et al. 2017). Studies have shown that ARM has successfully determined the level of association in various fields, and can provide evidence for probable causeand-effect relations (Qodmanan et al. 2011;Nahar et al. 2013;Guo et al. 2014;Peng et al. 2018).
In this paper, by using multi-level data mining methods, we attempt to analyse the cause and effect of flash floods. Three important novelties are being discussed in this paper: (1) the major causal patterns of flash flood events that indicate the relationship between risk factors and SM; (2) the association between SM and magnitude of flash floods; and (3) the causal mechanisms of flash flood hazards in study area.

Study area
We have performed a case study in a humid area, in Upper Hanjiang River of Southern China, Guangdong Province, covering an area of 3239 km 2 , with 105 rivers going through this Delta. The two main watersheds are Wuhua River and Qinjiang River (Figure 1). According to 23 observational stations in the study area, the average rainfall is 1542 mm annually, and the summer months (from May to August) are the wettest, accounting for approximately 79.9% of annual rainfall. Upper Hanjiang River is prone to flash flood hazards. For instance, due to Typhoon Utor (2013), the average rainfall record was 229.6 mm during the period of August 14-18, and the peak stage was recorded as 19.78 mm by Jianshan Rainfall Station at 4 PM on August 17. This rainfall event caused an extreme flash flood hazard, which had affected 3.68 Â 10 5 population and destroyed 28 bridges, directly resulting in an economic loss of US $87.7 million (Meng and Wang 2013). Consequently, a severe challenge is being faced in flash flood risk management of Upper Hanjiang River.

Data collection
Causal factors of flash flood hazards are various because of the specific hydrological and environmental characteristics of each location. After carefully reviewing the literatures and field survey in the study area (Zhang et al. 2010;Zheng et al. 2016), total 24-h rainfall (TR), elevation (DE), slope degree (SD), vegetation cover (VC), soil type (ST), drainage density (DD), and SM are selected as risk factors of flash flood hazards in Upper Hanjiang River. In this research, 31 flash flood events in the study area over the period of 2011, 2012, and 2013 have been considered. TR (mm) is calculated based on hourly rainfall data in 1 day prior to the flash flood occurring day at 23 observational stations of Upper Hanjiang River, which were obtained from the Hydrology Bureau of Guangdong Province (http://www.gd3f.gov. cn/). DE (m) reflects the vertical distance between the terrain surface and the sea level; areas in low DE are prone to flooding because the runoff from rainfall has the natural trend of flowing from highlands to lowlands. The data are obtained from SRTDEM 90M, which were acquired from the United States Geological Survey (USGS; http:// ned.usgs.gov/) SD ( ) reflects the changes in altitudes along a distance unit. Mountain areas have steep slopes that prevent water being collected when it rains, while flatlands have a gentle SD that can easily become a flooding threat. The data were extracted or calculated from digital elevation model (DEM) by using geographic information system (GIS) techniques.
VC (%) reflects the underlying surface. Rainstorm volume is intercepted by plant canopy, while additional VC leads to more rainstorm interception. Moreover, vegetation provides prevention when runoff occurs. The data of vegetation in this study are obtained from the USGS (http://ned.usgs.gov/) in 2013.
ST influences the infiltration capability (Costache and Zaharia 2017; Costache 2019), and a large grain size means a strong infiltration capability. The data of soil information in this study are obtained from the Hydrology Bureau of Guangdong Province (http://www.gd3f.gov.cn/). DD (m/km 2 ) reflects the amount of catchment in the study area, and a high DD relates to a high risk of flash floods . Based on DEM, Spatial distribution of DD is generated by using the tool of 'LineDensity' of GIS. SM (m 3 /m 3 ) controls the partitioning of rainfall into runoff and infiltration and therefore has an important effect on the runoff behavior of catchment (Scipal et al. 2005). Therefore, it is very important to obtain SM status to determine the magnitude of flash flood events, which is particularly useful for mechanism analysis of flash flood hazards (Koster et al. 2010). Satellite passive microwave radiometers are sensitive to SM, enabling global estimation with daily fidelity and approximately 9-25 km spatial resolution. Though bias still exists between satellite SM observations and ground data, satellite SM observations present well in the change trend of SM (Owe et al. 2008;Jackson et al. 2010;Mai et al. 2016). The data of SM in this study are obtained from The European Space Agency (ESA). The ESA Climate Change Initiative (CCI) SM data product consists of three surface SM data sets: The 'ACTIVE Product' and the 'PASSIVE Product' were created by using scatterometer and radiometer SM products, respectively; The 'COMBINED Product' is a blended product based on both scatterometer and radiometer products (Liu et al. 2011(Liu et al. , 2012Wagner et al. 2012). The data set spans over 39 years covering the period from November 1978 to June 2018 (http://www.esa-cci.org/). The annual average SM of ESA CCI SM product in China in the period of 2013 is shown in Figure 2, in which the SM of Southern China is in the interval of 0.2-0.6 m 3 /m 3 . The spatial distributions of SM in typical flash flood events in the study area are shown in Figure 3, respectively.

Apriori algorithm
ARM is an important data mining method (Agrawal et al. 1993). It determines the relationship between items or features that occur in the databases. ARM is represented by 'A ¼ > B'; let I be the set of all items and D a set of transactions in the database, where A 2 I, B 2 I, and A \ B ¼ U. It means that the transaction set D is a non-empty sub-item of I, which implicates 'A is contained in D' and has the trend of 'B is contained in D'. A is the premise of association rule, and B is its consequence. Whether the rule 'A ¼ >B' is held in D depends on the following three measure standards: support, confidence, and lift. Support is the rate of A contained in B and is taken to be the probability P (A[B), which is defined as follows: Confidence is the rate of A and B contained in D and is taken to be the conditional probability P (B｜A), which is defined as follows: Given a set of transactions D, the ARM is to generate all transaction rules that have a certain user-specified minimum support (minSup) and minimum confidence (minConf).
Lift larger than 1 indicates that A has a positive effect on the occurrence of B. It is defined as follows: Having the advantage of reducing the search space, the Apriori algorithm is most popular for ARM (Agrawal and Srikant 1994), which uses an iterative approach known as a level-wise search, where K item sets are used to explore (K þ 1) item sets. The Apriori algorithm involves the following two steps: Step 1. Detect large item sets whose support is larger than minSup.
Step 2. By using large item sets achieved in step 1, generate strong association rules whose confidence is larger than minConf. The rules generated by the Apriori algorithm must satisfy the following criteria: Strong association rules will be marked in the database for the decision makers, while the redundancy data will be deleted.

K-means clustering
Clustering analysis is a process by which large datasets can be separated into several groups, and the datasets in the same group are more similar to each other than those in other groups (Wu et al. 2016;Marco et al. 2017). The continuous indicators must be discretized before the application of the association rule algorithms. Thus, we provide K-means clustering analysis for addressing the requirement. The K-means algorithm aims at grouping observations according to distance measure in the Kdimensional space of x. It proceeds as follows: Step 1. Support observations must be grouped into k clusters, and we select the initial centroids x g for each cluster (g ¼ 1, 2, … , k).
Step 2. Calculate the distance d(x g , x i ) between the current data vector x i and the initial centroids x g as follows: For quantitative variables, such as TR, DE, and SD, the Euclidean distance is used as Eq. (1); for categorical variables, such as ST, we maintained the original clusters. The K-means clustering is used to determine a set of K centroids so as to minimize the distances d(x g , x i ).
Step 3. If x i is already a member of the group whose mean is closest, then repeat step 2 for x iþ1 ; otherwise, reassign x i to the group whose mean is closest and return to step 1.
Step 4. The processes are iterated until the centroids are confirmed; that is, a full cycle through all observations produces no reassignments.
The ARM procedure proposed in this study comprises three stages: K-means clustering classification, multi-level ARM, and deeper association rules discovery. The framework is shown in Figure 4.

Historical flash flood inventory
Based on the flash flood inventory covering the period from 2008 to 2013 in the study area, 31 flash flood records are summarized in Table 1, which shows the processes of antecedent rainfall in the flash flood events. Recurrence values are used to indicate the magnitude of flash flood: F1 stands for the recurrence interval less than 5 years, F2 for that between 5 and 20 years, F3 for that between 20 and 50 years, and F4 for that between 50 and 100 years. Both daily rainfall and 7-day antecedent rainfall have been analysed to display the role of daily rainfall as well as the accumulated rainfall in flash flood hazards ( Figure 5). It can been seen that rainstorm on the record day of flash flood event might induce the 5-year flash floods ( Figure 5(a)), and the obvious antecedent rainfall might induce 5-year flash floods, as shown in Figure 5 Figure  5(f); and the flash flood events with return period more than 50 year are shown in Figure 5(g) and (h). The findings suggest that intensity rainfall event is one of the triggered factors of flash flood hazards in Upper Hanjiang River. Flash floods might occur by moderate rainfall event following several days, or trigger by the accumulated rainfall which is more than 100 mm. It is also evidenced that SM is another crucial factor in flash flood hazards, which influences runoff. After continuous rainfalls, soil water content becomes saturated, and the overflow will increase when there are further rainfalls. In the final period, flash floods would be easily triggered even by slight rainfalls.

Qualitative statement by K-means clustering
In the discrete indicators, such as ST, we defined four classified groups as sand clay, clay loam, sandy loam, and heavy clay. The continuous indicators, such as TR, DE, SD, VC, DD, and SM, were clustered into four groups by using k-means clustering method. More similarities were displayed in the same group, while more differences were performed individually to each indicator between groups. In this study, risk magnitude was indicated by probability of occurrence as a continuous attribute as well. Because of the low probability of large magnitude flash floods in our study area, k-means clustering analysis would be likely to result in neglecting minor probability events. In order to address this issue, we took flooding return period as the classified standard of risk magnitude. The qualitative value of datasets was classified and results are shown in Table 2. Spatial distributions of classified DE, SD, VC, and DD are shown in Figure 6. By using standard GIS tools, topographic information was processed to delineate and subdivide the watershed into 258 sub-catchments, in which the largest covers an area of 36.05 km 2 , while the smallest 0.03 km 2 . The zoning map of watershed in the study area is shown in Figure 6(f). In order to present the spatial distributions of SM in flash flood hazards, four historical events with different return periods are selected (shown in Figure 2    flood records, 4028 datasets have been generated. Using the ARM model proposed in Figure 4, the rule extractions for SM and risk magnitude of flash floods have been processed by multi-level ARM. To detect the severe flash flood events that were the minor samples in the flash flood database in Upper Hanjiang River, deeper association rules have been generated, and the results are shown below. As listed in Table 3, 'lhs' means the left-hand side, and 'rhs' the right-hand side, which, respectively, represents the causes and consequences of association rules. Setting minSup ¼ 0.2 and minConf ¼ 0.8, 19 rules have been generated to display the association between risk indicator and risk magnitude of flash flood hazards. In the transaction set, 'Rule 1: holds with the support of 0.41, which is the largest support in the listed rules, with the confidence of 0.89, a lift of 1.12, which is more than 1, and a count of 1803. The support of 0.41 means that the proportion of the transactions contains 'TR1' and 'F1' is 0.41, and the confidence of 0.8 means that 80% of the transactions that contain 'TR1' also contain 'F1'. In other words, a strong associability is evident between TR1 and F1. Some multiple rules have been mined as well, such as 'Rule 4: Deeper rule extraction samplings with F2, F3, F4 have been conducted to detect the severe flash flood events in the database. Thus, a new database of 903 datasets is constructed. Setting minSup ¼ 0.25 and minConf ¼ 0.9, 9 additional association rules are obtained. Rules from ID 20 to ID 28 show that the severe flash flood events and a high risk of magnitude may be measured, in terms of the increasing of 24-h rainfall, such as 'Rule 24: It can be concluded that the TR is one of the critical attributes in Upper Hanjiang River flash floods; considerable TR would probably result in large flood magnitude.
Furthermore, multiple environmental factors mining are also shown in Table 3, which display the combination effect in flash flood hazards. For multiple dimensions mining, some rules with special combination in 'lhs' are generated, such as 'Rule 5: with the support of 0.28 and the confidence of 0.95. Evidently, the combined factors with TR and SM in the high levels, sandy clay area, and vegetable cover less than 30%, are associated with 100-year flash floods.
Based on the results described in Figure 5 and Table 3, SM is one of the crucial factors in flash flood hazards, especially in case of light rain. Multilevel ARM has been conducted to discuss the effect of SM on flash floods. The association rules between risk indicators and SM have been listed in Table 4. Setting minSup ¼ 0.12 and minConf ¼ 0.4, 20 rules have been generated with all 'rhs' consequences of {SM ¼ SM2} and {SM ¼ SM3}, which indicates a high rate of SM2 and SM3 in flash flood events database of the humid area. As shown in the soil category in Figure 6(c), the proportion of sand clay is 80%, accounting for the most ST in the study area. Rules 1, 5, 6, show the strong associations between ST1 and SM2, and the combination effects with ST and other risk indicators. We have constructed a new database by sampling SM1 and SM4 to mine the rare rules apart from SM2 and SM3. Setting minSup ¼ 0.15 and minConf ¼ 0.5, rules with ID from 21 to 36 have been obtained. 'Rule 21: with the support of 0.24 and the confidence of 0.72, indicates that the proportion of the transactions containing 'ST1' and 'SM1' is 24%, and transaction containing both 'VC1' and 'SM1' accounts for 72% of the   Table 4 is that, different from the results in Table 3, a rare association exists between TR and 'SM4', because SM is dependent on prolonged rainfall, which agrees well with Figure 3. Aiming to find the relationship between SM and flood magnitude, minSup ¼ 0.2 and minConf ¼ 0.8 have been set in the entire database, and minSup ¼ 0.1 and minConf ¼ 0.3 in the sampled database with F2, F3, and F4. Table 5 clearly shows the association rules between SM and flood magnitude. It can be concluded that if SM is in the range of SM1 and SM2, then the flash flood return period would be less than 20 years, as shown in Rules 1, 2, and 4. When SM is in the range of SM3, the severe flash floods with 100-year return period would likely occur, as shown in Rule 5. From Tables 3 to 5, it can be noted that TR, ST, and SM are the critical risk indicators of flash flood hazards in Upper Hanjiang River. Considering the high rate of sandy clay in the study area, we have built a database containing TR, SM, and flood magnitude. Rule extractions are indicated in Table 6. Setting minSup ¼ 0.1 and minConf ¼ 0.9, 'Rule 1:

Discussion of flash flood mechanism
TR is one of the critical attributes in Upper Hanjiang River flash floods; more TRs would probably result in large flood magnitude. 100-year flash floods are more likely attributed to the extreme rainfall. During rainy seasons, with prolonged and periodic intense rainfalls, slopes become more instable and cause a rapid increase of water levels in small watershed. In addition, rainfall, as one of the most common and important causes of flash floods, may also reduce the mechanical strength of slip surface in most areas. Meanwhile, provided by other hydrological and environmental factors, the support plays a combined crucial role in our study area, which has increased the risk of flash floods.
It is possible that strong rainfall events happen in the mountains and flash floods in the downstream; thus, SM is considered as another important factor that affects the occurrence of flash flood hazards. If SM is in high level, the relative soil saturation will be observed, and the flash flood return period would be less than 20 years; severe flash floods with 100-year return period are more likely to occur by sudden rainstorms, even if SM is in the range of SM3. Steady infiltration rate influences SM and the occurrence of overland flow. In the study area, four STs have different steady infiltration rates, the greatest of which was sandy loam, followed by sandy clay, clay loam, and heavy clay. Consequently, improving the steady infiltration rate in the  Table 6 with the confidence of 1 indicates two causal factor patterns of 20-year return period, one is evidenced that the extreme rainfall would induce the flash flood hazards, another show that even a light rainfall would induce flash floods in the saturated SM area. A rare association exists between TR and SM, because SM data used in this study are one day before flash flood events. Meanwhile strong associations have been observed between ST and SM. SM is dependent on prolonged rainfall and ST. Sand clay is the carbonate weathered soil in tropical and subtropical areas, and soil parent rocks have considerable influence on the formation of sand clay. Sand clay is easily corroded and weathered. Owing to a high rate of sand clay, especially in the mountainous regions on the east, west and south (Figure 6(c)), the massive surface becomes loose solid matter cover and accumulates on the ground, which might lead to flash floods after raining. In addition, flash flood hazards are more likely to occur in the scenario with steep slopes. The 100-year flash floods is not induced by the single factor, it is associated with the combined factors, such as extreme rainfall, SM in the high levels, sandy clay area, and vegetable cover less than 30%.
ARM method provides a simplified description of the Relationship between Causal Factor Patterns and Flash Flood Magnitudes, especially between rainfall, SM and runoff generation in humid area. The case study in this paper demonstrates that not only the sudden rainfall, but also the relative saturated area are responsible of the flash flood events. The results are consistent with the probability distribution of runoff detecting by Manfreda andFiorentino (2008), in which runoff is described as a function of rainfall depth and the state of the basin. ARM is capable of coping with the multiple datasets, and mining the association rules with single factor and combined causal factors, which provides a feasible and effective method for flash floods mechanisms analysis.

Conclusions
This research has presented multi-level ARM to explore the cause-and-effect relationship between SM, hydrological and environmental indicators, and flood magnitude. Using a case study in Upper Hanjiang River based on a flash flood database covering the period from 2008 to 2013, four types of rule extractions have been detected in humid area, including association rules for risk indicators and flood magnitude, association rules for risk indicators and SM, association rules for SM and flood magnitude, and association rules for TR, SM and risk magnitude. Deeper ARM has been carried out in the sampled database to extract rules from small samples. Several distinct characteristics and notable patterns of flash floods in Upper Hanjiang River are described in this study. ARM exhibits a good performance in detecting knowledge of flash flood hazards by setting minSup and minConf, especially in the large database. It has been found that the flash flood hazards in Upper Hanjiang River are especially active after the prolonged and periodic intense rainfalls, and flash floods are easily triggered by even a slight rainfall due to the relatively saturated soil. This finding also highlights that severe floods in Upper Hanjiang River are easily triggered by extreme rainfalls, while SM is the critical indicator of 5-year and 20-year floods. Owing to the difference of steady infiltration rate, ST is one of the important parameters in flash flood scenarios in Upper Hanjiang River. Therefore, soil conservation is an indispensable and co-ordinate part of flood control.
This research has demonstrated the method of integrating K-means clustering and Apriori algorithm in mechanism discussion of flash floods, and especially the potential mechanism of SM impact on small-scale flash flood events has been explored. The results exemplify the influence of TR and SM, and display the linkage between causal factor patterns and flash flood magnitudes. An advantage of this approach is that the association rules extracted by ARM present the pattern combinations not only to flood occurrence but also to flood magnitudes; additionally, the runoff mechanism in different climatic conditions has been discussed. However, this study is based on 31 flash flood events in Upper Hanjiang River, the database should be updated when more flood events recorded in high spatio-temporal resolution are available, and more potential influencing factors, such as TR, could be taken into account in further studies. The data mining technology approach on flash flood mechanisms have been conducted in humid area, and expect to expand to the study area with different climatic conditions or with various underlying surface in future work. This study is expected to provide scientific support for the rapid and reasonable diagnosis of flash flood mechanism, and further provide a basis for decisionmaking for the risk management of flash flood hazards.