Analysis of secondary-factor combinations of landslides using improved association rule algorithms: a case study of Kitakyushu in Japan

Abstract Landslide analysis prevents landslides from threatening resident safety and property, and the predominant method is susceptibility assessment which is cumbersome and time-consuming. The association rule algorithm (ARA) is proposed to mine the correlation between the factors and landslides simply and rapidly. The original ARA cannot reflect the scope of landslides which is non-negligible for landslide analysis and is thus improved to mine the frequent secondary-factor combinations (SFCs). Firstly, eight factors are selected using the out-of-bag error and chi-squared ( ) test. The accuracy of the factor selection is further verified employing landslide susceptibility assessment which is predicted using 30% of study grid data selected randomly as the training data. The improved ARA employs the area of historical landslides to mine the frequent SFCs, and the results are then verified by the frequency ratio and test. It is concluded that the frequent SFCs are: (21, 41), (21, 74), (34, 41), (34, 74), (41, 74), (21, 41, 74), and (34, 41, 74), and the area with the SFCs needs special protection. The present study provides a valuable reference for the primary prevention of landslides.


Introduction
The occurrence of geo-hazards leads to casualties, property damage, and environmental issues (Wang et al. 2019;Li et al. 2021). The prediction and prevention of geohazards is the focus of current scholars (Metternicht et al. 2005;Ma et al. 2019;Li et al. 2020). However, geo-hazards, such as landslides, are difficult to predict accurately even using current advanced technology due to complex natural and human factors, such as real-time rainfall and mining, and environmental compound elements, such as geological condition and climatic condition. Analyzing and predicting landslides influenced by multi-factors is thus a difficult problem in scientific research .
Anthropogenic activities, such as deforestation, engineering construction, and improper land use, and natural environments, such as heavy rainfalls and earthquakes, result in slope instability and reshape the topography with complex dynamics (Confuorto et al. 2017;Cebulski et al. 2020;Gomes et al. 2020). Nevertheless, the importance of influencing factors is difficult to determine. The common methods are field surveys and aerial photograph interpretation (Eker and Aydı n 2021). Meanwhile, the machine learning algorithm (MLA) is an alternative method because of the low cost and time cost (Sameen et al. 2020). It is broadly divided into supervised, unsupervised, and reinforcement learning (Leem and Kim 2020). In recent years, MLA has been popularly used for biomedicine (Sung et al. 2020;Qin et al. 2021), software information technology (Gonz alez et al. 2020;Singh and Singh 2020), and ecological environment (Ge et al. 2020;Obsie et al. 2020;. Currently, the landslide susceptibility assessment has been predicted using MLA, such as support vector machine (SVM) Saha and Saha 2020;Wang, Feng, et al. 2020), deep learning algorithms Dao et al. 2020), artificial neural network ensemble (Bragagnolo et al. 2020;Fang et al. 2020), optimized machine learning methods , and optimized intelligence models (Chen, Chen, Peng, et al. 2021;Chen and Li 2020;Zhao and Chen 2020).
Hypothesis tests, such as chi-squared (v 2 ) test, and statistical analysis methods are proposed to select the significant factors (Pourghasemi, Kornejady, et al. 2020;Sahin et al. 2020;Sameen et al. 2020;Wang, Kariminejad, et al. 2020). To date, the methods have been successfully utilized in many study fields, namely text categorization (Yang 1997), credit risk assessments (Attigeri et al. 2017), and landslide susceptibility mapping (Sahin et al. 2020). The statistical analysis methods primarily include discriminant analysis (Dao et al. 2020), cluster analysis (Melchiorre et al. 2008), and correlation analysis (Wistuba et al. 2021), etc. The statistical analysis methods commonly used for the selection of landslide-influencing factors are multicollinearity analysis , accuracy analysis in the random forest (RF) model (Sun et al. 2020), and recursive feature elimination (Sun et al. 2020). Relevant literature proposes various models for susceptibility assessment of landslides using the selected factors (Pham et al. 2016;Huang and Zhao 2018). Although the assessment results are refined and accurate, the process of assessment using conventional statistical methods is laborious, cumbersome, and time-consuming. The methods are thus not convenient enough to apply to the susceptibility assessment with excessive data and cumbersome processes.
An alternative approach needs to be developed due to the limitation of susceptibility assessment of landslides. Researchers tend to pay attention to the factor combination prone to landslides (Pourghasemi, Kariminejad, et al. 2020;Yao et al. 2019), which requires the methods to extract useful information from large amounts of data quickly and accurately. Data mining is an effective method to extract knowledge from complicated data (Ouyang et al. 2011;Witten et al. 2011;Sameen et al. 2020).
Association rule algorithm (ARA) is an effective data mining algorithm to analyze the landslide factors (Tsai et al. 2013).
ARA is a process of discovering associations among items or itemsets (Bagui et al. 2020). A classic algorithm for discovering frequent itemsets and association rules is the Apriori algorithm which requires scanning the database multiple times (Agrawal and Srikant 2000;Xie et al. 2019). It is applied in many research fields such as engineering applications (Guo et al. 2014;Singh et al. 2018) and data management (Cheng et al. 2015). The Frequent Pattern (FP)-Growth algorithm is another well-known algorithm for discovering frequent itemsets in a concise form (Pei and Yin 1970) and only scans the dataset twice without candidate itemsets (Bagui et al. 2020). Due to its efficiency, it is applied widely to electric management (Wang and Cheng 2018) and network detection (Hidayanto et al. 2017). The Eclat algorithm represents the vertical database without traversing the database repeatedly (Zaki 2000). It helps understand the associations between items and performs better to long patterns (Das et al. 2018). It is applied to various studies, namely transport management (Zheng and Wang 2014;Das et al. 2018) and energy consumption (Liu et al. 2020).
Mining the deep information of landslides, namely the frequent secondary-factor combinations (SFCs), using ARA is simpler and more rapid for the preliminary prevention of landslides, and it also further analyzes the association between causative factors and landslides. However, the original ARA is difficult to apply to landslide analysis because it mines the frequent itemsets by counting the occurrence number of landslides. It cannot reflect the scope of landslides which is a non-negligible parameter for analysis. The original ARA thus needs to be improved by learning from the area of historical landslide to apply to the study issue. Few studies in the current literature have improved the original ARA.
In the present study, the influencing factors of landslides are selected using out-ofbag (OOB) error and v 2 test. The receiver operating characteristic (ROC) curve is used for the verification of susceptibility assessment of landslides obtained using the RF model, deep belief network (DBN) model, and SVM model, which further verifies the accuracy of factor selection. The Apriori algorithm, FP-Growth algorithm, and Eclat algorithm are improved to mine the frequent SFCs. The association rules between the SFCs with landslides are then verified using the frequency ratio (FR) and v 2 test.

Methodological flow
The flowchart consisting of five main steps are as follows: (a) preparing the data of influencing factors and historical landslides; (b) selecting the influencing factors and determining their importance to establish a factor system; (c) evaluating the landslide susceptibility using various models based on the selected factors and verifying the assessment results to further prove the accuracy of the factor selection; (d) taking the area of historical landslides as a parameter to optimize the original ARA and mining the frequent SFC; (e) verifying the frequent SFCs. The methodologies in the present study are executed and presented in Figure 1.

Selection of influencing factor
2.2.1. OOB error and v 2 test The OOB error is an index of feature selection (Arora and Kaur 2020;Wang, Kariminejad, et al. 2020). Not only can it be used to obtain the significance of features, but also determine the optimal number of features. Another index of feature selection is Gini importance, but it is difficult to determine the optimal number of features using the index. The OOB is thus used to determine the importance of the influencing factors and the optimal factor number in the present study.
It assumes that the total number of OOB data is O which is classified using the RF classifier. The number of classification errors X is obtained because of the known classification of OOB data, and the OOB error is obtained using the ratio of X to O: In this study, the influencing factors are ranked according to their OOB scores obtained using the RF algorithm, and the least important factors are eliminated based on the recursive feature elimination. According to OOB errors of different factor sets in the elimination process, the optimal factor set is selected as the factor system.
The v 2 test is a hypothesis testing method for variable classification based on the chi-squared distribution. It is a well-established technique for measuring independence and determining whether the variables are related (Do gan et al. 2021). It assumes that the actual values are unrelated to theoretical values, and the v 2 can be calculated by Eq. 1.
where A is the actual value; and T is the theoretical value.

Susceptibility assessment model
The RF model, DBN model, and SVM model have extensively been applied in susceptibility assessment, and it provides a solid foundation for the assessment. In the present study, the landslide region is set to 1, while the non-landslide region is set to 0. 30% of study data are selected randomly as the training data to predict the landslide susceptibility of each grid in the study area. The ROC curve is a commonly used method to verify the assessment results of landslide susceptibility (Chen, Lei, et al. 2021;). In the present study, the landslide susceptibility is thus evaluated using the three models, and the accuracies of the assessment results are verified using the ROC curve. The more accurate the assessment results are, the higher the accuracy of factor selection is. RF model is an advanced integrated learning algorithm based on the ensemble of unpruned classification trees which are created by bootstrap sampling and random feature selection, and the results are obtained by a majority voting of the various classification tree (Xie et al. 2019). DBN is an efficient unsupervised learning algorithm in deep learning and a probabilistic generative model composed of restricted Boltzmann machines (RBM). RBM consists of the visual layer and hidden layer. There is no connection between the neurons in each layer. A DBN is structured by several RBMs. The hidden layer of the previous RBM is the visual layer of the next RBM, while the output of the previous RBM is the input of the next RBM.
SVM is performed by many kernels function, such as linear function, polynomial function, and radial basis function (RBF). The main parameters are the penalty parameter (c) and the kernel function parameter (g). In the present study, the RBF is chosen as the kernel function, and the optimal c and g are found using the grid method and cross-validation.

Improved association rule learning
Association rule learning is a common algorithm for discovering strong rules hidden in a large database and is used for mining the frequent SFC of landslides in the study area.

Original ARA
The dataset of the original ARA is fTID: itemsetg in which the TID is the thing identifier, and the itemset is the content of the TID. The two parameters are the data basis to find the association rules and frequent itemsets. There are two sub-problems in the original ARA: (a) finding out the frequent datasets whose supports are greater than the specified minimum support, and (b) determining the strong association rule based on frequent itemsets and the minimum confidence. The support and confidence are obtained respectively by Eq. 2 and Eq. 3.

SupportðA, BÞ ¼ PðA&BÞ,
(2) where PðA&BÞ is the probability of A and B concurrently; PðAjBÞ denotes the probability of B given A; and PðAÞ is the probability of A. Apriori algorithm is an ARA for Boolean mining based on a recursive algorithm based on the two key steps, namely connection step pruning ). The FP-Growth algorithm compresses the data of frequent itemsets to a frequent pattern tree and retains the itemset association information. There is no need for the candidate set and only a need to traverses the database twice. The Eclat algorithm is a depth-first-search algorithm based on the set intersection. It is applied to sequential and parallel issues with the characteristic of local reinforcement. Its inverted theory considers the item and transaction ID as the key and value, respectively. The detailed steps of the Apriori algorithm, FP-Growth algorithm, and Eclat algorithm are proposed as Figure 2.
The original ARA mines the itemsets meeting the requirements of support and confidence from a considerable amount of itemsets by counting the number of items. It only can mine frequent SFCs by using the occurrence number of landslides and is difficult to apply to the present study.

Improved ARA
In the present study, the scope of landslides is a non-negligible parameter for association rule analysis. However, the two parameters of the original ARA cannot accurately reflect the parameter. The characteristic is introduced, and the improved ARA mines the frequent itemsets based on fTID: itemset, characteristicg in which the characteristic is a continuous variable. The frequent itemsets are mined based on the accumulation of the corresponding characteristic rather than the occurrence number of the itemsets.
The characteristic, namely the area of historical landslides in this paper, is introduced in the improved ARA. The support and confidence are optimized using the area of historical landslides (Eqs. 4,5).
in which Area A&B ð Þ is the area of historical landslides with the secondary-factor A and B; Area A ð Þ is the area of historical landslides with the secondary-factor A; and P Area is the area of historical landslides. For the improved Apriori algorithm, after the datasets are scanned, the candidate itemsets are generated by accumulating the landslide area. The frequent SFC is mined and then connected and pruned based on the support in Eq. 4. The confidence used to generate rule is renewed using Eq. 5. For the FP-Growth algorithm, the root nodes created in the frequent item table also include the characteristic accumulation, when building the FP-Trees. For the Eclat algorithm, the support is calculated based on the characteristic accumulation rather than the length of the TID set, and there is the same improvement in the prior theory as the Apriori algorithm when the candidate itemsets are generated.
On the other hand, the FR is used to prove the association between the frequent SFCs obtained using improved ARA and landslides, and FR is obtained by Eq. 6.
where A LF i is the area of historical landslide with the SFC F i ; A F i is the area with SFC F i ; A L is the area of landslides; and A is the area of the study area.

Study area
Kitakyushu is located in the northern Kyushu Island, Japan (Figure 3). It indicates the north latitude range of 33 58 0 -33 43 0 and the east longitude range of 130 40 0 -131 01 0 , with an area of 488.78 km 2 . The terrain tilts from north to south with a relative altitude of 954 m. According to the geological characteristics and terrain genesis of the study area, it can be mainly divided into four regions, namely the southern mountain region, central plain region, northeastern mountain region, and northwestern hilly region. The terrain is smooth and characterized by the overburden soil layer thickness of about 1.30-1.76 m. According to the Ministry of Land, Infrastructure, Transport and Tourism of Japan, the geological condition of the study area is complex with an active geological tectonic movement. The geological formations are mainly sedimentary rock and igneous rocks, and the landfill area is more than 5% of the study area. The study area is warm and humid throughout the year with an average annual temperature of 16.2 C and average annual precipitation of 1265 mm (Sun et al. 2011). In the northern Kitakyushu, the area has a typical Sea of Japan climate, while the climate in the eastern region belongs to the Seto Inland Sea Climate which is warm and dry. The precipitation significantly varies, concentrated during the rainy season and typhoon season. Meanwhile, the study area is located in the Pacific Rim Volcanic Seismic Zone at the junction of Eurasian and Pacific plates with frequent crustal movement. There are thus frequently occurring landslides induced by rainfall and earthquake, and most of the landslides are shallow landslides with a sliding surface depth of less than 6 m. The data of historical landslides is obtained from the  Bureau of Land Policy in the Ministry of Land, Infrastructure, Transport and Tourism of Japan, and the geological environment of landslide-prone regions is complex, with active geological tectonic movements such as earthquakes. The historical landslides from 1992 to 2011 are shown in Figure 3.

Case influencing factors
Landslides are typical multi-factor complex geo-hazards, and their mechanisms are complicated with various induced factors. The Kitakyushu is considered as the study area, and the data in the study area is obtained using the field environment and related literature. The digital elevation model (DEM) data at a resolution of 10 m are provided by the Geospatial Information Authority of Japan. The geology conditions, such as lithology, surface information, and runoff, are provided by the Land and Water Resources Bureau in the Ministry of Land, Infrastructure, Transport and Tourism of Japan. The present study establishes a factor system consisting of ten factors, namely soil thickness (ST), cumulative runoff (CR), distance from road (DRO), topography, elevation, distance from construction line (DCL), slope, distance from railway (DRA), lithology, and distance from river (DRI). There are two qualitative factors and eight quantitative factors in the established factor system. The factors in the present study are divided into four levels to avoid too many factor levels leading to the excessively great computational amount, and there are 40 secondary-factors. Four quantitative factors, namely the ST, CR, elevation, and slope, are reclassified using the natural break method. However, four additional quantitative factors, namely the DRO, DCL, DRA, and DRI, are unsuitable to be reclassified by the same method because the factors cannot affect the entire study area and their impacts disappear beyond a short distance from them. Therefore, the four factors are reclassified within a certain distance according to the actual influence scope of the factors. The maps of the various factors are thus shown in Figure 4.

Selection and verification of influencing factor
A significant characteristic of the RF model is the OOB which can calculate the feature importance. Based on the recursive feature elimination, the factor importance, namely the OOB score, in the various factor selections and the OOB errors of various factor selection are presented respectively in Figure 5 and Figure 6.
As shown in Figure 5 and Figure 6, the OOB error of eight influencing factors is minimum, and the factors selected in the present study are thus DCL, topography, DRO, slope, DRI, ST, CR, and DRA.
The v 2 test can be used to verify the significance of factors affecting landslides, and the v 2 of the eight factors are obtained in Table 1 using Statistical Product and Service Solutions (SPSS) and compared with the test critical value. The eight v 2 are greater than the critical value (k ¼ 3.84) which proves the accuracy of factor selection.

Landslide susceptibility assessment
The landslide susceptibility assessment is employed to further verify the accuracy of factor selection. The assessment results of landslide susceptibility are obtained using the RF model, DBN model, and SVM model. The results are then classified using the natural breakpoint method into five levels, namely very low susceptibility, low susceptibility, medium susceptibility, high susceptibility, and very high susceptibility, and the level maps are obtained using the three models in Figure 7. The ROC curves of the assessment results using the three models are obtained employing the historical landslide data as a reference to verify the accuracy ( Figure 8).
As can be seen from Figure 7, the level distributions of landslide susceptibility maps obtained by the three models are very similar. The high-susceptibility areas are mainly distributed in the south-central region, while the low-susceptibility areas are distributed in the northern region. The area under the curve (AUC) is in the range of 0.5-1, and an AUC of 1 indicates perfect prediction, while an AUC of 0.5 indicates useless prediction. The AUCs of the RF model, DBN model, and SVM model are respectively 0.909, 0.878, 0.809. The three AUCs of the three models are greater than 0.8, and it indicates that the performance of assessment results is excellent and results in high accuracy. The highest accuracy is recorded by the RF model, followed by the DBN model and SVM model. It can be concluded that the RF model has better accuracy than the other models. Meanwhile, the results further prove the accuracy of the factor selection.

Mining and verification of the frequent SFC
The secondary-factors of the study area are coded and shown in Table 2.
The improved ARA is executed in Python, and two parameters, namely the minimum support and confidence need to be set in the algorithm. Two methods are usually applied, namely the trial and error method and using other parameters to replace the parameters ). However, the substitute parameter still needs to be set if using the second method, and the issue is not fundamentally addressed. The former approach is thus employed in the present study. The minimum support and confidence are selected respectively as 60% and 70% using the trial and error method,   Table 3. Meanwhile, the FR and v 2 are used to verified the association between the frequent SFCs and landslides and are presented in Table 4.    78% (21, 41, 74) 88.89% (34,41,74) 88.20% The greater the FR is, the greater the probability of landslides is. As can be seen from the table above, the FRs are greater than one, and it indicates the frequent SFCs are prone to landslides. The SFCs are thus sorted: (34, 41), (34, 41, 74), (21, 41, 74), (21, 41), (41, 74), (34, 74), and (21, 74). The v 2 is greater than the critical value, which denotes that the frequent SFCs are prone to landslides. The SFCs are sorted according to the v 2 : (34, 41), (41, 74), (34, 41, 74), (21, 41), (21, 41, 74), (34, 74), and (21, 74). All FRs and v 2 are correspondingly greater than one and critical value, which indicates a tight relationship between the SFCs and landslides. The most frequent SFC is (34, 41), namely the distance from road > 100 m and the topography of the mountain, and the area with the frequent SFCs needs special protection.

Comparison with original ARA
The dataset of the original ARA is fTID: itemsetg, and the number of itemsets is the number of historical landslides. The TID is the identifier of the landslides, and the itemset is the secondary-factor with the largest area in the corresponding landslide. However, there is no combination meeting the requirement of the minimum support of improved ARA in the original ARA. The minimum support is thus set to 20% using the trial and error method, while the minimum confidence is set to 40%. As a result, there are three SFCs,namely (21,41);(21,74);and (41,74), and their confidences are 48.59%; 41.90%; and 50.70%, respectively.
It is concluded that even if the minimum support is set to 20%, the maximum confidences of the three SFCs are only about 50%. It denotes that the results of data mining by taking the number of landslides as the research objectives are inaccurate enough for the study area, and the improved ARA is more applied to the study area than the original ARA.

Discussions
According to the relevant literature (Xie et al. 2019), the geo-hazards including landslides are analyzed using the data statistics and research reports. However, much of the literature on the analysis of geo-hazards pays particular attention to susceptibility assessment. The MLAs, such as the RF model and SVM model, are the most commonly used approach (Chang et al. 2019;Merghadi et al. 2020). In recent years, deep learning algorithms, such as the DBN model, begin to outperform previous traditional methods and develop rapidly Wang, He, et al. 2020). The models have been the focus of studies on landslide prevention. Although the susceptibility assessment results are refined and accurate, the process of assessment is laborious, cumbersome, and time-consuming. An alternative approach thus needs to be developed. Mining the deep information of the landslides using the data mining algorithm is simpler and more rapid and also valuable for the preliminary prevention of landslides. Current studies have investigated the triggering factors and threshold analysis of landslides employing the data mining methods and generated the association rules between triggering factors and deformation (Ma et al. 2017;Miao et al. 2021). Meanwhile, researchers have determined the cause-and-effect relationships between factors and landslide movement by identifying the contribution of each parameter to landslides employing association rule mining (Ma et al. 2017). However, the original ARA only can be used for discrete problems, and the problems involving the continuous variable cannot be solved. The method of the previous studies cannot apply to this study because it cannot accurately reflect the scope of landslides which is a nonnegligible parameter, and few current studies have improved the original ARA. The present study thus introduces the area of historical landslides to improve the original ARA to apply to the analysis of landslide factors.
The landslide information can be mined by employing the improved association rule analysis. Finding the frequent SFCs and discovering the association rules between the landslides and the influencing factors are particularly useful for landslide prevention. It provides a novel insight into the improvement of the ARA and a valuable reference for the primary prevention of landslides. However, the proposed method mines the association rules based on the scope of landslides, and it thus only addresses the issue in terms of space and has certain spatiotemporal limitations. Future research should concentrate on the investigation of extracting more valuable landslide information that optimizes the analysis of landslide prevention and rescue.

Conclusions
In the present study, the ARAs, namely the Apriori algorithm, FP-growth algorithm, and Eclat algorithm, are improved for mining the frequent SFCs of landslides. There are few studies on optimizing ARA in the same method and employing the improved ARA to landslide analysis. The conclusions are obtained as follows: 1. The influencing factors of landslides in the study area are selected using the OOB error and v 2 test. The factors are considered as the evaluation indices to evaluate the landslide susceptibility using the RF model, DBN model, and SVM model, which further verifies the accuracy of the factor selection. 2. The ARA is improved by introducing a continuous variable, namely the area of historical landslides, to apply to the present study, and the frequent SFCs are mined. It is proved that the frequent SFCs are: (21, 41), (21, 74), (34, 41), (34, 74), (41, 74), (21, 41, 74), and (34, 41, 74), and the association between the SFCs and landslides is verified using their FRs and v 2 test. Their FRs are corresponding 5. 20, 3.71, 5.97, 4.95, 5.16, 5.23, and 5.91, and their v 2 are corresponding 37. 00, 4.29, 46.80, 22.99, 43.88, 30.97, and 39.81 which are all greater than the critical value. 3. The SFCs are sorted according to the FRs: (34, 41), (34,41,74), (21,41,74), (21,41), (41,74), (34, 74), and (21, 74), and they are also sorted according to v 2 : (34, 41), (41, 74), (34,41,74), (21,41), (21,41,74), (34, 74), and (21, 74). The most frequent SFC is (34, 41), namely the distance from road > 100 m and the topography of the mountain, and the area with the frequent SFCs needs special protection. 4. The results obtained employing the original ARAs are inaccurate enough for the study area, and the improved ARA has more widespread applicability than the original ARA. The improved ARA provides a valuable reference for the primary prevention of landslides.

Disclosure statement
There are no financial competing interests.

Funding
This work was supported by the National Natural Science Foundation of China under Grant No. 51478483 and No. 41702310 and the China Scholarship Council.

Data availability statement
The data that support the findings of this study are available from the corresponding authors, upon reasonable request.