Performance comparison of landslide susceptibility mapping under multiple machine-learning based models considering InSAR deformation: a case study of the upper Jinsha River

Abstract Landslide susceptibility mapping (LSM) comprehensively evaluates the spatial probability of landslide occurrence by using different environmental factors. However, most of the evaluation methods ignore the dynamic characteristic factors of landslides, which makes it difficult to obtain reliable prediction results. Taking the upper reaches of the Jinsha River as the study area, this article introduces the deformation data into the landslide characteristic model and proposes an improved landslide susceptibility evaluation method. Four kinds of landslide susceptibility machine learning models were constructed by collecting 20 landslide related factors. The prediction accuracy of machine learning models is compared, and the performance of different models and the improvement of model performance by deformation information are evaluated. The results show that the performance of Random Forest and XGBoost model is better than SVM and logistic regression model. The prediction accuracy of Random Forest and XGBoost model is improved obviously after InSAR deformation is introduced. 96.9 and 93.19% of landslide areas were reasonably classified as high or very high risk levels. Compared with the calculation result of traditional model, the proportion of high and very high risk pixels in landslide area is increased by 2.97 and 1.13%, respectively. In addition, the percentage of high and very high risk areas in the susceptibility evaluation area increased from 15.45 to 16.23% and 18.73 to 21.89%, respectively. The accuracy of Random Forest and XGBoost models increased from 0.793 to 0.878 and 0.776 to 0.812, respectively, and the AUC increased by 0.9 and 1.7%, respectively. The SHAP and traditional feature importance analysis reveals that rainfall, aspect, temperature and NDVI are the main influencing factors of landslide in the upper reaches of the Jinsha River.


Introduction
The prediction of landslide susceptibility is an important subject in landslide hazard research.Landslide susceptibility mapping (LSM) can comprehensively analyze the relationship between various geological environmental factors and landslide distribution, and finally predict the probability of landslide occurrence in space (Gao et al. 2023).LSM plays an important role in revealing the control factors of landslides and landslide-prone areas, and has become one of the important means of scientific predisaster warning (Fell et al. 2008;Huang et al. 2022).Using landslide data samples and characteristic factors to predict landslide probability, LSM is mainly based on the following assumptions: (1) Future events may occur under conditions similar to those that occurred in the past; (2) Landslide Conditioning Factor is spatially correlated and can be used for prediction functions (Stefan et al. 2021;Huang et al. 2022;Merghadi et al. 2020).
Currently, there are four main types of LSM methods: physical-based model, heuristic model, statistical model and machine learning (ML) model (Merghadi et al. 2020).However, these separate methods have advantages and limitations.Physicsbased models require a large amount of detailed data to provide reliable results and have good performance in the prediction of small regions with detailed data, but are not very practical in LSM of large scale regions (Huang et al. 2020).Subsequently, heuristic models and statistical models were gradually adopted by LSM (Stefan et al. 2021).The heuristic model sorts or weights the influencing factors of landslides according to the expert knowledge and transforms them into rules that can be implemented by computers, but this method is highly subjective and the prediction results may have problems (Dou et al. 2020).Statistical models have been successfully implemented and applied to LSM in the last decade (Merghadi et al. 2020).However, the mechanism of geological disasters such as landslides is complex, and simple statistical model relationship is difficult to describe the relationship between landslides and environmental factors (Huang et al. 2020;Stefan et al. 2021), it is difficult to deal with complex nonlinear relationship problems and achieve high accuracy (Frattini et al. 2010).The machine learning model can effectively capture the nonlinear relationship between landslide and environmental factors, with higher flexibility and accuracy.Since 2010, machine learning methods, such as support vector machine (SVM), logistic regression (LR), random forest (RF) algorithm and XGBoost algorithm, have been widely applied in landslide susceptibility research due to high prediction accuracy (Bui et al. 2016;Dou et al. 2020;Hu et al. 2020;Chen and Zhang 2021;Zhao et al. 2021;Ado et al. 2022).Trigila (2015) predicted shallow landslides susceptibility on steep slopes based on LR and RF models, and the performance evaluation results proved the excellent prediction ability of these models.Zhang (2022) used XGBoost and RF model to predict the susceptibility probability of landslides in Hokkaido earthquake in 2018, and obtained accurate prediction results.Goetz (2015) adopted support vector machine (SVM) to conduct tests in Austria and proved that SVM can be used to predict landslide susceptibility.
Machine modeling has advantages in LSM, but there are performance differences between different modeling tools.Some scholars have compared the performance of machine learning models (Kavzoglu et al. 2014;Chen et al. 2017;Song et al. 2018).Kavzoglu (2014) compared the performance of decision analysis, SVM and logistic regression model in shallow landslide mapping, and the results showed that the former was better than the traditional logistic regression method.Chen (2017) compared the performance of decision tree model and logistic regression model in the landslide susceptibility analysis of Taibai County, and concluded that the LR model had good balance performance in training and verification.Song (2018) applied XGBoost and logistic regression model to predict landslide susceptibility in the Three Gorges Reservoir area, and compared performance parameters and found that XGBoost method performed better in LSM processing unbalanced landslide data.However, there is no consensus on which ML algorithm is best for predicting landslide susceptibility.In addition, the prediction accuracy of landslide modeling is not only affected by the basic quality of ML algorithm used, but also by landslide causative factors (Novellino et al. 2021;Huang et al. 2022).
At present, the prediction of landslide susceptibility is mostly based on static environmental factors (elevation, slope, distance to river, etc.), and the influence of dynamic characteristic factors on landslide susceptibility evaluation is often ignored, resulting in inaccurate landslide prediction and false negative error (judging unstable areas as stable areas).In recent years, InSAR technology has been widely used in displacement monitoring and disaster early identification, and is an effective method for obtaining large-scale and high-resolution surface deformation (Havenith et al. 2006;Othman and Gloaguen 2013;Zhang et al. 2013;Wasowski and Bovenga 2014;Shafique et al. 2016;Nishiguchi et al. 2017;Zhao et al. 2018).Moreover, there are some studies on PS-InSAR (Persistent Scatterer InSAR) and SBAS-InSAR (Small Baselines Subset InSAR) to predict the future development of landslides (Yao et al. 2017;Fan et al. 2019;Zhang et al. 2020;Xu et al. 2021).InSAR deformation reflects the motion state of landslides and can be used as a characteristic factor in the susceptibility assessment models.Deformation results can make up for the defects of the traditional LSM prediction model in analyzing the dynamic characteristics of landslides.Based on the ground deformation acquired from InSAR technology, some scholars conduct error correction for LSM or analyze the improvement of the landslide susceptibility model (Novellino et al. 2021;Huang et al. 2022;Gao et al. 2023).Novellino (2021) proposed a combination of machine learning and InSAR technology for landslide risk analysis.This new method provides a different landslide risk prediction model that can better understand the landslide process.Taking reservoir area as an example, Gao (2023) introduced deformation data as a dynamic factor and built a convolutional neural network (CNN) model, and found that the model had a good effect on identifying the landslide prone area that was in slow deformation.Huang (2022) proposed an improved LSM method based on InSAR deformation results, and the fusion of stochastic forest model and InSAR method increased the sensitivity level by 2.74% compared with the original model.These studies promote the application of InSAR technology in LSM.However, InSAR works in these studies fails to consider the deformation information on a long-time scale and the deformation data missing caused by satellite observation direction (Line of Sight, LOS).Moreover, most of the existing studies conducted single model tests, and rarely considered the comparison of the performance improvement of various model methods by introducing deformation factors, which is an urgent problem to be improved in landslide susceptibility evaluation.To solve this problem, the LSM study considering InSAR deformation was carried out in this article.SVM, LR, RF and XGBoost models are representative of machine learning models, which are easy to implement in ArcGIS, Python, GEE and other environments.These machine learning models are easier to be accepted and applied by the public, so this article chooses these four models to carry out a comparative study.The purpose of this article is to compare the performance of models in the susceptibility prediction of the study area according to the evaluation parameters, and to quantitatively assess how much InSAR technology improves the performance of the model and which model is more suitable to introduce InSAR deformation characteristics.
After the machine learning model predicted the landslide susceptibility, we further explained the predicted results.Explainability has always been one of the important issues in machine learning methods, which is important for understanding how Machine-Learning models make decisions (Al-Najjar et al. 2022).Some recent research works have tried to use explainability methods in Machine-Learning models (Novellino et al. 2021;Al-Najjar et al. 2022;Ekmekcio glu et al. 2022a), such as Shapley Additive Explanations (SHAP) or Local Interpretable Model-Agnostic Explanation (LIME).Al-Najjar (2022) employed a Shapley Additive Explanations (SHAP) approach, and proposed an explainable artificial intelligence (XAI) for landslide prediction.Results show that XAI method can measures the impact, interaction and correlation of conditioning factors.Ekmekcio glu and Koc (2022b) calculated the contribution of risk conditioned factors to event outcome prediction based on SHAP algorithm, which increased the interpretability of the method.Collini (2022) introduced explainable artificial intelligence technology in the study of rainfall-induced landslide LSM.Therefore, we combined SHAP and feature significance analysis methods to finally explain the influencing factors of landslide.
Under the influence of special topography, hydrology and geological structure conditions, many landslides developed in the upper reaches of the Jinsha River (Harp et al. 2011;Du et al. 2020;Liu et al. 2021).These landslides seriously threaten the safety of nearby residents and major hydropower projects (Yao et al. 2022a) and directly affect the development of society and the economy (Huang et al. 2022).There is an urgent need for a systematic and accurate landslide susceptibility evaluation considering landslide deformation characteristics.Therefore, this article takes the upper reaches of the Jinsha River as the study area.After comprehensively considering multiple landslide-related factors and InSAR deformation results, we comprehensively evaluated and analyzed the landslide susceptibility based on four machinelearning models.

Study area
The study area is located in the southeast of the Tibetan Plateau (Figure 1), at the junction of Sichuan and Tibet, and is an essential route for the Sichuan-Tibet Railway.The study area is a typical alpine valley landscape with steep topography on both sides of the river, the slope is generally between 20 and 45 , and the elevation is between 2500 and 5400 m.Under the action of plate compression and river erosion, the riverbed is relatively narrow, about 60-150 m, and the shape of the valley is V-shaped (Yao et al. 2022a).The temperature in the study area showed seasonal changes.Summer (June-August) has the highest temperature, with an average temperature of about 10 , while winter (December-February) has the lowest temperature of about À15 .The temperature increases gradually from north to south (upstream to downstream) in spatial distribution.Annual precipitation ranges from 500 to 800 mm, mainly in summer (Yao et al. 2022a).The fault zone in the study area is densely distributed along the NNW direction.Under the influence of special topography and tectonic conditions, a large number of landslides were developed on both sides of the river in the study area, and the height difference between the front and back edges of the landslide reached more than 1 km.In recent years, several landslide events have occurred in the study area (Song et al. 2018;Zhao et al. 2018;Zhang et al. 2020) caused losses of engineering facilities and seriously threatened the lives of nearby and downstream residents.2.2.Geospatial database 2.2.1.Landslide inventory Spatial geometry data of the landslides is one of the basic data for landslide susceptibility mapping (Guzzetti et al. 1999).In this article, using Google Earth images and InSAR deformation results, a total of 100 landslides were interpreted by the geomorphic features and deformation characteristics of landslides (Figure 1).It is worth noting that these landslides do not contain ancient landslide deposits.The total area of landslides in the study area is 57.1 km 2 , and the largest landslide is 8.27 km 2 .The types of landslides in the study area are mainly rock landslides, most of them are large and medium-sized deep-seated landslides.Figure 2 shows giant landslides identified by geomorphic features and InSAR deformation results.Both landslides and non-landslides are required for Machine-learning models.Therefore, 63 non-landslides were randomly selected in stratification within the buffer of the landslide 2 km range.In the subsequent machine-learning training, we input these landslides and non-landslides data as positive and negative samples, respectively.

Landslide-related environmental factors
It is important to select appropriate influencing factors for landslide susceptibility assessment (Wang et al. 2019).The factors affecting landslides are complex.Due to the differences in geographical and geological conditions, there is no unified standard for selecting environmental factors in different regions (Huang et al. 2021).Some scholars have studied the causative factors of landslides (such as Xiongba landslide, Sala landslide, Baige landslide, etc.) in the upper reaches of Jinsha River (Fan et al. 2019;Liu et al. 2021;Zhang et al. 2020;Yao et al. 2022aYao et al. , 2022b)).The study area is located in the suture zone of the Jinsha River.The topographic conditions of the study area have an important effect on the occurrence of landslides (Liu et al. 2020).Chen and Zhang (2021) believed that the spatial distribution of landslide was strongly positively correlated with the distance between the river and the fault zone, and the main influencing factors were rainfall, river, temperature change and earthquake.Liu (2021) found that the slope of landslide in the upper reaches of the Jinsha River is mainly 10 -45 , and vegetation changes quickly.Human activities such as vegetation destruction, land use and road construction are also one of the inducing factors of landslide (Lissak et al. 2020;Novellino et al. 2021).Therefore, topography, geology, hydrology and human activities are collected in this article as environmental influencing factors of landslides.
In addition to the topographic factors of elevation, aspect, slope, plane curvature and profile curvature, we also introduced slope length and topographical position index (TPI) factors, which reflect the relative position relationship between a point and surrounding area.In terms of geological processes, we collected lithology, distance to historical earthquakes and distance to fault as geological influencing factors.Rainfall infiltration and river erosion can reduce slope strength and increase body weight, which are important factors affecting landslides (Sahin 2020a;Wang et al. 2022).We took distance to river, rainfall, temperature, stream power index (SPI) and topographical wetness index (TWI) as hydrological influencing factors.SPI represents the effect of runoff concentration on soil erosion, and TWI is an indicator of the effect of regional topography on runoff accumulation (Amatya et al. 2021).Human activities (such as road building and vegetation destruction) inevitably interfere with the slope balance.Land use/land cover (LULC), NDVI (Normalized Difference Vegetation Index) and distance to roads are considered as human activities factors.
Therefore, a total of 18 landslide-related environmental factors is selected in this article.All influencing factors data are shown in Table 1.All the above factors were calculated and inversed by ArcGIS software and open-source remote sensing data.To facilitate model calculation, we sampled all the influence factors to the same raster resolution.

Dynamic feature factor
The dynamic characteristics of landslides can be obtained by InSAR technology.In recent years, InSAR and its derived technologies (SBAS-InSAR, PS-InSAR, MAI, etc.) have been widely used for monitoring surface deformation and ground changes (Yao et al. 2022b).In this article, a total of 191 scenes of Sentinel-1 SAR data spanning 4 years were collected with a temporal resolution of 12 days (Table 2).The surface deformation velocity in the study area was obtained using the SBAS-InSAR technology (Section 3.1).Due to the limitation of the satellite observation angle, the single observation direction inevitably has invisible areas, such as shadow and overlay regions.Therefore, we integrated the multi-orbit deformation results and obtained a complete surface deformation field of 30 m spatial resolution.

Methods
In this article, 8 km buffer zone of the upper Jinsha River was established as the study area.Firstly, the surface deformation field was introduced as the dynamic characteristic factor of landslides, and 18 environmental characteristic factors were combined to make landslide susceptibility mapping.Then, four statistical and machine learning models were used to construct susceptibility maps in the study area.Finally, the accuracy of different models in the study area was evaluated.The research flow is shown in Figure 3.

Environmental characteristic factors
In this article, terrain-related factors are calculated based on SRTM (Shuttle Radar Topography Mission) DEM with a 30 m resolution.Several topographic analysis tools in ArcGIS software (such as slope, aspect, curvature, etc.) are adopted to calculate all topographic factors.Furthermore, the Euclidean distance tool in ArcGIS software calculates the distance between each pixel and historical earthquakes, faults, rivers and roads.LULC, NDVI, rainfall and temperature of the study area were retrieved by using Modis remote sensing images and GEE (Google Earth Engine) remote sensing big data platform.Figure 4 shows the various feature factors obtained in the study.

Dynamic characteristic factors
InSAR (Interferometric Synthetic Aperture Radar) technology has been widely applied in hazards identification and early warning (Dong et al. 2018;Lissak et al. 2020;Xu et al. 2021;Liu et al. 2022).It uses the phase information of SAR backscattering and can optimally provide millimetre-level measurement accuracy along the LOS direction (Hanssen, 2001;Ferretti et al. 2007).We download the open-source Sentinel-1 SAR data covering the study area (https://scihub.copernicus.eu)with the time span of 4 years.The GAMMA software and SBAS-InSAR method were used to obtain the LOS surface deformations in ascending and descending directions (Costantini, 1998;Goldstein and Werner, 1998;Lyons and Sandwell, 2003).Precise orbit files are downloaded and preprocessed.To reduce the low coherence area, the thresholds of time and perpendicular baselines were set at 36 d and 200 m, respectively.Table 2    shadow and overlay problems, resulting in missing deformation data, which is difficult to meet the data demand.In this article, the SAR data from ascending and descending orbits was collected and processed, and the complete deformation field data is calculated based on the data fusion method.

Landslide susceptibility models
Machine learning model has more accurate and reliable performance in landslide susceptibility prediction, and can handle large and complex data sets, so it is widely used in landslide prediction.As the most commonly used machine learning models, logistic regression, decision tree (decision tree integration method) and support vector machine are simple to implement and low cost to calculate, which is suitable for predicting landslide susceptibility in the study area of this article (Kavzoglu et

Feature selection
In landslide susceptibility mapping, selecting a good feature subset is the key to building a reliable susceptibility model (Stefan et al. 2021).Feature selection is a common technology used to select the optimal feature subsets for building a robust learning model.The feature selection process mainly includes three steps: feature ranking, model prediction and statistically significant analysis.Feature ranking is used to determine which features have the most impact on construction learning (Santos et al. 2014).In this article, filter feature ranking methods, namely chi-squared, information gain, rank correlation, and random forest feature importance were evaluated to seek the most influential landslide conditioning factors on landslide susceptibility.
And the most commonly used technique of statistical significance tests namely, Wilcoxon signed-rank test, was applied to determine the best subset model.As a general practice, hypotheses are accepted only if the significance level is less than 0.05 (i.e., p-value < 0.05).First, in the stage of feature ranking, each set of features ranking generated is used to determine the feature score.Then, the feature sorting result with the most repetition times is selected by generating the repetition times of feature sorting.Finally, with the feature ranking, logistic regression model and statistical significance test completed, we take the optimal feature subset as the feature data set for the susceptibility model.

Logistic regression (LR)
The logistic regression model is a multivariate statistical analysis method combined with logical functions, which is used to study the dependence of binary variables (landslide or non-landslide) and independent variables (Chen et al. 2019;Zhao et al. 2019).The logistic regression model generally takes various characteristic factors as independent variables and takes landslide and non-landslide as dependent variables.By determining the logistic regression coefficient b, the probability P (0 < P < 1) of landslide occurrence under the combination of various evaluation factors was calculated.The logistic regression equation is as follows: (2) P is the probability of landslide occurrence; b1, b2, … , bn are the logistic regression coefficient; b0 is a constant.The correlation between the feature factors and their contributions to the model was determined by assessing the estimated coefficient and statistical significance of the logistic regression model.

Random forest (RF)
Random forest is a widely used and popular integrated learning method, first proposed by Breiman (2001).RF takes decision tree as the basic model and generates a series of different decision tree models by constructing different training datasets and features (Pradhan 2013).The prediction accuracy of RF model is improved by combining many decision trees.Unlike the general statistical model, RF does not need to worry about the problem of multivariate collinearity faced by the general regression analysis.
Through bootstrap resampling technology, N samples were randomly selected from the original training samples to generate a new training set.Each independently extracted training sample is used to train a tree, and the forest is composed of n decision trees generated based on the sample set (Breiman, 2001).Each decision tree votes independently.Finally, the voting results of all decision trees are integrated, and the category with the most voting times is designated as the final output (Sahin and Colkesen 2021).The generalization error P Ã of random forest is defined as: where P Ã is the generalization error of random forest; q is the average value of the correlation of the decision tree.s is the average strength of the decision tree.In this article, we determine the hyperparameters of random forest algorithm through parameter debugging.The RF hyperparameters were set to 80 for the number of trees and the minimum size of the terminal node was set as 8 (Probst et al. 2019).And we set 5 as the number of variables randomly sampled as candidates at each split (about one-third of the number factors in raster stack).
3.2.4.Support vector machine (SVM) Support vector machine (SVM) is a supervised learning method based on statistical learning theory (Huang and Zhao 2018;Luo et al. 2019).SVM uses nonlinear mapping functions to convert model input variables into high-dimensional variables so that high-dimensional variables can be accurately fitted by linear regression (Yao et al. 2008).SVM has many unique advantages, which can better solve nonlinear and high-dimensional pattern recognition problems in the case of a small number of samples, and has been widely used in landslide susceptibility assessment research (Yao et al. 2008;Xu et al. 2013).The detailed introduction of the model is as follows: Based on a set of linearly separable vectors x i (i ¼ 1,2 … n), they belong to two classes y i ¼ 61, namely landslide and non-landslide, n is the number of training samples.The goal of the SVM training process is to find an optimal hyperplane and divide the training data into two output classes (Luo et al. 2019).Its expression is: The formula should meet the following conditions: kxk is the norm of the normal hyperplane; b is a constant.
Based on the Lagrange multiplier, the cost function can be expressed as: k i is the Lagrange multiplier.For the non-separable case, the relaxation variable n i can be introduced as a constraint.
At this point, L can be deduced as: Previous studies have shown (Xu et al. 2012) that Gaussian kernel function has better results.Therefore, this article selected Gaussian kernel function, whose expression is shown in Equation ( 9).In order to improve the training accuracy and prevent overfitting, the gamma term for radial basis function (RBF) kernel is 0.2.

Extreme gradient boost (XGBoost)
XGBoost is a tree-based ensemble algorithm, which is currently one of the representative integrated learning methods by integrating new functions and fitting residuals (Chen and Guestrin 2016).XGBoost algorithm is good at dealing with discrete independent variables and dependent variable classification problems.It takes the weak evaluators (decision trees) with preferences as the base learner, and combines them for training learning, thus getting an integrated strong evaluator (Park and Kim 2019).Compared with other machine learning algorithms, XGBoost algorithm uses Taylor second-order expansion to optimize the loss function and adds regularization terms to control the complexity of the model, which greatly improves its computational efficiency and generalization ability.XGBoost algorithm can be expressed as follows: where ŷ is the predicted landslide susceptibility, f k is the classification tree, and K is the total number of trees.F is the collection of all classification trees.f k is the kth tree produced by the k-round iteration.
The expression of the objective function (Obj) is as follows: where P n i¼1 lðy, ŷÞ is the loss function, which is used to evaluate the error between the predicted susceptibility and the true landslide susceptibility; P k k¼1 Xðf k Þ is the regularization term that avoids over-fitting, which is shown in Equation ( 12).The XGBoost algorithm keeps adding the classification tree and learning a new function to fit the previously predicted error.
where x is the leaf node score, T indicates the total number of leaf nodes, and c and k are the weighting factors (Chen and Guestrin 2016).Through Taylor expansion of the objective function, the optimal objective function is as follows: where g i is the first-order partial derivative of the loss function, and h i is the secondorder.The reduction of loss after splitting is shown as follows: Through debugging, we adjusted several hyperparameters in XGBoost.The number of trees in the ensemble (n_estimators) was set at 50.To prevent overfitting, we set the subsample and colsample_bytree to 0.6 and 0.75.

Performance evaluation
In this article, 30% of samples were selected for model verification.The prediction accuracy of the susceptibility model is usually tested by the receiver sensitivity (ROC) curve and the area under the curve (AUC; Wang et al. 2019).The horizontal axis of ROC curve represents the proportion of non-landslides predicted as landslides.The vertical axis represents the proportion of landslides predicted accuracy and the AUC represents the accuracy of the susceptibility model.In addition, accuracy, kappa value, precision, recall and F1 score were used to comprehensively evaluate the quality of LSM.

Feature selection of landslide-influencing factors
Feature ranking is used to determine which features have greater influence on the probability of landslide occurrence.In this article, the feature correlation was measured based on statistical tests.Four filtering feature ranking methods were used to comprehensively evaluate feature factors suitable for the landslide susceptibility in this study area.In this process, the importance estimation of features is started by iterative prediction.In each iteration, new features are added to the previous dataset based on their importance.Landslide susceptibility is affected by various factors.Table 3 shows the different feature weights and rankings obtained using four methods.The influence factors in Table 3 represent the contribution weight of the factors to landslide susceptibility.Large factor weight is conducive to the occurrence of landslides, and zero presents invalid influence factor.The influence factors predicted by four methods show that aspect, rainfall, temperature, earthquake and deformation have the greatest influence on landslides.Curvature, slope, slope length, SPI, TPI and TWI have little contribution to landslide occurrence.We further performed significance tests using the Wilcoxon signed-rank test method to select the best subset of features.In the statistical hypothesis test, Wilcoxon signed-rank test was used to explain the correlation between the two feature factors.The larger the statistical factor, the greater the correlation between the two features.We generally believe that the greater the correlation between feature factors, the more similar the role played in machine learning.Therefore, highly correlated factors need to be removed from feature subsets to avoid data redundancy and affect data processing efficiency.Figure 5 shows the correlation test results for all subsets.The results show that elevation is highly correlated with other topographic feature factors (profile curvature, slope, slope length, SPI and TPI).Since this article calculates topographic feature factors based on elevation data, they are highly correlated with the terrain.The study area is relatively small and landslides are distributed along both sides of the Jinsha River.The change of elevation has little relationship with the distribution of landslides.Therefore, elevation is deleted from the feature subset in this article.In addition, the fault is highly correlated with lithology and plane curvature.The study area is located in Jinshajiang suture zone, and the special geological tectonic environment makes the fault location is often the location of lithology change, but also the area with large relief.Therefore, only lithology and plane curvature factors were retained, and faults were deleted from the feature subset.Finally, according to the feature weight and correlation analysis, ten feature factors, including slope, aspect, earthquake, river, rainfall, temperature, NDVI, road, LULC, lithology and deformation, were retained as the best susceptibility evaluation dataset.

Evaluation of landslide susceptibility
In this study, four evaluation models (logistic regression, random forest, SVM, and XGBoost) were used to perform LSM.The susceptibility of landslides considering deformation characteristics and only considering environmental characteristics are calculated, respectively.Figure 6 shows the evaluation results of landslide susceptibility in the study area.The natural breakpoint method divided the susceptibility zones into five categories: very high, high, medium, low and very low.The susceptibility results showed that the areas with high and very high susceptibility were consistent with the landslide distribution area.
The results of susceptibility partitioning predicted by the four models are quite different.The highly prone areas of RF, SVM and XGBoost are concentrated near the landslide, while the prone areas of the LR model are more dispersed.The susceptibility distribution of RF is more consistent with the normal distribution, while SVM and XGBoost have a large concentration in the medium and very low susceptibility area.Moreover, the same susceptibility model results show similar results.But whether the deformation factor is considered is different in the landslide susceptibility assessment.With the introduction of the deformation characteristic factor, the high susceptibility area of the LR model is more concentrated in the large deformation landslide area.The deformation factor shows a higher weight in the logistic regression calculation, which leads to a greater dependence of the susceptibility on the information with deformation.However, this is not conducive to the comprehensive judgment of landslide characteristics and susceptibility prediction.After introducing the deformation feature factor, the susceptibility results of RF and SVM models changed less, and the susceptibility of a few areas decreased after considering the deformation information feature.The XGBoost results show an increase in high susceptibility areas, which have large deformation values but are not judged as high susceptibility areas based on the previous feature factors.These results indicate that for predicting landslide susceptibility in the study area, the introduction of deformation information is more favorable to the judgment than considering only static feature factor results.
Table 4 shows the susceptibility classification statistics of the four models before and after the introduction of deformation factors.Statistical results show that the very high susceptibility areas in LR and SVM models decreased by 22.78 and 15.07%, respectively, and the medium and low susceptibility areas increased.In contrast, RF and XGBoost model results show that the proportion of very high and high susceptibility areas has increased by 0.23-1.65%.By comparing the distribution of landslide and high susceptibility area, it is found that the introduction of deformation information on the one hand reduces the false positive error in the high susceptibility area, on the other hand, the rapid deformation area is correctly included in the high susceptibility area, and the risk level of the large deformation area is improved.Table 5 shows statistical pixel values of landslide susceptibility in landslide area before and after deformation introduction.The prediction accuracy of random forest and XGBoost model is obviously improved after InSAR deformation is introduced.96.9 and 93.19% of landslide areas were reasonably divided into high and very high risk areas.Compared with the calculation result of traditional model, the proportion of high risk and very high risk pixels in landslide area is increased by 2.97 and 1.13%, respectively.However, after InSAR deformation was introduced into LR model, about 63% of the pixels were reclassified into high risk areas, and very high risk pixels were greatly reduced.This indicates that InSAR deformation distribution is quite different from the classification results of the original LR model and is not suitable for use in the LR model.After InSAR deformation was introduced into SVM model, the high and very high risk areas were slightly reduced, and InSAR deformation did not improve the SVM model much.
To further verify the model's reliability, two typical landslides were selected from random forest susceptibility results for verification.Figure 7 shows the results of random forest susceptibility assessment and deformation information of L1 and L2 landslides (marked in Figure 6b).Two landslides, L1 and L2, are located on the right bank of the Jinsha River.The time-series InSAR results (Figure 7d, h) show that the cumulative deformation of these two landslides in recent 4 years reaches 50-60 cm.They are in the stage of rapid creep, with greater risk and susceptibility.However, without considering the deformation information, these two landslide areas are classified as high susceptibility areas (Figure 7a, e), whose landslide susceptibility is underestimated.In the optimized random forest susceptibility results (Figure 7b, f), landslide areas are correctly classified as very high susceptibility.Results show the importance and reliability of deformation factors in landslide susceptibility assessment.

Model performance validation and comparison
ROC curve and evaluation parameters such as accuracy, F1 and recall were used to evaluate the model results.The ROC curves of the four susceptibility models are shown in Figure 8, and the confusion matrix is shown in Table 6.The results show that all four models have good predictive performance.After introducing the deformation characteristic factor, the prediction performance of the random forest, support vector machine and XGBoost model was greatly improved, and their AUC values increased by 0.832, 0.32 and 1.535, respectively.But the AUC value of logistic regression model decreased.In addition, accuracy indexes of the confusion matrix, such as Accuracy, Kappa, Precision, Recall and F1, can display the accuracy performance of the model, and the results are shown in Table 6.The accuracy, Kappa, precision, recall and F1 of the Random Forest, SVM and XGboost models are improved after the introduction of the deformation factor, and the random forest and XGboost models have the best performance.The accuracy and accuracy of random forest and XGBoost were 0.878, 0.746 and 0.812, 0.94, respectively.In contrast, the predictive performance of logistic regression model is poor, and the introduction of deformation factor decreases the predictive performance.Therefore, based on ROC curve and accuracy evaluation, Random Forest and XGBoost model have better prediction accuracy.The introduction of deformation factor increases the predictive performance of the model, which can better predict the landslide susceptibility and has higher rationality.
According to the results of susceptibility evaluation and model performance comparison, we compared the results with previous studies based on historical literature.In terms of landslide susceptibility prediction in the study area, we effectively identified Xiongba landslide (Figure 7b), Sala landslide (Figure 7f) and other high-risk areas, which is similar to the study on landslide or susceptibility in this area (Liu et al. 2021;Wang et al. 2022).In addition, Wang (2022) adopted two deep learning (DL) algorithms, convolutional neural network (CNN) and deep neural network (DNN), to predict the landslide susceptibility in Jinsha River region.It is found that  rainfall, NDVI and topography are closely related to landslide distribution in this region, which is similar to the conclusion of this article.In terms of Machine-Learning model performance, Zheng (2021) has demonstrated the superiority of InSAR method in LSM using InSAR deformation data in the Jinsha River Basin.By comparing different machine learning methods, he believes that RF and SVM methods have better recognition accuracy.Merghadi (2020) provides an extensive analysis and comparison of different ML technologies based on case studies.The results show that compared with other machine learning algorithms, the tree-based integrated algorithm achieves excellent results, and the random forest algorithm can provide more robust performance for accurate landslide susceptibility.Sahin (2020a) uses a tree-based ensemble learning algorithm to predict the potential distribution of landslide susceptibility.The results show that the gradient enhancement algorithm such as XGBoost and RF have strong performance in landslide susceptibility prediction.In conclusion, in the assessment of landslide susceptibility in Jinsha River area, the prediction results of landslide susceptibility in this article are similar to those of previous studies.In terms of model performance, random forest and XGBoost are more reliable and have better performance in predicting susceptibility in the study area.

Discussion
Landslide susceptibility mapping is very important for landslide prediction and disaster reduction policies.How to use high-quality sample datasets and adopt reasonable evaluation models is the focus of current landslide susceptibility research (Huang et al. 2021;Zhao et al. 2023).However, due to the lack of landslide dynamic characteristic factors, it is impossible to characterize the state of landslide movement.Therefore, the traditional machine learning model is difficult to obtain reliable susceptibility results (Huang et al. 2022).SBAS-InSAR technology is used to obtain the surface deformation field of the study area, which is introduced into the susceptibility evaluation dataset as the dynamic characteristic factor.At the same time, the model can be further optimized by adjusting the hyperparameters in the testing process, and then the spatial distribution of landslide susceptibility can be reasonably predicted.

Limitations and improvements
Based on four machine learning methods, this article introduces deformation information as a dynamic characteristic factor to evaluate landslide susceptibility.The research focuses on the improvement of model performance and comparison between models after the introduction of dynamic deformation factor.However, this study still has the following limitations and improvements.Although machine learning models such as random forest, SVM, logistic regression and XGBoost have been widely used in studies on landslide susceptibility, these machine learning models have inevitable limitations: (1) overfitting.Overfitting will occur when the model is too close to the training data, which will lead to poor generalization ability of the model.( 2) Explainability.When machine learning models make predictions, we cannot clearly explain or identify the logic behind them.(3) Sensitivity to outliers.SVM and logistic regression can be sensitive to outliers in the training data, leading to poor performance.(4) Data requirements.Machine learning models require high quality data for effective training.If the data is not representative or contains errors, the model may perform poorly.We try to adjust the hyperparameters of the machine learning model, optimize the data set, and interpret the analysis to solve the limitations of the machine learning model as much as possible.In addition, the deformation characteristic adopted in this article is the average deformation rate on a long-time period, and the deformation on a three-dimensional time scale is difficult to be integrated into the machine learning model.Therefore, the model only reflects the general trend in recent years.Future research may consider incorporating the acceleration of deformation information into the model.When improving the susceptibility model, we suggest that not only the optimization of the model structure should be considered, but the model data is also important for the accuracy and reliability of the results.

Landslide-related environmental factors
The comparison results of accuracy and reliability show that the random forest and XGBoost models have higher performance.Based on the landslide susceptibility results of random forest and XGBoost model, the landslide-related environmental factors in the study area are revealed.Figure 9 shows the importance ranking of landslide-related environmental factors predicted by XGBoost and Random Forest models.We find that rainfall, aspect, temperature, earthquake and NDVI are the main influencing factors of landslides.Secondly, roads and LULC in human activities, rivers and lithology in natural factors play a secondary role in landslide occurrence.Other characteristic factors have little effect on landslides.The deformation is the external manifestation of the dynamic change of landslides, which can be used to analyze the susceptibility of landslide, but it does not belong to the cause of disaster.SHAP feature importance can be considered as a substitute for traditional feature importance with higher reliability in comparison (Al-Najjar et al. 2022).SHAP shows the numerical value of the influence of each feature in the model, which can explain the importance of landslide features.RF model is the optimal predictive performance model of this article, so we choose this model to carry out SHAP analysis.Figure 10 shows the SHAP values from the RF model.In the figure, the feature variables are arranged from high to low.The horizontal points reveal the correlation between the feature and the landslide, as well as the positive and negative relationship.The color of the points indicates whether the feature value is high or low.According to SHAP results, aspect, temperature and rainfall have the greatest influence on landslide, followed by earthquake, NDVI, river, etc., which is consistent with the results of feature importance analysis in this article.

Conclusion
In this article, a landslide susceptibility evaluation method based on machine learning model considering landslide deformation factors is proposed.Based on the traditional landslide characteristic data set, surface deformation is introduced as the dynamic characteristic factor of landslide.The advantage of this method in LSM is that it not only extracts the spatial characteristics of landslides, but also synthesizes the dynamic changes of landslides from the deformation factors.Firstly, based on the SBAS-InSAR technology, we obtained a complete surface deformation velocity field of four years in the study area by using the ascending and descending orbit observation images.Then, four machine learning methods, Logistic regression, random forest, support vector machine and XGBoost, were used to predict the landslide susceptibility in the upper reaches of Jinsha River.Finally, the performance of the machine learning model and the improvement of the machine learning model performance by InSAR method were evaluated by comparing the prediction results with the model performance evaluation indexes.
The results show that the random forest, XGBoost and SVM models have good performance.The predictive performance of random forest and XGBoost is improved greatly after the introduction of deformation characteristic factor.96.9 and 93.19% of landslide areas are reasonably divided into high and very high risk areas.Compared with the calculation result of traditional model, the proportion of high risk and very high risk pixels in landslide areas is increased by 2.97 and 1.13%, respectively.The results of susceptibility evaluation of the whole study area showed that the prediction proportions of high risk area and very high risk area of random forest and XGBoost model increased from 15.45 to 16.23% and 18.73 to 21.89%, respectively.The accuracy of the prediction model increased from 0.793 to 0.878 and 0.776 to 0.812, respectively, and the AUC increased by 0.9 and 1.7%.The SHAP and the feature significance analysis of random forest and XGBoost reveals the influencing factors of landslide.The results show that rainfall, slope aspect, temperature, earthquake and NDVI are important factors affecting landslide, while road, land use, river and lithology are secondary factors.This study can provide scientific reference for the study of LSM in Jinsha River Basin.

Figure 1 .
Figure 1.Location of the study area and distribution of landslides along the Jinsha River.

Figure 2 .
Figure 2. Examples of landslides identified by geomorphic features and InSAR deformation results.
records the detailed parameters of SAR images.Firstly, the adaptive filtering function is used to smooth the generated interferogram, and then the minimum cost flow (MCF) method generates the unwrapped deformation phase.Finally, singular value decomposition (SVD) method is applied to calculate time-series deformation.The landslides in the study area are mainly distributed on the right bank of the Jinsha River, and effective deformation can be better obtained by the ascending images.Therefore, to obtain the complete surface deformation in the study area, we use the ascending as the main image, and the descending images are used to fill the shadow and overlay area.Previous studies have introduced InSAR deformation characteristics into the prediction of landslide susceptibility(Lissak et al. 2020;Merghadi et al. 2020;Hu et al. 2021;Huang et al. 2022).Different from previous work, the InSAR work in this article has the following two innovations: (1) This article establishes a long series InSAR deformation field covering the study area, with a time span of 4 years.The cumulative average deformation rate of four years can effectively suppress the errors of atmosphere and noise in the Jinsha River basin, and provide reliable deformation data for landslide susceptibility prediction.Moreover, the deformation results in this article have been verified by cross-comparison method(Yao et al. 2022b), and the deformation results are reliable.(2) In alpine and canyon areas, due to the limitation of SAR observation direction, the application of InSAR technology has serious

Figure 3 .
Figure 3. Research methodological flow of the study.

Figure 5 .
Figure 5. Correlation coefficient of each covariate based on Wilcoxon signed-rank test.

Figure 7 .
Figure 7.Typical regional susceptibility optimization results.(a-d) the susceptibility results and deformation information of L1 landslide (marked in Figure 6b), a random forest susceptibility results without considering deformation, (b) random forest susceptibility results considering deformation.(e-h) The susceptibility results and deformation information of L2 landslide (marked in Figure 6b), (e) random forest susceptibility results without considering deformation, (f) random forest susceptibility results considering deformation.

Figure 8 .
Figure 8. ROC curves of four susceptibility models.(a) ROC test without considering deformation characteristic factors, (b) ROC test considering deformation characteristic factors.

Figure 9 .
Figure 9. Significance of landslide-related environmental factors of the study area predicted by XGBoost and random forest models.

Figure 10 .
Figure 10.Summary plot of SHAP values derived from RF model.

Table 1 .
Landslide-related environmental factors categories in the study area.

Table 2 .
Detail parameters of SAR images.
al. 2014;Chen et al. 2017; Song et al. 2018).Therefore, we used four machine learning models (Logistic Regression, Random Forest, SVM and XGBoost) for landslide susceptibility mapping in the study area.Machine learning algorithms need to build training data sets and test data sets.Training datasets are generally used to build prediction mod- (Sahin et al. 2020b)sets are utilized to analyze predictive capability, or model performance evaluation based on specific accuracy measures(Sahin et al. 2020b).Therefore, we stratified and randomly divided the landslides and non-landslides datasets into two parts, 70% was training data and 30% was validation data.Finally, 75 landslide and 40 non-landslide samples constituted the training data, and the remaining 25 landslide and 24 non-landslide samples served as the validation data.The specific model methods are introduced as follows.

Table 3 .
Detail feature importance's calculated by four ranking algorithms.

Table 4 .
Statistical pixel values of landslide susceptibility before and after the introduction of deformation.

Table 5 .
Statistical pixel values of landslide susceptibility in landslide area before and after deformation introduction.

Table 6 .
Statistical confusion matrix of the four susceptibility models.
(Bui et al. 2016)identified the long-term deformation process of landslides in the upper reaches of the Jinsha River by combining SBAS-InSAR technology and remote sensing images.It is found that InSAR technology can provide important data support for early identification and susceptibility assessment of landslides, making the prediction model more reliable.Selection of the landslide susceptibility prediction model has a great influence on the landslide susceptibility evaluation results.For different study areas and data sets, it is necessary to find appropriate evaluation models(Bui et al. 2016).In this article, four susceptibility models, including logistic regression, random forest, support vector machine and XGBoost, are used for LSM of the study area.ROC curve, F1, Recall, precision, accuracy and other indicators were used to validate the accuracy of the model prediction results.The results show that the AUC of Random Forest, XGBoost and SVM models are greatly improved after the introduction of landslide dynamic characteristic factors, and the prediction accuracy is 90.767, 90.24 and 80.136, respectively.The confusion matrix also shows that the precision of the random forest and XGBoost model is improved.The random forest and XGBoost models have good prediction ability and can obtain high-quality LSM.Based on the case study in this article, we analyze the reasons why Random Forest and XGBoost model are superior to SVM and logistic regression.(1) We adopted many landsliderelated factors, and the relationship between these variables is complex.Random Forest and XGBoost are based on the decision tree model to capture complex interactions between variables and better nonlinear relationships.Logistic regression is a generalized linear model.When the number of feature samples is large, the logistic regression model is easy to underfit, resulting in the reduction of the accuracy after the introduction of deformation factor.Although kernel function mechanism is introduced in SVM to solve nonlinear problems, its relative performance is poor when the data dimension is large.(2) Ensemble methods: Random Forest and XGBoost are ensemble methods, which combine multiple weak learners to create a strong learner.This allows them to reduce overfitting and improve generalization performance.Compared with SVM and Logistic regression, they are more robust to outliers, and stronger anti-interference, anti-overfitting and generalization capabilities can obtain more reliable prediction results in this study area.The outliers will have a great impact on the linear model.(3) Random Forest and XGBoost provide feature importance scores, which can help filter data sets suitable for LSM in the study area.