Landslide susceptibility mapping with feature fusion transformer and machine learning classifiers incorporating displacement velocity along Karakoram highway

Abstract The Karakoram Highway (KKH) is a pivotal gateway within the framework of the China-Pakistan Economic Corridor. Nevertheless, its distinct and intricate geographical characteristics make it susceptible to recurrent landslides. As an essential tool, landslide susceptibility mapping (LSM) is significant in managing and alleviating landslides near the KKH. Moreover, landslide conditioning factors (LCFs) are crucial determinants influencing the outcomes of LSM. However, existing methods primarily rely on machine learning algorithms, which do not adequately account for the intricate spatial characteristics and patterns between LCFs and landslide occurrences. In response, this study introduces the feature fusion transformer (FFTR) module, constructed based on the foundations of the transformer framework, to fuse the spatial information features of all LCFs. Subsequently, the abstract high-level spatial features obtained are fed into diverse machine learning classifiers, including random forest (RF), extreme gradient boosting (XGBoost), gradient boosting decision trees (GBDT), categorical boosting (CatBoost), and extremely randomized trees (ET), to generate landslide susceptibility maps. Displacement velocity calculated by Persistent Scatterer Interferometric Synthetic Aperture Radar (PSInSAR) is incorporated in the LSM. The results demonstrate that FFTR-RF achieves premium performance in the area under the curve (AUC) (94%), accuracy (87.31%), precision (87.21%), recall (88.02%), and F1-score (87.61%). Incorporating displacement velocity into LSM results predicted by models enhances the comprehensiveness of LSM. These methods will furnish early warning systems for landslide disasters along the KKH, thus aiding recommendations for mitigating landslides’ social and economic losses.


Introduction
China-Pakistan Economic Corridor (CPEC) is a paradigmatic project and flagship initiative of China's ambitious 'Belt and Road' initiative (Abid and Ashfaq 2015;Javaid 2016;Khan and Liu 2019).As a critical strategic route, the KKH is often called the 'Friendship Road' between China and Pakistan (Nilofar et al. 2014).It has catalyzed economic and trade cooperation between China and Pakistan (Kulsoom et al. 2023).The KKH faces significant challenges due to its high-altitude mountainous terrain, intricate geographical conditions, and intense precipitation, making it highly susceptible to landslides (Hussain et al. 2023).These landslides often destroy bridges and road blockages and cause casualties and significant socio-economic losses (Zimmerman 2015;Ali et al. 2019).Furthermore, landslides exhibit complex causative characteristics, necessitating careful consideration of various conditioning factors when creating susceptibility maps for landslide-prone areas (Wang et al. 2019).
LSM aims to calculate the likelihood of occurrence of landslides within a specific area based on the landslide conditioning factors (LCFs) (Brabb 1985;Guzzetti et al. 1999).LSM techniques can be classified into statistical, empirical, physics-based, and machinelearning approaches (Guzzetti et al. 2012;Dou et al. 2019;Whiteley et al. 2019;Hussain et al. 2022).However, the development of statistical, empirical, and physics-based methods has encountered limitations (Huang et al. 2018;Roy et al. 2019;Di Napoli et al. 2020;Huang et al. 2020a;Merghadi et al. 2020;Loche et al. 2022).In recent years, machine learning methods have shown higher accuracy and efficiency in LSM, as they can capture non-linear relationships between LCFs and landslide data (Melchiorre et al. 2006;Merghadi et al. 2020;Qing et al. 2020).Various machine learning algorithms have been employed in this context, including but not limited to logistic regression, support vector machine, RF, XGBoost, naive Bayes, and CatBoost (Huang and Zhao 2018;Hussain et al. 2021;Khalil et al. 2022;Li et al. 2022;Ye et al. 2022;Zhang et al. 2022;Gupta and Shukla 2023;Huang et al. 2023a).However, there is still room for development in LSM based on the above research.The results of LSM significantly rely on the distribution characteristics of LCFs.Therefore, considering contextual features in geographic spatial information is crucial for LSM.Nevertheless, traditional machine learning methods struggle to adequately account for the complex contextual spatial information and patterns among landslide data and various LCFs (Chen et al. 2021).Therefore, it is imperative to extend the research on LSM into deep learning to enrich the comprehension of the spatial relationship between landslides and LCFs and uncover intricate patterns of landslide occurrences in space.
As deep learning rapidly advances, researchers have focused on incorporating deep learning methods into LSM to understand and represent regions prone to landslides (Bui et al. 2020;Huang et al. 2020b).Several deep learning techniques including the convolutional neural network (CNN), residual network (ResNet), recurrent neural network (RNN), and long short-term memory (LSTM), have been successfully employed in LSM (Wang et al. 2019;Fang et al. 2020;Wang et al. 2020;Chen et al. 2021;Ngo et al. 2021;Lv et al. 2022;Huang et al. 2023b;Jiang et al. 2023;Zhang et al. 2023).In recent years, the development of transformer architecture has shown remarkable success in various downstream tasks (Zheng et al. 2021;Amatriain 2023).The effectiveness of the transformer stems from its self-attention mechanism (Yang et al. 2021).It is precisely due to its remarkable ability to unearth spatial relationships that the transformer can learn the patterns and features of landslide occurrences in space, enabling a comprehensive analysis of large-scale landslide susceptibility.The multi-head attention mechanism further enriches feature representations and explores intricate relationships between different features (Vaswani et al. 2017).Therefore, the transformer holds significant advantages in capturing long-range contextual spatial relationships between landslide data and various LCFs.Utilizing the transformer to fuse multiple complex LCF features demonstrates the capacity to improve LSM.Nonetheless, it is worth noting that dynamic surface displacement is a crucial dataset capable of capturing landslide susceptibility but is often disregarded in the mapping process based on deep learning models.
Hence, incorporating displacement velocity into the LSM is necessary.Surface displacement velocity effectively reflects the dynamic changes in slopes.Therefore, recent studies have detected and updated landslide inventories based on surface displacement velocity (Zhao et al. 2018;Rehman et al. 2020;Su et al. 2021;Su et al. 2022;Hussain et al. 2023).InSAR is a new earth observation technique, notable for its extensive monitoring time and the capacity to calculate precise and reliable data regarding surface deformation (Mondini et al. 2021).Among InSAR techniques, PS-InSAR is an advanced InSAR technique well-suited for the millimeter-level monitoring of large-scale surface deformation (Beladam et al. 2019).The timely acquisition of slope displacement velocity is attainable through the PS-InSAR technique, and incorporating it into LSM is a viable option (Hussain et al. 2021).Subsequently, upon the computation of displacement velocity, its incorporation into LSM exhibits the potential for augmenting the overall reliability and comprehensiveness of the mapping.
Upon the preceding context, this study focuses on the buffer zone along the KKH as the study area and conducts LSM by considering fifteen LCFs that include topography, geology, hydrology, and others.The LCFs are categorized into continuous and discrete types and then fed into the FFTR module for feature fusion, resulting in the advanced spatial feature.Subsequently, the fused feature is input for various classifiers to compute landslide classification results within the region, including RF, XGBoost, GBDT, CatBoost, and ET classifiers.These results are further utilized to produce landslide susceptibility maps.Finally, the displacement velocity calculated through PS-InSAR is incorporated into the LSM to create the ultimate landslide susceptibility map.By generating the susceptibility maps, this study aims to contribute to the prevention and mitigation of life and property losses along the KKH and promote the development of the regions adjacent to the highway.

Methods
In summary, based on the source data shown in Figure 1, landslide inventory and LCFs were created.Firstly, all LCFs were input into landslide susceptibility prediction models to obtain susceptibility maps.Secondly, displacement velocity was calculated using the PSInSAR technique based on the Sentinel-1 datasets.Finally, the displacement velocity was integrated into the susceptibility maps predicted by the models, resulting in the generation of the final landslide susceptibility maps (Figure 1).

Study area
This study focuses on the KKH within the CPEC, which spans approximately 353 km in a northwest-southeast direction, with a 10 km buffer zone along the highway.The China-Pakistan Economic Corridor stretches from Kashgar in the north to Gwadar Port in the south, bordered by Tajikistan and Afghanistan to the east and China and India to the west (Figure 2).The region possesses a strategically significant geographical location.
The corridor's traverses multiple high-altitude mountain ranges, including the Tian Shan, Pamir, Karakoram, and Himalayan ranges (Searle et al. 1999).The corridor is adjacent to the Taklamakan Desert, the Tarim Basin, and the Qinghai-Tibet Plateau, encompassing parts of the Indian plains (Figure 2).The KKH, with its lowest point at 460 m and the highest point at 4,733 m, crosses numerous rivers and glaciers and experiences an annual rainfall ranging from 150 mm to 2,000 mm (Rashid et al. 2020).The region's extremely high temperatures and heavy rainfall contribute to frequent landslides along the KKH (Yang et al. 2020).Due to its unique and treacherous geographical setting, the KKH is susceptible to frequent landslides.

Landslide inventory and conditioning factors
The landslide inventory is critically vital in documenting the occurrence and characteristics of landslides, including the locations and extents.It serves as a crucial basis for conducting LSM, enabling us to enhance the accuracy and reliability of methodologies (Guzzetti et al. 2012).A comprehensive landslide inventory within the study area is developed, encompassing 547 individual landslide areas with a cumulative area of 245 km 2 (Figure 3).These areas include various landslides, such as debris flows, scree slopes, and rockfalls.Initially, relevant landslide disaster data from publications by the Geological Survey of Pakistan (GSP) represent authentic instances of landslide disasters within the study region, providing valuable data for analysis (Fayaz et al. 1985;Khan et al. 1986;Khan et al. 2003).Furthermore, the publication related to CPEC is also referenced (Yixuetao and Shao 2021).Moreover, Sentinel-2 satellite imagery (2022) combined with Google Earth imagery facilitated detailed observations and validation of the collected data, ensuring the landslide inventory's reliability and robustness.Random sampling was conducted within the landslide areas to consider the spatial information of these landslides (Figure 3).Thus, there are 2,111 landslide locations and 2,145 non-landslide locations.These locations were randomly split into training and testing data with a ratio of 7:3 (Table 1).
Due to the complex and multifaceted nature of landslide occurrences, which are influenced by various interacting factors, selecting conditioning factors is of utmost importance (Huang and Zhao 2018;Gaidzik and Ram� ırez-Herrera 2021).Therefore, considering the   7) and fault lines; topography, which includes elevation, curvature, plan curvature, profile curvature, aspect, slope, roughness, and topographic wetness index (TWI); hydrology, which encompasses distance to river and precipitation, as well as landcover, normalized difference vegetation Index (NDVI), and distance to road (Table 2, Figures 4-6).Among these, the influences of the river and road factors on landslide hazards are analyzed using Euclidean distance analysis, providing insights into the spatial relationships between these factors and landslide occurrences (El Jazouli et al. 2019).The calculation of NDVI utilized Sentinel-2 satellite imagery from May, July, and September 2022, coinciding with the surface deformation monitoring period.The calculation formula is as follows: NIR and Red represent the reflectance values in the near-infrared and red bands.One of the significant intrinsic factors contributing to landslide occurrences in the study area is the unique internal geological characteristics, particularly the lithology (Guzzetti et al. 1996;Henriques et al. 2015).The predominant lithology within the study area consists of sedimentary rocks, including siliciclastic sedimentary rocks, unconsolidated sediments, mixed sedimentary rocks, and carbonate sedimentary rocks, collectively accounting for approximately 62% of the total area.Mixed sedimentary rocks are the most prevalent, covering approximately 38% of the study area.Additionally, the remaining lithology is predominantly igneous rocks, with acid plutonic rocks being the most common, occupying approximately 28% of the study area (Figure 7).

Feature fusion transformer
This study introduces FFTR to integrate the spatial features of all the conditioning factors (Figure 9).The conditioning factors are divided into two types for feature fusion: discrete features and continuous features.The discrete feature fusion includes landcover, lithology, and aspect (Figures 6 and 7).Before embedding coding, landcover and aspect (each with ten classes) are encoded with indices [1, 2, 3, … , 10], while lithology (with eight classes) is encoded using indices [1, 2, 3, … , 8].The factors are initially inputted into their respective embedding layers for the fusion of discrete features (Figure 9).For the fusion of continuous features, taking the processing of the landcover feature in the embedding layer as an example, the feature dimension outputted by the embedding layer is 16.Thus, the shape of the output from the embedding layer is (10, 16).These three discrete features are then inputted into the stack layer, resulting in a new shape of (10, 3, 16).Subsequently, the features are further processed in the transformer block for subsequent feature fusion.The 'flatten' layer transforms the multidimensional representation of discrete features into a one-dimensional form, effectively flattening the input.By doing so, the original input shape, such as (10,3,16), is reshaped to (10,48).At this point, the fusion of discrete features is concluded (Figure 9).Furthermore, some details still need clarification regarding the handling of discrete features.The key to the transformer block takes its foundation from implementing the attention mechanism (Figure 8).The essence of the attention mechanism is a process of addressing, where a query vector (Q) is given for the factor features.It computes the attention distribution concerning the key vectors (K) and applies it to the value vectors (V) to calculate the attention value.This study's transformer block applies a multi-head self-attention mechanism based on Scaled Dot-Product Attention (Vaswani et al. 2017).The computation principle of the attention mechanism is as follows: where ffi ffi ffi ffi ffi d k p represents the scale parameter.Through utilizing Multi-Head Attention, the model is fragmented into multiple heads, ultimately establishing distinct subspaces.The Multi-Head Attention is accomplished by applying the Scaled Dot-Product Attention process H (in this study, H is set to 4) times and then combining the outputs (Vaswani et al. 2017).The input is processed through a linear transformation: After the linear transformation, the matrix multiplication and scale ( ffi ffi ffi ffi ffi d k p ) are performed between Q and V.The multi-head attention mechanism follows the following procedure: The data is normalized by transforming its mean to 0 and standard deviation to 1 by applying layer normalization (Ba et al. 2016).After normalization, the features are fed into the feedforward layer.The feedforward layer is formed by two linear transformations and a non-linear activation function, ReLU.The computation principle of the feedforward layer can be described as follows: After another skip connection and layer normalization, the operation of a single transformer block is completed.In this study, a total of three blocks were utilized.
For the fusion of continuous features, the continuous features are expanded in dimension and then inputted into a concatenate layer to combine the dimensions.The resulting concatenated features are further passed through layer normalization for regularization.This operation ensures consistent scaling and reduces the impact of variations in the continuous feature distributions.The output shape after layer normalization is (1, 12).Thus, the fusion of continuous features is complete (Figure 9).
After completing these two fusion steps, the fusion output for discrete and continuous features is concatenated with the output shape (10, 60) and inputted into a concatenate layer.The concatenated features are then passed to a Multilayer Perceptron (MLP) (Taud and Mas 2018).In this study, two hidden layers are incorporated into the MLP architecture.The first hidden layer consists of 120 hidden units, while the second hidden layer consists of 60 hidden units.Batch normalization is utilized as the regularization method within the MLP to address the potential 'Internal Covariate Shift' issue (Ioffe and Szegedy In the absence of batch normalization, the activation function can be represented as follows: With the inclusion of batch normalization, the activation function layer can be represented as follows: The last layer of FFTR consists of a dense, fully connected layer that connects all neurons from the previous layer to every neuron in the current layer.The activation function applied in this layer is the sigmoid function.The output dimension of this layer is 1, indicating the fused spatial features are outputted as a single scalar value. The optimizer adapted is adaptive moment estimation (Adam) (Kingma and Ba 2014).The loss function is binary cross-entropy loss:

Feature fusion transformer and machine learning classifiers
The fused spatial features obtained from the FFTR module are combined with the LCFs to generate the spatial feature of the landslide.Combining the abstract fused features derived from FFTR with the LCFs allows the susceptibility model's learning perspective to encompass global and local features.This approach maximizes the effective exploration and utilization of the distinct characteristics inherent in all LCFs.These combined features are then utilized with RF, XGBoost, ET, CatBoost, and GBDT to classify the data into landslide and non-landslide instances.The final step in the process is to output the probabilities of landslide occurrence and generate susceptibility maps (Figure 10).RF employs an ensemble learning technique combining multiple decision trees, culminating in a powerful classifier (Breiman 2001).It is known for its effectiveness in handling missing values and its ability to assess the importance of different features in the classification process.In this study, feature selection is accomplished using the Gini coefficient:

Gini p
p k represents the probability that a sample belongs to the class labeled as k.
The Extra Trees (ET) algorithm is also an ensemble learning method that utilizes a set of decor-related decision trees for classification (Geurts et al. 2006).In contrast to RF, ET does not employ the bagging technique but instead uses all training samples for individual decision trees.Additionally, ET performs feature splits in a fully random feature selection strategy without using a random subset of features, reducing variance and preventing the undue influence of certain features.The feature splits in ET are also based on the Gini coefficient.
GBDT is a classifier that utilizes CART regression trees as its base learners.It is an iterative algorithm that constructs a set of weak learners and combines their outputs to generate the final prediction.GBDT can discover high-order relationships between features automatically.The feature splitting in GBDT is based on Friedman Mean Square Error (Friedman 2001).The loss function for classification is the logarithmic loss function, which is defined as follows: p k represents the probability that a sample belongs to the class labeled as k.
Under the framework of the GBDT algorithm, two notable improvements emerged: XGBoost and CatBoost.XGBoost introduces a novel technique that pre-sorts and stores the data, allowing for efficient reuse during subsequent computations (Chen and Guestrin 2016): Denote b y i t as the predicted result of sample i after the t-th iteration, as the predictions from the previous t-1 classifiers and f t x i ð Þ denotes the function of the t-th classifier.CatBoost focuses on handling categorical features efficiently.It employs a specific type of decision tree called 'oblivious trees' as the base learners (Dorogush et al. 2018;Prokhorenkova et al. 2018).Typically, numerical representations are derived from categorical feature values via the application of the provided equation: In the given equation, 'countInClass' stands for the total sample count in the current categorical feature value that has a label value of 1. 'prior' is the initial value of the numerator, which is determined based on the initial parameters.'totalCount' refers to the overall count of samples with identical categorical feature values as the current sample incorporating the ongoing sample.In this study, five evaluation metrics were employed.These indicators encompass accuracy, precision, recall, F 1 -score, and AUC value (Ji et al. 2020).

Displacement velocity calculation
This study employed the PS-InSAR technique to calculate the displacement velocity within the study area.PS-InSAR is based on multi-scene SAR data in the same geographical region, which can detect Persistent Scatterer points unaffected by temporal and spatial baseline decorrelation.This technique offers significant temporal coverage for displacement monitoring, the capacity to mitigate atmospheric interference, and the capability to discern millimeter-scale displacement (Beladam et al. 2019).
The SAR data mainly utilized ascending and descending data from 2021 to 2023, totaling 100 scenes (Table 3).

Incorporation of displacement velocity with landslide susceptibility mapping
This study defines the PS-InSAR calculated displacement velocity into three levels to incorporate the outcomes of PS-InSAR into the landslide susceptibility predicted by the model.Regarding the landslide displacement velocity classification (Cruden and Varnes 1996), moderate, high, and very high landslide susceptibility levels were assigned.
More specifically, it is divided into these three intervals according to its displacement velocity distribution: (l − r, l þ r) is moderate; (l − 2r, lr] and [l þ r, l þ 2r) are high; the rest are very high.l is the mean; r is the standard deviation calculated as follows: Finally, landslide susceptibility predictions missed by the model or predictions that differ from PS-InSAR calculations will be integrated.

Comparison of landslide susceptibility prediction models
The fused spatial features obtained from the Features FFTR were utilized as input for machine learning classifiers, creating five combination models.The combination of FFTR and RF exhibited the most outstanding performance among these models.It achieved an AUC value of 94%.The accuracy of this model was measured at 87.31%, while precision, recall, and F 1 -score were calculated as 87.21%, 88.02%, and 87.61%, respectively (Table 4).
The results of this study demonstrate that the FFTR, integrating all LCFs, exhibited superior performance among all modules.The FFTR module achieved an AUC of 91.77%.It also gained an accuracy of 84.34%, precision of 84.96%, recall of 84.18%, and F 1 -score of 84.57%, all of which ranked first among the baseline models in performance.Following FFTR, RF module attained an AUC of 89.18%.The remaining modules, ranked in descending order of performance, were XGBoost, ET, CatBoost, and GBDT (Table 5).Notably, incorporating FFTR in feature fusion led to an increase of 2.23% in AUC compared to using FFTR alone; when compared to RF alone, the AUC was improved by 4.82%.Remarkable enhancements in performance were observed across all evaluation metrics for the remaining modules as well.These findings underscore the significant improvements achieved by incorporating FFTR for feature fusion, further enhancing the overall performance of the models.
Based on the classification performance across all data, the FFTR-RF model demonstrates better results than other models with the highest AUC value.The larger area under the ROC curve suggests its superior ability to distinguish between positive and negative instances (Figure 11).Similarly, the larger area under the Precision-Recall (PR) curve indicates its better precision and recall trade-off (Figure 12).These results demonstrate that the FFTR-RF model exhibited the most optimal performance among the evaluated models regarding classification ability.

Displacement velocity results
The PS-InSAR technique processes SAR data encompassing 100 scenes acquired between 2021 and 2023.Two primary software packages were utilized: SARscape 5.6 and MATLAB-based SARPROZ.
For SARscape, the initial step generates a connectivity map and automatically selects the super-master image.After the interferometric process, the displacement velocity is obtained by completing two inversions and geocoding.Notably, a coherence threshold of 0.6 is applied.
For SARPROZ, after the data input, co-registration is completed.In multi-image sparse points processing, the threshold is 0.6.The displacement velocity is output by geocoding.The displacement velocity along the slope in the study area was calculated.The velocity values of the displacement ranged from −101.46 mm/year to 92.03 mm/year within the study area, with a mean value of 1.68 mm/year and a standard deviation of 19.41 mm/year (Figure 13).Certain areas within the area exhibited high levels of deformation along the slope.These areas may signify the presence of potential landslide-prone areas.

Landslide susceptibility mapping
After feeding all the LCFs into the model, the model outputted the probability of landslide occurrence, which was then used to classify the landslide susceptibility into five levels: very low (0 to 0.2), low (0.2 to 0.4), moderate (0.4 to 0.6), high (0.6 to 0.8), and very high (0.8 to 1) (Figure 14).The resolution is 30 m.The classification is based on the calculated probabilities, and each susceptibility level represents a different likelihood of landslide occurrence.
In all the landslide susceptibility maps, the central part of the study area exhibits higher susceptibility to landslides.Using the FFTR-RF model for LSM, the outputs demonstrate that the percentage of areas classified as low (including low and very low) is 57.64%, moderate is 29.45%, and high (including high and very high) is 12.91%.This distribution indicates that a portion of the study area is susceptible to landslide hazards, emphasizing the need for effective mitigation and management strategies in these highrisk areas.
For these five different landslide susceptibility maps, some comparisons were made in terms of displacement velocity results.More specifically, for the three susceptibility classes (moderate, high, and very high), the intersection ratio (IR) between the model-predicted results (m) and displacement velocity results (p) was calculated: FFTR-RF displays the highest IR with the deformation calculation results computed from PS-InSAR, thus also establishing it as the top-performing model among these five (Table 6).Therefore, the FFTR-RF model with the best performance was chosen.The results from LSM within regions categorized as 'moderate,' 'high,' and 'very high' were integrated with the results obtained from PS-InSAR.Simultaneously, the displacement velocity was also categorized as 'moderate,' 'high,' and 'very high.'The union of the results for these three categories from both the FFTR-RF model and PS-InSAR was taken.For 'high' and 'very high', in cases where there is inconsistency in the category values for a particular pixel, the value from PS-InSAR was adopted for that pixel.
The final LSM results are generated by integrating the FFTR-RF model's predictive outcomes and the PS-InSAR calculated results (Figure 15).Notably, in regions A, B, C, and D, it is evident that the FFTR-RF model exhibits misclassification and omissions for high and very high susceptibility results.In region B, some moderate susceptibility results also show misclassifications and omissions.In summary, integrating the results from PS-InSAR has significantly enhanced the mapping comprehensiveness of LSM.

Newly identified landslide-prone regions
The areas classified as high and very high in landslide susceptibility in the final integrated LSM result were overlaid to identify new areas with landslide propensity along the KKH (Figure 17).The high and very high susceptibility areas likely indicate areas with a tendency for landslides.Furthermore, some of these areas are not included in the landslide inventory compiled in this study, suggesting that they may represent areas with a high potential for future landslides.We recommend regular monitoring and timely preventive measures for these areas with identified landslide propensity to mitigate potential risks.

Feature fusion transformer parameters
The parameters configuration for FFTR in this study is presented in Table 7. Notably, the multi-head attention mechanism plays a vital role in capturing contextual information   related to landslides, enabling the model to learn various contextual patterns (Vaswani et al. 2017).The number of heads in the multi-head attention mechanism is critical as it determines the model's ability to capture a wide range of relevant features and enhance its expressive capacity.However, a larger number of heads increases model parameters and computational complexity.In this study, four was chosen as the number of heads, striking a balance between capturing diverse contextual information and maintaining computational efficiency.Furthermore, the Adam optimizer was employed, and the 'weight decay' parameter was utilized for weight decay regularization.Weight decay helps control the model's complexity by adding a regularization term to the loss function, preventing overfitting (Loshchilov and Hutter 2017).The 'weight decay' parameter applies L2 regularization to the model's weights in the Adam optimizer.During each weight update, the 'weight decay' value is multiplied by the weights, gradually reducing the weights.Notably, a 'weight decay' value greater than 0.001 led to underfitting in this study.Hence, a 'weight decay' value of 0.001 was selected.
The overall model comprises 29,369 parameters, with the transformer blocks accounting for 14,016 parameters.

Sensitivity of FFTR-ML
The Monte Carlo Cross-Validation method assessed the model's data sensitivity and generalization capability (Xu and Liang 2001).The dataset was randomly partitioned in each iteration, with 70% of the data allocated to training, 20% to validation, and the remaining 10% excluded from analysis.This process was repeated ten times to perform model training and validation.Using metrics such as AUC, accuracy, precision, recall, and F 1 -score as evaluation criteria, the average values of these metrics were calculated from the ten shuffle-split results and visualized using a radar chart (Figure 18).AUC, accuracy, precision, recall, and F 1 -score were meticulously recorded for each of the ten cross-validation runs (Figure 19).
The results unequivocally demonstrate that, compared to other models, the FFTR-RF model exhibited the highest performance, indicating its robust generalization ability.

Mapping discrepancies in FFTR-XGBoost
Among the diverse model mapping outcomes, it is evident that the LSM generated by the FFTR-XGBoost model exhibits pronounced dissimilarity compared to those produced by other models (Figure 14).In the IR result, the FFTR-XGBoost value is the lowest (Table 6).A crucial insight from these mapping results highlights that the moderate category, characterized by probabilities ranging from 0.4 to 0.6, represents the smallest proportion among the five models, accounting for only 7.56% of pixel values.
The scarcity of outcomes in the moderate category can be attributed to the unique nature of the XGBoost model.Notably, each weak learner in XGBoost is sequentially fitted based on the preceding weak learner, with their predictions ultimately weighted and aggregated to formulate the final prediction.The training process of XGBoost explicitly optimizes the classification results of samples (refer to Equation ( 13)) rather than directly optimizing the probability values.Consequently, the loss function incorporates penalties for misclassified samples, fostering a tendency for the model to favor more definitive predictions rather than assigning ambiguous probabilities around 0.5.In essence, XGBoost exhibits a predisposition towards outcomes with either low or high probabilities, thereby resulting in a diminished number of outcomes falling within the moderate category.Therefore, the mapping results exhibit significant differences compared to the other maps.With the high and very high categories having similar proportions in all model results, the lowest percentage of the moderate category contributes to its lowest IR values.

Future research
This study conducted LSM along the KKH using a series of landslide conditioning factors to analyze the spatial distribution of susceptibility.However, spatial LSM still has limitations in timely alerts for areas at risk of potential landslides.Therefore, future research will continue to explore LSM on a temporal scale.
In addition, landslides also encompass different categories, each with distinct prevention and control measures.Thus, considering these categories in LSM is a prospective avenue for future research.In the area around KKH, the effect of glaciers or snow melting on landslide hazards in the summer as the temperature rises is not included in this study.Subsequent research will continue to improve the susceptibility mapping by considering the effect of snow and ice meltwater on landslides.

Conclusions
In recent decades, landslide disasters have profoundly impacted the KKH, a pivotal economic conduit linking China and Pakistan.Consequently, this article employs a comprehensive array of LCFs, including lithology, distance to fault, elevation, curvature, plan curvature, profile curvature, aspect, slope, roughness, TWI, precipitation, distance to river, landcover, NDVI, and distance to road.Based on the above data, a landslide inventory is compiled, utilizing a 10-kilometer buffer zone of KKH.Subsequently, landslide susceptibility models are formulated.Finally, displacement velocity is incorporated for LSM.
The substantive contributions of this article are listed as follows: 1. To enhance the exploration of spatial patterns and interrelationships between LCFs and landslide data, a transformer-based FFTR feature fusion module is introduced.The 15 LCFs are categorized into continuous and discrete data subsets, facilitating effective feature fusion.The fused feature is subsequently fed into five distinct machine learning classifiers: RF, ET, GBDT, XGBoost, and CatBoost.Comprehensive evaluation metrics collectively indicate that the FFTR-RF model yields the most superior outcomes for landslide susceptibility prediction.2. To capture the dynamic deformations of landslides, this article utilizes the PSInSAR technique to monitor surface deformations continuously.Employing a SAR dataset of 100 images from 2021 to 2023, the displacement velocity in the study area is calculated.3. Upon the two points mentioned above, and to enhance the comprehensiveness of susceptibility mapping, this study categorizes displacement velocity into three distinct classes: moderate, high, and very high.These classifications are then integrated into the susceptibility results generated by the model.This approach contributes to the derivation of the final susceptibility map.
The method proposed in this study enhances the accuracy of landslide warnings and assists governmental and societal institutions in developing robust mechanisms for landslide disaster prevention and early warning.

Figure 1 .
Figure 1.Flow chart of the study.

Figure 8 .
Figure 8. Explanatory diagram of the transformer block.

Figure 9 .
Figure 9.The framework of the FFTR.

Figure 12 .
Figure 12.PR curve of different models.

Figure 14 .
Figure 14.LSM results of different models.

Figure 15 .
Figure 15.LSM results.(a) is the final integrated LSM result; (b) is the LSM result of FFTR-RF.

Figure 19 .
Figure 19.Evaluation metric for each shuffle split.

Table 1 .
Training and testing data.
complex geographical environment and unique geological characteristics of the KKH, this study incorporates several conditioning factors to assess landslide hazards comprehensively.The chosen conditioning factors include geology, which encompasses lithology (Table2, Figure
Figure 11.ROC curve of different models.