A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images

ABSTRACT Recent change detection (CD) methods focus on the extraction of deep change semantic features. However, existing methods overlook the fine-grained features and have the poor ability to capture long-range space–time information, which leads to the micro changes missing and the edges of change types smoothing. In this paper, a potential transformer-based semantic change detection (SCD) model, Pyramid-SCDFormer is proposed, which precisely recognizes the small changes and fine edges details of the changes. The SCD model selectively merges different semantic tokens in multi-head self-attention block to obtain multiscale features, which is crucial for extraction information of remote sensing images (RSIs) with multiple changes from different scales. Moreover, we create a well-annotated SCD dataset, Landsat-SCD with unprecedented time series and change types in complex scenarios. Comparing with three Convolutional Neural Network-based, one attention-based, and two transformer-based networks, experimental results demonstrate that the Pyramid-SCDFormer stably outperforms the existing state-of-the-art CD models and obtains an improvement in MIoU/F1 of 1.11/0.76%, 0.57/0.50%, and 8.75/8.59% on the LEVIR-CD, WHU_CD, and Landsat-SCD dataset respectively. For change classes proportion less than 1%, the proposed model improves the MIoU by 7.17–19.53% on Landsat-SCD dataset. The recognition performance for small-scale and fine edges of change types has greatly improved.


Introduction
Change detection (CD) of remote sensing images (RSIs), a process of extracting land cover change information by analysing a pair of co-registered remote sensing images of the same area in distinct periods, comprises a hot topic in the intelligent interpretation of remote sensing images community (Shafique et al. 2022). The definition of change detection exhibits very large variability depending on different applications. Detecting changes manually is a time-consuming and labor-intensive task (Singh 1989). Therefore, automated CD is one of the key technologies for earth observation applications, and plays an important role in urban expansion (Chen and Shi 2020), deforestation (De Bem et al. 2020), disaster assessment (Fujita et al. 2017), as well as other practical application (Naegeli, Huss, and Hoelzle 2019). Especially for the areas with fragile ecological environment, regular and long-ranges monitoring in terms of land cover is increasingly vital. Fortunately, with the improvement of earth observation technologies, massive high-quality multitemporal wide coverage RSIs provide key data support for CD tasks.
According to the type of semantic label information desired in the output change map, CD divides into two categories: Binary Change Detection (BCD), where problems only interrelated to 'where changes happen', and semantic change detection (SCD), where problems related to both 'where changes happen' and 'how changes happen' are solved in parallel. Hence, SCD provides 'from-to' change map indicating change direction and contains more comprehensive land-cover change information, and the acquisition of detailed change type conversion information is crucial for specific applications (Peng et al. 2021).
Current CD networks are data-driven deep learning-based methods. Well-annotated CD datasets play a crucial role in exploring novel CD methods. In the context of CD tasks, there are many BCD datasets (Chen and Shi 2020;Wang et al. 2018;), while few from-to-change well-annotated datasets as open source are available. Additionally, existing datasets suffer from some bottlenecks: (1) lack of long-range multi-temporal RSIs.
(2) lack of multiple change types with detailed annotation. (3) lack of SCD information, which almost reflects whether there is a change but not reports the direction of land cover transformation, for example, from farmland to building. To a certain extent, exploration of from-to CD dataset accelerates future research on SCD methods. In this paper, we create a large-scale SCD dataset with more variation change types and from-to information, richer prior land cover information, and longer time series RSIs.
At present, majority of CD methods are mainly resorted to various convolutional neural networks (CNNs) to realize the BCD. One of the ideas of CD methods is to fuse the bands of a pair of images and input them into the end-to-end network to get a pixel-level change map (Alcantarilla et al. 2018;Peng, Zhang, and Guan 2019). Others are based on deep Siamese networks to obtain CD maps. The input pair images are fed into two shared weights feature extracting branches, and embed them into a feature space where the distance of change pairs is large and no-change pairs is small ). However, it is an intrinsic locality of convolution operation that leads to performance degradation in representing image features and imposes a further constraint on modeling explicit long-range relations.
To enhance the performance of CD, the latest studies focus on increasing the receptive field and improving feature extraction and refinement. Peng, Zhang, and Guan (2019) and Zhang et al. (2018) utilize multiscale atrous convolution to extract multiscale features for improving the performance of CD. Others strive to explore the performances via the nonlocal operations of attention mechanisms to enhance the global context of features (Chen and Shi 2020;. Nevertheless, existing CNN-based CD methods generally still struggle to relate long-range concepts in space-time. And it is essential that long-range context information for the semantic changes of bitemporal RSIs. Inspired by the encouraging performance of the transformer in the Computer Vision (CV) area, transformer-based approaches are proposed in the downstream task of CD (Bandara and Patel 2022a;Chen, Qi, and Shi 2021). Thanks to global self-attention, vision transformers (ViTs) have stronger long-range spatial and temporal relations shaping ability than CNN-based networks (Dosovitskiy et al. 2020). There is a certain degradation in model performance of capturing multiscale features when solving many visual tasks at different scales objects, thereby failing in capturing small objects and fining edge of objects. Despite the ViTs having a wider receptive field and more robust long-range context modeling ability, the transformers-based SCD works do not carry out in a deep-going way. And the most recent related researches mainly focus on the BCD.
Based on the above researches on the CD work, we intuit that extracting various scales of objects, especially for small objects and fine edges in change map, need to capture long-range and multiscale context features in SCD tasks. To meet the above challenges, we introduce the Pyramid-SCDFormer network. At the encoder stage, the shunted self-attention (SSA) module (Ren et al. 2021) is integrated to better model multiscale features among different attention heads within one self-attention layer, and then multi-level features from the Siamese network are concatenated to obtain the distance feature maps of bitemporal RSIs. The Semantic Change Map (SCM) is finally obtained after processing by Multi-Layer Perception (MLP) and upsampling in the decoder architecture.
For feature extraction, the Pyramid-SCDFormer is different from previous transformer-based CD methods, which learns to extract pyramid features of different scales changed objects at different attention heads within one attention layer in an efficient and effective manner thanks to the SSA module. Hence, it retains more fined-grained features and clearly identifies small objects and fine edges of changes that other models easily ignore. For the distance maps of the bitemporal images, we concatenate the features of different levels from the encoder stage to obtain the final distance map, while previous CD models mainly calculate the absolute distance.
In sum, the main contributions are as follows: (1) For the clear recognition of small-scale objects and fine boundaries of changes objects, we propose a novel end-to-end SCD network based on a transformer module, Pyramid-SCDFormer, which utilizes the SSA module to capture features at different scales simultaneously with favorable efficiency, then integrates different hierarchical features of the changed land covers in Siamese network to obtain fine-grained features of changes.
(2) A new, open-source optical satellite SCD dataset with unprecedented time series and semantic change types, Landsat-SCD, is presented, which comprises 8468 pairs of multispectral Landsat images with 10 classes change types. Landsat-SCD is an available dataset to advance the stateof-the-art models in BCD and SCD tasks. (3) Extensive experiments confirm the validity of the proposed Pyramid-SCDFormer. The proposed model well mitigates the misdetections of small-scale changes and the fine edges, and achieves state-of-the-art performance on the LEVIR-CD, WHU_CD, and the proposed Landsat-SCD benchmarks.
The remainder of this paper is structured as follows. Section 2 presents the related work of the proposed network and dataset. Section 3 describes the proposed dataset in detail. Section 4 presents the architecture of the proposed Pyramid-SCDFormer network and each network module will be introduced in detail. Section 5 reports the experimental results and discusses the performance of the proposed network. Section 6 concludes this paper with remarks and future work.

CNN-based CD methods
Deep learning-based CD methods for RSIs are rapidly evolving and yielding good results, such as supervised (Zhang et al. 2018;Chen and Shi 2020;Li et al. 2021), unsupervised (Gong et al. 2019), and semi-supervised (Bandara and Patel 2022b) for different CD datasets. Here, we focus on supervised CNN-based methods for CD tasks. The main prior CD methods benefit from the semantic representation capability of CNNs. To improve the recognition performance of CNN-based CD methods, scholars mainly optimize the network structure, introduce the attention mechanism and other tricks.
There are roughly three ways to improve the feature extraction of CNN-based architectures: multiscale feature, spatial-temporal features, and residual connection. Zhang et al. (2018) introduce the atrous convolution in ResNet101 to increase the receptive field for capturing multiscale context information and make the best of Atrous Spatial Pyramid Pooling (ASPP) to extract features to keep various scale characteristics. Yang et al. (2020) propose an end-to-end deep learning CD framework based on the D-LinkNet to overcome the boundary error of traditional block. Li et al. (2021) propose MFCN network by using multiscale convolution filters to extract detailed information. Gedara Chaminda Bandara, Gopalakrishnan Nair, and Patel (2022) first use Denoising Diffusion Probabilistic Models (DDPM) to leverage more multiscale information from RSIs, then train a CD classifier for the precise CD task. For exploring temporal features, the BiDateNet network (Papadomanolaki et al. 2019) imports Long Short-Term Memory Networks (LSTMs) to improve the CD accuracy. Song et al. (2018) propose the convLSTM network, combining 3D Fully Convolutional Networks (FCNs) and LSTMs for hyperspectral images CD to preserve spectral-spatial features. Chen and Shi (2020) utilize classic residual connections for coarse-grained and fined-grained features of changes. UNet++ (Peng, Zhang, and Guan 2019) employs dense skip connections to improve spatial accuracy and reduce pseudo-changes by enhancing scale variance. FC-Siam-Co and FC-Siam-Di  make the best of the skip connection for achieving multiscale feature extraction of CD results. CNN-based CD methods are good at extracting high-level semantic features that reveal the change of interest but focus on local modeling. However, the above methods fail to alleviate pseudo-changes effect.
Consequently, the strategy of leveraging attention mechanism for CD has also been applied to obtain discriminative information. A deep supervised image fusion network (IFN) ) fuses multi-level deep features in a channel attention-wise manner to improve boundary completeness and internal compactness. To capture more discriminative features, DSAMNet ) imports a Convolutional Block Attention Module (CBAM) into the network to obtain the spatial and channel information simultaneously. DASNet ) and DTCDSCN ) introduce a dual attention module (DAM) to get more discriminative information of semantic changes in RSIs. However, the above approaches only realize the reweighting of feature information in the spatial dimension and the channel dimension, and it is a limitation of capturing long-range spatiotemporal information required for the accurate CD. As a supplement, the self-attention mechanism can capture more contextual information to solve long-range dependency problems and relieve the influence of pseudo-changes (Chen and Shi 2020).

Transformer-based vision methods
The transformer (Vaswani et al. 2017) boomed in 2017, and succeed in Natural Language Processing (NLP). Based on this intuition, ViTs models are proposed one after another and achieve promising performance across classic CV tasks, such as classification (Deng et al. 2009), object detection (Everingham et al. 2010), and semantic segmentation (Zhou et al. 2017). Typical ViTs models achieve profitable results or perform better than CNN-based models in many tasks ). The transformer is the architecture based on the self-attention mechanism with quadratic cost in the number of pixels. Hence, recent studies generally exploit down-sampling and token merging strategies to reduce the amount of computation. Wang, Li, et al. (2022) propose the UVACD network combining a transformer and a CNN with the help of spatial and temporal information to extract more distinguish change information. Dosovitskiy et al. (2020) apply down-sampling projection to reduce computation cost but incur the output with single-scale and coarse-grained information. Wang et al. (2021) merge the tokens through linear projection and adopt a Spatial-Reduction Attention (SRA) layer to reduce the computational cost. However, the above ViTs largely omit the static receptive fields of each token feature within one self-attention layer, leading to insufficiency in terms of the boundary and shape of the change of interest. However, it is of significance to realize fine SCD to meet the practical applications, such as urban expansion, land desertification, agricultural land occupation, etc.
To deal with the above problems, we propose the Pyramid-SCDFormer method to tackle the fine SCD task, especially for the extraction of small changes and detailed edges. In particular, the model learns to extract pyramid features of different scales changed objects at different attention heads within one attention layer. The Pyramid-SCDFormer method, a transformer Siamese network model embedded with the SSA module, is effective in representing multiscale semantic features and capturing lang-range spatiotemporal information of the change of interest.

The existing CD datasets
There are currently many CD datasets for remote sensing from drones and satellites (Shafique et al. 2022). Obviously, the majority of CD datasets contain the binary label, showing the change and nochange information. Table 1 demonstrates that the publicly available CD datasets have some limitations.
(1) Lack of long-range multi-temporal RSIs. Most of the existing public CD datasets contain bitemporal RSIs of the same area, and a few contain remote sensing images of three phases. The SZTAKI Air dataset (Benedek and Szirányi 2009) is the earliest CD dataset, comprising 1000 pairs of bitemporal images with the size of 800 × 600 and the resolution of 0.5 m. To make the best of the rich change information about high-resolution RSIs, the Learning, Vision and Remote Sensing Laboratory LEVIR-CD (Chen and Shi 2020) is released to monitor change in buildings, containing 637 pair bitemporal aerial images with a size of 1024 × 1024 and resolution of 0.3 m. In addition, the Wuhan University (WHU) Building CD (Ji, Wei, and Lu 2018) is also a bitemporal building change detection dataset, which has a higher spatial resolution. The Onera Satellite Change Detection (OSCD) dataset ) comprises 24 pairs of multispectral Sentinel-2 satellite images from five areas around the world, each with a size of 600 × 600 and a resolution of 10 m. What is more, the bitemporal hyperspectral image CD dataset 'River' (Wang et al. 2018) was also proposed for the objective evaluation of CD methods.
(2) Lack of multiple change types with detailed annotation. LEVIR-CD and WHU_CD, contain binary building CD label, and the SZTAKI AirChange and OSCD only label binary land cover information. The high resolution semantic change detection (HRSCD) dataset (Daudt et al. 2019) is a SCD dataset, but it does not obviously represent the transformation relationship between features with the label accuracy of 80-85% leading to inaccurate borders in some cases. On the side, Hi-UCD dataset (Tian et al. 2020) only includes 9 classes land cover maps and binary change maps. Although the HRSCD, Hi-UCD, and HCCD (López-Fandiño et al. 2018) contain more than two change types, they are still lacking in accuracy and semantic information richness.
(3) Lack of SCD information. The above-mentioned public datasets almost reflect whether there is a change but not reports the direction of land cover transformation. Therefore, refined SCD is hindered by the lack of SCD datasets.
Existing CD datasets do not meet the needs of the SCD methods and there is still room for improvement in large-scale and richer change information. Firstly, the high-resolution RSIs are unable to provide large-scale and ultra-long time series Land Cover and Land Change (LULC) monitoring, satellite image data such as Landsat imagery have rich historical data, wider spatial coverage, and higher time resolution, which could be used as a good supplement. Secondly, the definition of change types across the existing CD datasets is too broad to meet practical applications. The proposed Landsat-SCD dataset largely complements existing CD datasets in spatial scale, time span, and the diversity of change types.
3. Landsat-SCD: a new well-annotated dataset for semantic change detection In view of the challenge that there are few existing SCD datasets, but the mainstream CD methods have high requirements for data quality and quantity. Landsat-SCD focuses on refined land cover semantic changes and provides a benchmark with longer time series and more various semantic change types for evaluating the refined SCD models of RSIs in this context. Multispectral RSIs mainly become the vital data source in the field of CD due to theirs relatively high-quality spatiotemporal and spectral resolution and good data accessibility. The source data of the Landsat-SCD benchmark from Landsat-like images taken between the years 1990 and 2020 in Tumushuke (3939΄N -404΄N, 7853΄E -7919΄E), Xinjiang, adjacent to the Taklimakan Desert with the fragile ecological environment and located on the Belt and Road Economic Belt. The detailed information about the Landsat-SCD dataset is shown in Table 2.
The Landsat-SCD dataset provides 10 change types with much more fine change information than is previously available in the context of CD datasets, where each 'from-to' change type is a separate class representing land-cover transitions. The change type codes and color maps corresponding to the specific dataset are shown in Table 3.
Some examples of image pairs and labels are depicted in Figure 1. Despite its unprecedented size and qualities, the challenges of this dataset are needed to be discussed. First, the dataset contains many complicated detection scenes with unprecedented multiple change types. The study area is adjacent to the edge of the desert, and the buildings therein are small and scattered. In addition, the data source is Landsat series images with a resolution of 30 m. The above two points are challenges for accurate manual annotation and new robust CD model. Second, the label imbalance also arises in the proposed dataset. In line with the characteristics of the real world, changes are much less ratio than unchanged land covers. Thus, we provide a realistic evaluation benchmark for SCD methods.

The Pyramid-SCDFormer for the precise and fine SCD task
This section describes a robust SCD network (Pyramid-SCDFormer) for monitoring LULC by bitemporal optical satellite RSIs. An overview of the network first provided, after which each network module will be introduced in detail.

Overall model architecture
The Pyramid-SCDFormer takes bitemporal RSIs as input and then outputs a pixel-level 'from-to' change map, each pixel of which belongs to the unique encoded 'from-to' change label. We focus on the semantic CD, which means not only reflects the change and no change, but also reflects the changes from one land cover to another. The proposed Pyramid-SCDFormer architecture (Figure 2) consists of three parts: the Siamese pyramid transformer encoder, the fusion module of multi-level distance maps from bitemporal feature pairs, and the prediction head of the decoder. Concretely, the inputs I 1 , I 2 in size of H 0 × W 0 × 3 are fed into the Siamese pyramid transformer encoder. The patch embedding mixes convolutions of different kernel sizes to achieve image-to-token conversion and get semantic tokens with the size of H 0 4 × W 0 4 × C, where C is the dimension of the token. In the i-th stage, the output feature maps F I T j is the solution of where C is the number of channel from the token sequence, i = {1, 2, 3, 4} is the number of stages, and T j = {T 1 , T 2 } is the number of periods, which are sent into the fusion module to obtain the distance maps followed by the prediction head of decoder to acquire the Semantic Change Map (SCM).

The Siamese pyramid transformer encoder
In the pyramid transformer encoder phase: first, the bitemporal images I 1 , I 2 in size of H 0 × W 0 × 3 are fed into the patch embedding for expressing the input images into a few high-level semantic tokens. Then, we achieve the input sequence with the size of H 0 4 × W 0 4 × C. The next four stages are followed by feature extraction. Each stage contains a linear embedding and several SSA transformer blocks. After a stage, the height and width of feature maps are halved and the channel numbers are doubled. We get four output feature maps F 1 T j , F 2 T j , F 3 T j , F 4 T j from four stages with the size of where i is the number of stages and T j is the number of periods. And the branches of the two Siamese networks share weights Shunted Self-Attention (SSA): Different from ViT (Dosovitskiy et al. 2020), Swin transformer , and PVT , the SSA is designed to enable the self-attention to simultaneously extract multiscale features at different attention heads within one attention layer. And the extracted features are more discriminative and contain more fine-grained feature information, benefiting distinguishing the change of different scale interests.
The input tokens are projected into Query (Q), Key (K), and Value (V) vectors. The Multi-Head Self-Attention (MSA) with different attention heads to compute attention scores simultaneously. The Multiscale Token Aggregation (MTA) down-samples the K, V from different heads to different sizes. The SSA is calculated by: where X is the input feature map. W Q i , W K i , W V i are linear projection parameters. r i is the downsampling rate in the i-th head. When r i grows large, more K, V tokens are merged and the computation cost is low, capturing large scale objects. when r i grows small, the computation cost is high, preserving more details information. Hence, we subtly mix multiple r i to extract multi-granularity features within one self-attention layer. d h is the dimension of Q and K, h i is the output from the i-th head. Concate( · ) is the concatenation operation. MSA(X) is the output feature map from the MSA module. Therefore, the key and value vectors enable to capture different scales in a self-attention by integrating variant r i in different attention heads.
Distance Map: The distance map aims to leverage the multi-level feature maps from each stage of the encoder and compute the optimal distance metric at each pyramid level while the traditional CD methods Chen, Qi, and Shi 2021) directly calculate the absolute distance. The distance map is calculated as: where F i T 1 and F i T 2 represent the output features of the i-th stage in the T 1 and T 2 periods. Detail-specific Feed forward Layer: Based on the traditional feed forward layer, we import a specific layer as the local details complement between the two fully connected layers.

The prediction head of decoder
Multi-level distance maps are provided from the Siamese Pyramid transformer encoder stage. And then those multiscale features at each level are aggregated to predict the semantic change map in the prediction head of decoder. Concretely, we utilize MLP layer to unify the channel dimension of the distance map F i dist followed by unsampling each one to the size of H 0 4 × W 0 4 × C conc . Then, unsampled distance maps are concatenated and fused by an MLP layer followed by an upsampling with a factor of 4. Thus, we obtain the feature map with the size of H 0 × W 0 × C conc . Finally, the upsamlped feature map is processed by MLP and softmax layers to obtain the SCM with the size of H 0 × W 0 × N cls , where N cls = 10 is the number of change types as follows:

Experimental results and discussion
In this section, we introduce the existing state-of-the-art methods about CD. Then, the model evaluation metrics and implementation details are represented. Lastly, we compare the SCD performance of the proposed Pyramid-SCDFormer with the existing state-of-the-art methods on the LEVIR-CD, WHU_CD, and the proposed dataset Landsat-SCD.

The existing state-of-the-art methods
In summary, three CNN-based methods, one attention-based method, and two transformer-based methods are compared in the experiment to evaluate the effectiveness of the proposed methods.
(1) FC-EF (Daudt, Le Saux, and Boulch 2018): a CNN-based network. The concatenated bitemporal images are fed into a fully convolution network and skip connection to transport multiscale features.
(2) FC-Siam-Di : a CNN-based network. Bitemporal images are fed into a Siamese network to capture multi-level features, and differences are transported to the decoder. (3) FC-Siam-Co : a CNN-based network. Bitemporal images are fed into a Siamese network to extract multi-level features, and different level concatenations from the encoder are used to detect changes. (4) DTCDSCN : an attention-based CNN method. Bitemporal images are fed into a Siamese-based network which employs DAM to explore the correlation of channels and spatial dimensions, capturing more discriminative features. (5) BIT (Chen, Qi, and Shi 2021): a transformer-based method. The semantic tokens are fed into the encoder-decoder transformer architecture to enhance context information. (6) ChangeFormer (Bandara and Patel 2022a): a transformer-based method. A transformer encoder in a Siamese network to extract detail and semantic features of bitemporal image, a light decoder to fuse the multi-level features to acquire change map.

Metrics and implementation details
In this work, we conduct experiments on the LEVIR-CD, WHU_CD, and proposed Landsat-SCD datasets. To evaluate the effectiveness of the proposed Pyramid-SCDFormer network, we present four encoder variants with different configurations, as shown in Table 4. Then, we use five evaluation metrics to evaluate SCD results, which are Overall Accuracy (OA), Precision (P), Recall (R), F1score (F1), and Mean Intersection over Union (MIoU).
where TP, FP, TN, and FN are the numbers of true positive, false positive, true negative, and false negative pixels, respectively. We implement all experiments with PyTorch library on the GeForce RTX 3090 GPU platform. All networks are randomly initialized by default. We train all models adopting the Cross-Entropy loss function and Adamw optimizer (Loshchilov and Hutter 2018) with decay equal to 0.01. The learning rate is initially set to 0.0001, except for WHU_CD dataset, which has an initial learning rate of 0.00001. And the batch size is 6.

Comparisons on the public CD datasets
To demonstrate the robustness of the Pyramid-SCDFormer model in this paper, we selected two publicly available CD datasets, LEVIR-CD and WHU_CD, for our experiments to make the results section much stronger. Table 5 presents the segmentation performance, model parameters, and training time cost of different CD methods on the LEVIR-CD and WHU_CD test sets. From Table 5, we can see that the proposed method outperforms most of baselines on the LEVIR-CD and WHU_CD datasets in precision, recall, OA, MIoU, and F1, and performs consistently well on the WHU_CD dataset with slightly more training time cost and less number of model parameters. Significantly, the Pyramid-SCDFormer-B improves previous state-of-the-art network in OA/MIoU/F1 by 0.12/1.11/0.76% and 0.90/0.57/0.50% on LEVIR-CD and WHU_CD dataset. The second-ranked Pyramid-SCDFormer-S obtains OA/MIoU/F1 of 98.29/84.27/90.84%. Pyramid-SCDFormer-T and DTCDSCN obtain similar recognition results on the LEVIR-CD dataset. FC-EF gets the lowest OA/ MIoU/F1 of 94.91/47.45/48.69% among these nine contrasting networks on the LEVIR-CD dataset. FC-EF has only four max pooling and four upsampling layers and the layers are shallower compared with U-Net (Ronneberger, Fischer, and Brox 2015). Therefore, its ability to extract features of changes is not enough. FC-Siam-Co is higher 0.18/1.77/1.51% than FC-Siam-Di network in OA/MIoU/F1 on the LEVIR-CD dataset, which indicates that the concatenation operation can preserve more change information than difference. The segmentation of DTCDSCN network is slightly lower than that of Pyramid-SCDFormer-B module on WHU_CD dataset, thanks to its introduction of spatial pyramid module and attention module to extract multi-scale contextual features in the decoder phase, which is the same idea as the Siamese pyramid transformer encoder of this paper.
The radar plot (Figure 3) shows the accuracy of 'change' type in different CD models on the LEVIR-CD dataset. Due to the small proportion of 'change' type in terms of on Recall (R_1), Precision (P_1), F1 (F1_1), MIoU (MIoU_1), it can better highlight the advantages of the proposed model in small-scale class of changed interests. For 'change' type, the results of all transformer-based models yielded precision > 81%, recall > 73%, F1 > 77%, and MIoU > 63%. The Pyramid-SCDFormer achieves the highest precision/recall/F1/MIoU/ of 71.91/83.66/86.44/81.05% for 'change' type. Hence, transformer-based models have stable performance in solving the CD problem.
Moreover, the visual comparison results of LEVIR-CD and WHU_CD datasets are displayed in Figure 4 For better visualization. white, black, red, and green represent TP, TN, FP, and FN respectively. Overall, the proposed model achieves better results compared to other models. The Pyramid-SCDFormer shows stronger robustness under 'non-semantic change' conditions such as changes caused by a light change (Figure 4(a)) and seasonal differences (Figure 4(c)), which indicates that the proposed model effectively learns global context information in longrange spatiotemporal conditions exclude the irrelevant change. In scenes with more complex backgrounds, it can recognize finer detail than other networks (Figure 4(b,c)). Compared with other models, the proposed network can better handle the multiscale changes, which can be seen in Figure 4 (b,d,f,j). For large scale changes, the proposed model has a larger receptive field for complete feature extraction and obtains a more intact building shape (Figure 4(d,f)). For small scale changes, it achieves finer detail recognition than other networks (Figure 4(b,j)). Outstanding performance of recognition for multiscale change types thanks to the assistance of the SSA module, which captures multiscale features within one self-attention layer via multiscale token aggregation.

Comparisons on the landsat-SCD dataset
As shown in Figure 5, we compare the OA curve of each existing CD models with the Pyramid-SCDFormer model on the Landsat-SCD dataset in the training and validating phases. It can be observed that the Pyramid-SCDFormer represents a good fit between the training and validating curve and achieves high and stable learning performance. Note: Four colors are used for better visualization. White for true positive, black for true negative, dark gray denotes that 'no change' type is wrongly classified into 'change' type for false positive, light gray indicates that 'change' type is missed for false negative. Note: P_1, R_1, F1_1 and MIoU_1 represent recall, precision, F1 score, and Mean Intersection over Union, respectively. The value in the circle of radar plot is 40% and the value of outer boundary is 90%. The higher value means that higher accuracy was achieved.
As shown in Table 6, the three variants of the Pyramid-SCDFormer with different configurations all achieve the best results compared to other state-of-the-art methods on the Landsat-SCD dataset. Pyramid-SCDFormer-B achieves the highest OA/MIoU/F1 of 96.08/59.91/72.50%. The delight is that compared to the existing best performing network (BIT) the OA/MIoU/F1 of Pyramid-SCDFormer-B are increased by 0.99/8.75/8.59% on the Landsat-SCD dataset. The third-ranked Pyramid-SCDFormer-T obtains 95.75/56.13/68.52% of OA/MIoU/F1, which are 0.66/4.97/4.61% higher than the current state of the arts, respectively. The great improvement of the Pyramid-SCDFormer not only further improves the effectiveness of SSA module and the fusion of distance map, but also proves the gain effect of their combination.   Note: The bolded represents the best and second best experimental results. Figure 6. Visualizing comparison results on the Landsat-SCD dataset.
Note: Ten colors are used for better visualization of ten change types. The blue with RGB (0, 47, 167) denotes the false positive and false negative. Table 7 shows the MIoU performance of 10 change types among different CD models on the Landsat-SCD dataset. Notably, the proposed model has an obvious improvement effect for a small proportion of change types. 'water to farmland' is the smallest change type (0.09%). Compared with the optimal effect of the BIT model, the proposed model has improved by 7.17%. For change types proportion below 1%, such as 'water to farmland', 'building to desert', 'building to farmland', 'desert to building', and 'farmland to building', MIoU increased by 7.17-19.53% compared to the best existing models. For change types proportion between 1% and 20%, such as 'farmland to desert', 'desert to farmland', 'desert to water', and 'water to desert', MIoU increased by 3.49-9.53%. Hence, the proposed model is more efficient for boosting the small proportion of change types. Figure 6 indicates the performance of different CD methods on the Landsat-SCD dataset. The blue pixel represents the error recognition, and fewer blue pixels mean fewer misclassifications. In general, the semantic CD results of the proposed model are closest to the ground truth. First, the Pyramid-SCDFormer model keeps more precise boundaries of multiscale change objects than all the baselines in Figure 6(a-c), which demonstrates that more useful fine-grained features are preserved to improve the accuracy. Compared with other state-of-the-art models, the missed and false alarms are significantly reduced in semantic change maps. Second, the proposed model accurately identifies small change objects that other existing models are prone to miss in relatively complex scenarios of the Landsat-SCD dataset, such as Figure 6(a,c). For example, the proposed model can not only identify accurately almost small scale 'desert to farmland' change type, but also maintain fine boundary information of changes in Figure 6(c). Therefore, the Pyramid-SCDFormer can effectively recognize more accurate scale-variance change types and keep finer boundaries. Particularly, the recognition improvement of changes on a small scale is the most obvious in complex scenarios.

Ablation experiments
To verify the effectiveness of the SSA module, we perform ablation experiments on the LEVIR-CD, WHU_CD, and Landsat-SCD datasets. Pyramid-SCDFormer-same indicates the same depth as ChangeFormer, where the depth represents the number of SSA transformer blocks at each stage. Table 8 shows our Pyramid-SCDFormer model outperforms the ChangeFormer model in the precision/recall/OA/MIoU/F1 for the same depth parameters with much small computational complexity and model parameters. Combining the segmentation of our model on the three datasets, it is worth noting that the model has the least enhancement performance on the WHU_CD dataset with higher spatial resolution, and the most significant enhancement performance on the Landsat-SCD dataset with lower spatial resolution. Our Pyramid-SCDFormer model has higher improvement for dataset containing more information on small objects and details. This is a further indication of the efficiency and cost-effectiveness of our Pyramid-SCDFormer model on the recognition for smallscale and fine edges of change types.

Conclusion
In this work, a new SCD benchmark dataset, Landsat-SCD is created, which largely complements existing SCD datasets. We benchmark the Landsat-SCD dataset by using classical approaches in BCD and SCD tasks. Extensive experiment results report that the proposed dataset is changing and useful, which will facilitate the future research of effective methods for refined SCD tasks. Then, we present a novel transformer-based Siamese networks, Pyramid-SCDFormer trained end-to-end from scratch that surpasses state of the arts for bitemporal remote sensing SCD. Compared with prior three CNN-based, one attention-based, and two transformer-based networks, the Pyramid-SCDFormer achieves the best performance on the LEVIR-CD, WUH_CD and Landsat-SCD datasets respectively. Most notably, the SSA is introduced into the pyramid Siamese architecture to effectively capture the multiscale context features to achieve precise recognition of multiscale changes and the fine edges of objects in complicated detection scenes.
In the next work, we will continue to expand the multi-region dataset further verifying the generalization of the CD models and hope to promote the development of SCD. Moreover, we will further study how to reduce computational consumption based on ensuring that the SCD model can extract refined and multiscale changes in complex scenes.

Disclosure statement
No potential conflict of interest was reported by the author(s).