Scene-level buildings damage recognition based on Cross Conv-Transformer

ABSTRACT Different to pixel-based and object-based image recognition, a larger perspective based on the scene can improve the efficiency of assessing large-scale building damage. However, the complexity of disaster scenes and the scarcity of datasets are major challenges in identifying building damage. To address these challenges, the Cross Conv-Transformer model is proposed to classify and evaluate the degree of damage to buildings using aerial images taken after earthquake. We employ Conv-Embedding and Conv-Projection to extract features from the images. The integration of convolution and Transformer reduces the computational burden of the model while enhancing its feature extraction capabilities. Furthermore, the two branch Conv-Transformer architecture with global and local attention is designed, allowing each branch to focus on global and local features respectively. The cross-attention fusion module merges feature information from the two branches to enrich classification features. At last, we utilize aerial images captured during the Beichuan and Yushu earthquakes as both the training and test sets to assess the model. The proposed Cross Conv-Transformer model improved classification accuracy by 4.7% and 2.1% compared to the ViT and EfficientNet. The results show that the Cross Conv-Transformer model could significantly reduces misclassification between severely and moderately damaged categories.


Introduction
Earthquake pose a significant threat to human society worldwide, resulting in substantial environmental damage, casualties, and property losses (Chen, Wang, and Xiao 2018).Accurately assessing and mapping the damage to buildings after an earthquake is crucial for the prompt and precise allocation of rescue resources (Duarte, Nex, and Kerle 2020).
From the perspective of data, there are various methods available for extracting building damage information from remote sensing images, including optical data (Fan et al. 2019b), synthetic aperture radar (Adriano et al. 2019), and LiDAR data (Wang and Li 2020).Medium-resolution optical satellite images can provide an overview of the damage caused by earthquake disasters on a large scale.However, the limited resolution prevents the detection of finer-scale earthquake damage (Fan et al. 2021).While aviation aircraft could provide apparent and detailed information because their better spatial resolution.It is the most effective data to classify the damaged buildings after the earthquake.In terms of damage detection methods, both single-phase and multi-phase classification approaches.Multi-phase methods primarily rely on detecting changes between pre and post-disaster data to identify damage information (Akhmadiya et al. 2020).In contrast, singlephase methods only utilize post-disaster data to identify damage information (Settou, Kholladi, and Ben Ali 2022).Multi-phase methods present challenges in identifying data of consistent quality before and after the disaster, requiring meticulous and time-consuming.Single-phase methods circumvent issues caused by differences in acquisition periods, weather conditions, and background factors that can significantly affect the accuracy of image classification.Hence, single-phase remote sensing image information extraction technology proves to be a more effective approach.
Machine learning is used to extract disaster loss information from remote sensing images.Bialas utilized the random forest method to extract buildings from high-resolution aerial images (Bialas, Oommen, and Havens 2019).The results showed that the performance of machine learning algorithm could maintain a relatively stable segmentation effect for any given task as long as the features used for classification are correctly selected.However, it requires manual feature design and encounter difficulties in model training (Naito et al, 2020;Mangalathu et al. 2020).Convolutional neural networks (CNN), as powerful deep learning structures, can automatically extract rich hierarchical features from satellite images.Consequently, several remote sensing image methods based on CNN frameworks have been developed (Ma et al. 2019;Zhu et al. 2017).Gebrehiwot et al further adopted a VGG convolutional neural network architecture to classify UAV images acquired after flood disaster (Gebrehiwot et al., 2019).The experiments revealed that deep convolutional neural network is superior to support vector machine (SVM) classifier in flood area classification.Similarly, Yang et al using a variety of CNN model with transfer learning method to identify earthquake-damaged buildings.Among them, DenseNet121 achieved the best performance in the classification task.However, the recognition accuracy remained below 90%, and the models did not classify the levels of damage to collapsed buildings (Yang, Zhang, and Luo 2021).To address this issue, Prashath developed a lightweight CNN network for extracting damage information from UAV images, achieving a model accuracy of 91% (Prashath, Priyadharshini, and Lakshmi 2021).However, CNN's receptive field is limited, and increasing the depth of the model to enlarge the receptive field often results in information loss.While Some researchers improve the identification accuracy of buildings damage from the perspective of dataset.Wang developed a buildings damage classification method that includes building localization and addressing the imbalanced sample distribution (Wang, Alvin Wei, and Zhang 2022).The results show that the architecture could identify buildings damage well.Moreover, other researchers have used an attention-based strategy to classify damaged buildings at the pixel level (Liu et al. 2022).Shen et al introduced a cross-directional attention module to explore the correlation between pre-disaster and post-disaster images, proposing a two-stage convolutional neural network called BDANet for buildings damage assessment.The model achieved state-of-theart performance on the xBD dataset (Shen et al., 2022).Shi et al proposed improved YOLOv4 model to detect objects only using the aerial image datasets of collapsed buildings after Beichuan earthquake and Yushu earthquake, and the extraction accuracy reached 93.76% (Shi et al. 2021).However, compared with the perspective of the scene, all these methods are complex as they require first determining the footprint of each building.Furthermore, when assessing post-disaster building damage, relying solely on the perspective of objects and pixels often leads to inaccurate positioning of detection frames, fragmented pixel classification results, and inefficient model training.
Transformer has achieved remarkable results in the field of natural language processing (NLP) (Vaswani et al., 2017).Transformer has several advantages over CNN, including parallel computing, global vision, and flexible stacking.Additionally, it can capture global context information, establish long-range dependencies, and extract more powerful features.For instance, Hong reevaluated hyperspectral image classification using the Transformer sequence perspective and introduced a new backbone network called SpectralFormer, which achieved high accuracy in hyperspectral image classification (Hong et al. 2021).Jia et al. presented a new multi-scale convolution embedding module for hyperspectral images to efficiently extract spatial spectral information.This module can be effectively combined with the Transformer to leverage unlabeled data for training (Jia and Wang 2022).Thus, harnessing the potential of the Transformer can significantly enhance the models' capability to identify scenes depicting damage caused by natural disasters, even in complex backgrounds.
The primary contributions of this paper are as follows: (1) An aerial image dataset is created for the classification of scenes depicting damaged buildings using data augmentation and noise addition.The rest of this article is organized as follows: Section 2 introduces the Transformer's detailed mechanism information and its most recent research applications in disaster information extraction.Section 3 introduces the details of the study area, dataset creation process and the details of our method.Section 4 introduces the details of experimental design, experimental evaluation indicators and experimental results.Section 5 summarizes the content of the full text and future research directions.Section 6 introduces conclusions of the paper.

Related work
As a result of the successful application of Transformers in speech recognition and machine translation, Transformer-based networks have also found their way into the field of computer vision (Radford and Tim 2021).The pioneering network that fully embraces the Transformer architecture for image classification is the Vision in Transformer model (ViT) (Dosovitskiy et al. 2020).In this model, the input image is segmented into fixed-size image patches, which are then further divided into a sequence through linear projection.Following the sequence, position embeddings are applied, and the resulting representation is passed through the Multi-Head Self-Attention mechanism for global attention modeling.Subsequently, the output is forwarded to the Head module for classification.Research findings demonstrate that the ViT model achieves comparable classification accuracy on the ImageNet dataset when compared to CNN-based classification models.For instance, Bazi employed the ViT model to classify datasets of remote sensing scenes (Bazi et al. 2021).The results underscore the ViT model's proficiency in extracting multi-channel characteristics from remote sensing images and accurately discerning them.
As illustrated in Figure 1, the ViT model consists of three main components: an embedding layer, an encoder, and the final head.Initially, the input image is divided into fixed-size image blocks, and each block is transformed into a one-dimensional vector.These flattened patches are then converted into tokens, with an additional class token introduced to encode category information, following a linear projection step.Notably, the linear projection operation causes the loss of positional information for each image patch with respect to the original image.To address this, positional embedding is employed in the token representation subsequent to the linear projection.This positional embedding enables the acquisition of relative position information among the patches by computing the cosine similarity between them.Consequently, patches sharing the same rows or columns exhibit high similarity.Subsequently, all the patch tokens are fed into the Transformer encoder, along with the MLP Head, for the classification process.
The most crucial component of the Transformer architecture is the Transformer Encoder (Voita et al., 2020), which comprises the Multi-head Self Attention (MSA) module and the Feed Forward Network (FFN) (Mangan and Alon 2003;Xiong et al. 2020).The MSA module, serving as the core of the Transformer, consists of a linear layer, self-attention layer, and concatenation layer.The process begins by converting a clipped 2D image into a vector, denoted as X, after passing through the linear projection layer and incorporating position information.Subsequently, three weight matrices, initialized for the query (Q), key (K), and value (V), are introduced.By multiplying these matrices with the vector X, the MSA module identifies the information with the highest weight through a dot product operation among Q, K, and V.This operation establishes the global connections among all image blocks, enabling the identification of the relative importance of a patch embedding compared to others in the sequence.Consequently, the MSA module determines the focal point of the visual task by establishing the center of attention.
Research on natural disaster information extraction based on Transformer has been gaining increasing attention.Ahan et al introduced the Flood-Transformer model, which represents the first visual transformer-based model capable of detecting and segmenting flood areas from aerial images (Roy et al., 2022).This model employed the SWOC Flood method to segment the dataset and achieved a superior mean Intersection over Union (mIoU) score of 0.93, surpassing other existing methods.Furthermore, Amir et al proposed the SiamixFormer model, which consists of two transformer encoders.It takes both pre-and post-disaster images as input (Mohammadian and Ghaderi 2022).The outputs from each stage in both encoders are fed into a temporal converter for feature fusion, which generates queries, keys, and values from the pre and post-disaster images.Additionally, the model incorporates temporal features in the fusion process.The SiamixFormer model was evaluated on the xBD buildings disaster dataset for buildings change detection and demonstrated superior performance compared to state-of-the-art models.
Although Transformer has shown promising results in image classification, it requires more computational resources compared to the CNN model.The CNN structure leverages spatial subsampling and weight sharing to capture information that is invariant to shifting, scaling, and distortion.This characteristic provides an advantage that is not present in the Transformer model.Furthermore, the hierarchical structure of convolution allows the model to consider different levels of local spatial context information in damaged buildings.This includes capturing simple low-level edge features as well as high-level texture and semantic information (C.F. Chen et al. 2019).To address the limitations of Transformer and enhance its performance, recent studies have proposed various variants of the Vision in Transformer (ViT) model.Some of these approaches incorporate distillation techniques for data-efficient training of visual converters (Touvron et al. 2020), while others combine the pyramid structure of CNN to leverage its benefits (W.Wang et al. 2021).Among them, the approach of integrating CNN and Transformer features the benefits of straightforward design and efficient training, establishing it as a current research focal point.For instance.Marco et al. employed a combination of the ViT and ResNet to create an AI-driven framework for automated hierarchical classification of road tunnel defects, with the aim of improving the efficiency of this potent indirect measurement approach (Rosso et al. 2023).Zhang et al. introduced a purely data-driven deep learning model, EPT, to mine potential crustal and tectonic movement patterns from the global historical earthquake catalog data.By employing multi-head self-attention from ViT, it captures long-term dependency relationships within regional time series, highlighting the connections between salient features and mitigating the challenges faced by Long Short-Term Memory (LSTM) networks in focusing on long-term information in extended time series (Zhang et al. 2023).
In contrast to the aforementioned methods, we present a novel two-branch model that incorporates both local and global attention mechanisms to extract multi-scale features.Our approach aims to leverage the strengths of both convolutional and transformer models in image classification, as well as harness the effectiveness of multi-scale feature fusion in visual tasks.

Material
On May 12, 2008, a magnitude 8.0 earthquake struck Wenchuan, Sichuan Province.On April 14, 2010, a magnitude 7.1 earthquake occurred in Yushu, Qinghai Province.Both earthquakes caused a considerable number of buildings to collapse, as well as several deaths and major economic losses.Beichuan County was one of the most severely affected places in the Wenchuan earthquake.Most masonry structures in Beichuan were damaged in various ways, including wall cracking, partial collapse, and complete collapse.In our research, remote sensing images captured by aerial photography on the second day after the Yushu and Beichuan earthquakes were chosen as the data source, with an image resolution of 0.5 meters.The geographical location and aerial images of the study area are shown in Figure 2. The selected images cover the entire city area, and contain substantially large damaged and undamaged buildings, which could provide better data support for deep learning training.
The main reasons for selecting post-earthquake aerial images of Beichuan and Yushu as the image data source are: (1) The structures of most buildings in Beichuan and Yushu are different.Most of the damaged buildings in Beichuan are of masonry structure, and most of the damaged buildings in Yushu are of concrete structure.The performance of the damaged masonry structure and civil structures on the remote sensing images is different.(2) The backgrounds of the two regions are quite different.The vegetation in the Beichuan area is relatively lush, so the buildings are often surrounded by vegetation.Yushu is located in the area with sparse vegetation, and the color of damaged buildings is usually similar to the surrounding background, so it is difficult to classify accurately.Therefore, selecting data from these two places for training and test, which can verify the model's robustness.
Furthermore, due to the severe damage caused by the earthquakes in Beichuan and Yushu, most buildings were in contact with each other, without clear boundaries.Therefore, in the dataset design, group building scenes were chosen as the units for recognizing overall damage levels.This approach also enhances the efficiency of large-scale building damage recognition.
According to the seismic damage assessment standards issued by the State Seismological Administration of China and the actual situation of building damage in the two earthquakes, we pay more attention to the levels of building damage.Then the sample categories were defined through the visual interpretation, expert knowledge and on-site investigation.Then damaged buildings are divided into 3 levels based on the damaged rate of the building in the images, as depicted in Table 1.Where the damaged rate is the ratio of the number of damaged buildings to the total number of buildings.
Due to the dense distribution of buildings in the Beichuan and Yushu areas, it becomes challenging to distinguish the damage category of individual buildings due to their interleaved and overlapping nature.There is no clear demarcation between building fragments, further complicating the identification process.Additionally, the severe damage is concentrated in specific areas after the two earthquakes.To address these challenges in post-earthquake building damage scene identification, we propose a sample generation method that utilizes groups of buildings as the unit for scene identification.To ensure the quality of the generated samples, we follow a step-by-step approach.Firstly, we utilize the administrative boundary vector of Beichuan and Yushu to extract the aerial image of the region of interest.Next, the road vector data of the same areas are used to divide the aerial image into blocks.We employ a fixed sliding window method to extract slice images from each block.To select a slice image as a sample, we require that the building area within the slice image exceeds 50% of the entire image area.This criterion ensures that all sample images contain a sufficient number of building samples.
This research classifies the damage degree of group damaged buildings into three levels, based on the actual damage observed in the two earthquakes.The classification process relies on the overall and local image features of the group buildings following the earthquake, employing a block assessment method.
In the classification process, the collapse rate of all buildings within each block is assessed comprehensively.This assessment provides an indication of the damage degree of the buildings within the block.Subsequently, all image slice samples within the block are labeled according to the corresponding damage category.Table 2 illustrates the block-level collapse rate, which represents the proportion of the number or area of collapsed buildings to the total number or area of the entire block.
Finally, to account for the large number of buildings in the study area and the memory requirements for model training, the sample partition size within each block is set to 224 × 224.We divide the dataset into four categories, Table 3 depicts the number distribution of each part of the sample set.There are 5560 severely damaged, 5272 moderate damaged, 5165 slightly damaged, and 5046 negative samples.The total number of samples is 21,043.To ensure the model's comprehensive grasp of damaged building attributes across diverse locations, we carefully curated an equal number of sample images for each building damage category from the datasets of the two earthquake disasters.This approach bolsters sample representativeness and mitigates challenges arising from imbalanced sample categories during model training.

Methods
In previous studies, modifications were made to the Transformer block by incorporating convolution.This involved either replacing the multi-head attention with a convolutional layer or introducing an additional convolutional layer within the Transformer sequence structure to capture local relationships (Gulati et al. 2020).In contrast, our approach is inspired by recent advancements that introduce convolution into the Transformer network, specifically in two key aspects of the vision Transformer (Wu et al. 2021).Firstly, we employ convolution instead of the existing linear embedding for performing attention operations.Secondly, we design a hierarchical structure that generates patch tokens with different resolutions.This approach significantly reduces the computational load of the linear projection in the Vision in Transformer (ViT) and enhances the efficiency of model training.
As illustrated in Figure 3, the Conv-Transformer model is comprised of three stages, each consisting of two components: Conv-Token Embedding and Conv-Transformer.In the initial stage, instead of using the embedding operation as in the ViT model, the image is fed into the Conv token embedding layer, reshaping the tokens into a two-dimensional spatial sequence to be processed by the subsequent layers.A normalization layer is then applied to the tokens, enabling the Conv-Transformer structure to progressively reduce the number of token markers at each stage, while widening the token markers.This process achieves spatial down-sampling and enhances the representation of features, leading to increased richness (Touvron et al. 2020;Wu et al. 2021).
Subsequently, in the Conv-Transformer part, convolution is employed to perform Conv-Projection operations.These operations create the embeddings for the query, key, and value (Yuan et al. 2021).It is worth noting that the class token is only added in the final stage.Lastly, the predictions for the samples' classes are made using the MLP (Fully Connected Layer) head.
To enhance the feature extraction capability, we incorporate the convolution operation into the Transformer network by utilizing the Conv-Token Embedding layer and Conv-Projection within the multi-head self-attention module.
To be specific, the Conv-Token Embedding operation aims to capture local spatial context information, ranging from low-order edge details to high-order semantic information.It follows a multistage hierarchical approach similar to CNN.In this operation, an image or token map from the previous stage is inputted, and the Conv-Token Embedding operation is performed in the subsequent stage to generate a new token map.The resulting token map is then flattened into a one-dimensional vector and passed on to the subsequent Transformer part.The Conv-Token Embedding layer allows for adjusting the dimensions and number of tokens at each stage by manipulating the convolution parameters.As depicted in Figure 3, In stage 1, the Conv-Token Embedding parameter is set to the convolution kernel size as c = 7, the number of conv s = 64, and the stride p = 4.In stage 2, the convolution kernel size c = 3, the number of conv s = 192 and the stride p = 2.In stage 3, the convolution kernel size c = 3, the number of conv s = 384, and the stride p = 2.By applying the Conv-Token Embedding layer in each stage, the token sequence length is reduced while the token dimensionality is increased.This enhances the ability of each layer's token to represent complex visual patterns across a large spatial range.Figure 4 illustrates the implementation details of the Conv-Projection in our Conv-Transformer structure.Initially, the token is transformed into a two-dimensional tensor.Subsequently, Conv-Projection is performed using a convolution operation with a convolution kernel size (S) of 3. The number of convolutions employed in this operation is identical to the Conv-Token Embedding used in the corresponding stage.Finally, the token, after undergoing Conv-Projection, is flattened into a one-dimensional sequence containing the query, key, and value components.This processed token sequence then proceeds to the subsequent Conv-Transformer stage for further processing.
By incorporating Conv-Token Embedding and Conv-Projection into each stage, we have devised the Conv-Transformer structure (Graham et al. 2021).This design eliminates the need for a separate position embedding module, thereby simplifying the design of visual tasks that involve variable input resolutions.The Conv-Transformer structure effectively captures local spatial context and enables the model to handle varying input sizes without the reliance on explicit position information.
Furthermore, the incorporation of multi-branch methods in CNN networks has been shown to be effective in capturing features at different scales, thereby enhancing feature richness (Shocher et al. 2020).This approach has found success in various computer vision tasks, including object detection and recognition (Knyaz, Kniaz, and Remondino 2018).For instance, Fan et al. proposed a two-branch feature extraction network architecture called bLVNet-TAM, which achieved promising results in video action recognition tasks (Fan et al. 2019a).While the utilization of multi-scale feature representations has been well-established in CNN models, its application in Transformers is relatively limited.Therefore, in our research, we adopt a two-branch Transformer structure to classify and analyze the scenes of damaged buildings after earthquakes, leveraging the benefits of multiscale feature extraction.
On the other hand, the size of the token patch has an impact on the accuracy and complexity of ViT.For example, When the patch size is 16, the performance of ViT is 6% better than that of 32, but it uses more storage resource.Therefore, we take advantage of finer-grained patches while balancing complexity, which introduce a two branch Transformer in particular.Each branch operates on a distinct scale (or patch size in patch embedding), and then a simple and effective module to fuse information between branches is proposed.
To sum up, we design two branch model: (1) The Big-Branch employs a larger patch size, with more transformer encoders, and a larger embedding size.(2) Small-Branch: This branch has a smaller patch size, fewer encoders, and a lower embedding size.After merge the outputs of the two branches, the CLS markers of the two end branches are employed for prediction.
Moreover, to enable the model to capture both global and local information within the image across the two branches, we propose the incorporation of global and local attention mechanisms.These mechanisms are derived from the original Self Attention model and are implemented separately in the Big and Small branches, respectively.
In Figure 5(a), the Self Attention mechanism divides the image into fixed-size patches and applies an attention mechanism to capture features between each patch.However, this approach often focuses on only a small portion of the total image area, resulting in redundant computations and potential interference from irrelevant features.To address these issues, we introduce Local Attention in the Conv-Transformer part of each stage within the Big-branch.Here, the token feature dimension after Conv-Embedding and Conv-Projection is mapped to (H/L × W/L, L × L, C) vectors, as illustrated in Figure 5(b).The token vectors are further divided into L × L windows, and Self Attention is applied within each window, resulting in attention dimensions of (H/L × W/L).Additionally, to enhance feature representation, Global Attention is employed in the Small-branch.As depicted in Figure 5(c), we employ a G × G uniform grid on token vectors (G × G, H/G × W/G, C), followed by Self Attention within this sparse global grid.By utilizing local windows and a global dilution grid (L = G = 4), our approach effectively captures information from both local and global perspectives, ensuring balanced computation between the two.Importantly, these methods exhibit linear complexity with respect to spatial size or sequence length, thereby reducing computational complexity.However, the previous two-branch model simply concatenates information from the two branches and feeds it to the subsequent classifier.This approach fails to consider the correlation and information redundancy between the branches, resulting in reduced classification efficiency and performance.To address this limitation, we introduce the Cross Attention module, which allows for the fusion of information from the two branch transformers.
In order to effectively fuse and integrate information from the two scale Conv-Transformer branches, we adopt a Cross Attention token fusion approach in our study (Chen, Fan, and Panda 2021).The underlying concept of the Cross Attention module is illustrated in Figure 6, where it involves the interaction between the class token of one branch and the patch token of the other branch.To facilitate the integration of multi-scale features, we utilize the class token of each branch as a representative entity to exchange information with the patch token of the other branch, and subsequently incorporate this information back into its own branch.Given that the class token has acquired abstract knowledge shared by all patch tokens, the interaction with a patch token from another branch contributes to capturing data of diverse scales (Huang et al. 2020).Once the class token merges with other branch tokens in the subsequent transformer encoder, it engages in interactions with its own patch token.This allows the class token to assimilate information from other branches into its own patch token, thereby enhancing the representation of each patch token.
Figure 7 shows the details of the Cross Attention method.Specifically, for the Big-branch, it first collects the patch token from the small branch, and then connects the patch token with the class token, as shown in equation ( 1).Let X i be the token sequence of branch i (including patch and CLS markers), where i can be a Big-branch or a Small-branch.X i cls represent the class token of branch.
where f l (•) is a function used for alignment of dimensions.Then, because the information from the patch token is fused into the class token, employ Cross Attention (CA) between the class token and the patch token, where the class token is the sole query.CA may be stated mathematically using the following equations: W q , W k and W v are learnable parameters, C = 192 and h = 6 are the embedding dimension and number of the head.In the model operation, only the class token is used in the query, so the computational and memory complexity are linear when generating the attention map, it improves the overall efficiency of the process.In addition, just like self attention in Transformer, we also use Multiple heads in Cross Attention (MCA).The equation ( 4) of the Cross Attention mechanism using layer normalization and residual connection is given below.Among them, f l (•) and g l (•) are the aligned projection function and back projection function, respectively.After the Cross Attention module, we normalize the output and add it to the input of the previous layer to form a residual connection then get the final result z l .Finally, the fused class token is selected and passed through the MLP Head for image scene classification.

Evaluation metrics
The experimental results of the baseline model and our proposed Cross Conv-Transformer model are evaluated using the Overall Accuracy (OA) and confusion matrix.OA is calculated as the ratio of the number of correctly classified images to the total number of images in the test set after completing model training (Foody 2020).OA serves as the primary performance metric for characterizing the image classification performance of the model, with values ranging from 0 to 1. Higher values indicate better classification performance.
Additionally, the confusion matrix provides detailed information about the correct and misclassified instances for each class (Xu, Zhang, and Miao 2020).It is a tabular representation where columns represent the predicted class of the instances, and rows represent the actual class of the instances.Each element X ij in the matrix represents the number of images predicted to belong to the ith category while actually belonging to the jth category.

Experiment setting
The GeForce RTX 3080 GPU is used in this experiment.The settings and parameters were adjusted gradually during the training phase.The maximum number of training epochs was set to 200.The optimizer used for training was Adam, and a batch size of 12 was selected.The initial learning rate was set to 1e-4, and it was gradually reduced during the model training process.The momentum for the model's settings was set to 0.85, and the weight decay coefficient was set to 0.005.During the training process, the loss function was monitored, and training was stopped when the loss function no longer showed improvement.The trained model weights were then saved for later use during verification and evaluation.Throughout the training process, important evaluation metrics were recorded to assess the performance of the model.

Results
In order to compare the proposed method with the pure CNN model and the pure Transformer model, ResNet, EfficientNet and ViT model are selected as the benchmark respectively because these models are the representative of CNN and Transformer.We evaluate the performance of the three models by using loss and accuracy curves during training and confusion matrix of classification.
After multiple training iterations, the average accuracies of the ResNet, EfficientNet, ViT and the Cross Conv-Transformer models ultimately reached 91.34%, 94.89%, 93.64%, and 97.61%, respectively.During training phase, the accuracy and loss value changes are depicted in in Figure 9.
It is evident that the Cross Conv-Transformer model consistently outperforms both ViT and EfficientNet throughout the entire training process.Notably, our proposed model exhibits higher accuracy at the beginning and end of training, demonstrates a narrower fluctuation range in the accuracy curve, achieves faster accuracy improvement, and requires a shorter training period.Conversely, the ViT model exhibits higher loss values when training approaches convergence.Furthermore, it is worth noting that EfficientNet outperforms ViT, as the pure Transformer model requires a substantial amount of training data to fully leverage its advantages.
In our study, we employed 95% of the data (19,459 samples) for training and validation.To ensure balanced data categories, an equal amount of data was assigned to each category in the training set.Subsequently, 5% of the data (1584 samples) was reserved for testing.Notably, the test set consisted of a larger number of positive samples (damaged buildings) compared to negative samples.This decision was made based on the classification task's objective, considering the importance of identifying damaged buildings.The confusion matrix, depicting the model's performance compared to the ground truth, is presented in Tables 4-7.Analysis of the confusion matrix demonstrates that our model outperforms the baseline models.
In the Tables 4-6 of baseline models, ResNet yielded the lowest accuracy, reaching merely 91.34%.This can be attributed to the straightforward single-branch convolutional stacking approach, which led to the model discarding numerous essential feature details.Specifically, 33 severely damaged samples were inaccurately categorized as moderately damaged, while 27 and 25 moderately damaged samples were erroneously classified as severely damaged and lightly damaged, respectively.Additionally, 21 and 14 lightly damaged samples were incorrectly grouped into moderately damaged and severely damaged classes.
Specifically, the ViT model exhibits classification errors primarily between severely damaged and moderately damaged samples, as well as between moderately damaged and slightly damaged samples.Among the misclassifications, 22 severely damaged samples were classified as moderately damaged, 30 and 16 moderately damaged samples were misclassified as severely damaged and slightly damaged, and 31 slightly damaged samples were misclassified as moderately damaged.
Similarly, EfficientNet performs slightly better than ViT but still exhibits some misclassifications.Notably, 18 severely damaged samples were misclassified as moderately damaged, 8 and 10 moderately damaged samples were misclassified as severely damaged and slightly damaged, and 31 slightly damaged samples were misclassified as moderately damaged.
In contrast, our proposed model in Table 7, which combines the strengths of CNN and Transformer, demonstrates superior performance.It effectively reduces misclassifications between moderate damage and severe damage, particularly improving the model's sensitivity in classifying moderate damage.This improvement is particularly meaningful in the classification of building damage scenes after an earthquake.In summary, from an analysis of the changes in training loss and accuracy curves, it is evident that the Cross Conv-Transformer model demonstrates faster convergence during the training phase.In comparison to the baseline models, the Cross Conv-Transformer model achieves earlier reduction in training loss, and when the loss function curve reaches a plateau, it attains higher accuracy than both ViT, ResNet and EfficientNet.These findings highlight the lightweight nature of our proposed model, which not only facilitates convenient training but also exhibits stronger feature extraction capabilities.
Moreover, examining the classification confusion matrix results of the three models on the earthquake disaster buildings damage dataset, we observe that misclassification is more prevalent between severe damage and moderate damage.This can be attributed to the limited disparity in image texture and shape between severely damaged and moderately damaged buildings, as well as the minimal contrast between the background and the color of the damaged structures.Nevertheless, the Cross Conv-Transformer model outperforms both ViT and EfficientNet by minimizing misclassifications.
As illustrated in Figure 10, integrating the recognition outcomes of building damage scenes with geographic information from image slices enables a more accurate evaluation of damage conditions in different locations following disasters.In this study, we selected the blocks within the primary urban area of Yushu as the fundamental units for earthquake disaster assessment.The seismic image slices were divided into individual blocks, with the predominant sample categories in each block representing the corresponding earthquake damage categories.The building damage classification standard referred to methods used in research on similar regions (Zhao et al. 2013).The experimental results demonstrate that out of the 137 blocks in Yushu's main urban area, the Cross Conv-Transformer model successfully detected the building damage levels in 133 blocks.The accuracy of block-level damage identification reached 97.08%, with a Kappa coefficient of 0.82, surpassing the second-best performing EfficientNet model by 4.35%.False detections mainly occurred between neighborhoods classified as moderately damaged and lightly damaged.
Subsequently, we conducted ablation experiments to assess the efficacy of the two-branch architecture and the Cross Attention component within our proposed framework.Furthermore, to elucidate the impact of our designed attention and feature fusion mechanism in buildings damage scene recognition, we employed Grad-CAM to visualize the attention feature heat map.The intensity of the heat map corresponds to the level of importance attributed by the model.
Table 8 presents the results of the ablation experiments.Initially, we employed a single branch structure, which yielded a performance reduction of 2.62% compared to the two-branch structure.Subsequently, by incorporating Cross Attention as the feature merging component within the two-branch structure, the accuracy of the scene classification task improved by 0.82% in comparison to the method of simply merging the features from the two branches.These findings highlight the effectiveness of the two-branch framework in providing a richer combination of local and global information, consequently enhancing the classification performance of the model.Furthermore, the use of Cross Attention facilitates better integration of features at different scales, surpassing the performance of simple stacking operations.Figure 10.Block level distribution map of earthquake damage degree in Yushu.
Figure 11 visually demonstrates the impact of the feature attention and fusion model we designed.As a result of this model, the attention is directed towards the features of damaged areas in the images.In scenes depicting completely collapsed buildings, the model places greater attention on the building debris scattered on the ground.Moreover, the attention mechanism allows the model to focus on both global and local information, enabling more precise delineation of the boundaries of the damaged areas.

Discussion
Image classification plays a crucial role in the field of computer vision and finds significant applications in assessing damaged buildings using remote sensing images.In this research, we propose the Cross Conv-Transformer model for accurately classifying post-earthquake building damage.Our experimental findings demonstrate the strong performance of our model, as well as the ResNet EfficientNet and ViT models, in the task of disaster building damage classification.However, despite the overall success, there are instances of misclassification that can be attributed to several factors.One prominent factor is the complex background environment in aerial images following an earthquake.Additionally, the characteristics of damaged building debris and exposed soil bear striking similarities, thereby increasing the likelihood of misclassification.Our experiments reveal that misclassification tends to occur more frequently between buildings with moderate damage and those with severe damage.Notably, our proposed method exhibits superior performance compared to the baseline models, as it effectively reduces misclassification across all damage classes.
To summarize, the Cross Conv-Transformer model demonstrates superior capability in classifying damaged buildings.Our model leverages the inherent advantages of convolutional neural networks to enhance feature extraction for building damage scenes, while also capitalizing on the strengths of Transformers, such as parallel computing, global vision, and flexible stacking.Consequently, the model effectively captures the long-range dependencies between different regions within the building damage images.Furthermore, we employ a two-branch approach, incorporating window-sized attention and grid-sized attention, along with feature fusion, to facilitate the model's comprehensive learning of local and global features, thereby improving its robustness.As the Transformer structure continues to evolve, we anticipate the emergence of more Transformer-based networks in the near future.These networks are expected to possess enhanced feature extraction capabilities and offer more targeted remote sensing high-resolution image classification.Additionally, we aim to continuously enhance our natural disaster remote sensing dataset by expanding its scope to encompass a greater variety of data and disaster types.In future research, we will emphasize exploring the relationship between the degree of building damage and the features extracted by deep learning models, with the objective of improving the model's proficiency in extracting fine-grained building damage features.

Conclusions
In this study, we have created a dataset specifically for buildings damage scene recognition, utilizing aerial images from the Beichuan and Yushu earthquakes.

Disclosure statement
No potential conflict of interest was reported by the author(s).
(2) A Cross Conv-Transformer model is proposed that leverages the strengths of CNN's feature extraction and Transformer's global attention capabilities by incorporating both global and local attention.(3)We propose the Cross Attention in the two-branch model, which offers the advantages of linear computation and memory to filter features from different branches.

Figure 1 .
Figure 1.Map of vision in transformer.

Figure 2 .
Figure 2. Location of the study areas.

Figure 4 .
Figure 4.The flowchart of the Conv-Projection structure.

Figure 5 .
Figure 5. Self Attention (a), Local Attention (b) and Global Attention (c).(Local Attention can only obtain the information of the image in the window, while Global attention can pay attention to the information of the entire image).

Figure 6 .
Figure 6.Cross Attention implementation in details (Only the Class tokens are fused because the Class tokens represent all the information of the branch patches).

Figure 8
Figure 8 presents the structure of the proposed Cross Conv-Transformer network, incorporating the various modules discussed earlier.The network is composed of two branches: the Big-branch and the Small-branch.In the Big-branch, the input images are divided into patches of size 16 × 16 pixels, while in the Small-branch, the images are divided into patches of size 12 × 12 pixels.The original Transformer encoder is replaced with the previously proposed Conv-Transformer structure.Each branch is further divided into three stages, where each stage consists of a Conv-Embedding and Conv-Transformer part.The Conv-Embedding operation divides the input image into small patch sequences, which are then processed by the subsequent Conv-Transformer part to extract global features.The Big-branch has 1, 4, and 16 (N 4 , N 5 , N 6 ) Conv-Transformer parts in the three stages, respectively, while the Small-branch has 1, 2, and 10 (N 1 , N 2 , N 3 ) Conv-Transformer parts.After each stage, a new token map is generated, gradually reducing in size and increasing in dimension.The token map obtained in the third stage is transformed into a sequence through layer normalization and then fed into the previously designed Cross Attention model to fuse the information learned by the two branches, allowing the model to capture global and local features.

Figure 7 .
Figure7.Cross Attention implementation in detail (The class token in Big-branch interacts with the patch tokens in Small-branch as query, and the class token in Small-branch also performs the same operation to complete the interaction of the feature information between the two branches).

Figure 8 .
Figure 8. Cross Conv-Transformer network structure.(The network is divided into two branches, each branch is composed of a different number of Conv-Transformer.The output of the last two branches is sent to the Cross-Attention module for information fusion and filtering, and finally to the MLP for classification).

Figure 9 .
Figure 9. Precision and loss curve of the four models during training.
Notably, we have incorporated both Convolutional Neural Networks (CNN) and Transformers into the task of earthquake disaster scene recognition.Our proposed approach involves a two-branch Conv-Transformer model that incorporates local and global attention mechanisms.Through extensive experiments, we have observed that the Cross Conv-Transformer model outperforms the baseline models in terms of classification accuracy across different levels of building damage.Furthermore, it exhibits lower loss during the training phase.By analyzing the confusion matrix, we found that the Cross Conv-Transformer model effectively reduces the classification errors between severely damaged and slightly damaged buildings.Moreover, the accuracy of the Cross Conv-Transformer model's classification is consistently higher during the training phase.Our research demonstrates the effective application of the Cross Conv-Transformer model for extracting valuable disaster information from aerial images.

Table 1 .
Classification of building damaged by aerial images after earthquake.

Table 2 .
Example images of classification instances.
Background of water bodies, open spaces, etc.

Table 3 .
Distribution of the sample set.
Figure 3.The flowchart of the Conv-Transformer structure.

Table 4 .
Confusion matrix of the ResNet.

Table 5 .
Confusion matrix of the ViT.

Table 6 .
Confusion matrix of the EfficientNet.

Table 7 .
Confusion matrix of the Crosss Conv-Transformer.

Table 8 .
Confusion matrix of the Crosss Conv-Transformer.
Figure 11.Heat map of the model's feature.