A lightweight distillation CNN-transformer architecture for remote sensing image super-resolution

ABSTRACT Remote sensing images exhibit rich texture features and strong autocorrelation. Although the super-resolution (SR) method of remote sensing images based on convolutional neural networks (CNN) can capture rich local information, the limited perceptual field prevents it from establishing long-distance dependence on global information, leading to the low accuracy of remote sensing image reconstruction. Furthermore, it is difficult for existing SR methods to be deployed in mobile devices due to their large network parameters and high computational demand. In this study, we propose a lightweight distillation CNN-Transformer SR architecture, named DCTA, for remote sensing SR, addressing the aforementioned issues. Specifically, the proposed DCTA first extracts the coarse features through the coarse feature extraction layer and then learns the deep features of remote sensing at different scales by fusing the feature distillation extraction module of CNN and Transformer. In addition, we introduce the feature fusion module at the end of the feature distillation extraction module to control the information propagation, aiming to select the informative components for better feature fusion. The extracted low-resolution (LR) feature maps are reorganized through the up-sampling module to obtain high-resolution (HR) feature maps with high accuracy to generate high-quality HR remote sensing images. The experiments comparing different methods demonstrate that the proposed approach performs well on multiple datasets, including NWPU-RESISC45, Draper, and UC Merced. This is achieved by balancing reconstruction performance and network complexity, resulting in both competitive subjective and objective results.


Introduction
Remote sensing images are a valuable data source for obtaining ground information, with applications in both civilian and military fields.These images provide rich texture details that are crucial for various tasks, such as change detection (Bai et al. 2022;Zhu et al. 2022), object identification (Jing Wang et al. 2022), land cover classification (Y.Xu et al. 2022;Xue et al. 2022), hyperspectral image classification (Asker 2023;Firat et al. 2023a;Fırat et al. 2023b;Fırat, Asker, and Hanbay 2022) and other fields (R. Li et al. 2022;Yu et al. 2023;Yuan et al. 2023;S. Zhang et al. 2021).However, the obtained images are often low-resolution (LR) due to limitations in transmission conditions and imaging equipment.Increasing the resolution of remote sensing images using hardware can be both time-consuming and costly.With the proposal of the comprehensive positioning, navigation, and timing system (PNT system) (Prol et al. 2022), remote sensing algorithms based on mobile terminals and edge devices have become a new research direction.Therefore, it is particularly important to design a lightweight super-resolution (SR) reconstruction technology from a software perspective to improve the resolution of remote sensing images.
The SR reconstruction techniques refer to the reconstruction of high-quality HR images from observed low-quality LR images via specific algorithms.Existing SR reconstruction techniques are divided into three major categories, i.e. interpolation (Lei Zhang and Wu 2006), reconstruction (X.Li et al. 2010), and learning-based methods (Zeng et al. 2019).Interpolation-based methods aim to insert new pixel points through the spatial position relationship between sub-pixels and surrounding pixels.Some widely used interpolation methods are nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.These methods are relatively easy to implement and offer fast performance.Reconstruction-based methods aim to mathematically combine the overall information of an HR image to reconstruct it; commonly used methods include iterative back-projection (Qiu, Cheng, and Wang 2021), projection onto convex sets (B.Wang et al. 2015), and maximization of a posteriori estimation (Jakhetiya et al. 2016).Learning-based methods obtain some prior knowledge to guide image reconstruction by learning the mapping relationship between LR images and HR images.Common learning-based methods include flow learning and sparse coding methods.These traditional SR reconstruction methods mainly rely on the construction of constraint terms and the accuracy of inter-image alignment to achieve the reconstruction effect.However, these methods are not suitable for SR reconstruction with large magnification, and the reconstruction results often own issues such as blurred edge texture and low-quality details.
With the continuous development of deep learning, convolutional neural networks (CNN) have shown great potential in the field of image SR reconstruction thanks to their powerful fitting ability, which has attracted wide attention from scholars.Dong et al. (2014) first applied CNN to an image SR reconstruction task and completed image feature extraction using three-layer convolution, which obtained great reconstruction results compared with traditional SR reconstruction methods.However, the interpolation of the LR images before the image feature extraction operation increases the computational demand, and the errors by the interpolation process introduce uncertainty to the reconstruction results.To address the reconstruction problem caused by interpolation amplification, Dong, Change Loy, and Tang (2016) further proposed the FSRCNN algorithm.The algorithm solves the interpolation problem by adding deconvolution layers to the network, increasing network depth, and adjusting the convolutional kernel size to improve the reconstruction effectiveness and training speed.Kim, Lee, and Lee (2016) first applied a recurrent neural network to SR tasks, which solves the problem of model gradient explosion (or disappearance) using a recursive supervision strategy and the idea of residual learning.Shi et al. (2016) proposed an image SR method based on sub-pixel convolution layers, which computes convolution directly on LR image features to obtain HR images and achieves satisfactory performance in speed and reconstructed results.Ledig et al. (2017) first introduced generative adversarial networks into SR tasks.The model consists of a deep neural network structure with two networks that pit the generative network and the discriminative network against each other.These two networks are trained iteratively on each other, leading to recovered high-frequency information in SR tasks.
The deep learning-based SR algorithms mentioned above are primarily designed for reconstructing natural images.However, remote sensing images present unique challenges due to their high spatial distribution, varied sizes and shapes of ground objects, and the need to extract high-frequency information for effective SR reconstruction.Therefore, these natural image-based SR algorithms cannot be directly applied to the task of remote sensing image reconstruction.To this end, Lei, Shi, and Zou (2017) proposed a new remote sensing image SR algorithm with a combined local-global network that adopts a multinomial multifork structure to learn the multilevel representation of remote sensing images.W. Xu et al. (2018) designed a remote sensing SR algorithm based on a depth-storage connection network, which enables the reconstruction network that demonstrates a better reconstruction capability by efficiently combining image details with environmental information.Jiang, Wang, Yi, Jiang, Xiao et al. (2018) designed a deep distillation recurrent network, which extracts remote sensing image features by combining ultra-dense residual blocks and multi-scale purification units and further promotes feature representation through a distillation mechanism.Dong et al. (2019) employed a new multi-perceptual attention network and a migration learning strategy to reconstruct remote sensing images by combining enhanced residual blocks and residual channel attention mechanisms.He et al. (2022) proposed a ResNet-based dense spectral Transformer to achieve spectral SR of multispectral remote sensing images.The network combines the Transformer with ResNet to meet the needs of learning long-range relationships for remote sensing images.Zhilin Wang, Shang, and Wang (2022) proposed a remote sensing image SR network based on a swin Transformer fusion attention network.The network uses a swin Transformer module with a fused attention mechanism to extract high-frequency information and uses a gradient method to extract edge details of the image, which effectively enhances the network's ability to reconstruct details.
The above-mentioned remote sensing image SR algorithm based on deep learning mainly extracts deep features of remote sensing images through the CNN network.However, since CNN lacks the ability to model long-range dependencies, it may not handle global features in remote sensing images well.In contrast, Transformer is a neural network suitable for sequence data, which can effectively model dependencies between sequences.In remote sensing image SR, Transformer can help the model learn the global features of the image, so as to better improve the performance of the model.Therefore, combining CNN and Transformer can merge their advantages to enhance the performance of remote sensing image SR models.At the same time, the existing Transformer-based remote sensing SR methods have too many network parameters and a large amount of calculation, which makes it challenging to deploy them on related hardware.To address these challenges, we propose a lightweight distillation CNN-Transformer architecture for remote sensing SR, called DCTA, which mainly solves the problem of insufficient feature information and a large amount of calculation when combining CNN and Transformer for remote sensing image reconstruction.As shown in Figure 1, our proposed DCTA network achieves the highest reconstruction performance on the NWPU-RESISC45 dataset with only 0.53M parameters.Specifically, the DCTA network uses fewer layers and filters, which significantly reduces the number of network parameters.Moreover, considering the particularity of remote sensing images, this network combines CNN and Transformer modules to effectively extract remote sensing image features.
(4) While keeping fewer network parameters and floating-point operators, the designed DCTA network achieves state-of-the-art remote sensing image reconstruction performance on three remote sensing datasets.Related ablation experiments also verified the validity of DCTA's structure design.
The rest of this paper is organized as follows: Section 2 provides a summary of related works on CNN-based and Transformer-based SR for remote sensing images.Section 3 elaborates on the detailed structure of the proposed DCTA network.Section 4 covers the experimental design, including a comparison of the subjective and objective experimental results as well as ablation experiments.Finally, Section 5 summarizes the study.

Related work
In this section, we provide an overview of the research related to remote sensing image SR using both CNN-based and Transformer-based methods.

CNN-based remote sensing image SR
The CNN-based algorithms have shown remarkable performance in various remote sensing image tasks, such as change detection (D.Wang et al. 2023;Zhu et al. 2022), pan-sharpening (Jiaming Wang, Shao, Huang, Lu, and Zhang 2022), super-resolution (Jiaming Wang, Shao, Huang, Lu, Zhang, and Li 2022), owing to their advanced feature representation capabilities.As a result, CNN-based SR for remote sensing images has become a mainstream approach.For instance, Lei, Shi, and Zou (2017) introduced a multi-fork network structure that combines local and global information to learn multi-level features for remote sensing image processing.Jiang, Wang, Yi, and Jiang (2018) presented a remote sensing image SR network that progressively enhances the image details using a dense link network.Pan et al. (2019) proposed a residual dense back-projection network for remote sensing image SR, which utilizes a residual back-projection module to simplify the network and speed up the reconstruction process.Lei, Shi, and Zou (2019) developed a dual-path framework for remote sensing image generation using a coupled discriminative GAN network with coupled discriminative loss.D. Zhang et al. (2020) designed a remote sensing image SR network based on a hybrid higher-order attention network that adopts a higher-order attention model for remote sensing feature detail recovery and connects the feature extraction and feature refinement networks by introducing frequency-aware connections to greatly improve the SR performance.Liu et al. (2022) proposed and demonstrated a pairwise learning-based graph neural network that considers self-similar feature blocks in remote sensing images by aggregating across scales.The successful reconstruction of remote sensing images in SR heavily relies on the extraction of high-frequency information, making it crucial to develop a refined CNN-based feature extraction module.

Transformer-based remote sensing image SR
The Transformer was originally proposed in 2017 by Vaswani et al. (2017) and has gained significant popularity in the natural language processing (NLP) field.In recent years, there has been growing interest in exploring the potential of Transformers in computer vision applications.Transformers have been adopted to slice and encode images, replacing convolutions, to obtain internal connections through attention models.Recently, the visual Transformer has also been widely used in remote sensing images for tasks such as semantic classification (Strudel et al. 2021), building extraction (Chen et al. 2022), super-resolution (Lei, Shi, andMo 2021), etc. Cai andZhang (2022) proposed a texture transfer transformer-based SR network for remote sensing images, which reduces the dependence on reference images through a feature fusion scheme of the U-transformer and produces rich remote sensing texture information.Tu et al. (2022) designed a remote sensing image SR network by combining the Swin Transformer and CNN, which extracts depth features from the residually dense Swin Transformer to produce HR images.An et al. (2022) designed a new end-to-end remote sensing SR framework, which solves the multi-frame remote sensing image SR problem by introducing Transformer.Lei, Shi, and Mo (2021) and S. Wang et al. (2021) have previously proposed Transformerbased remote sensing image SR networks that respectively employ a multi-level enhancement structure and a context transformation layer to extract contextual features and improve SR reconstruction.In contrast, our approach aims to combine both CNN and Transformer frameworks to extract depth features of remote sensing images, resulting in a lightweight yet effective remote sensing image SR framework.This approach enhances the network's perceptual field, improving the high-frequency feature detail representation of remote sensing images.

The proposed method
In this section, we describe the overall structure of DCTA and the detailed structure of the DCTB components, i.e. the dual attention distillation block (DADB) and the affine-swin transformer block (ASTB).

Network architecture
In this work, we design an SR network for remote sensing images based on distillation CNN-Transformer architecture, i.e.DCTA.As shown in Figure 2, the network can be divided into four major components, i.e. the coarse feature extraction layer (CFEL), the feature distillation extraction module (FDEM), the feature fusion module (FFM), and the up-sampling module.Specifically, given an LR remote sensing image I LR with a spatial size H × W and an HR remote sensing image I HR with a spatial size tH × tW, the proposed DCTA network aims to reconstruct the SR remote sensing image I SR : where F (.) denotes the SR function of the proposed network, and H, W, and t indicate the height, width, and scale factor, respectively.The four-step process follows: (1) Coarse Feature Extraction Layer.First, the LR remote sensing image I LR is fed into a 3 × 3 convolutional layer, which extracts the coarse features of the I LR , with the function expression: where M 0 denotes the coarse feature of the I LR , and F 1 (.) denotes the CFEL's 3 × 3 convolution operation.
(2) Feature Distillation Extraction Module.Second, M 0 is fed to the DCTB, whose function expression is: where M 1 denotes the refined distillation features extracted by DCTB.Considering that the FDEM is stacked by multiple DCTBs, the feature expression extracted by each DCTB can be defined as follows: where M i denotes that FDEM extracts fine features from remote sensing images using i-DCTBs.
(3) Feature Fusion Module.Third, the above-extracted distillation features are input to the FFM to achieve efficient feature fusion without increasing network's parameters, and the process is expressed as follows: where F 3×3 and F 1×1 represent convolution operations of 3 × 3 and 1 × 1, respectively.
(4) Up-sampling Module.Finally, the highly efficient fused remote sensing features are upsampled and reconstructed, outputting the final reconstructed SR remote sensing image: where F up denotes the reconstruction function of the pixel shuffle layer.The loss function for our DCTA model is formulated as follows: where Θ denotes the DCTA parameters, 1 denotes the l 1 norm, and I n SR and I n HR denote the nth reconstructed SR image and corresponding ground truth image, respectively.

Dual attention distillation block (DADB)
In the following subsection, we provide a detailed description of the DADB's structure, as shown in Figure 3.The proposed enhanced spatial residual attention (ESRA) and enhanced contrast channel residual attention (ECCRA) enable DCTA to extract enhanced remote sensing multi-scale features at each channel direction and spatial scale while maintaining a small number of model parameters.Note that the ESRA and ECCRA are linked together to effectively distill fine remote sensing features while reducing the extracted redundant remote sensing information features, as shown in Figure 4.The detailed structure of DADB is shown in Table 1.The details of ESRA and ECCRA are discussed in the following sections: (1) Enhanced Spatial Residual Attention (ESRA).To maximize the effectiveness of DCTA and to focus remote sensing image features on spatial scales of critical importance.To this end, we  designed an ESRA block based on enhanced spatial attention (Fang et al. 2022), which enhances more attention to the regions of interest by refining the features.Specifically, the extracted coarse remote sensing image features M 0 are first fed to the 1 × 1 convolution layer to reduce the channel size.Then the maximum pooling layer is passed through a stride convolution kernel to ensure the expansion of the perceptual field with a smaller number of parameters, followed by input to the group convolution layer of the residuals.To recover the spatial dimension and channel size, an upsampling layer and a 1 × 1 convolution layer are attached.Finally, the attention mask is generated through the sigmoid layer, and element multiplication is performed with the rough remote sensing image feature M 0 to realize the extraction of fine remote sensing image features.The expression of the ESRA block can be defined as: where F ESRA represents the feature extraction process of the ESRA block.
(2) Enhanced Contrast Channel Residual Attention (ECCRA).To effectively capture the global information of remote sensing images and further improve the detailed structural information, we designed an ECCRA block.The proposed ECCRA differs from the traditional attention block, as it uses the contrast information of the sum of standard deviation and mean to calculate the weight of channel attention and then uses the residual in the residual within the module to solve the problems such as disappearing during the feature information transfer.Let's Q = [q 1 , . . ., q c , . . ., q C ] as the input, which C represent feature maps.Therefore, the contrast information value can be calculated by: where H(q c ) denotes the global contrast information evaluation function.Specifically, the extracted remote sensing image features M ESRA are first fed into the contrast layer, then into three 1 × 1 convolutions via residuals.Finally, the calculated feature information is fed into the sigmoid layer, where elemental multiplication is performed with the extracted remote Sigmoid, ESRA Sigmoid sensing image features M ESRA .Thus, the expression of ECCRA is defined as: where F ECCRA denotes the feature extraction process of the ECCRA block and M DADB denotes the feature output of the overall DADB.

Affine-swin transformer block (ASTB)
Swin Transformer (Liang et al. 2021) is a Transformer architecture that employs local attention and offset windows and has achieved great performance in computer vision tasks.For remote sensing image reconstruction tasks, we propose an ASTB based on a Swin Transformer, with which features can be extracted from shallow to deep by varying downsampling scales and gradually expanding the network's perceptual field while using localization to reduce module ground computation complexity.As shown in Figure 5, the remote sensing image features extracted by DADB are passed to the layer normative (LN) layer and the multi-head self-attention (MSA) layer in turn, and the output of the MSA layer and the output of the DADB are operated with element-wise addition.Then the extracted features are passed to the LN layer and the ResMLP layer in turn, and finally, the element-wise addition operation is performed with the output of the MSA to output the final remote sensing image features.Therefore, the overall process of ASTB can be defined as: where F LN , F MSA , and F ResMLP represent the LN, MSA, and ResMLP function operations, respectively.
In the above ASTB, a new residual MLP layer is designed to mitigate the issue of gradient explosion in the feature information transfer process.By introducing the affine transformation layer, the network's training becomes more stable without adding additional training costs.As shown in Figure 6, the transfer process for ResMLP follows: Affine − Linear − GeLU − Linear − Affine.Benefiting from ResMLP, our proposed ASTB shows advanced reconstruction performance in remote sensing image SR tasks by exploiting cross-window information and then alternating with ResMLP.To strike a balance between network complexity and reconstruction performance, we incorporate two ASTBs within each DCTB.

Experimental results
In this section, we present our experimental results from four aspects, i.e. the dataset, implementation details, comparison with the state-of-the-art, and ablation studies.

Datasets
Our DCTA is evaluated through experiments conducted on three publicly available remote sensing datasets: the NWPU-RESISC45 dataset (Cheng, Han, and Lu 2017), the Draper dataset, and the UC Merced (Yang and Newsam 2010) dataset.Figure 7 displays selected samples from these datasets.The NWPU-RESISC45 dataset comprises 45 scene categories, each containing 700 remote sensing images with a size of 256 × 256 pixels.For training and testing, we utilized 2,250 and 90 images, respectively, from this dataset.
The Draper dataset comprises 324 scene categories, each containing 5 images, with an original image size of 3, 099 × 2, 329 pixels.To obtain HR images, we crop the original images to a size of 192 × 192 pixels via Bicubic interpolation.We selected 1,000 remote sensing images for training and 200 images for testing from this dataset.
The UC Merced dataset is composed of 21 categories of remote sensing scenes, each containing 100 images with a size of 256 × 256 pixels.We divided the dataset into two parts, with 1,050 images used for training and the remaining 1,050 images utilized for testing.

Implementation details
In this study, we focus on the 4× scale factor for remote sensing image reconstruction.The LR images in the training set are all obtained by bicubic downsampling.In the training phase, the images for training are enhanced by random rotation and horizontal flips.Similar to previous work (Y.Wang et al. 2023), We use five evaluation metrics to measure the reconstruction quality of SR images, including PSNR, SSIM (Zhou Wang et al. 2004), FSIM (Lin Zhang et al. 2011), VIF (Sheikh and Bovik 2006), and ERGAS (Ranchin and Wald 2000).We also analyze the number of network parameters (Params) and floating-point operations (FLOPs) in the models.Note that all reconstruction results are evaluated on the Y channel of the YCbCr color space.
We used the Adam optimizer (Kingma and Ba 2014) to train our model with b 1 = 0.9, b 2 = 0.999, and e = 1e − 8.The initial learning rate was set to 5e-4 and halved every 200 epochs, for a total of 1,200 epochs.For each training mini-batch, 16 randomly cropped patches of size 48 × 48 were used as input.The specific device parameters are shown in Table 2.The source code is available at https://github.com/Yu-Wang-0801/DCTA.
(1) Quantitative Results on NWPU-RESISC45 Dataset: The best results are highlighted in red font in Table 3, which presents the performance comparison of various methods on the NWPU-RESISC45  dataset.The results indicate that our proposed method achieves better objective evaluation metrics than the other compared algorithms, as demonstrated in Table 3. Specifically, while maintaining fewer network parameters and floating-point operators, our method outperforms other algorithms on five commonly used objective evaluation metrics.Figure 8 shows the reconstruction of different comparison algorithms on the NWPU-RESISC45 dataset images, where we label the regions where the reconstructed images are noticeable.The texture detail recovery capability of the proposed method was observed to be superior to that of the other competing methods.
(2) Quantitative Results on Draper Dataset: To further validate the effectiveness of DCTA, we conducted relevant experiments on the Draper dataset.Table 4 shows the comparison of the algorithms on the Draper dataset, where the best results are marked in red font.From Table 4, it can be seen that the proposed method obtains the best reconstruction results in PSNR and ERGAS.Although DCTA is worse than RCAN in three evaluation metrics (i.e.SSIM, FSIM, and VIF), the number of network parameters of DCTA is only 4.9% of RCAN, and the FLOPs are 4.4% of RCAN.Thus, we believe our method presents a better balance between performance and efficiency than other competing algorithms.The reconstructed images obtained by various competing algorithms on the Draper dataset are compared in Figure 9.It can be observed that our proposed method can reconstruct finer feature information on vehicle and aircraft runway lines compared to the other algorithms.
(3) Quantitative Results on UC Merced Dataset: To further validate the generality of DCTA, we tested the proposed method as well as other competing methods on the UC Merced dataset.Table 5 shows the algorithm comparison on the UC Merced dataset, where the best results are marked in red font.From Table 5, it can be seen that DCTA outperforms the other comparison algorithms for all five commonly used objective evaluation metrics with a low number of network parameters and FLOPs.Figure 10 compares the reconstructed images of different competing algorithms on the UC Merced dataset.As shown in Figure 10, our method outperforms other competing algorithms regarding reconstructed details, indicating the superior performance of the proposed method both from quantitative and qualitative perspectives.

Ablation studies
In this subsection, we design a series of ablation experiments to evaluate the effectiveness of each component of the DCTA network and the rationality of the number of DCTBs.
(1) Effectiveness of the proposed DADB and ASTB: From Table 6, the reconstruction performance of our network differs by 0.0798 and 0.0158 in terms of ERGAS and spectral angle mapper (SAM) Yifan Zhang, Backer, and Scheunders (2009) metrics when DCTA removes the ASTB component.The results indicate that the ASTB (as explained in Section 3.3) improves the extraction of deep features from remote sensing images and expands the perceptual fields.Additionally, we performed another experiment to evaluate the effectiveness of DADB.In this experiment, we removed DADB from the network and observed a reduction of 0.0216 and 0.0044 in the ERGAS and SAM values of the reconstruction results, respectively.This phenomenon also proves that DADB facilitates the extraction of high-frequency details from remote-sensing images.To further verify the persuasiveness of the above ablation experiments, we validated the networks with different modes using error comparison plots.The subjective results are shown in Figure 11.From the detail expansion region in Figure 11, it is clear that adding the proposed components leads to better-reconstructed images, proving the effectiveness of our proposed components.
(2) Effectiveness of different numbers of DCTBs: In order to verify the influence of the number of DCTBs on the network, we designed a set of ablation experiments for comparison, and the experimental results are shown in Table 7.It can be seen from Table 7 that with the increase in the number of DCTBs in this network, most of the objective evaluation indicators show an upward trend.We know from experiments that when the number of DCTBs increases to 8, the network parameters will reach 1.01M.Therefore, this network chooses the number of DCTBs to be 6 under careful consideration to ensure optimal reconstruction performance while ensuring a lightweight network.
(3) Effectiveness of the contrast convolution layers: In order to verify the effectiveness of the contrast convolution layer in ECCRA, we conduct related experiments for comparison, and the

Verification of superiority in classification tasks
The remote sensing image SR reconstruction technique is often used as one of the pre-processing steps in computer vision tasks.To better validate the effectiveness of reconstructed remote sensing images for downstream tasks, we used an unsupervised semantic segmentation algorithm (ISO-DATA) to evaluate reconstructed remote sensing images using various competing algorithms.We conducted the relevant experiments with the software ENVI5.3 platform and set the number of classifications to 5 and the maximum number of iterations to 5. The classification results are shown in Figure 12, where it can be observed that our DCTA-reconstructed images present closer classification results compared to the ones from ground truth images.We also notice that the reconstructed results of other competing algorithms lack relevant texture detail information that facilities correct classification.This experiment further supports the effectiveness of the proposed DCTA model in reconstructing remote sensing images with superior performance.

Conclusion
In this paper, we present a novel lightweight super-resolution (SR) architecture, named DCTA, specifically designed for remote sensing applications.Our approach introduces a unique distillation CNN-Transformer block (DCTB) that combines the strengths of CNN and Transformer structures in a lightweight manner.The proposed DCTB enables the extraction of deep features at various scales in remote sensing images, effectively enhancing the network's perceptual field and efficiently utilizing global feature information.To validate the effectiveness of DCTA, we conduct experiments on three datasets: NWPU-RESISC45, Draper, and UC Merced.The results demonstrate that our method achieves an excellent balance between computational cost and reconstruction performance, outperforming existing methods.Furthermore, we verify the efficiency of our design through comprehensive ablation experiments.The deployment of our method on hardware devices is anticipated, as it can enhance the accuracy of downstream tasks related to remote sensing images, such as change detection and building extraction.

Figure 1 .
Figure 1.The number of model parameters and PSNR value trade-offs is compared with other SR algorithms on NWPU-RESISC45 datasets for ×4 SR.

Figure 3 .
Figure 3.The overall structure of the proposed Dual Attention Distillation Block (DADB).

Figure 8 .
Figure 8.Comparison of subjective results on the NWPU-RESISC45 dataset with other comparison algorithms.Best view via zoomed-in view.

Figure 9 .
Figure 9.Comparison of subjective results on the Draper dataset with other comparison algorithms.Best view via zoomed-in view.

Figure 10 .
Figure 10.Comparison of subjective results on the UC Merced dataset with other comparison algorithms.Best view via zoomedin view.

Figure 11 .
Figure 11.MSE maps between reconstructions and ground truth of different ablation experiments.The error between images increases and decreases with the value corresponding to the color of the MSE map.

Figure 12 .
Figure 12.The classification results obtained by different algorithms through the ISODATA classification method are compared.Zoom in for more details.

Table 4 .
Comparison of Params, FLOPs, PSNR, SSIM, FSIM, VIF, and ERGAS results of different algorithms on the Draper dataset.Parameters (Params) and floating-point operations (FLOPs) are tested on an LR image with 48 × 48 pixels.

Table 5 .
Comparison of Params, FLOPs, PSNR, SSIM, FSIM, VIF, and ERGAS results of different algorithms on the UC Merced dataset.Parameters (Params) and floating-point operations (FLOPs) are tested on an LR image with 64 × 64 pixels.
experimental results are shown in Table8.We re-perform related experiments on the NWPU-RESISC45 dataset by removing the contrast convolution layer in ECCRA.From Table8, we can see that with the assistance of contrastive convolutional layers, our DCTA network can stably improve the reconstruction ability of remote sensing images.

Table 6 .
The results of the ablation experiments for the NWPU-RESISC45 dataset with scale factor × 4.

Table 7 .
Comparison of reconstruction performance of different numbers of DCTBs on NWPU-RESISC45, Draper, and UC Merced datasets.

Table 8 .
Comparison of reconstruction performance of contrast convolution layers on the NWPU-RESISC45 dataset.