MC-Net: multi-scale contextual information aggregation network for image captioning on remote sensing images

ABSTRACT Remote Sensing Image Captioning (RSIC) plays a crucial role in advancing semantic understanding and has increasingly become a focal point of research. Nevertheless, existing RSIC methods grapple with challenges due to the intricate multi-scale nature and multifaceted backgrounds inherent in Remote Sensing Images (RSIs). Compounding these challenges are the perceptible information disparities across diverse modalities. In response to these challenges, we propose a novel multi-scale contextual information aggregation image captioning network (MC-Net). This network incorporates an image encoder enhanced with a multi-scale feature extraction module, a feature fusion module, and a finely tuned adaptive decoder equipped with a visual-text alignment module. Notably, MC-Net possesses the capability to extract informative multiscale features, facilitated by the multilayer perceptron and transformer. We also introduce an adaptive gating mechanism during the decoding phase to ensure precise alignment between visual regions and their corresponding text descriptions. Empirical studies conducted on four publicly recognized cross-modal datasets unequivocally demonstrate the superior robustness and efficacy of MC-Net in comparison to contemporaneous RSIC methods.


Introduction
Recent remarkable advances in deep learning technology have facilitated a better comprehension of remote sensing images.Conventional visual tasks, such as classification, detection, and segmentation, have made significant achievements, whereas semantic-level visual tasks such as image captioning remain challenging problems.RSIC involves a concise and coherent description of a complex scene captured in a specific image using natural language.The goal is not only to recognize the information pertaining to target objects in an image but also to understand the relationship between objects and to generate syntactically accurate and semantically fluent descriptive sentences.
In recent years, inspired by advances in machine translation and computer vision, researchers have explored various techniques for the task of RSIC, which holds promise for applications in remote sensing image retrieval (Cheng et al. 2021), object detection (Zhang et al. 2019), military intelligence generation (Shi and Zou 2017), and disaster assessment (Liu et al. 2018).
Traditional image captioning methods can be classified into template-based (Farhadi et al. 2010;Ordonez, Kulkarni, and Berg 2011) and retrieval-based (Kulkarni et al. 2013;Ushiku et al. 2015) methods.Template-based methods used a predefined template to generate image descriptions.Retrieval-based methods generate a description of the query image based on the description of the similarity image.The generating sentences of above methods rely on the performance of the template and retrieval.
The advent of deep learning has significantly advanced the captioning of remote sensing images, where the encoder-decoder network (Anderson et al. 2018;Lu et al. 2017;Vinyals et al. 2015;Xu et al. 2015) has been widely adopted, and it can overcome the limitations of traditional template and retrieval approaches.Natural image captioning has made remarkable progress with the rapid development of artificial intelligence (AI).Remote sensing images have unique characteristics such as large imaging ranges, the presence of objects of different scales, and complex scenes in the same image (Du et al. 2021;Feng et al. 2023).In addition, complex scenes often depict several types of feature targets, making it challenging for category labels to capture the entire image content.These characteristics significantly increase the difficulty of captioning RSIs.
Researchers have conducted studies in the field of remote sensing.Some benchmark datasets (Lu et al. 2017;Qu et al. 2016) have been established to meet the data dependence of deep learning methods, including the largest cross-modal dataset established in an earlier work (Cheng et al. 2022).Researchers have used multilevel attention (Yuan, Li, and Wang 2019), denoising (Huang, Wang, and Li 2020), label information (Zhang et al. 2019), and attribute information (Zhang et al. 2019) to obtain richer image features.Wang et al. (2022) combined multilabel semantic information as a priori information and designed two methods for fusing semantic attributes and image features.On the decoder side, researchers have optimized the generated statements by extending long short-term memory (LSTM) Fu et al. (2020), using support vector machines (SVM) Hoxha and Melgani (2020), and optimizing loss functions during the training process (Li et al. 2020).Some innovative approaches have been introduced to model architectures that incorporate image retrieval (Wang et al. 2020), sound information (Lu, Wang, and Zheng 2019), and interpretability enhancement (Wang et al. 2020).Sumbul, Nayak, and Demir (2020) developed a summary-driven model, whereas (Hoxha, Melgani, and Demir 2020) combined retrieval and captioning methods to generate multiple captions and then compared them to reference headings to determine the final sentence captioning.Aiming at the problem of label scarcity in remote sensing image captioning, Yang, Ni, and Ren (2022) combined meta-learning into a remote sensing image captioning framework, which extracts meta features from two support tasks, including natural image classification and remote sensing image classification, and transfers the meta features to RSIC.To address the multi-scale problem, Zhang et al. (2019) introduced multiscale cropping to enhance data.Li et al. (2021) incorporated static and multiscale features by using recurrent attention.Zhang et al. (2021) reduced the burden of hidden states because the proposed language state (LS) provides text features exclusively so that the hidden states only guide the visual-textual attention process.Furthermore, Cheng et al. (2022) fused specific spatial regions and image scales.Wang et al. (2022) proposed a multiscale feature representation structure based on the two-stage training strategy.Existing remote sensing image captioning methods neglect to filter the redundant features of remote sensing images and do not consider the effective utilization of multi-scale image features, which hinder the development of remote sensing image description, therefore, how to obtain significant visual features is a problem that needs to be solved.
To address these problems, we propose a multi-scale contextual information aggregation network, including multi-scale feature image encoder and adaptive visual-text alignment decoder.The image encoder uses a multiscale feature extraction module (MS) to extract image features, for which channel relation modeling is proposed to filter redundant features.Furthermore, two feature fusion methods were proposed using MLP and a transformer to model the local and global context information of the images.Finally, a visual-text alignment mechanism is used to generate descriptions that are both syntactically accurate and semantically fluent.Based on the above design, the three proposed modules can work together to generate accurate and informative descriptions.The contributions of this paper can be summarized as follows: (1) We propose a multi-scale contextual information aggregation network for RSIC that mainly extracts deep multi-scale features and multifaceted backgrounds inherent in remote sensing images (RSIs) by expoloiting multi-scale feature image encoder and adaptive visual-text alignment decoder.
(2) We propose multi-scale feature extraction and feature aggregation modules in the image encoder.
The former module extracts deep features from remote sensing images at various scales and filter redundant features.The latter module can adaptively fuse cross scale feature information with the aim of improving the ability to understand the contextual information of RSIs.A visual-text alignment mechanism is introduced to the decoder to generate accurate descriptive sentences.
(3) We demonstrate the effectiveness and robustness of MC-Net on four cross-modal benchmark datasets.The results show that the proposed MC-Net achieves state-of -the-art performance.

Method
The architecture of the MC-Net model, which is based on an encoder-decoder structure, is shown in Figure 1.We provide a detailed overview of the MC-Net model, which includes the encoder based on multi-scale feature extraction and fusion, and the decoder, which uses the visual-text alignment mechanism.Specifically, the encoder leverages the MS to extract features at various scales and then combines this information to obtain optimized image features.On the decoding side, the fused multiscale feature information is input to an LSTM decoder, which generates the captioning of the remote sensing image.In the training process, cross-entropy loss is used to train MC-Net.

Multi-scale feature extraction module
Remote sensing image captioning aims to understand the information of an image at a fine-grained level and describe it in a fluent sentence.In this work, we design a MS module to extract visual features at different scales and filter redundant features of RSIs.
A pretrained VGG-16 was used to extract the visual features.Specifically, we divide the input features into four groups along the channel dimension, with each group denoted as X i [ H×W×C i ,i [ {1, 2, 3, 4}, where H, W and C i denote the different dimensions of image, respectively indicates height, width, and channels in each group, respectively.For each group of vectors, the convolution is calculated.Specifically, for the first group of input feature vectors, the output features are obtained directly by a convolution of 1 × 1.The second, third, and fourth groups of input feature vectors were generated with the output of the previous group.For which, the output features were obtained using a 3 × 3 convolution block after each remaining group of inputs can obtain a larger sensory field (1 × 1, 3 × 3, 5 × 5, 7 × 7).Furthermore, the channel relationship modeling module adopts the previous set of feature vectors output by the spatial relationship modeling module and the current feature vectors as the input.This design achieves filtering of redundant information with an optimized image feature representation.Figure 2 shows the structure of the MS module based on a multiscale convolution layer.The right panel of Figure 2 shows an expansion diagram of the channel relationship modeling module.
The above workflow can be summarized using the following equations: where I denotes the input of the encoder, K s i denotes the convolution operation corresponding to each set of feature vectors, and F i denotes the corresponding output.Next, the channel-level information is obtained by using global average pooling: where , W SS , respectively, denote the parameters of the two convolutions in module.
Next, the channel-level information is obtained by using global average pooling: (5) We used a fully connected layer and an activation function to extract useful channel information.
The output of the ith attention group is denoted by : The final output of spatial multi-scale features is obtained: We further concatenated each group of optimized features along the channel dimensions to obtain the final feature. where

Feature aggregation module
We propose two feature aggregation methods to adaptively fuse cross-scale feature information with the aim of improving the ability to understand the contextual information of remote sensing images.Our strategy includes local modeling and global modeling.

Local modeling of images
The extracted multi-scale image features are fed into the MLP for feature learning, and the features on the four scales are attention-weighted using the sigmoid activation function.Figure 3 shows the local image modeling process.First, we obtained feature S by concatenating the extracted multiscale remote sensing image features: Next, the concatenated image features were downsampled through the FC layer, and the correlation between the multiscale features was learned by the MLP.We calculate the weight matrix W using the sigmoid activation function for downsampling image features.
Subsequently, the features of different scales are multiplied by the score weight matrix to obtain the weighted image features: where W is the weighting coefficient, and represents the image features obtained by post-attentional weighting.

Global modeling of images
Image global context information is important to the image description results.We use the advantage of transformer to fuse the image features of different scales to obtain the global information.
The process of global modeling of images is shown in Figure 4.After the extraction of multiscale features, they are fed into the transformer encoder, where the position information of each element in the sequence is considered by positional encoding.Further, the encoded information is fed into the encoder that consists of four identical encoding modules, each containing two sublayers, that is, a multi-head attention and feed forward network.The forward network extracts the features of the input sequence, and each sublayer reduces the loss of information and ensures training stability by connecting the residuals and regularizing the layers.The specific workflow can be summarized as follows: where E is the input image block feature, E pos is the location encoding, MSA denotes the multi-head self-attention, MLP denotes the forward network, and LN denotes the layer regularization.The Transformer Encoder module employs multi-head self-attention to identify the interrelationships among objects in remote sensing images.The encoder module does not alter the dimensionality of the image features.

Visual-text alignment decoder
We employed an adaptive gating mechanism to achieve the adaptive selection of image information and language during the decoding phase.The Visual-Text alignment module consists of gated attention LSTM (G-LSTM) and adaptive language LSTM (AL-LSTM).The multiscale contextual features of the image extracted by the encoding network are fed to the decoding LSTM to generate descriptive statements of the image.The input vectors for the gate attention LSTM at each time step serve as the embedding vector of the current word, average pooling features of the image, and previously hidden state of the adaptive language LSTM.Then, the specific positions of the LSTM multiscale features are guided according to the attention mechanism, whereas the attention vector is optimized by the gating mechanism.Furthermore, the semantic gating vector facilitates the adaptive alignment of the visual features of the decoding process and textual information of the description statement.Finally, the context vector and Attention LSTM hidden state generated by gating attention are fed to the language LSTM.The decoding process of the adaptive visual alignment of the two-layer LSTM is as follows.
where V denotes the global average feature; W e is the word embedding matrix; W va , W ha and W T a are the learnable parameters; and a t refers to the vector composed of the attention weights corresponding to each of the regional feature vectors.V denotes the visual attention vector.The attention mechanism guides the decoding process to generate a weighted average feature vector at each time step, and the result of the image description generation greatly depends on the attention result.In this study, we optimize the attention vector by extending the existing attention mechanism and combining it with the gating mechanism, which leads to the retention of useful attention information during the decoding process.The optimized attention vector V′ is derived as follows: where w i q , w i v , b i , b g refer to the learnable parameters and ⊙ denotes element-by-element multiplication.
To select visual or sentence context information to generate sentences, we introduce a semantic gate.The context vector trades off how much new information the network is considering from the image with what it already knows in the decoder memory.
b t produces a scalar in the range [0,1].A value of 1 implies that long and short term visual and linguistic information of decoder memory is used and 0 means only spatial image information is used when generating the next word.To calculate b t , we add an additional element S t to the decoding model, which indicates the extent to which the model pays for the sentence context: where V = v 1 , v 2 , . . ., v L denotes an image feature vector.w s and W h represent learnable weight parameters.Furthermore, the context vector c , t , is input to the language LSTM to obtain the final output captioning after the softmax layer: Our model is trained with maximum likelihood estimation of MLE loss, with the goal of minimizing MLE loss.The input x t and previous hidden state h t−1 are combined to obtain the hidden state h t in the training phase.Then, the softmax function is used to calculate the probability distribution of words during utterance generation, and the word with the highest probability is selected as the predicted word.The predicted word then served as the input for the next time step.The above steps were repeated until the network predicted the end vector.The loss function for model training is the sum of the negative log likelihoods that generate the correct description of words in each time step: where θ is the parameter to be learned and (s * 1 , . . .s * t ) represents the generated descriptive sentence.

Dataset
(1) The UCM-Captions Dataset (Qu et al. 2016)  The dataset is also larger and more varied in its feature categories, which accurately reflects the intricate image variation in remote sensing images, including high intraclass diversity and interclass similarity.Students with a remote sensing background manually annotated the dataset using comprehensive expressions to ensure rich structural and lexical diversity in the description statements.The dataset contained a minimum of six words per description statement.

Evaluation metrics
Five commonly used evaluation metrics were employed to evaluate the performance of the proposed MC-Net for RSIC: BLEU (Papineni et al. 2002), ROGUE (Rouge 2004), METEOR (Banerjee and Lavie 2005), CIDEr (Vedantam, Zitnick, and Parikh 2015), and SPICE (Anderson et al. 2016).Evaluation metrics are utilized to objectively measure the correlation between the generated description sentences and the ground truth.A higher value for all the evaluation metrics indicates that the generated image description statement is closer to the ground truth.BLEU is a machine translation metric that analyzes the n-tuple correlation between generated description sentences and reference sentences.ROGUE, originally used for text summarization, calculates the longest common subsequence (LCS) and then obtains the F-measure.METEOR, another machine-translation metric, has a strong correlation with human judgment.CIDEr and SPICE are specifically designed for image captioning.CIDEr comprehensively evaluates the performance of the model, emphasizing the quality of image content.SPICE focuses on evaluating the sentence structure of a generated description sentence by utilizing a scene graph form to encode the targets, attributes, and their relationships within the sentence.

Implementation details
In our study, we evaluated the performance of the model using four datasets: 80% of the data reserved for model training, 10% for model validation, and 10% for model testing.Prior to training, the input model images were preprocessed to 224 × 224 pixels.The feature extraction network of all the models was standardized using VGG16 as the backbone model, with the encoder model finetuned.An Adam optimizer was used.The initialized learning rates of the encoder and decoder were set to 1e-4 and 5e-4, respectively.For every five epochs, the learning rate decreased to 0.8 times of the original during the training process.The batch size was 64 and the maximum number of epochs was 100.The word-embedding dimensionality was set to 512, while the number of encoding layers of the transformer was four.The model with the maximum value of CIDEr was selected for testing.During sentence generation, we utilized a beam search strategy set to five to generate candidate sentence fragments for an image.The maximum length of the generated sentences was set to 25.

Ablation experiments
A series of ablation experiments are desined to evaluate the effectiveness of different module of MC-Net.The baseline model serves as the control, while stacked convolution, Transformer-based feature aggregating, and feature aggregating method using MLP represent the different configurations tested.We named multi-scale feature extraction module as MSF, local modeling module as LM and global modeling module as GM.
Tables 1-4 display the captioning results of each submodule, demonstrating the impact of submodules on the captioning performance of the MC-Net model.The experimental results show that adding each submodule to the three different datasets outperformed the baseline model, and the best image description accuracy was achieved by adding global modeling module.Specifically, on the RSICD dataset, multi-scale feature module improved the Cider and Spice values by approximately 1%, METEOR and Rouge values by approximately 3%, and BLEU values by more than 5%, compared to the baseline model.Similarly, the global modeling module showed a significant improvement in each indicator, which was comparable to the performance of multiscale feature module, but the performance improvement varied slightly across different datasets.For instance, multi-scale feature module performed slightly better than global modeling module on the UCM-Captions and Sydney datasets, whereas global modeling module outperformed multi-scale feature module on the RSICD and NWPU-Captions datasets.Our findings demonstrate that incorporating multiscale feature extraction and aggregating modules can improve model performance.Furthermore, the Transformer-based global modeling approach outperformed the feature aggregation approach using MLP and Sigmoid in the fusion module.The highest performance was achieved by combining stacked-convolution multiscale feature extraction with transformer-based feature aggregation, highlighting the ability of MC-Net to generate more accurate and fluent sentences.

Comparative experiments
To evaluate the effectiveness of the MC-Net proposed in this study, we compared it with other comparative methods, including Attend (Xu et al. 2015), Convcap (Aneja, Deshpande, and Schwing 2018), CSMLF (Wang et al. 2019), Multimodal (Qu et al. 2016), Sound-a-a (Lu, Wang,  (Hoxha, Scuccato, and Melgani 2023).Among the aforementioned methods, the CSMLF approach is a retrieval-based technique that utilizes metric learning to acquire semantic embeddings, project image features and sentence representations into the same embedding space, calculate the similarity of the input image and the description statement, and select the nearest neighboring sentence as the description statement for the test image.The Multimodal technique is a typical codec structure that employs a CNN encoder and an LSTM decoder to generate captioning.The FC-ATT/SM-ATT model is based on an attribute mechanism to guide the captioning model to focus on high-level features of images.The SAT(LAM)/adaptive (LAM) approach integrates label information and subsequently uses LSTM to maintain attention mapping for better sentence representation.The Sound-a-a method generates descriptions of remote sensing images by utilizing sound to guide the attention mechanism.The GVFGA+LSGA approach introduces the average pooled features in the encoding segment to guide the entire model.MLCA-Net uses multi-attention to strengthen the spatial and contextual information of an image.Among the compared computer vision domain techniques, Show, Attend and Tell is the first method to incorporate the attention mechanism into the codec network in the decoder to assign varying weights to different feature regions of the input image at different decoding steps, thereby guiding the dynamic concentration of the image region.The Convcap technique is employed to encode sentences and generate descriptive sentences using CNN as decoder.The SD-Net introduces a novel decoder that is based on support vector machines (SVMs).The proposed postprocessing strategies are based on hidden Markov models (HMMs) and Viterbi algorithm are introduced to PPIC-Net to rectify the errors and improve sentences quality.
Tables 5-8 show the performances of the different captioning methods on the four cross-model datasets.In general, the results indicate that MC-Net exhibits competitive results across all four datasets when compared to existing image description generation methods in both remote sensing and computer vision fields.Unlike natural images,remote sensing images possess multiscale characteristics and exhibit background complexity, rendering the direct application of natural domain methods to remote sensing images challenging.Compared with the latest SD-Net and PPIC-Net, our model has better performance in terms of semantic fluency and syntactic accuracy, for example, the cider evaluation metrics of our MC-Net has improved by 8%.The performance improvement is due to the use of a multi-scale feature extraction module, which can better deal with multi-scale targets.Notably, the GVFGA+LSGA remote sensing image captioning method exhibits the best performance among all the comparison models because of the introduction of global average pooling features as global image information at the encoding side and the proposal of a language state (LS) at the decoding side, reducing the burden of hidden states in providing exclusive text features.
In comparison, the MC-Net model displays the highest performance, particularly demonstrating an improvement of 5% in BLEU4, METEOR, and CIDEr evaluation metrics on the Sydney Captions because of its effective extraction of multi-scale features and global contextual features, enabling a deeper comprehension of images.Additionally, the gated adaptive attention mechanism effectively guides sentence generation by considering both image information and sentence context information.On the UCM-Captions dataset, the MC-Net model outperforms the current best-performing model GVFGA+LSGA, exhibiting an improvement of approximately 2% across eight evaluation metrics, and a more significant improvement of approximately 4% in BLEU values and SPICE metrics, as presented in Tables 5-8.On the RSICD dataset, the overall improvement is more significant, especially with respect to the Meteor value, which exhibits an 8% improvement, albeit lower than the GVFGA+LSGA method in the ROUGH and SPICE metrics.This is related to the uneven data categories in the dataset, with relatively small-scale changes and many feature categories exhibiting similar scenes.Nonetheless, overall, the MC-Net model proposed in this paper displays apparent advantages, particularly exhibiting the best results on both the NWPU-Captions dataset, which has twice as many data categories as the UCM-Captions dataset, with richer intraclass diversity and interclass diversity.The CSMLF approach exhibited the lowest performance among all the compared models, indicating that the CNN + LSTM structure is better suited for remote sensing image captioning than the metric learning semantic embedding-based approach.The metric learning-based approach utilizes descriptions of similar images as descriptive information for the query image, resulting in generated statements with limited diversity.Additionally, on the UCM-Captions dataset, the LSTM-based decoder structure significantly outperforms the CNN-based decoder, highlighting the advantage of LSTM in sequence modeling tasks.The introduction of attention in the CNN+LSTM framework leads to a significant improvement in the model results, with models using label information to guide the attention mechanism performing better.This suggests that using only visual features to guide sentence description is often insufficient, and models incorporating attribute and label information can perform better.However, the visual text alignment model proposed in this study, which includes a MS module and a TR module, demonstrates the best results, highlighting its superiority over other models.
The performances of the proposed model, MC-Net, smaller datasets, UCM-Captions (2100 images), and Sydney were compared on four distinct datasets, as depicted in Tables 5-8.The experimental results demonstrate that MC-Net performed better on the captions (613 images), compared to the two larger datasets, RSICD with 10,921 images and the other NWPU-Captions with 31,500 images.This indicates that the performance of image description generation tasks is correlated with the size of the dataset and the scale of the constructed vocabularies, especially given the diverse information and annotations provided by different professionals in the NWPU-Caption dataset.The RSICD dataset also posed a challenge owing to its similar visual categories, such as sparse, medium, and dense residential areas.Nevertheless, our proposed model outperformed the comparison models for all four datasets, demonstrating its generalizability.

The visualization of captioning results
We also present the captioning results of the description sentences using the decoding model based on visual text alignment achieved through gating and adaptive mechanisms.Figure 5 shows examples of the captioning results, wherein the category information is marked in green, relevant proxemics are marked in blue, and incorrect or missing words in the description process are marked in red.The proposed MC-Net model performs well on the high-resolution remote sensing image captioning dataset, generating syntactically correct and similar statements to ground-truth sentences.The model can accurately describe the attribute information of the main objects in the images, such as aircraft, buildings, and tennis courts, and their relationships, while generating a novel vocabulary.Specifically, the narrow runway in Figure 5(b) and white roofs in the subfigure (c) accurately represent the target attribute information.The visualization result in Figure 5(c) can accurately describe the five storage tanks, which indicates that the multi-scale feature extraction module of the model can effectively extract the multi-scale information of the image.In addition, 'near a road', 'next to four table tennis tables' and other relational terms accurately describe the scene information in the image, which indicates that the multi-scale feature fusion module of the model can effectively obtain the global contextual information of the image.The MS effectively extracts multiscale information, whereas the TR module effectively obtains global contextual information.However, we also notice some challenges, such as the difficulty in detecting small targets like 'cars' in the first row of Figure 5(a).These experiments show that more discriminative image features and effective use of contextual features are crucial for image captioning and scene understanding.Achieving alignment of visual features and textual information is also important for generating grammatically accurate and natural statements with diverse semantics.

Parameters analysis
We have made a parameter sensitivity analysis as shown in Table 9.It shows captioning performance with different H values. H means the number of multi-Head in the GL module.The accuracy of the generated image description sentences varies with the number of heads, and a better experimental result can be achieved when the number of heads is 4, in which the CIDEr reaches 3.355, which is higher than that of LM's evaluation method, indicating that better performance can be achieved using the GL-based multi-scale feature aggregation algorithm.

Speed performance
The time costs of our model is shown in Table 10.To evaluate the efficiency of our method, we, respectively, calculated inference speed (images per second), Memory Access Costs(MACs) and the number of parameters on the UCM-Captions dataset.Comparing the results of the experiments, it can be seen that our method has more MACs and more parameters than the baseline model due to the addition of a multi-scale feature and global modeling module.Considering both time cost and performance factors, our method trades a significant performance gain for a relatively small time cost.

Different scale of contextual information analysis
In this paper, we have analyzed the features of the input multi-scale feature extraction module in the image encoder.In order to analyze the effect of image features of different scales, ablation experiments consider the combination of different scales (1 × 1, 3 × 3, 5 × 5 and 7 × 7).Table 11 shows the experimental results, it can be seen that divided into the only 1 × 1 scale feature extraction of the experimental results of image captioning lowest, as the scale of fusion increases, gradually obtaining the information of larger sensory field, the better the performance of the generated image description sentences, while the experimental results of fusing four scales at the same time is the best, the difference between the two is about 10%, indicating that different scales image features affect the performance of image captioning.

Conclusion
In this study, we have proposed a novel network called MC-Net for remote-sensing image captioning.MC-Net addresses the multiscale and complex background problems in practical applications.The proposed MC-Net consists of a multi-scale feature extraction module and a feature fusion module.Additionally, an adaptive visual-text alignment mechanism is used to generate accurate and fluent descriptive sentences.Ablation experiments confirm the availability of each module.Comparative experiments demonstrate the generalizability and robustness of MC-Net.Despite the superior performance achieved by our proposed model, we believe that the MS and FA modules are not the only way to address the multiscale and background complexity of remote sensing images, and we encourage more approaches to explore.Future research will explore more fine-grained image captioning approaches to improve the captioning results of complex large scenes and extend our research to unsupervised learning to automatically generate fine-grained semantic description utterances for an abundance of unlabeled complex images for better use in real scenes.

Figure 1 .
Figure 1.The overall architecture of the MC-Net.The image encoder with multi-scale feature extraction and feature aggregating module is used to extract visual features, and the description decoder with visual-text alignment LSTM is used to generate description sentences.

Figure 2 .
Figure 2. Outline of multi-scale visual feature extraction module.

Figure 3 .
Figure 3.The process of local modeling of images.

Figure 4 .
Figure 4.The process of global modeling of images.

Figure 5 .
Figure 5. Captions results by MC-Net on four datasets.The first to the fourth columns are selected examples from UCM-Captions, Sydney-Captions, RSICD, and NWPU-Captions, respectively.
(Qu et al. 2016) 2010)ng of the UCMerced Land Use Dataset(Yang and Newsam 2010)originally designed for remote sensing image classification tasks.To enrich each remote sensing image, five distinct descriptive statements, each annotated by the same person, were incorporated.The annotated sentence descriptions consist of 21 categories, with each category containing 100 images, and the resolution of image is 256 × 256 pixels and the resolution size of a pixel is 0.3048 m.The dataset contains 2,100 images paired with 10,500 sentence descriptions.(2)TheSydney-CaptionsDataset(Qu et al. 2016)enriches each image by including five distinct sentences, each annotated by the same person, across seven categories.The number of each category is different.All images (500 × 500 pixels) in the dataset have a resolution of 0.5 m.It consists of 613 images and 3,065 sentence descriptions and has been frequently used for comparative evaluations with the UCM-Captions Dataset in remote sensing description generation tasks.(3) The RSICD (Lu et al. 2017) was developed in 2017, featuring 10,921 images distributed across 30 categories, each having a resolution of 224 × 224 pixels.Each image in the dataset is accompanied by 1-2 description statements, totaling 24,233 annotations.The description statements are sourced from multiple individuals, resulting in five sentence descriptions for each image.In instances where an image has fewer than five titles, existing sentence descriptions are extended through random copying.Consequently, the five descriptive statements for each image in this dataset may not be entirely distinct.(4) The NWPU-Captions Dataset (Cheng et al. 2022) contains 31,500 images (256 × 256 pixels) across 45 categories, each having a resolution of 0.228 m.Unlike other datasets discussed previously, the NWPU-Captions Dataset comprises five unique descriptions labeled by different individuals, ensuring a diverse set of sentences.

Table 1 .
Settings and results of ablation experiments on the UCM-Captions.

Table 2 .
Settings and results of ablation experiments on the Sydney-Captions.

Table 3 .
Settings and results of ablation experiments on the RSICD.

Table 4 .
Settings and results of ablation experiments on the NWPU-Captions.

Table 10 .
Comparison of our methods in terms of inference speed (images per second), MACs and parameters.All results are reported based on the UCM-Captions.

Table 9 .
Captioning performance Comparision with different multi-heads(H) values on the UCM-Captions.