Multiscale Cascaded Network for the Semantic Segmentation of High-Resolution Remote Sensing Images

Abstract As remote sensing images have complex backgrounds and varying object sizes, their semantic segmentation is challenging. This study proposes a multiscale cascaded network (MSCNet) for semantic segmentation. The resolutions employed with respect to the input remote sensing images are 1, 1/2, and 1/4, which represent high, medium, and low resolutions. First, 3 backbone networks extract features with different resolutions. Then, using a multiscale attention network, the fused features are input into the dense atrous spatial pyramid pooling network to obtain multiscale information. The proposed MSCNet introduces multiscale feature extraction and attention mechanism modules suitable for remote sensing land-cover classification. Experiments are performed using the Deepglobe, Vaihingen, and Potsdam datasets; the results are compared with those of the existing classical semantic segmentation networks. The findings indicate that the mean intersection over union (mIoU) of the MSCNet is 4.73% higher than that of DeepLabv3+ with the Deepglobe datasets. For the Vaihingen datasets, the mIoU of the MSCNet is 15.3%, and 6.4% higher than those of a segmented network (SegNet), and DeepLabv3+, respectively. For the Potsdam datasets, the mIoU of the MSCNet is higher than those of a fully convolutional network, Res-U-Net, SegNet, and DeepLabv3+ by 11.18%, 5.89%, 4.78%, and 3.03%, respectively.


Introduction
Image semantic segmentation is a computer-vision task that aims to provide each pixel in an image with a label that represents its semantic category to segment the image into different categories.Given the rapid development of convolutional neural networks (CNNs), image semantic segmentation technologies combined with a CNN achieve better segmentation results than conventional algorithms.However, certain challenges are associated with the existing semantic segmentation algorithms (e.g., intraclass semantic error recognition, small-scale object loss, and fuzzy segmentation boundaries).Therefore, to improve the semantic segmentation accuracy, capturing additional feature information and optimizing object boundaries are crucial.
The problems and challenges associated with the object complexity in remote sensing images make their semantic segmentation different from conventional computer vision tasks.For remote sensing images, the classic semantic segmentation network cannot deal properly with situations where some types of predictions are best handled at a lower resolution, and other objects are best handled at a higher resolution.That is, in remote sensing images, serious scale inconsistencies exist with regard to ground objects; the fully CNN method with a fixed receptive field size cannot effectively recognize objects with different scales.Typically, this causes incomplete recognition of large-scale objects or wrong prediction categories and inefficient recognition of small-scale objects.For fine details, such as the edge part of each category, the image size can usually be enlarged to better recognize them.Moreover, the predictions of larger categories that require more global context usually show better results after the image size is reduced because the network can gain more necessary contextual semantic information.The aforementioned problem can be solved using multiscale methods.Therefore, to achieve effective recognition of objects with different scales in remote sensing images, an appropriate receptive field size should be used by the network when learning the characteristics of these objects (Li et al. 2022;Peng et al. 2022;Shang et al. 2020).Furthermore, an attention mechanism can allocate different weights, as per the characteristics and relevance of a task, and extract information that is important to the task.Such a mechanism can expand the receptive field to obtain global information and adaptively reflect the dependence between different positions based on various inputs to ignore unnecessary information but extract key information (Mehta and Murala 2021;Zhao et al. 2021).Thus, we perform multiscale feature extraction and introduce an attention mechanism to build a network for the semantic segmentation of remote sensing images.The primary contributions of this study are as follows:

Related works
Semantic segmentation Long et al. (2015) proposed the first fully convolutional network (FCN); they successfully applied it to the field of image semantic segmentation.Notably, an FCN uses a fully convolutional layer to replace the fully connected layer to achieve end-to-end training.
Although it can adapt to the size of any input picture, its disadvantages are evident, and the results are not sufficiently precise.Although the effect of upsampling 8 times is considerably better than that of upsampling 32 times, the result of the former is blurry and smooth, and is insensitive to details in images; moreover, each pixel is classified.In an FCN, the relationship between pixels is not completely considered.Moreover, the spatial regularization step used in the usual pixel classification-based segmentation method is ignored, resulting in a lack of spatial consistency.Ronneberger et al. (2015) proposed a U-Net network, a U-shaped encoder-decoder structure.In the decoder structure, a jump connection layer was used to fuze more encoder features to improve the segmentation effect.Badrinarayanan et al. (2017) proposed a segmented network (SegNet) comprising 2 modules: encoder and decoder.An index was first established with regard to the pooling layer of the decoder, followed by upsampling based on the index value to reduce the number of training parameters.However, the network was not sufficiently refined owing to the incomplete use of contextual information.Zhao et al. (2017) proposed PSPNet, which employed a pyramid pooling module to capture the contextual information of images and employed pooling layers of different sizes to extract multiscale features.Although the network sufficiently fused high-level features, low-level features were insufficiently fused; hence, the effect of segmentation was insufficient.Chen et al. (2014Chen et al. ( , 2017Chen et al. ( , 2018aChen et al. ( , 2018b) ) proposed a deeplab series of networks, which introduced atrous convolution to expand their receptive field and extract additional scale-feature information.These networks were extremely influential in semantic segmentation.Tao et al. (2020) proposed hierarchical multiscale attention for semantic segmentation.These network predictions at certain scales are better at resolving particular failures modes, and the network learns to favor those scales for such cases in order to generate better predictions.
Semantic segmentation of remote sensing images Zhao et al. (2021) proposed a novel multistage-fusion and multisource-attention network for the multimodal segmentation of remote sensing data.The multistagefusion module fused complementary information after calibrating deviation information by filtering the noise from multimodal data.Meanwhile, the proposed multisource-attention mechanism aggregated similar feature points to enhance the discriminability of features with different modalities.Although this network effectively improves the multimodal fusion mechanism, it still needs to be improved with regard to mining complementary features and the multimodal fusion mechanism.Qi et al. (2020) proposed a replaceable module, called AT block, using multiscale convolution and an attention mechanism as the building block in ATD-LinkNet.The AT block fused features with different scales and effectively used considerable spatial and semantic information from remote sensing images.
This deep network with an attention mechanism and multiscale convolution exhibited an ultrahigh resolution with regard to the segmentation of remote sensing images.Furthermore, they adopted dense upsampling convolution to refine the boundaries in the road extraction task.Chen et al. (2021) proposed a remote sensing semantic segmentation algorithm based on Res-U-Net combined with atrous convolution.The conventional U-Net for semantic segmentation was improved as a backbone network.A residual convolution unit was employed to replace the original U-Net convolution unit to increase the network depth and prevent the disappearance of gradients.To detect more feature information, a multibranch hole-convolution module was added between the encoder and decoder modules to extract semantic features.Moreover, the hole-convolution expansion rate was modified to ensure a better effect of the network with regard to small object category segmentation.

Overview
An MSCNet is a multiscale, multiresolution associative network that is suitable for the semantic segmentation of remote sensing images during land-cover classification (Figure 1).We first rescale the original images via bilinear interpolation.The scales of the input remote sensing images include 1, 1/2, and 1/4.Subsequently, 3 backbone networks-Resnet101, DenseNet169, and VGG16-extract features from the remote sensing images with different resolutions.Using the multiscale attention module SCAM, features are then extracted from the 2 dimensions of channel and space.
Thereafter, the features extracted with different degrees are fused to generate F1 via a concatenate operation.The fused features are input into DenseASPP (Yang et al. 2018) to obtain multiscale information and generate F2.Then, a concatenate operation is performed on F1 and F2, and the final prediction results are generated.Thus, effective recognition of objects with different scales in remote sensing images can be achieved.Notably, the network uses an appropriate receptive field size when learning the characteristics of objects with different scales.Incorporating an attention mechanism into the network ensures more focus on the locations that require attention.
The numbers of channels in the input and output of each convolution of DenseNet are considerably lower than those corresponding to ResNet.Consequently, the number of parameters of the fully connected layer is considerably lower than that corresponding to ResNet.
Although the number of parameters required by DenseNet is lower, its feature map is considerably larger, resulting in a much larger computation amount and a larger memory footprint in the convolution process.Moreover, DenseNet must frequently read the memory; therefore, its training speed is slower than that of ResNet.As a result, considering the memory consumption, training speed, number of parameters, and other factors, Resnet101 and DenseNet169 are employed as feature-extraction networks for input images with a scale of 1 and 1/2, respectively.However, for input images with a scale of 1/4, VGG16, which has fewer layers and parameters, is employed for feature extraction.By combining the 3 above-mentioned networks for training, feature extraction can be performed on images of different scales.
For images with a scale of 1 (i.e., images with original sizes), Resnet101 is employed for feature extraction (Figure 2).First, a 7 � 7 � 64 convolution is performed.Notably, 3 þ 4 þ 23 þ 3 ¼ 33 building blocks are present, with each building block comprising 3 layers.The fully connected layer is then used for the classification.
For input images with a scale of 1/2, DenseNet169 is selected as the feature-extraction backbone network (Figure 3).As per the design of the dense block, each layer can obtain the input of all the previous layers.Thus, after concatenation, the input channel remains large.Moreover, 3 � 3 convolutions of each deny block include a 1 � 1 convolution operation.The objective herein is to reduce the number of input feature maps, possibly reducing the dimension and calculation amount and leading to the integration of the characteristics of each channel.
For images with a scale of 1/4, VGG16 is selected as the backbone network for feature extraction (Figure 4).It comprises 13 convolution and 3 fully connected layers.VGG16 has 6 block structures with the pooling layer as the boundary, and the number of channels in each block structure is the same.Because the convolution and fully connected layers involve weight coefficients, they are called weight layers.Notably, the pooling layer does not involve a weight coefficient.A total of 13 convolution and 5 pooling layers are responsible for feature extraction.The last 3 fully connected layers are responsible for completing the classification.

SCAM
In computer-vision tasks, the attention mechanism can focus on the important areas of images and discard irrelevant information.In the human visual cortex, the attention mechanism can be used to analyze complex scene information quickly and effectively.This mechanism is introduced into computer-vision tasks to improve their performance.Each channel of a feature represents a special detector.Therefore, determining the features focused by the channels is crucial.A spatial attention module focuses on where these features are useful.Notably, the SCAM adopts a combination of spatial and channel attention modules.Figure 5 shows its structure.First, feature map F0 is entered into the spatial attention module to generate feature map F1.Second, the original feature map F0 is combined with the generated feature map F1 to generate a new feature map F2.Third, the feature map F2 obtained is input into the channel attention module to generate feature map F3.Finally, feature map F3 is combined with feature map F2 to obtain the resultant output.
The channel attention module aims to make the input image more meaningful and calculate the weight coefficient of each channel of the input image using the network.Specifically, to improve the ability of feature representation, it focuses more on channels containing key information and less on those with less important information.In the channel attention module, the lightweight channel attention mechanism ECA-Net and pooling module are combined to develop the ECAP (ECA-Net-pooling) module (Figure 6).
ECA-Net.The dimension attenuation in the 2 FC layers after a gap in SENet (Hu et al. 2018) affects the weight learning of the channel attention module.After the channel-level average global pooling is obtained without reducing the dimension, the ECA captures local cross-channel interaction information by considering each channel and its K neighbors.ECA-Net ensures the model efficiency and accurate calculations, and achieves better results than SENet (Wang et al. 2020).In ECA-Net, a method that can obtain local cross-channel interaction information is explored to ensure the mode efficiency and effectiveness.Band matrix W k is used to learn channel attention.For the aggregation characteristic y 2 R C , without reducing the dimension, where C1D indicates one-dimensional convolution, and r represents a sigmoid function.
Adaptive pooling.Average pooling aims to obtain the average value for each rectangular area and can extract the information of all features in a feature map and enter the next layer.Unlike maximum pooling, only the features with the largest value are retained in adaptive pooling.Notably, remote sensing   images have more useful information and less background information; therefore, average pooling can be used to retain more image information.Adaptive pooling is unique because the size of the output tensor is a given output_ size.If the size of the kernel of the pooling layer is kernel_size, then output_ size can be given in terms of padding, stride, kernel_size, and input_size as follows: Thus, the pooling module herein adopts the adaptive average pooling design.The sizes of the feature map outputs are 1 � 1, 2 � 2, 3 � 3, and 6 � 6, and the number of channels does not change.Finally, the feature map outputs corresponding to the 4 scales are added and then integrated with spatial attention.
Spatial attention is a multiscale attention mechanism (MSAM) used for locating the information of interest and suppressing unnecessary information (Figure 7).As not all regions in an image contribute equally to a task, only the task-related regions should be considered.For example, the aim of classification tasks of the spatial attention model is to identify the most important part of the network for processing.The MSAM employs convolution blocks of 3 � 3, 5 � 5, and 7 � 7 to build a multiscale spatial attention module.A larger two-dimensional convolution is divided into 2 smaller 1-bit convolutions based on the idea of factorization into small revolutions in inception V3 (Szegedy et al. 2016).In particular, 3 � 3, 5 � 5, and 7 � 7 convolutions are divided into 1 � 3 and 3 � 1, 1 � 5 and 5 � 1, and 1 � 7 and 7 � 1 convolutions, respectively.This technique saves multiple parameters, accelerates the operation, and eliminates the overfitting.Moreover, it increases the expression ability of the layers of nonlinear extended models.The result of asymmetric convolution structure splitting is more evident than that of symmetric splitting into several same small convolution cores, which can deal with more and richer spatial features and can increase feature diversity.

Dense atrous spatial pyramid pooling
The emergence of whole convolution improves the receptive field size without losing information.Atrous spatial pyramid pooling (ASPP) stacks hole convolutions with different atrous rates in parallel or cascade to obtain multiscale information.DenseASPP combines the ideas of the DenseNet network and ASPP.The characteristic images are convoluted and sampled using a convolution kernel of atrous with variable expansion rates.A dense connection structure is employed in the sampled characteristic images, combining the convolution characteristic image of each layer with those of all subsequent layers.Notably, the characteristic image of each layer is the combination of those of all previous layers.The target image features with different scales are obtained via convolution with atrous with different expansion rates and then are combined via dense connection.This dense connection obtains a larger receptive field than perforated convolution or simply multiple perforated convolution pyramids.Therefore, it has better advantages with regard to feature reuse than ASPP or an encoderdecoder model.DenseASPP has gained more receptive fields than conventional ASPP.In ASPP, an atrous convolution layer works in parallel, and 4 subbranches do not share any information during the feedforward process.However, the expansion convolution layer in DenseASPP shares information through a layer-hopping connection.The layers with small and large expansion rates work interdependently, wherein the feedforward process forms a dense feature pyramid and generates a larger filter to sense large contextual information.The largest receptive field of ASPP (3,6,12,18,24) is as follows: However, the largest receptive field of DenseASPP (3,6,12,18,24) is as follows: Moreover, DenseASPP can compensate the gap left by the atrous convolution kernel with a large expansion rate because of the dense connection and the merging of multiscale feature maps, which cannot sample detailed information.The network has the advantage of DenseNet, alleviating the problem of deep network gradient disappearance.

Loss function
The loss function uses pixelwise cross-entropy loss.It separately checks each pixel to monitor the complete classification one-hot vector of all predicted pixels.
The loss function of cross-entropy separately evaluates the class prediction of each pixel vector and then calculates the average value of all pixels.Thus, the pixels of the image are equally learned.However, class imbalance often occurs in remote sensing images; thus, the training is dominated by classes with more pixels, and learning its features is difficult for small objects, thereby reducing the effectiveness of the network.Therefore, the loss is weighted in each channel to offset the problem of uneven categories in the data set.
The network output is the pixelwise softmax: where p l ðxÞ indicates the output probability of pixel x on the channel where the real label is located.
is the weight parameter, where N is the total number of pixels in the dataset, and N i is the number of ith class pixels in the dataset.

Datasets and protocols
The Deepglobe datasets (Demir et al. 2018) contain 803 remote sensing images, with a size of 2,448 � 2,448 (in pixels) and a resolution of 50 cm; each image has a corresponding annotation.The objects in the datasets are divided into 7 categories: urban land, agricultural land, rangeland, forest land, water, barren land, and unknown.However, unknown is not included in the evaluation criteria.
The Vaihingen datasets (Rottensteiner et al. 2012) contain 33 high-resolution remote sensing images with fine annotation.They are named after Vaihingen, the city from where the data are collected.The average size of an image in the Vaihingen datasets is 2,494 � 2,064 (in pixels).The objects in this dataset are classified into 6 categories: impervious surfaces, buildings, low vegetations, trees, cars, and clutter/background.However, the clutter/background category is not included in the evaluation criteria.
The Potsdam datasets (Rottensteiner et al. 2012) contain 38 fine-labeled high-resolution remote sensing images.They are named after Potsdam, where the data-collection site is located.The size of all images in the Potsdam datasets is 6,000 � 6,000 (in pixels).The objects in this dataset are divided into 6 categories, the same as the Vaihingen dataset.However, only 5 categories are included in the evaluation criteria.
The images for each dataset are divided into 512 � 512 subimages, and different overlap sizes are set based on the image size.Furthermore, the pixels on edges that are not divisible by 512 are discarded.Then, 60%, 20%, and 20% patches are randomly selected as training, validation, and test datasets, respectively.
All the experiments in this study are conducted using PyTorch on a single computer equipped with one NVIDIA Tesla A100 GPU.The optimizer uses momentum.The values of the weight decay and momentum are 0.0001 and 0.9, respectively, and the initial learning rate is 0.001.

Experiment with the Deepglobe datasets
In the ablation experiment, to confirm the effect of SCAM, SCAM was added into Deeplabv3þ.Table 1 presents the experimental results.The mean intersection over union (mIoU) is the standard measure of semantic segmentation.Because mIoU is concise and representative, it has become the most standard.Most researchers use this standard to report their results.It can be calculated as follows: By adding SCAM into Deeplabv3þ, it can be observed that the intersection over union (IoU) of each class increases with varying degrees compared with the baseline, and mIoU increases by 1.93%.With regard to the individual category, the IoU of water (WT) and urban land (UL) increased considerably by 3.13% and 3.02%, respectively.A pooling module is added based on the last experiment; that is, a complete ECAP network is added into Deeplabv3þ.The experimental results indicate that although UL's IoU decreases slightly, the mIoUs of agricultural land (AL) and rangeland (RL) differ negligibly, with the difference being <0.1%.However, the IoUs of the remaining categories increase.Subsequently, the MSAM is added into Deeplabv3þ.Clearly, the IoUs of each class increase with varying degrees compared with the baseline.Among them, WT's IoU increases the most (8.01%).In particular, the addition of a spatial attention module better improves these values than adding a channel attention module, indicating that spatial attention has a greater impact on the experimental results.Finally, when SCAM is added into Deeplabv3þ, the results show that mIoU reaches 74.80%, which is 4.27% higher than the mIoU of only Deeplabv3þ.
The second ablation experiment combines the MSCNet with MSFE and DenseASPP or no SCAM network.Similarly, DeepLabv3þ is compared to the classical semantic segmentation network.Table 2 presents the mIoU results of the 3 algorithms.Clearly, the MSCNet outperforms the other 2 algorithms in terms of both the total mIoU and various categories of mIoU.The mIoU of the MSCNet is 75.26%(i.e., 4.73% and 3.19% higher than those of DeepLabv3þ and MSFE þ DenseASPP, respectively).In terms of individual categories, compared with the corresponding values for DeepLabv3þ, the values for WT, UL, and forest land (FL) are 11.74%, 4.14%, and 3.97% higher, respectively, for the MSCNet.In particular, the values of all the categories of MSFE þ DenseASPP are 2% higher, among which the value for WT is 5.45% higher, followed by FL (3.97% higher).
Table 3 presents the precision results.Precision refers to the proportion of samples with predictive and true values of 1 and 1, respectively, among all the samples with a predictive value of 1.The formula for calculating precision is as follows: The precision of the MSCNet is 83.64%, which is 5.5% and 3.58% higher than those of Deeplabv3þ and MSFE þ DenseASPP, respectively.For individual categories, compared with the corresponding values for Deeplabv3þ, that with regard to WT was the highest (15.07%higher), followed by rangeland (RL, 6.45% higher) and UL (4.49% higher) for the MSCNet.Compared with the values for MSFE þ DenseASPP,  that with regard to BL (barren land) was the highest (7.23% higher), followed by WT (6.99% higher) for the MSCNet.However, the value for the AL category slightly decreased by �1% compared with the other 2 algorithms.Table 4 presents the results of the F1 score.The F1 score is defined as the harmonic average of precision and recall.The formula for calculating the F1 score is as follows: The F1 score for the MSCNet is 85.30%, which is 3.14% higher than that of DeepLabv3þ and 2.11% higher than that of MSFE þ DenseASPP.For individual categories, the values for the MSCNet with regard to UL, AL, rangeland, FL, water, and BL were 2.67%, 2.29%, 2.74%, 2.43%, 7.68%, and 1.05% higher than the corresponding values for DeepLabv3þ, while these values were 1.34%, 1.52%, 1.86%, 2.22%, 3.44%, and 2.27% higher, respectively, than the corresponding values for MSFE þ DenseASPP.
Figure 8 presents the comparison of the experimental results with the Deepglobe dataset.Evidently, the MSCNet exhibits the best result, which is consistent with the segmentation results presented in Tables 1-3.However, certain problems remain, such as partial segmentation errors and blurred boundaries.

Experiment with the Vaihingen datasets
The MSCNet is compared with SegNet, DeepLabv3þ, the unified Perceptual parsing network (UperNet) (Xiao et al. 2018), and the multifeature attention fusion network (MFNet) (Xiang 2021) for the Vaihingen datasets.mIoU results are compared for these datasets, as mIoU is one of the most commonly used criteria for various datasets among different evaluation metrics of semantic segmentation.Table 5 presents the experimental results with the Vaihingen datasets.The mIoU of the MSCNet is 79.30%, which is 15.3%, 6.4%, 4.42%, and 3.32% higher than those of SegNet, DeepLabv3þ, UperNet, and MFNet, respectively.For each category, the IoU improvement occurs

Experiment with the Potsdam datasets
The Potsdam dataset compares the MSCNet with an FCN, Res-U-Net, SegNet, and DeepLabv3þ.Similar to that of the Vaihingen dataset, mIoU is chosen as the evaluation index.Table 6 presents the experimental results with the Potsdam datasets.According to the experimental results, the mIoU of the MSCNet is higher than those of FCN, Res-U-Net, SegNet, and DeepLabv3þ by 11.18%, 5.89%, 4.78%, and 3.03%, respectively.In terms of categories, the mIoU of the buildings and cars for the MSCNet achieved the highest value among the 5 algorithms.Figure 10 presents the MSCNet results.Evidently, the segmentation result is generally good, but some segmentation errors are associated with the boundary.

Discussion
Through experiments on 3 datasets, it is evident that the MSCNet network has a higher accuracy than other networks.For the ablation and comparative experiments on the Deepglobe datasets, after gradually introducing different modules, the mIoU, precision, and F1score have all improved to varying degrees.For the Vaihingen datasets, except for the slightly reduced IoU of the Car class, all other categories have varying degrees of improvement.Perhaps due to the small range of Car class in remote sensing images, there are some omissions and false detections.For the Potsdam dataset, most categories have improved, but the low vegetation class has somewhat declined.This may be due to some similarities between this class and other classes (such as the Tree class) in remote sensing images, resulting in some false positives.However, these false detections and missed detections indicate that our algorithm still has some shortcomings, and we need to continue to make in-depth improvements in future research.

Conclusion
For remote sensing images, the classical semantic segmentation network cannot deal appropriately with situations where some types of prediction are best handled at a lower resolution, and other objects are best handled at a higher resolution.This can be solved using a multiscale method.Thus, we propose an MSCNet for the semantic segmentation of high-resolution remote sensing images.The MSCNet is suitable for remote sensing land-cover classification, and it aims to address the characteristics of remote sensing images with complex backgrounds and varying object sizes.Experimental results indicate that the MSCNet can effectively improve the accuracy of remote sensing land-cover classification.However, although the segmentation results of the proposed network are generally good, certain segmentation errors are associated with the boundary.Further research is required to address these errors.
the pixel position in 2 dimensions, a k ðxÞ represents the value of the kth channel corresponding to pixel x in the last output layer of the network, and p k ðxÞ represents the probability of pixel x belonging to class k.The loss function uses the weighted cross-entropy loss given below: E¼ X x wðxÞ log ðp l ðxÞÞ,

Table 1 .
Results of the ablation experiment.

Table 5 .
Experimental results with the Vaihingen datasets.

Table 6 .
Experimental results with the Potsdam datasets.