MSNet: multispectral semantic segmentation network for remote sensing images

ABSTRACT In the research of automatic interpretation of remote sensing images, semantic segmentation based on deep convolutional neural networks has been rapidly developed and applied, and the feature segmentation accuracy and network model generalization ability have been gradually improved. However, most of the network designs are mainly oriented to the three visible RGB bands of remote sensing images, aiming to be able to directly borrow the mature natural image semantic segmentation networks and pre-trained models, but simultaneously causing the waste and loss of spectral information in the invisible light bands such as near-infrared (NIR) of remote sensing images. Combining the advantages of multispectral data in distinguishing typical features such as water and vegetation, we propose a novel deep neural network structure called the multispectral semantic segmentation network (MSNet) for semantic segmentation of multi-classified feature scenes. The multispectral remote sensing image bands are split into two groups, visible and invisible, and ResNet-50 is used for feature extraction in both coding stages, and cascaded upsampling is used to recover feature map resolution in the decoding stage, and the multi-scale image features and spectral features from the upsampling process are fused layer by layer using the feature pyramid structure to finally obtain semantic segmentation results. The training and validation results on two publicly available datasets show that MSNet has competitive performance. The code is available: https://github.com/taochx/MSNet.


Introduction
Deep learning algorithms learn representative and distinguishing features hierarchically from data and have been introduced into the field of remote sensing and developed rapidly. Information such as spectra and textures of remote sensing images is fed into convolutional neural network models as the underlying features for pixel-based semantic segmentation to derive feature classification information. Compared with the traditional optical remote sensing image segmentation, the deep feature-based segmentation method can utilize neural networks to implicitly establish pixel-to-semantic mapping relationships (Schuegraf and Bittner 2019), and the network automatically learns to extract the target features of remote sensing images to complete the whole segmentation process without introducing manual feature intervention while improving the accuracy of the results and the generalization ability of the model .
Different from digital pictures, the high complexity of backgrounds and shadows in remote sensing images, scale multilateralism, overly dispersed targets, and gradually increasing high resolution all increase the difficulty of segmentation tasks . Efficient and intelligent remote sensing image segmentation by deep feature mining has become a research hotspot ), but currently deep learning algorithms cannot be directly applied to many tasks such as remote sensing image interpretation or data processing, the most notable reason being a large number of bands in remote sensing images.
Multispectral remote sensing images contain data of at least four bands or more. The conventional method is to reflect the spectral feature information of different ground objects through different combinations or calculations between the visible and invisible light bands and use them for classification and interpretation. However, most of the processing methods based on deep learning algorithms follow the design method of semantic segmentation of natural images, mainly in the three bands of visible light, RGB, and can directly apply classical semantic segmentation networks for digital pictures: DeepLab  or UNet (Ronneberger, Fischer, and Brox 2015), and use pre-trained models: ResNet (He et al. 2016) as backbone network. This network structure is not suitable for multispectral remote sensing images containing nonconventional data such as NIR and digital surface model (DSM) (Yuan, Shi, and Gu 2021). The commonly adopted solutions are mainly to increase the number of channels in the network input layer and to use image fusion techniques. (Carvalho et al. 2021) Although the scheme of expanding the number of input channels is effective, it also changes the structure of the pretrained model, causing its weight value to fail, and the final multispectral data is not significantly improved compared to the RGB model training results. Image fusion technology has gradually matured. The network adopts a dual-branch structure to integrate image and spectral information to improve segmentation accuracy. Image feature extraction can fully call the structure and weight values of the pre-trained model while taking into account the extraction of spectral features and geometric features of remote sensing images (Ha et al. 2017;Hazirbas et al. 2017;Huang et al. 2020;Song, Lichtenberg, and Xiao 2015;Sun, Zuo, and Liu 2019;Yuan, Shi, and Gu 2021).
We propose a novel deep neural network structure called the multispectral semantic segmentation network (MSNet) for extracting image features and spectral features in full-band data of multispectral remote sensing images. The network is a two-branch parallel structure that splits the multispectral remote sensing images into two groups of visible and invisible light, adopting ResNet-50 to simultaneously extract the image features of features in the visible band and the spectral features in the invisible band in both encoding stages. In the decoding stage, cascaded upsampling is used to recover the feature map resolution, and the feature pyramid structure is used to fuse the multi-scale image features and spectral features layer by layer to derive the semantic segmentation results. The main contributions of this paper are: (1) Designed a semantic segmentation network for multispectral remote sensing images: MSNet, which effectively fused image features with spectral features, and was verified by experiments in the WHU GID and the ISPRS Potsdam, two public datasets. MSNet has competitive performance; (2) Validating the active role of spectral feature information provided by invisible light bands and spectral indices, such as NIR, DSM, NDVI, and NDWI, in the semantic segmentation task of remote sensing images; (3) A method of dynamically synthesizing multichannel remote sensing image datasets is proposed: the method uses the spectral indices derived from band calculation to increase the amount of spectral feature information of remote sensing image data, and the band calculation is performed simultaneously to generate spectral indices NDVI and NDWI during the loading process of the original multispectral remote sensing image dataset consisting of RGB and NIR, which is dynamically synthesized into a six-channel remote sensing image dataset. In this data pattern, MSNet has excellent segmentation accuracy, which further proves the advantages of spectral features in the semantic segmentation task of remote sensing images.

Semantic segmentation of remote sensing images
With the increasing spatial resolution of optical remote sensing images, the cases of "same spectral of different substances" and "different spectral of the same substances" in the images appear more frequently, and the traditional image segmentation methods based on shallow and mid-level features have limited space for improvement and are facing new challenges. With the development of artificial intelligence and deep learning, segmentation methods based on deep features have been gradually migrated to optical remote sensing images, and improved based on the characteristics of remote sensing images that are different from digital pictures, such as multispectral, highland object complexity, high inter-class similarity, and intra-class dissimilarity (Panboonyuen et al. 2019), among which the prominent representative is the segmentation model based on convolutional neural network (CNN), which realizes image segmentation by learning the features of the optical remote sensing image layer by layer while training the neural network classifier to classify the image at the pixel level (Waldner and Diakogiannis 2020). The full convolution network (FCN) (Long, Shelhamer, and Darrell 2015) semantic segmentation model was presented in 2015, which is of epochmaking significance for image segmentation and realizes pixel-level image semantic segmentation. It replaces the full connection layer used for classification mapping in CNN structure with convolution layer, and up samples the obtained characteristic heat map to the size of the original input image through deconvolution operation, meanwhile combined with the information of the middle pooling layer, the image prediction segmentation map is generated. Based on the FCN structure of subsampling and then upsampling, a deep convolution encoderdecoder structure SegNet (Badrinarayanan, Kendall, and Cipolla 2017) is used for target recognition and segmentation in cities such as streets and vehicles. Furthermore, UNet (Zhao et al. 2017) structure is widely used in remote sensing image segmentation tasks because of its more accurate segmentation performance in the case of less training data. DeconvNet (Trias-Sanz, Stamon, and Louchet 2008) also adopts a similar encoder-decoder idea to up-sample images to their original size, and it utilizes a deconvolution layer to densify the up-sampled of the sparse feature map instead of pooling operations. Dilated convolution is also commonly used to alleviate the contradiction between feature maps and perceptual field sizes. It uses the null convolution kernel to convolve the input image, which is able to obtain feature maps of different scales while ensuring that the perceptual field is not reduced, thus obtaining more contextual information. Typical representatives of the image segmentation structures developed by this theory are DeepLab  and PSPNet (Kampffmeyer, Salberg, and Jenssen 2016). In 2017, PSPNet not only applied the null convolution to ResNet but also additionally added the pyramid pooling module to better perform multi-scale context aggregation and global information acquisition to assist remote sensing images for scene parsing.
A segmentation method based on graph theory and selective search method (Guo et al. 2018) is used to enhance the training data and refine the boundaries of the segmentation results using the conditional random field (CRF) (Guo et al. 2017). Based on DeepLab V3+, adopted improved the atrous spatial pyramid pooling (ASPP), fully connected fusion paths and pre-trained encoders are used to segment the high-resolution remote sensing images (Chen et al. 2019). For the extraction of buildings in high-resolution remote sensing images, the network structure of UNet is adopted, and the channel attention mechanism and the confrontation network are integrated into the network structure (Pan et al. 2019b). A generative adversarial network (GAN) is used for feature classification of highresolution remote sensing images, demonstrating the performance of GAN networks for semantic segmentation of remote sensing images (Benjdira et al. 2019).
The performance of most early neural networks is highly dependent on the number of training samples, which is a strongly supervised semantic segmentation method, but the labeling of training samples is a task with high time and economic costs, and the manual labeling is also subjective and uncertain. When encountering the situation that the original samples of remote sensing images are not sufficient, such as hyperspectral remote sensing data, the performance of strongly supervised semantic segmentation will be limited. Therefore, it is an effective solution to fuse the unique spectral feature information of remote sensing images and make full use of and deeply mine various types of feature information of limited samples. Meanwhile, more and more research scholars are focusing on the semantic segmentation methods for remote sensing images under weak and unsupervised supervision. Instead of using the high-cost pixel-level annotation information, the model is trained by the auxiliary annotation information of easily available boxes, lines, and points in the image or by using the semantic annotation data automatically generated by the computer, but most of the current segmentation results are not as good as the strongly supervised methods. In general, strongly supervised remote sensing image segmentation methods based on improvements are still the mainstream, and weakly supervised and unsupervised techniques still need a lot of research breakthroughs.

Data fusion semantic segmentation
A common approach to improving deep convolutional neural networks is to fine-tune the network pre-trained on a huge RGB scene dataset using a migration learning strategy, but this scheme is not applicable to multispectral and hyperspectral remote sensing images. A deep fully convolutional neural network (DFCNN)-based method is proposed, which uses residual correction to fuse data from heterogeneous sensors (optical and laser) for semantic segmentation of remote sensing images; this method lays the foundation for related research (Audebert, Le Saux, and Lefèvre 2017). The idea of integrating geometry and spectral information to improve segmentation accuracy is increasingly popular, given the availability of drastically different, geo-registered data in the remote sensing field (Yuan, Shi, and Gu 2021). Early multi-source data fusion using deep learning algorithms was designed for RGB-Depth datasets (Song, Lichtenberg, and Xiao 2015). Although there are significant differences between RGB-D and multispectral remote sensing images (Lin et al. 2020), these practical methods have significantly helped in subsequent multispectral remote sensing image processing. An end-toend deep convolutional network was proposed to extract the boundary information of objects in remote sensing images, and then the extracted boundary features were fed into FCN together with color, DSM (Digital Surface Model) information for network training for semantic segmentation of remote sensing images, and good segmentation results were achieved (Marmanis et al. 2018). A new deep superpixel-wise convolutional neural network called DeepQuantized-Net, which improves object patterns using superpixels instead of pixels, reduces computational cost reduction, and achieves accurate segmentation results in IND dataset (a New RGB-Depth Data Set for Building Roof Extraction) (Khoshboresh-Masouleh and Shah-hosseini 2021). FuseNet (Hazirbas et al. 2017) is a new CNN network that contains two encoders and can extract features from RGB and depth images simultaneously. A new deconvolutional network is designed to process RGB and depth images separately in both encoding and decoding stages, enhancing shared information by borrowing features from each other to achieve competitive segmentation accuracy . MFNet (Ha et al. 2017), a convolutional neural network architecture based on the RGB-Thermal dataset, combines RGB images and thermal infrared images for semantic segmentation of self-driving street images, and the accuracy of segmentation is significantly improved by adding thermal infrared information. The RIT-18 dataset (Kemker, Salvaggio, and Kanan 2018), a dataset synthesized from real multispectral images, demonstrates that synthetic multispectral images can be used to assist in the training of an end-to-end semantic segmentation framework when there is not enough labeled image data. CoinNet (Pan et al. 2019a) can take full advantage of the initial parameters in the first convolutional layer of the pre-trained network, and comparative experiments on multispectral datasets demonstrate the effectiveness of the proposed improvements.
TU-Net and TDeepLab (Iwashita et al. 2019) combine visible and thermal images, and the implicit training network is robust to light variations. To achieve robust and accurate semantic segmentation of self-driving cars, using thermal images and fusing RGB and thermal information into a new deep neural network RTFNet (Sun, Zuo, and Liu 2019), ResNet was used for feature extraction, and a new decoder was developed to recover the feature map resolution with excellent results, and in a follow-up study they used six visual feature patterns on the existing state-of-the-art (SOTA) single-mode and data fusion semantic segmentation CNNs for evaluation, and proposed a dynamic fusion module (DFM) that can be easily deployed in existing data fusion networks to effectively and efficiently fuse different types of visual features, and the resulting DFM-RTFNet ) achieved competitive performance in road benchmark tests.
At present, the research on semantic segmentation based on data fusion is still mainly in the field of digital pictures, and the "Y" type network structure of simultaneous coding of RGB images and thermal infrared images with the merging and decoding of feature maps has been established to achieve the significant improvement of data fusion and segmentation accuracy at the feature level. In the research of the remote sensing image field, multi-source remote sensing data fusion at the pixel level is mainly based on optical remote sensing image and DSM, LiDAR, and SAR. However, mining the spectral feature information of remote sensing image and fusing the visible light bands and invisible light bands, and even spectral index information at the feature level also has research value and potential.

Methods
The multispectral semantic segmentation network for remote sensing images designed in this paper aims to utilize all the band information of remote sensing images. Through band splitting and recombination, independent downsampling of visible light and invisible light bands is performed, and the image features of objects in the visible light band and the spectral features of the invisible light band are simultaneously extracted and merged. The structure of this dual-branch synchronous encoding is to receive different types of data simultaneously, and to directly invoke the pretrained model for deep-level feature extraction. In the decoding process, the feature map is fused layer by layer to achieve high-precision semantic segmentation results of remote sensing images.

The overall architecture
We present a new deep neural network structure called MSNet for multispectral remote sensing images for semantic segmentation of multiple feature scenes. Figure 1 shows the overall structure of MSNet, which mainly consists of two parts: 1) Band splitting and simultaneous feature extraction, split the multispectral remote sensing image bands into visible and invisible light, and both draw support from ResNet-50 for feature extraction in the coding stage; 2) Feature fusion decoding, adopt the cascaded upsampling method to recover the feature map resolution in the decoding stage, and uses the feature pyramid structure to layer-by-layer fusion of multi-scale image features and spectral features in the upsampling process to finally obtain semantic segmentation results.

Bands splitting and simultaneous feature extraction
We split the bands of multispectral remote sensing images into two groups, visible and invisible light, and then carry out grouped synchronous feature extraction, where the visible band part is the RGB of conventional digital pictures, mainly used for extracting color features, texture features, shape features, and spatial relationship features; the invisible light band part contains two data types, one is the NIR band of the original multispectral remote sensing images, and the other is the data. The spectral indices NDVI, NDWI, etc., which are synthesized synchronously during the loading process, are used to extract the spectral features, so MSNet has two types of multispectral data pattern (4-channel, 4c) and dynamically synthesized multichannel data patterns (6-channel, 6c), and Figure 1 shows MSNet in 6c data pattern. This dual-branch synchronous coding structure is the current conventional scheme of multi-source data fusion semantic segmentation model. The purpose is to receive different types of data simultaneously and can directly call the pre-trained model to extract deep-level features. Simultaneously, it also draws on the fundamental hypothesis behind the inception that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly (Chollet 2017). The ResNet-50 is used in the group synchronization coding stage for feature extraction; while taking into account the deep feature extraction and model training efficiency, it is worth noting that the number of channels should be initialized according to the number of bands of the data pattern; the remote sensing images have a lower resolution compared with the digital pictures, and the pooling layer causes information loss and ignores the correlation between the whole and the parts, leading to further reduction in resolution, some models in the design. The pooling layer is removed in the design of some models to solve the above problems, but simultaneously it significantly increases the computational cost of the model.

Image feature and spectral feature fusion decoding
We present MSNet for simultaneous encoding and decoding of multispectral remote sensing images by band grouping similar to the method of the siamese neural network (Iwashita et al. 2019), but they usually only merge the results of the last stage of the subnetwork decoder. We have borrowed from other excellent network processing methods, U 2 -Net uses the saliency fusion module to generate saliency probability maps at the side of each decoder level (Qin et al. 2020) and UPerNet fuses multiple semantic level features using feature pyramids (Xiao et al. 2018), Figure 2 shows the fusion decoding process, which retains the results of each layer of the subgroup decoder and uses a feature pyramid structure to fuse the multi-scale image features and spectral features from the upsampling process layer by layer for feature fusion decoding, and finally obtains the semantic segmentation results.
The left and right decoders represent the upsampling process of RGB and NIR, NDVI, NDWI synthesized bands, respectively, and they adopt a jump connection using channel splicing, which allows the combination of deep, semantic, and coarse feature maps in the decoder and shallow, low-level, and fine feature maps in the encoder. The middle part is a feature pyramid structure, which also uses channel splicing to fuse the multi-level feature results of the band grouping in both decoders step by step and keeps the upsampling synchronized. This approach is effective in remote sensing image resolution tasks that take into account scene, texture, and material, but also annotation blending and lack of significant depth-offield contrast. We use the pixel shuffle method (PixelShuffle) (Shi et al. 2016) in the decoding stage of the model, which was originally proposed to solve the image super-resolution problem by taking the low-resolution feature maps and up-sampling them to obtain high-resolution images by convolution and multi-channel inter-reconfiguration.

Multispectral remote sensing image datasets
In order to verify that msnet proposed by us has competitive performance and model generalization ability in the semantic segmentation task of multispectral remote sensing images, two public multispectral remote sensing image datasets WHU GID (Tong et al. 2020) and ISPRS Potsdam ("2D Semantic Labeling -Potsdam," n.d.) are selected, and three images are selected, respectively, in the two data sets as test data. Table 1 shows the details of data sets and preprocessing.
Geospatial data abstraction library (GDAL) ("GDAL -GDAL documentation," n.d.) is used to load multispectral remote sensing image data. According to the requirements of different network models for the number of data channels, the multispectral remote sensing image is divided into bands, band sequence is adjusted, spectral index calculation and multi-channel data synthesis are performed. And other operations to achieve the function of dynamic loading of remote sensing image datasets. The above data processing is realized in memory, so the preprocessing data only require image slicing, and no additional dataset organization and arrangement is needed for the specific network model experimental requirements, which avoids a large amount of redundant data occupying storage space and also improves the utilization efficiency of remote sensing datasets. We divide the training validation process of the network model with the premise that the datasets are grouped in different loading modes as follows: (1) 3c data pattern: applicable to digital pictures semantic segmentation networks such as UNet, loading complete 4-band multispectral remote sensing images, but only reading visible light RGB 3-band.
(2) 4c data pattern: applicable to visible and invisible light fused semantic segmentation networks such as RTFNet, loading and reading complete 4-band multispectral remote sensing images, with the bands split into RGB + NIR. It should be noted that the original band order of the GID dataset is NIR-R-G-B, and the band order is adjusted to R-G-B-NIR in order to maintain the consistency of the model calculation.
(3) 6c data pattern: In order to dig deeper into the spectral feature information of remote sensing images, when loading and reading 4-band multispectral remote sensing images, band operations are performed to generate spectral indices: normalized difference vegetation index (NDVI) (Rouse et al. 1974) and normalized difference water index (NDWI) (Gao 1996), which are synthesized into the 6-channel dataset.
NDVI and NDWI were selected for data synthesis because regardless of the two publicly available datasets used in this paper or other optical remote sensing images, these two spectral indices have a wide range of applicability and can be obtained directly by band calculations, and the combined reflectance operations of features in multiple wavelength ranges enhance the characteristics of vegetation and water.
It should be noted that there is no water in the Potsdam dataset, so NDWI will not be calculated, but the dataset contains DSM, which can be used as a spectral data supplement. DSM is a ground elevation model that includes the height of surface buildings, bridges, trees, etc., based on digital elevation mode (DEM), which further covers other surface information besides the ground, and is the most realistic representation of the ground undulation. After dynamic synthesis of the 6-channel dataset, it is then split into two groups: the GID dataset is RGB + NIR-NDVI-NDWI, and the Potsdam dataset is RGB + NIR-NDVI-DSM.

Implementation details
We designed MSNet based on the PyTorch 1.8.0 framework, and the network model training was performed on a GPU server: containing 2 CPUs (Intel Xeon E5-2640 v4) with a total of 128 GB RAM, 4 GPUs (NVIDIA Tesla V100 16 GB) with a total of 64 GB VRAM. The main parameter settings: the data batch size is set according to different networks to ensure maximum memory utilization, the number of threads is 40, the initial learning rate: 1e-4, the dynamic adjustment strategy of learning rate: ReduceLROnPlateau, the optimizer: AdamW (Loshchilov and Hutter 2018), the cross-entropy loss function is selected for the multi-category labeled dataset, and the number of training epochs: 100. The normalization parameters in the experiments of this paper: the mean and standard deviation of each channel, instead of using the default parameter values in the semantic segmentation of digital pictures, are pre-calculated based on the corresponding multispectral remote sensing image datasets.

Evaluation metrics
The evaluation work is divided into two parts: segmentation accuracy evaluation and network model efficiency evaluation.
To enable an objective and fair cross-sectional comparison with other proposed network models, six common evaluation criteria are used, including Overall Accuracy (OA), Precision, Recall (Davis and Goadrich 2006), F1-Score (F1) (Huang et al. 2015), Intersection over Union (IOU) and Frequency Weighted Intersection over Union (FWIoU). The results of these evaluation criteria were calculated using the confusion matrix, where TP represents the number of True Positive classes, TN represents the number of True Negative classes, FP represents the number of False Positive classes, and FN represents the number of False Negative classes: Overall Accuracy (OA) measures the global accuracy of the model prediction results.: Precision measures the proportion of positive samples detected by the model that are correct, and Mean Precision (MP) represents the average accuracy across all categories: Recall measures the proportion of correct predictions by the model in all positive samples.
F1-Score (F1) is the combined consideration of Precision and Recall: Intersection over Union (IoU) measures the ratio of the intersection and union between the predicted result of a certain category and the true value of the model. Mean Intersection over Union (mIoU) represents the ratio of the intersection and union between the predicted result of each category and the true value of the model: Frequency Weighted Intersection over Union (FWIoU) sets the weights according to the frequency of each category, and multiplies the IoU of each category with the weights to sum: The inferential relationship between evaluation indicators: The network model efficiency evaluation mainly includes the overall model training time, model save size, maximum batch size with a total of 64GB VRAM, number of model parameters, floating point operation speed, inference time, and inference speed.

Comparative with the state-of-the-art methods
We use the GID and Potsdam datasets, respectively, to experimentally compare MSNet of both data models with some mature semantic segmentation network models DeepLabV3+, FPN, PSPNet, UNet, and RTFNet. All models uniformly use ResNet-50 as the backbone network, which is used to improve the efficiency of feature extraction while ensuring the fairness of the experimental comparison. Considering that RTFNet has been verified to have significant advantages over ERFNet (Romera et al. 2018), PSP-Net (Zhao et al. 2017), SegNet (Badrinarayanan, Kendall, and Cipolla 2017), DUC-HDC , MFNet (Ha et al. 2017) and FuseNet (Hazirbas et al. 2017) equivalent networks in the task of processing 4-band images, no repeatability validation experiments are conducted in this paper. Figure 3 and Figure 4 respectively show the validation results and trend of the overall evaluation index value of each network model during the training process in the GID and Potsdam datasets. Table 2 and Table 3 respectively represent the segmentation accuracy evaluation of each network model for all label categories after the final round of training, and the overall validation results.

Comparison of training stages
In the validation evaluation results of the training phase of each network model, except for the segmentation index of the meadow change greatly, the variation of the segmentation accuracy and intersection ratio of other categories in each network model is within 6%, which is caused by the uneven number and scale of feature categories contained in each remote sensing image in the GID dataset, and the proportion of images containing labels of all feature categories. The proportion of images containing all feature category labels is less than 30%, and the sample size of meadow is insufficient. Combined with the results of the overall evaluation index values, MSNet (6c) has the best overall performance.
In the validation evaluation results of the training phase of each network model in the Potsdam dataset, all categories were better segmented and identified without significant differences. In the overall evaluation and the detail evaluation results for distinguishing the category labels, the results showed a trend of DeepLabV3+(3c) < PSPNet(3c) < FPN(3c) < UNet(3c) < RTFNet(4c) < MSNet(4c) < MSNet(6c).
The validation results from two datasets, Tables 2 and 3 show that our proposed MSNet achieves the best results in both sub-indexes and overall indexes in 4c and 6c data patterns, especially the highest accuracy in 6c data pattern. These results prove the performance of MSNet and also validate the effectiveness of our proposed method for the dynamic synthesis of multichannel remote sensing image datasets. The 4-channel data composed by adding NIR bands have increased the spectral characteristics and expanded the amount of remote sensing data information compared with RGB in visible bands, and the segmentation accuracy was significantly improved in the validation results of both RTFNet and MSNet (4c). The 6-channel data synthesized by band calculation add spectral indices NDVI and NDWI, which are more consistent with the conventional data processing and feature segmentation methods of remote sensing images. The segmentation accuracy of MSNet (6c) is further improved, which proves the positive role of spectral feature information in the semantic segmentation of remote sensing images. Figure 3 and Figure 4 show the specific validation results of the training process of each network model in the GID and Potsdam datasets, respectively, where the evaluation criteria curves of MSNet show a faster upward trend and stable fluctuations.

Comparison of test stages
In order to further compare the actual segmentation accuracy of the model, we randomly selected three images from each of the two datasets for testing, generated the comparison diagram between the segmentation results of each model and the truth label, and established the confusion matrix to quantitatively evaluate the segmentation accuracy of each category.
MSNet is already in the leading position in the 4c data pattern, and in the 6c data pattern, most of the index values have further improved slightly except for the IoU of meadow and the mIoU of the overall index, especially in the segmentation accuracy of vegetation and artificial objects. Figure 5 shows the visualization status and ground true value of the test data in all channels of the GID dataset. The segmentation of farmland is more accurate and more clearly separated from the road. In terms of water segmentation, bridges over rivers are also effectively distinguished. The above typical results demonstrate that the 6-channel data synthesized by adding NDVI and NDWI derived from band calculation can achieve excellent segmentation results in MSNet. Figure 6 shows the quantitative evaluation of the prediction result atlas, which is generated from the GID test dataset in each network model (Figure 3), and the results of the overall quantitative evaluation of each metric in the test phase are consistent with the evaluation results in the training validation phase (Table 2), which further confirms the competitive performance of our proposed MSNet. Figure 6 includes the detailed evaluation comparison results of the categorical labels. It is worth stating that the reason for the absence of meadow and forest data in the first set of statistical plots is the absence of these two categories in the randomly selected test image. The reasons for this phenomenon mainly include the problem of evaluation indicators: in this paper, six indicators are selected in the overall evaluation, and two indicators are selected in the categorical evaluation, in order to prevent the limitation of single indicator evaluation. Precision focuses on evaluating the exactness of the prediction results, indicating the proportion of correct predictions, and is generally used in conjunction with Recall, which focuses on evaluating the completeness of the prediction results, indicating the proportion of all predictions, but in this paper, we selected a more intuitive evaluation index IoU instead of Recall, and the inference relationship between these two indicators is detailed in Equation 8. The phenomenon of high Precision value and low IoU value indicates that the model predicts the category with a high precision rate but is incomplete. From Figure 6, we can find that the prediction precision of UNet(3c) for water in the first two images is less than 20%, which is significantly lower than other network models, and the prediction of RTFNet(4c) for forest is also lower than the average of all network models. MSNet of both models has no outstanding prediction performance in terms of Precision value, but IoU value performs well and has a more complete ability to predict the extracted categories, which is also confirmed by Figure 5.
The problem of samples: Same as the problems of the number and proportion of categories of samples mentioned in the training validation stage, the GID dataset consists of GF-2 remote sensing images of more than 60 different cities in China, and the category labels are not finely outlined and labeled, but synthesized with land cover data, so this dataset has the phenomenon of completely different feature and landform characteristics within the same category, also including the unbalanced proportion of samples within each category label. The imbalance in the proportion of samples within each category of labels causes the situation that most images contain only some of the feature categories. In the training and validation phase of the network model, a larger number of images were used and iterated in the network model continuously in the form of sliced samples, which largely weakened the above problems, but in the testing phase, there were only three randomly selected images, and when the sample size was small, the problems existing in the dataset will be highlighted.
In the Potsdam dataset validation results, MSNet also achieved the best results with further improvement of all metric values in the 6c data pattern, and since there is no water in the Potsdam dataset, but there are DSM data, DSM was used to replace NDWI in the 6c data synthesis. Figure 7 shows the visualization status and ground true value of the test data in all channels of the Potsdam dataset, as well as their prediction result plots in each network model. The segmentation effect of vegetation is improved more obviously, while the addition of DSM data drives the accuracy of house and road identification. Figure 8 shows the quantitative evaluation of the prediction result atlas, which is generated from the Potsdam test dataset in each network model (Figure 4), and the results of the overall quantitative evaluation of each metric in the test phase are consistent with those in the training validation phase (Table 3), further confirming that MSNet has better performance.  Figure 8 includes the detailed evaluation of comparison results of the categorical labels. In the Precision value, which reflects the exactness of the prediction results, the prediction levels of each network model are similar for different test image categories, and the prediction precision of MSNet(6c) is slightly higher for imp_surf; In the IoU value section that characterizes the completeness of the predicted results, MSNet(6c), which incorporates the spectral features of NDVI and DSM, has a clear advantage. The best results were achieved in all categories of completeness prediction; meanwhile, the prediction evaluation indexes the Precision and IoU value of each category in different network models were coordinated and stable, without the opposite situation similar to that in the rating results of GID test dataset, mainly because the image acquisition of Potsdam dataset was concentrated in one city, and the features and landforms within the same category did not change significantly. The relatively balanced proportion of samples within each category label also confirms to some extent that the GID dataset has problems such as labeling errors.
The validation results of the two datasets can be summarized: 1) Our proposed MSNet for semantic segmentation of multispectral remote sensing images has competitive performance; 2) In the NIR band, water has higher spectral absorptivity values and vegetation and artificial features have higher spectral reflectance values, these spectral features facilitate the task of semantic segmentation of remotely sensed images; 3) Passed the spectral index NDVI and NDWI obtained by band calculation are combined with 6-channel data, which helps to improve the accuracy of semantic segmentation of multispectral remote sensing images, especially the identification of vegetation, water, and artificial objects.

Ablation study
In the ablation study, we mainly verify the effects of data group encoding and decoding methods on segmentation results, and the main results when using the GID dataset are shown in Table 4: (1) Encoding method, the encoding method of splitting the bands into two groups in 4c and 6c data patterns can reduce the training time and improve the training efficiency, and increasing the number of filters has less effect on the segmentation results; (2) Decoding structure, the conventional fusion decoding method is to sum up the feature maps of branching structure directly, MSNet adopts feature fusion in the decoding part, that is, two groups of encoding results representing image features and spectral features are decoded independently, and then the feature pyramid structure is used to fuse two groups of independent decoding layer by layer, Table 4 shows that the final results of decoding by feature fusion are generally better; (3) Spectral index information, the ablation experimental results also confirmed the influence of spectral index information such as NDVI and NDWI derived from band calculation on the semantic segmentation results, and the spectral index information in MSNet_2 with the conventional fusion decoding method and MSNet (6c) with the feature fusion decoding method, resulting in the segmentation accuracy showing completely opposite results.
In addition, we tried different band combination methods in the 4c data pattern in the ablation experiment: replacing NIR with NDVI or NDWI in the GID dataset and replacing NIR with DSM in the Potsdam dataset, and the highest accuracy of each band combination method in the final model validation results was RGB + NIR, the possible reason is that the NDVI and NDWI directly derived from the band calculation are not postprocessed such as threshold splitting, and there are some incorrect spectral index results. The DSM in the Potsdam dataset can be applied to directly identify houses and trees with significant height information, but the height differences between other low features are not obvious and difficult to distinguish.

Evaluation in efficiency
Model efficiency evaluation has received extensive attention in deep learning research, with emphasis on the inference speed of the network model in the prediction phase, however elements such as the overall time consumption and computational cost of the network model during the training phase also deserve attention. Hardware devices, especially GPUs, are necessary for deep learning research. For this reason, GPU servers consisting of multiple GPUs for parallel computing, a larger number of CUDA cores and video memory capacity have been very commonly used in the field of deep learning. The complexity of the model determines the number of samples loaded per batch and the computation time, we also tried to use more complex backbone networks for deeper mining features in our study, such as ResNet-101 and ResNet-152, using the GID dataset and comparing the results of the model training validation phase, the improvement of the evaluation metrics is not obvious (Table 5), mainly due to the lack of sample labeling accuracy and unbalanced distribution of sample categories. However, the consequent training time increased significantly, leading to higher computational cost and lower computational efficiency. The result is a significant increase in training time,  resulting in higher computational cost and lower computational efficiency. It is not worth advocating the approach of blindly designing overly complex models and exchanging high computational cost for small accuracy improvement.
The results of Table 6 show a comparison with UNet (3c) and RTFNet (4c) under the same hardware environment and operating parameters while ensuring the highest video memory utilization. Model training and validation time (runtime), model size, maximum batch size, model parameters, floating point operations (FLOPs), and frames per second (FPS) are selected as the measurement metrics. The model efficiency of our proposed MSNet is at the average level in both 4c and 6c data patterns, and when combined with the model testing results, it can be confirmed that MSNet has favorable segmentation accuracy, while the computational cost and inference speed are balanced.

Discussion
In the testing phase, we selected three sets of images and truth labels from the two datasets to visually compare the predicted segmentation result graphs of each network model ( Figure 5 and Figure 6). The confusion matrix is again used to calculate the values of each evaluation metric in the predicted segmentation result maps to quantitatively evaluate the test results, especially the degree of segmentation in each category, which is an effective way to objectively reflect the true performance of the network models.

Dual-branch model structure
In the field of natural images, pre-training models such as ResNet help to accelerate neural network training and improve the training results simultaneously, mainly because the parameters in pretraining models are trained from massive data such as ImageNet and have generic features, and most of the migration learning based on this has been effective (He, Girshick, and Dollar 2019). However, these parameters are based on RGB three channels, which are not applicable to the processing of remote sensing images with multiple wavebands, and we also conducted comparative experiments for this purpose, Table 7 shows the test results of three images in the GID dataset, without changing the original structure of UNet, and increasing the number of channels to load NIR, NDVI, and NDWI wavebands to form the UNet(4c) and UNet(6c) networks, but the indicators in their test results are significantly lower than those of UNet(3c), which only uses the RGB band. When the input data have different structures, separate networks are usually used to process each data type and fuse them in the classification stage; therefore, dual-branch model structure has been used in past multi-source heterogeneous data fusion (Yuan, Shi, and Gu 2021), to maintain the advantages of the pre-trained model for RGB bands while using other data as auxiliary features to improve the model recognition accuracy.
The multi-source data fusion model, represented by RTFNet and others, uses dual-branch structure for grouping and extracting features in the encoding stage. The  network we designed extends the dual-branch structure all the way to the decoding stage, and the multi-source data are encoded and decoded simultaneously after grouping, which is more similar to the siamese neural network structure, but the data fusion part adopts a feature pyramid structure to fuse the multi-scale image features and spectral features in the upsampling process layer by layer. The results shown jointly in Tables 6, 7 and Figure 5 demonstrate the MSNet validity. Segmentation accuracy and inference speed for various feature classes differ significantly in the semantic segmentation networks selected for different combinations of data patterns, and this difference can provide empirical advice in scenarios oriented to the interpretation of specified features. 3c data pattern: the main approach in the field of computer vision, four classical models are selected in this paper, with simple network structure, fast training and inference, and better performance in the recognition of artificial features such as buildings and roads, but the phenomenon of mis-segmentation and omission is more obvious in complex ground coverage areas, for example, in the last image, none of the four models can completely segment the bridge over the water.
4c data pattern: with the addition of NIR bands and a two-branch network structure, the complexity of the model increases significantly, and this approach improves the recognition probability of vegetation and water bodies that behave sensitively in the NIR bands, and the results of RTFNet(4c) and MSNet(4c) show a significant improvement in the edge segmentation accuracy of vegetation and water. 6c data pattern: During the loading process of multispectral remote sensing images of four bands composed of RGB and NIR, simultaneous band calculation is performed to generate spectral indices NDVI and NDWI, which are dynamically synthesized into 6c data. MSNet(6c) adds a spectral index classification method to form a strengthened supervised semantic segmentation, which mainly improves the recognition accuracy of features with fragmented and fine distribution and further improves the overall accuracy.

The effect of spectral features
Previous studies have confirmed that the results of using RGB and thermal infrared band fusion in digital pictures semantic segmentation tasks are better than RGB 3-band images, while the task scenes contain classification targets with obvious temperature contrasts (Ha et al. 2017;Hazirbas et al. 2017;Wang et al. 2016). The traditional processing of remote sensing images reorganizes the NIR band with the visible band to synthesize standard pseudo-color images for visual interpretation. Based on the significant differences in the spectral reflectance of different features such as vegetation and water in the NIR band, we fuse RGB and NIR bands, which is essentially a fusion of image features with spectral features, and design a semantic segmentation network model.
Remote sensing images are converted into apparent reflectance with actual physical meaning through radiometric calibration in each band of the full spectrum. Analysis based on the full-spectrum characteristics of ground objects is the main method in the field of remote sensing data processing. The reflectance values of vegetation in the red and nearinfrared bands and the absorptivity values of water in the green and near-infrared bands all have a very significant upward trend. NDVI and NDWI can eliminate most of the irradiance changes related to radiometric calibration, solar azimuth angle (SOA), terrain, cloud shadows, and atmospheric conditions, enhance the response to vegetation and water, and achieve rapid differentiation from other ground objects.
The Potsdam dataset contains the digital surface model (DSM), which provides depth-of-field information similar to that in digital pictures. The experiments in this paper only verified the effectiveness of the proposed MSNet for such special spectral information, and no more in-depth research was conducted for DSM, for instance, DSM can convert two-dimensional remote sensing images into three-dimensional images, unlike the form of data with high spectral dimension, which is a data cube with real height difference, and using 3D convolutional neural network can extract richer features of features in layers, which will help to improve the extraction accuracy of local details (Konapala, Kumar, and Khalique Ahmad 2021;Roy et al. 2020).
Our experiments validate the effectiveness and advantages of semantic segmentation using multispectral remote sensing data after band combination calculation, if multispectral remote sensing images with more number of bands, for example, containing short-wave infrared and mid-wave infrared, shortwave infrared (SWIR), medium-wave infrared (MIR), are used to generate more Multi-spectral index layers will theoretically help to improve the model segmentation effect, and also enable weakly supervised or even unsupervised segmentation of conventional features such as impervious surfaces, vegetation, and water (Waser et al. 2021;Xun et al. 2021), but caution should be exercised to increase the autocorrelation that may result from band data, and to avoid the curse of dimensionality similar to hyperspectral remote sensing images.

Conclusion
This paper mainly introduces our proposed multispectral semantic segmentation network MSNet for remote sensing images. Through experimental validation and test in two public datasets, MSNet has competitive performance in model prediction accuracy, training time-consuming and inference speed compared with the same type of semantic segmentation models; The contribution of the spectral feature information provided by NIR, DSM, NDVI, and NDWI to the semantic segmentation accuracy when fused with visible light is demonstrated. A method is proposed for the dynamic synthesis of multichannel remote sensing image datasets, loading multi-spectral image data with simultaneous band calculation to generate spectral indices NDVI and NDWI, and the experimental results of MSNet show that the overall segmentation accuracy is improved, especially the extraction effect of vegetation and artificial objects is most obvious. It further proves the advantages of spectral features in the semantic segmentation task of remote sensing images.
The original intention of our proposed MSNet is to make full use of all the band data of multispectral remote sensing images, and we try to increase the spectral feature information by obtaining spectral indices through band calculation and dynamically synthesizing multichannel remote sensing image datasets, and we have achieved excellent results, proving that the addition of spectral information helps to improve the semantic segmentation accuracy of multispectral remote sensing images. At this stage, the intelligent interpretation of remote sensing images based on deep learning still widely adopts the mature neural network model oriented to digital pictures, and it is urgent to design deep learning neural networks dedicated to the field of remote sensing. In the subsequent work, we will carry out the research focusing on the semantic segmentation by fusing the spectral information of remote sensing images, designing the end-to-end neural network structure, and integrating the spectral information and geological knowledge into the network framework, which will be conducive to solve the hard nut to crack for feature classification of natural geographical elements.