Histogram matching-based semantic segmentation model for crop classification with Sentinel-2 satellite imagery

ABSTRACT Accurate and near-real-time crop mapping from satellite imagery is crucial for agricultural monitoring. However, the seasonal nature of crops makes it challenging to rely on traditional machine learning methods and previous samples generated within specific domains. In this study, we improved the histogram matching method for color correction of multi-temporal images and tested the performance and prediction classification accuracy of three semantic segmentation models based on weak samples. Classification experiments were conducted for nine categories in two cities in Henan province from 2019 to 2022 using 10 m resolution Sentinel-2 images with different feature selection schemes. We trained the models using classified and recorrected results in four selected sites in 2019 and 2020, and designed experiments to assess the performance of the improved histogram matching method and verify the transferability of semantic segmentation models across regions and years. The experimental results showed that the UNet++ model with feature selection and improved histogram matching methods outperformed other models, such as DeepLab V3+ and UNet, in crop classification transfer cases, with better model performance and higher classification accuracy. The UNet++ model without training samples achieved optimal overall accuracy, Kappa coefficient, and mean F1-score values from 2019 to 2022, exceeding 87%, 82%, and 65%, respectively. Moreover, the representative error of weak samples and prediction classification results were analyzed to improve the model robustness. As an application of transfer-learning in crop mapping, the proposed model effectively addressed the classification problem of multispectral satellite imagery with missing labels.


Introduction
Agricultural practices and food security are among the main issues of global concern (e.g.Food and Agriculture Organization).Due to the seasonal characteristics, the ground data of crops had vital timeliness.Obtaining the spatiotemporal distribution of crops based on a few available ground samples is of great significance for agricultural monitoring and yield estimation (Wang et al. 2022a).Remote sensing (RS) satellites have provided multi-sensor and multiscale data sources for target recognition and land-use classification.Combining appropriate classifiers can serve near-real-time or long time-series vegetation monitoring tasks.Zhang et al. (2022) detected the forest damage based on Sentinel-2 images and UNet ++ network to formulate proactive prevention and control strategies.Wang et al. (2022b) used the Gaofen and Sentinel-2 multispectral satellite images to classify crops in complex agricultural areas, providing valuable data for agricultural management.Song et al. (2021) evaluated the applicability of Landsat, Sentinel, and the Moderate Resolution Imaging Spectroradiometer in national-scale crop mapping.Particularly, Sentinel-2 satellites developed by the European Space Agency provide optical images with 13 bands and a fused spatial resolution of 10 m for monitoring agricultural practices, especially crop classification (Martinez et al. 2021;Wang et al. 2022a).
However, most classification methods need the support of massive ground surveys and samples, while obtaining these data sets under smallholder patterns requires more time and labor costs (Xu et al. 2022).Kaiser et al. (2017) highlighted the limitations of deep learning models, noting the heavy dependence on a large amount of labeled data set that is resource-intensive and timeconsuming to acquire.Furthermore, deep learning networks performed rather robustly against noise in the training labels.This opened up the intriguing possibility to obtain samples from existing maps or supervised classification results.Therefore, these labeled data sets that are not completely reliable and may even present misclassified pixels can be commonly referred to as weak samples (Hao et al. 2020;Kaiser et al. 2017;Paris and Bruzzone 2021).However, it is essential to establish and evaluate a deep-learning model that can automatically extract detailed crop information from images using a few weak samples (Diakogiannis et al. 2020;Khan, Fraz, and Shahzad 2021;Li et al. 2020).Automating this process can provide nearreal-time crop spatial distribution products and enhance the application efficiency of digital agriculture.
Crop classification from RS imagery is challenging, particularly in different topographic regions with widespread and scattered smallholders in China.Furthermore, modern technologies such as soil improvement and interplanting make agricultural activities highly dynamic, resulting in higher requirements for satellite sensors (Martinez et al. 2021).Several studies have demonstrated that appropriate classifier and feature selection of satellite images are the critical factors affecting the classification results (Wang et al. 2022a;Martinez et al. 2021;Xu et al. 2022).Furthermore, the three red-edge bands and a narrow near-infrared band of Sentinel-2 data are sensitive to crop classification in complex agricultural areas (Khan, Fraz, and Shahzad 2021;Waldner and Diakogiannis 2020;Wang et al. 2022a;Yuan, Shi, and Gu 2021).Compared with Landsat and MODIS satellite data, Song et al. (2021) recognized the red-edge and short-wave infrared bands in Sentinel-2 imagery as more valuable for corn and soybean classification.Wang et al. (2022) employed time-series Sentinel-2 images to map corn, soybeans, and others with an overall accuracy exceeding 85%.Such outcomes demonstrated the significance of the red-edge bands and vegetation indices in classification task.Wang et al. (2021) applied multi-temporal Sentinel-2 images to classify corn, cotton, soybean, winter wheat, and other categories, and highlighted the impact of crop growth period differences on feature selection.Collectively, these studies furnish valuable insights for the successful application of Sentinel-2 images in agricultural monitoring and crop classification.
In data preprocessing, the feature selection strategy plays an essential role in improving classification accuracy from single or multi-temporal RS images.However, the phenological periods or imaging conditions lead to feature differences of multi-temporal or simultaneous images.Color equalization algorithms should adjust such gray level or color differences to keep in a relatively consistent brightness and color range in the application scene.As concluded by Cao et al. (2015), the multispectral image matching methods can be classified into two categories: approaches involving filtering (such as wavelet transformation and Fourier transformation) and statistics (such as moment matching and histogram matching).The histogram matching (HM) method is more popular because of the ability to enhance the contrast of images and correct for nonlinear response detectors (Cao et al. 2015).In terms of multi-crop classification, it is necessary to test further the performance of the HM approach in processing multiple features (including spectral, vegetation indices, and texture features) and its impact on the classifier model.
Given the intraclass and interclass diversity of crops with seasons, the quality of the labeled sample data set will be significantly influenced by the mixed pixels or small parcels, and these samples will become ineffective in reference to the past (Paris and Bruzzone 2021;Zhu et al. 2022).The classification accuracy of RS images using machine learning algorithms or deep learning networks is affected by the quality of training samples.Despite the fact that several countries have published agricultural data, such as the Cropland Data Layer and the Crop Inventory, it is not readily available in China due to the smallholder planting mode (Wang et al. 2022b).The crop samples will change with seasons and become ineffective, referring to the past.Kaiser et al. (2017) used the OpenStreetMap weak samples to extract buildings and roads and pointed out that high-precision models can be trained using weak samples with some errors.Hao et al. (2020) selected the Cropland Data Layer and Sentinel-2 images as a training data set to identify crops in three test regions, with an overall classification accuracy higher than 86%.Paris and Bruzzone (2019) used weakly labeled samples to train a convolutional neural network (CNN) and demonstrated its effectiveness in land-cover classification.The above studies have proved the applicability of the weakly labeled samples in RS classification without considering the feature differences between multi-temporal images.Similarly, the crop classification results obtained using traditional pixel-based machine learning algorithms (such as support vector machine and random forest) can be regarded as weak samples after processing the "salt and pepper phenomenon."Combining feature selection and HM methods, it is necessary to further test and analyze the performance of such non-100% accuracy samples in different semantic segmentation networks.
Deep learning has been widely applied in RS image recognition and classification thanks to the effectiveness of deep feature extraction and model generalization of complex data (Perantoni and Bruzzone 2022;Wang et al. 2022b).Semantic segmentation serves as a critical pixel-level classification method in deep learning networks and has been employed to address RS problems of multi-scale and diverse applications (Waldner and Diakogiannis 2020;Wang et al. 2021Wang et al. , 2022a)).The advantage of semantic segmentation-based methods mainly depends on the powerful encoder to extract complex features from a few samples (Wang et al. 2022b).Among many semantic segmentation networks, the encoder-decoder architecture is the most commonly used one, mainly because of its powerful multi-level learning capacity, especially in intelligent interpretation of RS images.
Crop mapping is a pixel-wise classification task using RS imagery to generate thematic maps that depict the spatiotemporal distribution of identifiable agricultural planting types (Li et al. 2022).Wang et al. (2022a) classified seven summer crops based on Sentinel-2A data sets with 14 feature bands and yielded that the UNet++ model exhibited the optimal performance, with an overall accuracy above 91%, while ignoring the feature differences of images in multiple phenological periods.Martinez et al. (2021) tested a hybrid deep learning architecture upon time series SAR datasets and indicated its advantages in N-to-N crop classification in tropical sites.However, due to climatic factors such as continuous precipitation, obtaining time series optical satellite images in short crop growth is often difficult.Adrian et al. (2021) fused Sentinel SAR and optical images and yielded higher training performance of the UNet model, with an overall accuracy of 0.992, indicating that the semantic segmentation model can improve crop classification accuracy.In addition, the attention mechanism is often introduced to improve the efficiency and quality of deep learning networks due to its effective use of spatial and semantic information of RS images.Several studies have proved that the attention mechanism, such as selfattention and Spatial and Channel Squeeze & Excitation Block (scSE), can effectively segment land cover, crop, road, and water information on the medium-and high-resolution images (Ienco et al. 2019;Martinez et al. 2021;Qi et al. 2020;Xu et al. 2020;Zhang et al. 2022).Spectral, spatial, and textural features provide more information for crop classification in complex agricultural areas.The semantic segmentation models have shown stronger generalization ability in the intelligent classification of multi-feature RS images.These beneficial studies provide a reference for the experiments of weak samples in different deep learning networks.Especially for semantic segmentation models, the generalization ability and applicability remain very challenging problems.In most cases, the preprocess of feature selection and image matching based on weak samples should be considered to input and test such networks.
This study aims to assess the performance of featureselection schemes with improved HS methods in different semantic segmentation networks with attention mechanisms based on weak samples.This assessment included (1) comparing the preprocessing performance of the improved HM method in multi-temporal Sentinel-2 images; (2) evaluating the training performance of different schemes in semantic segmentation models based on weak samples; (3) selecting the optimal training model to evaluate the prediction classification accuracy and model generalization ability.Overall, the proposed method combines ideas distilled from image consistency processing and semantic segmentation to demonstrate the applicability performance.

Study sites
This study focused on the main crop classification in the central part of Henan province, China (Figure 1).The area belongs to the north warm temperate continental monsoon climate zone with an annual precipitation of about 700 mm and an annual average temperature of ~ 14°C.The annual precipitation mainly occurs from July to September, and the simultaneous heat and precipitation promote crop growth while limiting the acquisition of multi-temporal optical images.The summer crops in this area are diverse, and their growing season is ~110 days from early-June to mid-September.As shown in Figure 1

Sentinel-2 data
Due to the low vegetation cover in the early stage of crops such as soybean and tobacco, we collected Sentinel-2 images in the late growth stage from late-July to early-September yearly to reduce the impact of soil (Table 1).The Google Earth Engine (GEE) platform can conveniently obtain high-quality and cloudless images and their vegetation indices and texture features (Adrian, Sagan, and Maimaitijiang 2021).The GEE provides calibrated and corrected Sentinel Level-2A atmospheric bottom reflectance data.The sustained precipitation in August made it difficult to obtain cloudless and high-quality images during this period.As listed in Table 2, both schemes include 10 and 18 features, respectively.The coastal aerosol, water vapor, and Cirrus bands with 60-m resolution were removed, and the resolution of the preprocessed image was 10 m.
Currently, machine learning methods are commonly used for feature importance evaluation and selection.Wang et al. (2022b) employed the XGBoost algorithm to  obtain 14 features from 38 feature sets to classify crops in complex agricultural areas, which represents a feature importance evaluation method based on the Gradient Boosting Decision Tree.Additionally, the random forest and feature sensitivity analysis were adopted to select feature subsets (Izquierdo-Verdiguier and Zurita-Milla 2020; Wang et al. 2021).Based on prior knowledge and the role of different vegetation indices in classification, we selected 8 features derived from Sentinel-2 images.
Given the varying recognition abilities of feature variables on categories, selecting suitable features through analysis of the training dataset can enhance the explanatory power of the classifier.As listed in Table 2, scheme 1 consists of only 10 spectral features of the Sentinel-2 images, whereas scheme 2 augments scheme 1 by incorporating an additional 5 vegetation indices and 3 texture features.Based on the comparison of model performance and classification accuracy, the importance of feature combinations in crop classification in complex agricultural planting areas is tested.In scheme 2, five vegetation indices (detailed in the supplementary material of Chaves et al. (2020)) were selected, which are normalized difference vegetation index (NDVI) used to distinguish vegetation from nonvegetation, modified normalized difference water index (MNDWI) used to recognize water, normalized difference buildup index (NDBI) used to identify buildings, modified soil adjusted vegetation index (MSAVI) used to reduce soil influence, and NDVI red-edge (NDVIRE) used to reflect crop growth.The texture features near-infrared band-based were derived by the gray level cooccurrence matrix in GEE, and the sum average (SAVG), correlation (CORR), and dissimilarity (DISS) were selected in scheme 2 (Wang et al. 2022b(Wang et al. , 2022a)).

Sample data
We used the handheld Global Positioning System named Jisibao UG905 to collect ground samples in mid-July every year.Four sample plots were investigated yearly with a total area of 4.11 km 2 , each covering an area of 1 km 2 .Figure 2(a-d) shows the sample plots in different years.In addition to the sample plots, we obtained the location and type information of typical crops along the way.These samples were used to establish an interpretation mark library for manual sample selection.In this study, we classified crops and land use into 9 categories, which are corn, soybean, peanut, tobacco, non-cultivated land (including harvested and abandoned cropland, NCL), other crops (such as pepper, rehmannia, and vegetables), forest land (including nursery stock planted on cultivated land), urban (including greenhouses, roads, and buildings), and water.
Based on the ground sample and manually selected sample points, the classification results of four sites (Figure 1) in 2019 and 2020 were accomplished by using the random forest method in GEE.However, the "salt and pepper" problem in classification results made it impossible to input the model as label data directly.Therefore, the recorrecting processing methods, including format conversion, singlepixel elimination, and manual editing, were used to improve the accuracy of weak labels.This process included (1) converting raster classification results (in tif format) to vector data (in shp format); (2) merging the selected single-pixel with the neighboring unselected polygon with the longest shared boundary; (3) manually editing the significant misclassifications; (4) selecting random points to verify the results.These methods were provided by ArcMap, developed by Environmental Systems Research Institute, Inc, located in Redlands, California, USA.We selected 20 random crop sample points for each site every year, and found a total of 13 misclassifications, such as soybean and other crops, greenhouses and noncultivated land, and peanut and sweet potato.Finally, this kind of non-100% accuracy data shown in Figure 2(e-h) was converted to TIF format with the same coordinate system and resolution as the corresponding image, and used as weak labels for model training and performance evaluation.
In addition to the four sample plots (S1-S4 in Figure 1), the study area was divided into 5�5 km regular grids.We selected a crop point for each grid and assigned the attribute values to evaluate the prediction classification accuracy and spatiotemporal generalization capability of different semantic segmentation models without training samples.The distribution of validation sample points is shown in Figure 1.Here, water samples were not included in the validation data set because the water was relatively easier to extract from images, and this study mainly focused on crop classification in complex agricultural areas.

Model data set processing
The initial step is to select the reference image and use histogram matching to process the Sentinel-2 images.Then, these matched images were scaled to 8 bit, and the bands were stretched to satisfy the input requirements of semantic segmentation networks.The data stretching method used in this study is linear stretching, which is provided by ENVI software.Finally, the preprocessed images with a resolution of 10 m and weak labels of four sites in 2019 and 2020 were clipped to the patch size of 256 × 256 pixels using a regulargrid algorithm with zero repetition rate.The remaining matched images in the study area were trimmed similarly and used for model prediction classification.When the remaining image edges are less than 256 pixels, fill with the background value and complete the clipping process.It is worth noting that the clipped images maintained the same spatial position and coordinate system as the original images to quickly mosaic the predicted classification results during postprocessing.A total of 624 pairs of training data sets were obtained in this study.The pixel proportions of corn, peanuts, soybean, tobacco, non-cultivated land, other crops, forest land, urban, and water were 40.95%, 5.20%, 7.89%, 1.18%, 0.68%, 8.03%, 13.36%, 21.81%, and 0.90%, respectively.Due to the lower proportion of tobacco and non-cultivated land, it is necessary to deal with the imbalanced class to reduce the impact on the semantic segmentation models (Wang et al. 2022b).Although the proportion of water pixels is also deficient, the broad spatial distribution and significant spectral features meant that upsampling could be avoided.

Improved histogram matching method
Histogram matching (HM) can be used as a lightweight normalization for image processing, such as feature matching, especially the images from different regions or phenological periods (Horn and Woodham 1979).The solution of the traditional HM algorithm is to match the cumulative probability distribution function of the target detector and reference detector (Cao et al. 2015).HM manipulates the pixels of a target image so that its histogram matches the histogram of the reference image.Unlike grayscale or RGB images, RS images usually have more channels and different ranges of feature values for each channel.Therefore, the HM is done independently for each channel (band), as long as the number of channels (bands) is equal in the target image and the reference.The steps of the improved HM algorithm proposed in this paper were as follows: (1) Even after radiometric calibration and atmospheric correction, satellite images still have noise or feature outliers.In particular, a small number of high reflectivity values are often caused by bright artificial objects or clouds.Therefore, for the histogram results of band i (B i ) , we continuously traverse the statistical number of pixels for each reflectance value.If the number of pixels from a certain reflectivity value (r max ) is lower than the given threshold (T g = 20) for 10 consecutive times, then the reflectivity values of the pixels greater than r max will be recorrected to r max .(2) For the input image, it has a probability density function (Equation ( 1)).
where n k is the frequency of the reflectivity value r k in B i , and N is the total number of pixels.p(r k ) is the probability of r k , and its value range is [0, 1].
(3) For the reference image, the probability density function of r k in B i can easily be mapped to its cumulative distribution function by Equation (2).

S r k
where S(r k ) is the cumulative HM distribution of r k in B i .L is the total number of gray levels within the range of [0,r j ].
(4) Similarly, for the input image, the cumulative probability distribution of z k in B i is determined using Equation (3).
where G(z k ) is the cumulative HM distribution of r k in B i .
(5) Finally, using the equation S r k ð Þ ¼ G z k ð Þ, the desired output image and its histogram similar to the reference histogram can be obtained.In this paper, the performance metrics, including histogram, mean, variance, and mean gradient, are used to compare the image quality before and after HM based on the existing works (Cao et al. 2015;Jayasankari and Domnic 2020).

Semantic segmentation models based on FCN
The fully convolutional network (FCN), employing an encoder-decoder structure, learned features from the multi-channel RS images and accommodated input images of arbitrary size (Zhu et al. 2017).Several semantic segmentation models FCNbased have been constructed to achieve pixel-level image classification, such as UNet (Ronneberger, Fischer, and Brox 2015), DeepLab V3 (Chen et al. 2018), and SegNet (Badrinarayanan, Handa, and Cipolla 2015).As shown in Figure 3, the skip connections were introduced to fuse information from different depths and recover the fine-grained spatial information lost in the downsamples (Zhu et al. 2017).The upsamples and softmax in the last layer of the decoder were used to recover the lower size features in downsamples to the original image size.

Semantic segmentation networks selection
In this study, we conducted model training and performance evaluation using preprocessed dataset with DeepLab V3+, UNet, and UNet++ networks.Figure 4(a) illustrates the UNet (Ronneberger, Fischer, and Brox 2015) network, which consists of downsampling and upsampling layers.The UNet++  (Figure 4(b)), as modified by Zhou et al. (2018), incorporated dense skip connections to extract multi-scale feature maps from multi-level convolution pathways.The DeepLab V3+ (Chen et al. 2018) (Figure 4(c)) improved the segmentation performance by using the Atrous Spatial Pyramid Pooling (ASPP) to resample features at different scales.Specifically, we appended a combined attention module in the decoder part to obtain fine-grained semantic segmentation improvement.The Spatial and Channel Squeeze & Excitation Block (scSE, Figure 5), as described in detail by Roy et al. (2018), enables the learning of more meaningful feature maps that are relevant spatially and channel-wise.Additionally, a backbone pretrained by the ImageNet dataset and named RegNetY_320 was adopted in the networks to extract the multi-level feature maps.

Model parameters and training environment
To test the performance of preprocessed images and weak labels, Table 3 shows the parameters common to all the semantic segmentation networks.We selected the stochastic gradient descent (SGD) as the optimizer to adjust the parameters after each sample and accelerate convergence.In the scheduler, T_0 is the number of epochs for the first restart, and T_mult is the value controlling the speed of the learning rate.For the multiple classes, we adopted the joint loss function of label smoothing cross-entropy loss function (LSCE loss ) and dice coefficient loss (DC loss ) in reference of Wang et al. (2022b).The smooth factor was set to 0.1 in LSCE loss .The joint loss and mean Intersection over Union (mIoU) values used to evaluate model performance, and the training will be terminated if both values were not improved for 10 epochs.The loss and mIoU are used to evaluate the model performance.Due to the small number, 624 pairs of data set were used for model training and validation.In particular, the random linear stretching of 0.5%, 1%, or 2% was applied to reprocess the input images to evaluate the validation accuracy of the model during the training process.Wang et al. (2022a) pointed out that upsampling of small samples in imbalanced classes can improve the model  performance and classification accuracy.Hence, three-fold upsampling methods were used to increase the sample number in the model.In this study, the upsampling method involves generating an augmented dataset by copying the specified categories in the training dataset multiple times before model training.The model training is subsequently conducted using this augmented dataset.All models in the experiments were performed on an Ubuntu workstation with a single NVIDIA Tesla T4 graphics card (16 GB RAM), using the PyTorch and GDAL packages based on the Python platform.

Evaluation indicators
The evaluation indicators of prediction classification included the overall accuracy (OA), mean F1 score (F1score), and Cohen's kappa coefficient (Kappa) (Diakogiannis et al. 2020;Wang et al. 2022a;Xu et al. 2020).The user's accuracy (UA) and producer's accuracy (PA) were used to evaluate the accuracy of crops based on the optimal semantic segmentation model.Finally, the landscape indicators, including patch density (PD) and edge density (ED), were used to evaluate the patch fragmentation of the optimal classification results.The greater the patch density, the more substantial the overall heterogeneity and the higher fragmentation, while the more significant the edge density, the more complex the edge shape of the crop and land use classification (Su et al. 2022).

Images and histogram based on improved HM method
The reference, target, and images with improved  Consequently, these regions exhibit spectral characteristics of bare land at the false color composite bands, increasing the contrast between vegetation and soil to a certain extent.The improved HM method not only makes the color consistency of Sentinel-2 images, but also improves the visual information of images.
In addition, Figure 7 shows the histogram comparison of 10 spectral bands of the original image and the preprocessed image.Figure 7(a,b) shows that the original histogram results of RE3, NIR, and NNIR exhibit significant differences.The reason for this problem is not only the difference in the original image acquisition date but also the difference in the groundobject types and pixel numbers.The histogram of the images based on the reference image and the improved HM has remarkable similarity (Figure 7(c,  d)).Compared with the original histogram results, the reflectivity curves of B, G, R, RE1, and SWIR2 bands are more concentrated, while those of RE2, RE3, NIR, NNIR, and SWIR2 bands vary with two peaks.Therefore, the improved HM method made the features of multi-temporal images more consistent.

Model performance evaluation
The training dataset of 624 pairs was increased to 1905 pairs with three-fold upsampling of tobacco and non-cultivated land.The epochs, joint loss, and mIoU results of different schemes and semantic segmentation networks based on weak training datasets are shown in Figure 8.The epochs of all models are very close within 70-72.The training data set processed based on the improved HM method can obtain lower loss and higher mIoU values for the same feature scheme and network.The ~ 61st epoch obtained the optimal performance of different networks.From the result curves of both feature schemes, the curve changes of DeepLab V3+ are the closest.The curve changes of UNet are the largest and show more distinct, while the UNet++ model achieved the optimal performance.Therefore, affected by the sample accuracy, the loss values of all training models are still high.Improving the accuracy and quality of samples can enhance the model performance (Wang et al. 2022b).
The total time (TT, min), total epoch (TE), optimal loss value (OL), optimal mIoU value (OM), and average time of each epoch (AT, min) of different schemes and  4. For the same network, scheme 2 has a lower loss value and a higher mIoU value than those of scheme 1.Therefore, feature selection can enhance the training model performance.In addition, the UNet model requires the shortest total and average time to accomplish the training process, while the UNet++ model requires the longest time.The UNet++ model of scheme 2 using the improved HM method obtains the lowest loss value and highest mIoU value, with values of 0.457 and 0.760, respectively, which are 0.009 lower and 0.017 higher than those without the improved HM method.Consequently, the improved HM method not only reduces the feature divergence of Sentinel-2 images but also provides better model performance under the same feature scheme.

Classification results of the optimal semantic segmentation model
Figure 9 shows the classification results and accuracies of crops and other categories from 2019 to 2022 based on the UNet++ model.The OA values based on UNet and DeepLab V3+ models of both schemes with or without improved HM method are lower than 82.50%.Building on previous work (Adrian, Sagan, and Maimaitijiang 2021;Wang et al. 2022bWang et al. , 2022a;;Yuan, Shi, and Gu 2021), higher classification accuracy can be achieved based on better model performance as listed in Table 4. Without training samples, the UNet++ model of scheme 2 with an improved HM method yields the optimal prediction classification accuracies within a specific and adjacent area.The predicted crop classification time for the entire study area from 2019 to 2022 is 8.62 minutes.The predicted classification accuracies demonstrate the stronger spatiotemporal generalizability of the UNet++ model based on feature selection and improved HM method.The success of prediction classification based on semantic segmentation models can be attributed to the multi-level feature maps-generating strategy in multi-feature RS imagery (Diakogiannis et al. 2020;Paris and Bruzzone 2019;Zhou et al. 2018).
It can be seen that the results of scheme 1 without the improved HM method in 2022 show significant misclassifications, especially for forest land.The model with improved HM method can obtain higher OA, Kappa, and F1-score values based on the same scheme.Moreover, in addition to the F1-score in 2021, the classification accuracy values of scheme 2 without improved HM method exceeds the values of scheme 1 with improved HM method in 2021 and 2022.For the UNet++ model, the loss value of scheme 2 without the improved HM method is 0.002 higher than that of scheme 1 with the improved HM method, while the mIoU value of scheme 2 without the improved HM method is 0.002 higher than that of scheme 1 with improved HM method.As a result, the prediction classification accuracy of models with similar performance has advantages and disadvantages.However, according to the F1-score values from 2019 to 2022, scheme 2 without the improved HM method achieves better classification results than scheme 1 with the improved HM method.These results indicate that the vegetation indices and texture features can effectively improve the model performance and enhance the prediction classification accuracy.In addition, the OA, Kappa, and F1score values of scheme 2 with improved HM method based on the UNet++ model are higher than 87%, 82%, and 65%, respectively.Consequently, the proposed method in this study effectively improves the model performance and prediction classification accuracy.

PA and UA of crops based on the optimal semantic segmentation model
In addition to OA, Kappa, and F1-score values, the UA and PA indicators were used to assess the accuracy of different categories further.We analyzed the PA, UA, PD, and ED values of the optimal model classification results, i.e. the prediction classification results of scheme 2 with the improved HM method of the UNet++ model.As listed in Table 5, the classification precision values of soybean in crops are relatively low.
The values of non-cultivated land, forest land, and urban are lower.On the one hand, the verification sample pixels of these categories are fewer.On the other hand, the width of the field path is generally about 2 m, and it is difficult to effectively recognize due to the influence of high-coverage vegetation such as corn.However, their distribution information would be recorded during the ground survey, leading to low PA and UA values.
Similarly, individuals or rows in the field (such as poplar) would not be recorded during the ground survey, while the trees' shading of the field made the crops more likely to be mistaken for forest land.In addition, due to the influence of image resolution, the image pixels in interplanting mode are usually classified as the crops with higher coverage.For example, corn and soybean interplanting will be classified as corn in the late growth stage.Hence, the combination of ground sample results and mixed pixels of satellite imagery leads to the low classification accuracy of minor ground objects.
The PD and ED values of the prediction classification from 2019 to 2022 in Table 5 indicated that the heterogeneity and fragmentation of the classification landscape showed an increasing trend year by year, while the edge shape values showed a downward trend in 2022.As shown in Figure 9, in addition to the classification in 2022, the spatial distribution of soybean and other crops was more concentrated.The higher ED value in 2021 showed the more contiguous agricultural landscape.As listed in Table 5, the corn, soybean, other crops, and urban had higher PD and ED values, indicating more scattered distribution and complex edge shapes.Due to the concentration of planting methods, the classification accuracy of peanuts and tobacco was higher than that of soybean.Therefore, for minor crops, their classification accuracy is greatly affected by spatial distribution.

Contrast profiles of original images and improved HM images
Figures 6 and 7 show the visualization and histogram results of original images and improved HM images in 2020 and 2021.As described by Cao et al. (2015), Cui et al. (2017), and Helmer and Ruefenacht (2006), the mean value reflects the brightness of the image, and the higher the value, the greater the brightness of the image.Variance reflects the dispersion of gray levels of each pixel in the image relative to the average value of gray levels, and is used to evaluate the amount of image information.The mean gradient reflects the ability to express the contrast of minute details in the image, and the larger the value, the higher the image's sharpness.In this paper, two multispectral remote sensing image pairs were selected to test the improved HM performance.
Figure 10 shows the change curves of mean, variance, and mean gradient of 10 spectral bands before and after improved HM processing.Compared with the contrast profiles of original images, the metric values of the four spectral bands, including RE2, RE3, NIR, and NNIR have been significantly improved, with little changes in other spectral bands.In particular, the mean and variance of SWIR2 in Figure 10(b) are even decreased.Several existing studies have shown that the red-edge, narrow near-infrared, and near-infrared spectral bands of Sentinel-2 images played a more critical role in complex crop classification (Chaves, Picoli, and Sanches 2020;Portales-Julia et al. 2021;Song et al. 2021).Combining the performance and the prediction classification accuracy, the UNet++ model based on feature selection and improved HM method can be applied to complex agricultural areas with weak samples.

Analysis of feature selection and model efficiency
The semantic segmentation networks adopt an encoder-decoder architecture to achieve N-to-N classification tasks in a complex agricultural area.This improved HM method and the UNet++ model with scSE module demonstrate superiority with respect to other models (Figures 8 and 9).Training and prediction time are often used to evaluate a network.In addition to the total time and average epoch time in Table 4, Figure 11 shows the total prediction time of three networks for each scheme with or without the improved HM method.For the same prediction data set, an increased feature number of satellite images does not lead to a significant increase in the prediction time; even some models require a shorter time, such as the UNet and UNet++ models.This result is different from the time required for model training.Although the increased features lead to a longer time for model training, Table 4 shows an acceptable training time in RS based on a few weak samples.Compared with the experimental results of Wang et al. (2022a) and Li et al. (2021), weak samples reduce the labeling workload, but leads to a higher loss value of the training model.The higher loss value indicates that the problems of imbalanced class and lowquality samples need to be further improved (Yuan, Shi, and Gu 2021).The superior trade-off between prediction classification accuracy and regional model efficiency proposed in this study brings advantages, such as the potential to obtain near-real-time satellite images for cropping and land use monitoring.

Representative error analysis of weak samples and classification results
Figure 9 and Table 5 show the classification accuracies of land use and crop types.The influencing factors of the lower accuracy of soybean, noncultivated land, forest land and urban need to be As shown in Figure 12(a-d), the prediction classification results in different sample plots from 2019 to 2021 show that the field roads (about 2 m wide) cannot be effectively recognized and extracted from the 10 m-resolution satellite images.The soybeans in Figure 12(b) are misclassified as corn due to the influence of corn plant-height (more than 2 meters) in the later growth stage (4 September 2020).Therefore, the accuracy of N-to-N crop classification is not only affected by the imbalanced classes, model parameters, image resolution, feature selection, and sample quality, but also by the factors such as agricultural mode, crop phenology and types, fragmentation degree of fields, and the consistency of image features.Figure 12(c-d) mainly shows the classification results of urban and water categories in other geographical areas.The classification results of urban and water information based on scheme 2 and the optimal semantic segmentation model are also satisfactory.Notably, single houses with one or a few pixels can be effectively identified, mainly because of the significant spectral characteristics and the less influence by other ground objects.Similarly, water is also extracted with high classification accuracy and finer edge information.Due to the more concentrated planting mode and simpler shapes, the classification accuracies of peanuts, tobacco, and water were higher, and the PD and ED values of them were also lower.Consequently, it is necessary to improve the weak samples further and extract finer road information from high-resolution images to reduce the representative error on the semantic segmentation models and classification accuracy in complex agricultural areas.

Assessment of the model-based transferable learning
In this experiment, we evaluated the crop classification results and accuracy of different deep-learning models in the same city and an adjacent city for multiple years without samples.On the one hand, the trained semantic segmentation model can accomplish image-based crop classification without samples, reducing the dependence on training samples in traditional supervised classification methods and improving the generalization ability of crop classification models.On the other hand, the model can quickly and accurately obtain multi-year crop classification results.The predicted classification time (8.62 minutes) and classification accuracy (Figure 9) of Sentinel-2 images from 2019 to 2022 in the entire study area demonstrate advantages in reducing manual interpretation workload.This study mainly constructed a regional crop semantic segmentation model according to Tobler's First Law of Geography.The model performed generalization ability in adjacent regions with similar crop types and phenological periods.
The UNet++ networks of scheme 2 with improved method performed best in the entire study area, with OA, Kappa, and F1-score exceeding 87%, 82%, and 65%, respectively.In the domestic transfer cases, this model made satisfactory prediction results with insignificant gaps in accuracy.However, it should be further observed that the performance of models in more significant regions or other countries with similar crop types and conditions.For transfer across time, we used the weak samples in 2019 and 2020 to train the models and predict the classification results from 2019 to 2022.Specifically, these semantic segmentation models were transferred to more years with different classification accuracy, as shown in Figure 9.The UNet++ model performed well in cross-year transfer, indicating that the improved HM method enhanced the contrast and consistency of multi-temporal images, and thus the effect of differences in crop phenology could be alleviated.Without training samples, the model's transferability led to satisfactory performance in the study area across regions and years based on the weak samples.In this study, the source and target domains were selected from latitudes in close proximity with similar growth phenology.Despite advancements made, certain limitations persist in this study, pertaining to the geographic area and target domain categories.These limitations offer promising avenues for future enhancements.The construction of geographical zones also serves the purpose of mitigating the accumulation and propagation of representative errors across distinct regional models, especially for minor crops planted regionally.

Conclusions
RS-based accurate crop mapping is critical for monitoring agricultural practices and food production.However, due to the seasonal nature of crops, previous samples and traditional machine learning methods generated within a specific domain often lose their validity across years and regions.In this study, we proposed an improved HM method to alleviate the difference between multi-temporal images and the negative impact of domain shift, thus enhancing the transferability for efficient semantic segmentation of remotely sensed crop and land use images.Specifically, the dataset used for model training involved the multi-feature images and the weak labels extracted from Sentinel-2 imagery.Then, three semantic segmentation networks, DeepLab V3+, UNet, and UNet++, were selected to evaluate the model performance and transferability in adjacent regions without samples.It was found that the improved HM method can enhance the contrast of multi-temporal images and model performance of different feature selection schemes.Coupling spectral, vegetation indices, and texture features, the UNet++ model based on an improved HM method and weak samples had better performance than two other models and higher classification accuracy in transfer cases, with OA, Kappa, and F1-score exceeding 87%, 82%, and 65%, respectively.
Moreover, the representative error of weak samples and prediction classification results were analyzed to improve the model's robustness further.The method proposed in this study provides an efficient solution for crop and land use mapping in labelmissing regions and years.In future research, we will continue to explore the potential and spatiotemporal generalization ability of semantic segmentation networks for geospatial vision tasks.
, four sites each covering ~ 20 × 20 km in Xuchang city were chosen to generate weak samples in 2019 and 2020 as the training data set.The sample plots and validation sample points were used to assess the prediction classification accuracy of the optimal training model.The prediction classification results of both cities from 2019 to 2022 were used to test the model generalization and transfer learning ability.

Figure 1 .
Figure 1.Geographic location, terrain, samples, and training sites in the central part of Henan province, China.

Figure 3 .
Figure 3. Remote sensing image classification process based on encode-decoder architecture.

Figure 6 .
Figure 6.Reference image, target images, and images with improved HM method.(a,c) the target image (a) and improved HM-based result (c) on September 4, 2020.(b,d) the target image (b) and improved HM-based result (d) on July 31, 2021.The selection R/G/B of Sentinel-2 images were NNIR/SWIR1/RE1.
appear visually similar due to the similarity of vegetation characteristics in the same combined bands after HM processing.The emerald-colored areas in the southern region of Figure 6(d) are primarily dominated by soybean cultivation.At this stage, soybeans exhibit limited vegetation coverage and are influenced by the soil spectrum, which closely resembles the spectral characteristics of bare land.

Figure 7 .
Figure 7. Histogram results of 10 spectral bands of the images shown in Figure 6.The (a), (b), (c), and (d) correspond to those in Figure 6(a-d).

Figure 8 .
Figure 8. Training results of both feature schemes with and without improved HM method.The position of the gray line represents the optimal loss and mIou values of the training model.

Figure 9 .
Figure 9. Prediction classification results and evaluation indicators of both feature schemes with and without improved HM method based on the UNet++ model.

Figure 10 .
Figure 10.Contrast profiles of original images and improved HM images.(a) the original image was obtained on September 4, 2020, and its improved HM-based image.(b) the original image was obtained on July 31, 2021, and its improved HM-based image.The images correspond to those in Figure 6(a-d).

Figure 11 .
Figure 11.Total prediction time of three networks for each scheme with or without improved HM method from 2019 to 2022.

Figure 12 .
Figure 12.Samples and prediction classification results of scheme 2 with improved HM method based on UNet++ model.(a) sample plot (S1) and classification result in 2019; (b) sample plot (S4) and classification result in 2020; (c) sample plot (S1) and classification result in 2021; (d) sample plot (S2) and classification result in 2022; (e) the crop and urban classification in other subregions in 2019; (f) the crop and river classification in other subregions in 2020.

Table 1 .
Sentinel-2 images from 2019 to 2022 were used in this study.The data follows the U.S. Military grid reference System (US-MGRS) to set the satellite orbit number, such as T49SFT.

Table 2 .
Sentinel-2 schemes with feature selection.For spectral features, the center wavelength with nm unit is listed.

Table 3 .
Parameters of the semantic segmentation networks.

Table 4 .
Parameter results of different schemes and networks.The total time, total epoch, optimal loss value, optimal mIou value, and average time of each epoch are abbreviated as TT, TE, OL, OM, and AT, respectively, and the unit of TT and at is min.

Table 5 .
PA and UA results of scheme 2 with improved HM method based on the UNet++ model.