Enhanced crop classification through integrated optical and SAR data: a deep learning approach for multi-source image fusion

Agricultural crop mapping has advanced over the last decades due to improved approaches and the increased availability of image datasets at various spatial and temporal resolutions. Considering the spatial and temporal dynamics of different crops during a growing season, multi-temporal classification frameworks are well-suited for mapping crops at large scales. Addressing the challenges posed by imbalanced class distribution, our approach combines the strengths of different deep learning models in an ensemble learning framework, enabling more accurate and robust classification by capitalizing on their complementary capabilities. This research aims to enhance the crop classification of maize, soybean, and wheat in Bei’an County, Northeast China, by developing a novel deep learning architecture that combines a three-dimensional convolutional neural network (3D-CNN) with a variant of convolutional recurrent neural networks (ConvRNN). The proposed method integrates multi-temporal Sentinel-1 polarimetric features with Sentinel-2 surface reflectance data for multi-source fusion and achieves an overall accuracy of 91.7%, a Kappa coefficient of 85.7%, and F1 scores of 93.7%, 92.2%, and 90.9% for maize, soybean, and wheat, respectively. Our proposed model is also compared with alternative data augmentation techniques, maintaining the highest mean F1 score (87.7%). The best performer was weakly supervised with ten per cent of ground truth data collected in Bei’an in 2017 and used to produce an annual crop map for measuring the model’s generalizability. The model learning reliability of the proposed method is interpreted through the visualization of model soft outputs and saliency maps.


Introduction
Crop mapping is essential for the assessment of the underlying factors for farming system changes and the management of crops.Northeast China has become one of the main breadbaskets of the country, serving an increasingly important role in agricultural production and international trade of certain crops such as soybeans (Dong et al. 2016;Yang et al. 2019).Targeting the economic sustainability of agricultural development, however, the retrieval of quantitative information from the changes in the local croplands has been limited due to the annual crop rotation practice featured in this region (You et al. 2021).As such, accurate annual crop maps are still in high demand by local authorities in China to build near real-time crop monitoring mechanisms for early yield assessment of major crops at the countylevel scale.Many studies have made considerable progress in the development of crop mapping systems by using satellite imagery with moderate spatial resolutions due to their coverage and regular repeat acquisitions (Boryan et al. 2011;Defourny et al. 2019;Inglada et al. 2015).Considering the spectral characteristics observed in commonly used optical satellite sensors such as Landsat, MODIS and Sentinel-2, many studies have investigated and quantified the dynamics (i.e.seasonal changes) of vegetation indices (VIs) and optical bands, using them as distinctive input features to accurately identify crop types throughout the growing seasons (Fan et al. 2014;Song et al. 2017;You and Dong 2020;Zheng et al. 2015;Zhong et al. 2016Zhong et al. , 2016)).In light of existing research in automated crop identification, our study seeks to develop a novel approach for enhancing crop mapping performance leveraging the potential of satellite remote sensing data, which can contribute towards addressing the pressing need for sustainable agricultural development in Northeast China.
Cloud cover and/or adverse weather conditions can limit the quality of optical acquisitions and impact upon crop monitoring capabilities, resulting in data loss within timeseries of satellite acquisitions during the growing season (Griffiths, Nendel, and Hostert 2019;Kussul et al. 2018;Sonobe et al. 2014).Synthetic aperture radar (SAR) sensors are active remote sensors that can operate independently of weather conditions or solar illumination.SAR images provide unique radar-related information primarily responding to the biophysical properties of vegetation (e.g.Gao et al. 2018;Qu et al. 2020;Sun et al. 2019).Many studies have demonstrated the feasibility of using radar polarimetric features to detect crop types, generated by specific polarimetric decomposition algorithms that include Pauli, Cloude-Pottier, Freeman-Durden, H/A/α, Huynen, Yamaguchi Neumann, and Krogager (Gao et al. 2018;He et al. 2020;Liao et al. 2020;Xie et al. 2019).Both optical and radar data have respectively demonstrated their capability for crop mapping, and the fusion of image data from both data sources is increasingly explored to improve the crop mapping performance (e.g.Gao et al. 2018;Li et al. 2022;Liao et al. 2020;Moumni, Lahrouni, and Jung 2021;Sun et al. 2019;Van Tricht et al. 2018).The combination of optical and radar data provides complementary information that can reduce temporal gaps in data capture, which can contribute significantly to identifying crops in cloudprone regions (Liao et al. 2020;Sun et al. 2019).Similarly, combining multi-sensor data yields richer information on certain crops to overcome the heterogeneity of some areas caused by mixed crops (Moumni, Lahrouni, and Jung 2021).Most previous studies have either stacked optical and radar data at the pixel level for crop classification (e.g.Gao et al. 2018;Liao et al. 2020;Moumni, Lahrouni, and Jung 2021;Van Tricht et al. 2018), or independently trained the image data from dual sources, e.g.Sentinel-1 and Sentinel-2, using separate models in parallel.Subsequently, the resultant outputs from each model are integrated into one learned feature sequence (Teimouri et al. 2022).
Previous studies employed machine learning models, such as Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF) to identify crops based on multi-temporal observations (Bargiel 2017;Gao et al. 2018;Pelletier et al. 2016;Teluguntla et al. 2018;Zhong, Gong, and Biging 2014), however conventional machine learning models were not originally designed to process temporal data.Additionally, the enhanced representation of crop growth patterns requires phenological metrics defined with expertise in multi-temporal remote sensing data (You and Dong 2020), and those designed metrics are not always available until the end of the crop growth cycle (Xu et al. 2021).Although machine learning approaches improve classification performance with increasing dimensions of input variables and reduce the requirements for designating threshold-based classification rules, the temporal relationship in multi-temporal satellite data cannot be fully and automatically utilized.More recently, studies demonstrated that a series of deep learning networks could successfully explore the sequential relationships within time-series remote sensing data for crop classification (Crisóstomo de Castro Filho et al. 2020;Dou et al. 2021;Liao et al. 2020;Rußwurm and Körner 2020;Sun et al. 2020;Xu et al. 2020;Zhao et al. 2021;Zhong, Hu, and Zhou 2019).These deep neuron-based architectures include one-dimensional Convolutional Neural Networks (1D-CNNs), Long Short-Term Memory (LSTM), and variants or combinations of both architectures.Given that these architectures by the 1D-CNN or LSTM models are naturally fitted with extracting sequential dependencies within multi-temporal remote sensing data, these models generally outperform the nontemporal models such as RF in terms of classification performance for maize and soybean (Xu et al. 2020) and other crops (Liao et al. 2020;Rußwurm and Körner 2020;Zhong, Hu, and Zhou 2019).However, temporal models are not used for the extraction of spatial features from satellite imageries.
The spatial relationship, known as the spatial arrangement of the adjacent pixels represented by the data matrix in remote sensing images, is also a main consideration for crop classification with remote sensing data.Two-dimensional Convolutional Neural Networks (2D-CNNs) are used to extract multi-level spatial features from satellite data for crop classification (He et al. 2020;Kussul et al. 2017;Wei et al. 2019).A patch-based CNN architecture is designed for regional-level classification on medium-resolution satellite imagery by collecting a series of image patches as inputs instead of pixel-based samples used for machine learning models, 1D-CNNs or LSTM models (Sharma et al. 2017).2D-CNNs only focus on the spatial dimension due to the multidimensional input (the image size and the channel-wise image bands), whereas the temporal dependencies are not considered.Therefore, three-dimensional Convolutional Neural Networks (3D-CNNs) are proposed for the extraction of spatio-temporal features from image data.Fewer studies have applied 3D-CNN-based architectures for crop classification (e.g.Adrian, Sagan, and Maimaitijiang 2021;Ji et al. 2018;Teimouri et al. 2022).Roy et al. (2019) showed that a hybrid 3D-2D CNN had an improved performance over using standalone 3D-CNN and 2D-CNN, respectively.Another approach to obtaining spatio-temporal features is Convolutional Recurrent Neural Networks (ConvRNNs), and the variants represented by different recurrent units have been used to identify a large number of crop classes in a hierarchical framework (Turkoglu et al. 2021).To the best of our knowledge, there is less research regarding the synergistic use of 3D-CNN, 2D-CNN, and convRNN architectures for crop classification.Despite the findings in previous studies, annual crop mapping in Northeast China remains challenging due to the high intra-class variance and inter-class similarity of spectral qualities and phenology of crops in the region, which are influenced by varying climate conditions, geomorphic characteristics, and cropping systems (Wang, Azzari, and Lobell 2019).Additionally, regular and cloud-free time series acquisitions are often limited to agriculture monitoring at a large scale (Defourny et al. 2019).As a result, this study utilizes a small number of available optical acquisitions for large-scale crop mapping as supplementary sources for time series SAR data to develop models that enhance crop mapping accuracy.The study aims to develop a novel framework that combines 3D-CNN, 2D-CNN, and ConvRNN architectures for county-level crop mapping based on the fusion of multitemporal optical and SAR images for Bei'an county in Northeast China in 2017 at a 10 m spatial resolution.This spatio-temporal model contributes to improved performance in identifying crops during the growing season and addressing imbalanced class distribution, which could lead to model bias towards majority classes.The resulting crop maps can be used for dynamic monitoring of interannual crop growth in the same area and provide annual crop inventory information for local authorities to evaluate land-use policies.In this study, the proposed model is assessed for crop mapping and juxtaposed with models presented in previous studies (Ji et al. 2018;Pelletier, Webb, and Petitjean 2019;Roy et al. 2019;Turkoglu et al. 2021).Additionally, the models are examined in relation to data augmentation techniques and evaluated across three randomly selected geographical locations.Subsequently, the optimally chosen model is employed to generate an annual crop map for Bei'an in 2017 through model inference.

Study area
Bei'an is a county located in the northeast part of Heilongjiang province in China (47°35'N ~ 48°33'N, 126°16'E ~ 127°53'E) (Figure 1).According to Bei'an Municipal People's Government (http://www.hljba.gov.cn/), the total area of Bei'an county is approximately 7149 km 2 .Bei'an is subject to a cold and temperate continental monsoon climate.The average annual temperature is around 1.2°C with annual effective accumulated temperature ranging from 18.30°C to 23.50°C.Bei'an receives an average annual precipitation of 529 mm, with the majority of rainfall occurring during the summer months from June to August.The average total amount of annual surface water resources are approximately 1.156 billion cubic metres.Bei'an is geographically located in the transitional zone between Songnen Plain and the Khingan Mountains, which is regarded as one of the world's three Chernozem (black soil) belts.Given the favourable soil fertility, meteorological conditions, and regional temperature, this region serves as an ideal ecological habitat conducive to crop growth and agricultural yield.According to Heihe Social and Economic Statistics Yearbook (2018), the total crop sown area of Bei'an approximates 2190 km 2 .Summer maize and soybean are the primary crop types, accounting for 29.5% and 61.8% of the total sowing area, respectively.In contrast, wheat, as one of the minority crop types in Bei'an, covers 2.9% of the total sown area.According to the local crop sowing scheme, the growing season of maize often spans from late April to late September, and soybeans are normally sown from early May to mid-September.These periods might vary annually due to crop rotation cycles in the study area over the years.

Sentinel-1/2 datasets and pre-processing
In this study, both time-series Sentinel-1B Single Look Complex products (Interferometric Wide swath SLC) and Sentinel-2A/B (Level-1C) image datasets were acquired from the Sentinel Scientific Data Hub (https://scihub.copernicus.eu/dhus/#/home).Considering the local cropping practice in which the majority of crops were planted and harvested from early May to late September 2017, the image acquisitions were collected from NaN Invalid Date to NaN Invalid Date , corresponding to the vegetative growing cycle of the recorded staple crops in Bei'an.As such, twenty-three Sentinel-1 acquisitions and three Sentinel-2 acquisitions were collected.The selection of Sentinel-2 data was based on the criteria that the average percentage of cloud coverage for the acquisition candidates is less than 8%.
The pre-processing of time series Sentinel-1 images was completed using the Sentinel Application Platform (SNAP) developed by the European Space Agency (ESA).The standard pre-processing steps follow Qu et al. (2020) which typically include radiometric calibration, multi-temporal speckle filtering (Refined Lee) and geocoding.Backscatter values were converted to decibel (dB) scale, and the cross-ratio of the backscatter was calculated by subtracting VV from VH, in accordance with logarithm rules.Sentinel-1 operates as an inherently dual-polarized SAR platform, which can constrain the extent of polarimetric information that can be explored, compared to quad-polarimetric SAR systems providing fully polarimetric observations.However, quad-polarization satellite acquisitions often suffer from reduced swath coverage, revisit time, and accessibility.Hence, a compact polarimetric technique, m-chi decomposition (Raney et al. 2012), has demonstrated its utility for crop mapping using dual-pol data (Sonobe 2019).The m-chi decomposition parameters were also obtained using SNAP.Each type of Sentinel-1 data was resampled to 10 m spatial resolution.
The pre-processing of Sentinel-2 images consists of the transformation from top-ofatmosphere (TOA) Sentinel-2 Level-1C reflectance images to bottom-of-atmosphere (BOA) Level-2A using Sen2Cor.Additionally, Band 4 (Red, 10 m), Band 8A (Vegetation Red Edge, 20 m), and Band 11 (SWIR, 20 m) were selected for their sensitivity to differentiate soybean and maize in Northeast China (You et al. 2021).All selected bands were resampled to 10 m and collocated with SAR data in a time-series sequence.Finally, a global min/max normalization approach was applied to all input features using the scikitlearn package to hasten the convergence of deep learning algorithms.

Ground truth and partitioning
Ground surveys of the study area were conducted during July, August, and September 2017 by the Chinese Academy of Agricultural Sciences (CAAS).During the 2017 period, 21257 fields were surveyed to calculate the area of crop parcels, record crop categories and retrieve annual statistics during the agricultural household survey.For cropland parcels with various crop types, they were manually digitized and labelled based on 5metre resolution RapidEye satellite imagery (NIR Infra-red, Red Edge, and Red composite), while Sentinel-2A images (SWIR, Narrow NIR, and Red composite) were used for drawing relatively large cropland parcel areas with uniform crop types.In total, the classes of interest for major crops were assigned unique labels, including maize and soybean.In the in-situ dataset provided by CAAS, a small number of polygons were also identified for wheat and unknown crops.The proportions of ground sample pixels for each class in 2017 and the distribution of crop parcel size are displayed in Figure 2. A cropland mask layer, produced for Bei'an in 2017, is used to exclude non-cropland areas in this study during the model inference stage for the generation of an annual crop map.The cropland distribution and extent barely changed during 2017-2019 due to cropland protection by policies in Northeast China (Liu et al. 2014;Ning et al. 2018).
Since the cropping and managing system for each crop parcel would be different, the pixels within the same crop polygon are strongly correlated and need to be isolated when assigned to training, validation, and testing data sets.i.e. pixels in each set should be mutually exclusive and not from the same crop parcels.Additionally, the class distributions in all sets should be identical (Rußwurm and Korner 2017).In most croplands, pixels in the same parcel are very homogenous and highly correlated.Allocating pixels in a parcel to different sets will violate the principle of independence.The model generalization on truly unseen data would be affected because it is likely that models have seen at least parts of the image patches used for validation (Audebert, Le Saux, and Lefèvre 2019).Although the study area can be split into relatively large sub-regions, the crop types are not usually distributed evenly in the study area, which cannot ensure subregions with similar class distributions (Zhong, Hu, and Zhou 2019).In this study, each crop polygon was regarded as an entity.The parcels are further grouped using grids at 10kilometre intervals so that crop parcels in the same grid are considered as a whole in dataset partitioning.Ten per cent of total ground samples are randomly selected to be used for model training, validation, and testing based on stratified sampling with the ratio of 60%, 20%, and 20%.

Methodology framework
The entire workflow in this study is depicted in Figure 3, outlining the four stages designed to evaluate deep learning approaches for crop mapping using the fusion of multi-source, time-series satellite data.The initial stage focuses on the pre-processing of multi-temporal satellite acquisitions, specifically targeting the overlapping area in Bei'an (Figure 1).The processed image data are then subdivided into small patches, aligning with the ground truth labels.These image patches then serve as input for the CNN models considered in this study.The experimental stage compares the performance of the proposed model with other state-of-the-art methods, given multiple model input scenarios.Particularly, an ablation experiment is conducted during model training and testing to determine the key input scenario for crop identification in Bei'an 2017.Following this, this study assesses the efficacy of implementing data augmentation techniques.The final stage involves generating a county-level crop map using the best performer and analysing the model learning outcomes.In the subsequent sections, the specifics of the experiment are introduced, presenting aspects such as classification algorithms employed, the environment in which the models are deployed, compact polarimetric parameters, and augmentation techniques, to provide a comprehensive understanding of the methodology in this study.

3D-CNN
Convolutional Neural Networks (CNNs), motivated by the animal's visual cortex is a deep learning technique to extract features in a way that considers spatial contexts between pixels instead of focusing on a single vector transformed by the typical multilayer neural networks (Sharma et al. 2017).CNNs, therefore, are also known as a dimensionality reduction method to handle multidimensional inputs in terms of its unique feature extraction pattern performed by convolutional kernels.Conventional two-dimensional CNNs (2D-CNN), however, are limited to the spatial features and may produce overwhelming parameters if the multidimensional inputs have large channels (spectral information) or time steps (temporal information) (Mäyrä et al. 2021).Conversely, the onedimensional CNNs (1D-CNN) extract features from single-pixel temporal or spectral profiles of the input data without considering the spatial relationship between features.Although 2D-CNN can be combined with 1D-CNN to extract spatial-spectral or spatialtemporal information for improved results compared to dealing with information in only one dimension (Audebert, Le Saux, and Lefèvre 2019), the large number of model parameters will be needed by 2D-CNN.An alternative method to extract features simultaneously on both dimensions is three-dimensional CNNs (3D-CNN).The convolutional kernels in 3D-CNN are cubes and produce a feature map with volume rather than a twodimensional image derived by 2D-CNN or a single vector by 1D-CNN.The three-dimensional convolving process can be written as follows: Where Y x;y;d i;j is the output values at 3D coordinate (x; y; d) on the j th feature cube in the i th layer, x; y ð Þ is the spatial position and d indicates temporal index.w t;p;q iÀ 1 ð Þ;n is the 3D kernel value at location (t; p; q) from the n th feature cube in the previous layer.Similarly, (p; q) is the spatial position, and t denotes the temporal indicator of the kernel.X x;y;d iÀ 1;n is the input at position (x; y; d) from the n th feature cube in the previous layer.b i;j is the bias vector on the j th feature cube in the i th layer.The size of the kernel is T � S � S which is equivalent to length, height, and width.Empirically, the size of height equals width in CNNs.

ConvSTAR
Convolutional recurrent neural network (ConvRNN) is a variant of sequence modelling that is built with convolutional operations in state transitions instead of matrix multiplications for handling spatio-temporal data.A typical convolutional Long Short-Term Memory (ConvLSTM) is designed to capture the spatio-temporal correlations for precipitation forecast (Shi et al. 2015), which outperforms the general LSTM structure in which spatial information is not considered.Given that stacking multiple ConvRNN layers contributes to feature extraction, a novel recurrent cell, namely STAckable Recurrent cell (STAR), is developed to reduce the exploding gradient effects and the number of trainable parameters (Turkoglu et al. 2021).The convolutional version of STAR (ConvSTAR) is the modification of ConvLSTM in which the input and output gate are removed, which can be written as: Where σ is the sigmoid activation function, � denotes the convolution operator and � indicates the Hadamard product (elementwise

Synergic use of 3D-CNN and ConvSTAR
Recent studies have applied the combination of 3D-CNN and convolutional recurrent networks for univariant and multivariate time series forecasting.The 3D-CNN layers and attention ConvLSTM layers are utilized sequentially for multispectral soybean prediction (Nejad et al. 2022).The 3D-CNN layer can also be fed with features produced by ConvLSTM, serving as the model output layer for predicting urban expansion in image segmentation (Boulila et al. 2021).The enhanced performance on human action recognition was also proved by the combined use of these two deep learning approaches (Wang et al. 2021).In this study, a similar hybrid feature learning framework, 3D-ConvSTAR, is proposed to improve crop classification performance (Figure 5).The proposed network consists of three stages.The first step is made of a three-layer 3D-CNN with the optimal kernel size 3 � 3 � 3, considering that three convolutional layers demonstrated effectiveness over two-layer and four-layer networks (Ji et al. 2018).Each layer has 32, 32, and 64 filters, respectively.The feature maps after each 3D convolutional operation are not shrunk by applying zero padding.The convolutional cubes are moved during one step.As previously introduced in Section 4.2.1, 3D-CNN is used to extract spatio-temporal features simultaneously.Next, the output tensors from the 3D-CNN module are reshaped and fed into a three-layer bidirectional ConvSTAR unit.Bidirectional recurrent cells preserve temporal information from both the future and past, alleviating temporal bias towards data in later time steps (Rußwurm and Körner 2018).The kernel size for ConvSTAR is 3 � 3 and the number of kernels for each layer is set to 64.Followed by ConvSTAR layers is a shallow 1-layer 2D-CNN with 64 3 � 3 kernels to take in the final hidden state from the previous layer for further extracting discriminative feature maps on the spatial dimension.It also can perform dimensionality reduction to some extent so that the number of model parameters can be optimized since it reduces the size of the feature maps and preserves the main information captured by the previous layers (Mäyrä et al. 2021;Roy et al. 2019).The last part of 3D-ConvSTAR is constructed with three fully connected (FC) layers.The last two dense layers have 256 and 128 units respectively and both are followed by a dropout layer with a factor of 0.4 to prevent networks from overfitting.The activation function Rectified Linear Unit (ReLU) (Nair and Hinton 2010) is applied after CNN, convolutional recurrent, and FC layers to augment the nonlinearity of outputs and control model systematic errors (bias).Pooling layers are not applied after all layers in the proposed model since it could cause loss of information at multiple dimensions (Li, Huang, and Ji 2019).

M-chi decomposition
A similar methodology was developed for single-transmitted dual-receive polarization data that transmits at circular polarization and receives at horizontal and vertical polarization (Raney et al. 2012).This decomposition methodology, originally for compact polarimetric radar data, is based on the 2 × 2 covariance matrix, which is not applicable to general quad-pol data.This method is often characterized by the form composed of four-element Stokes parameters, which provides potential for hybrid polarimetric and dual-pol data.One application is the m-chi decomposition, in which the observed field is characterized by the degree of polarization (m) as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Where S 1;2;3;4 represents four Stokes parameters in the total power over an image field, and m refers to the degree of polarization.Chi χ ð Þ, a Poincaré variable, denotes the field's ellipticity and circularity, which can be expressed through: Based on these two variables calculated from the Stokes parameters, three target scattering parameters P s (Single-bounce scattering), P d (Double-bounce scattering) and P v (Volume scattering) for m-chi decomposition can be expressed as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Previous studies have shown that hybrid polarimetric data are close to and occasionally equivalent to the analysis of quad-pol data e.g. using the conventional Freeman-Durden decomposition (Ainsworth, Kelly, and Lee 2009;Nord et al. 2008).
Even though all required four-element Stokes parameters can be derived from dual-polarized data such as Sentinel-1 SLC products, the challenge imposed on this usage is the separability between single and double bounce targets due to the ellipticity angle from dual-pol data on the verge of zero which could decrease the difference between polarimetric scattering types.Despite this challenge, one scattering type between the mix of all scattering mechanisms often dominates agricultural targets at a certain growth stage, and the scattering type of dominance could change with the development of the canopy (McNairn et al. 2009).Therefore, all scattering types of m-chi decomposition are considered in this study as input predictors to classify crops across full growth stages.

Model implementation
Considering the model inputs, we adopted the typical remote sensing scene classification approach and extracted square image patches centred around a labelled crop category.Square patches for each crop class are generated with the size of 9 � 9, 11 � 11, 15 � 15 and 21 � 21.Larger image patches will cover multiple polygons from different crop species but could lead to greater model parameters.Therefore, the optimal size of image patches is selected to 11 � 11 after examinations for classification purposes.Therefore, the complete shape of input image patches for the proposed model is 11 � 11 � 26 � 3 (width � height � sequence � channel).Patches without any ground truth information and missing values of pixels were eliminated.
The implementations of models are customized to train the network with reference to the Adam optimizer (Kingma and Ba 2014), and cross-entropy loss function (Botev et al. 2013).The learning rate for the optimizer is set to 0.001, and weight decay is regularized at 0.0001.A batch size of 128 is used during the training stage, and the model is saved with the best validation accuracy for later model inference.Another regularization strategy, namely Early stopping, is used to prevent the models from overfitting since it will terminate the model training progress once the model validation performance is relatively stabilized during the iterative training process.
The proposed 3D-ConvSTAR is compared with deep learning architectures applied in other studies: a 1D-CNN-based architecture, namely temporal CNN (TCNN) (Pelletier, Webb, and Petitjean 2019), a typical 3D-CNN (Ji et al. 2018), a hybrid 3D-2D CNN (Roy et al. 2019) and a 3-layer ConvSTAR (Turkoglu et al. 2021).This study reproduced the implementations of the models in the aforementioned studies.The modelling environment is implemented in Python 3.7.15with Tensorflow backend (2.5.0) and Keras library (2.1.1)for model construction and generalization under two graphic devices of NVIDIA Quadro P4000 (8 GB RAM per GPU), and two processors of Intel (R) Xeon (R) Silver 4114 CPU (2.20 GHz/2.19GHz).

Data augmentation
Imbalanced class distribution in real-world label datasets often leads to bias in supervised classification approaches, where machine learning models, under typical model training schemes, are prone to weigh importance in favour of majority classes (Dong, Gong, and Zhu 2018;Ren et al. 2018).Underrepresented crop classes in certain places may have the same or even higher value (either financial or ecological) as the crops that occur more frequently (Turkoglu et al. 2021).The annual statistics record that the total sown area for wheat is 6,309 hectares compared to 64,564 and 135,401 hectares for maize and soybean in Bei'an 2017 (Heihe Social and Economic Statistics Yearbook 2018).Consequently, classbalanced model evaluation is crucial for agricultural mapping applications, as fine-structured agricultural systems in local areas could lead to crop classes with the imbalanced distribution.
One approach to address this issue is typically called inverse frequency weighting (Cui et al. 2019).The training samples of minority classes are assigned higher weighting factors than majority classes for calculating training loss, which neutralizes the model bias towards majority classes.The underlying side effect of weighting samples could weaken model performance on majority classes to which lower weights are given.Another approach to balance the dataset is either via oversampling minority classes (Ling and Sheng 2008) or undersampling majority classes (He and Garcia 2009).Both techniques depend on a trade-off for the number of samples across all classes.Missing data for dominant classes due to undersampling could severely affect model performance, considering that deep learning and machine learning methods are data-driven and datahungry.Therefore, undersampling is not used in this study.
This study applied an oversampling technique followed by a rotation of image patches.Each resampled sample patch is randomly rotated within 180 degrees and flipped horizontally.Oversampling minor classes, however, is not always a panacea, as it might not significantly improve mapping minority classes when the sample size is too small.A recent data augmentation method called mix-up, designed to linearly combine labelled images for model training, has proven successful in mapping tree species (Mäyrä et al. 2021).In this study, the input image patches of different crop classes and corresponding labels (categorical encoding) are mixed up, respectively, to generate blended or synthetic datasets, i.e. an image patch may contain 20% soybean and 80% wheat.The class distribution of the output via mix-up is also balanced.This study proposes a joint learning structure to combine deep learning models to counter imbalanced class distribution and compares it with the aforementioned data augmentation techniques, including oversampling, inverse weighting, and mix-up, for mapping minority classes.

Model interpretation
The interpretability of deep learning models on crop mapping tasks is still limited considering that the extracted higher-level features are outputted by a hidden learning process in the operating mechanism held by deep learning approaches, which is often called a 'black box' (Heo, Joo, and Moon 2019).The explanation of deep learning methods benefits users in understanding the intricate patterns of crop growth and evaluating model reliability on crop mapping (Xu et al. 2021;Zhang and Hu 2019).
Previous studies investigated deep learning methods by visualizing intermediate layers of the networks for monitoring the model learning process and temporal learning patterns of certain crops (Xu et al. 2020;Zhong, Hu, and Zhou 2019).Another approach to interpreting deep learning models is based on gradient-based explanations (Bastings and Filippova 2020;Mäyrä et al. 2021;Rußwurm and Körner 2020;Xu et al. 2021).Typically, the deep learning model input and corresponding labels are fitted with neural-network-based functions that are differentiable and perform nonlinear transformations.According to the gradient descent algorithm, the model weights during feature extraction can be iteratively updated and optimized to minimize the difference between predicted outputs and corresponding true input values.This study computed the gradients of the predicted scores for each crop type with respect to input image patches for the proposed model via vanilla backpropagation.A gradient for each crop is composed of an array of partial derivatives and it signifies the correlation between the changes in the input features and the corresponding prediction score.The highest magnitude of the gradient indicates the most influential pixels for the process of identifying certain crops.The prediction score is the model soft output derived by the softmax function at the last layer of the proposed model (Figure 5) and it suggests the confidence degree of the proposed model to the classification results of each crop category in this study.The gradients for each class can be visualized via saliency maps.Considering that the spatial dimension of the sample patches used in this study is only 11 � 11, the assessment of the important input features at spatial dimension may not decisively and accurately demonstrate the locations that contribute to spatial importance for crop mapping.Therefore, the results are more of a performance check for the proposed model.

Evaluation metrics
For the accuracy assessment of each network, overall accuracy (OA), Cohen's kappa coefficient (Kappa), and F1 score were selected as the performance indicators in this study.The OA is calculated to evaluate the overall model performance.Overall accuracy is calculated by aggregating the number of correctly classified values n correct i based on the number of classes C and dividing by the total number of samples N in Eq. ( 10): The Kappa coefficient, also known as Cohen's Kappa, is a widely used metric in deep learning and remote sensing studies to assess the performance of classification models, particularly in the context of crop classification (Congalton 1991;Foody 2004).It is computed from the empirical probability of observed agreement, also known as OA and expected agreement p e in Eq. ( 11) and ( 12).p e is calculated by n p i , the total number of predicted labels, and n t i , the total number of ground truth labels.Kappa values range from −1 to 1, with values closer to 1 interpreted as a high level of agreement between the predicted and ground truth labels and values closer to 0 indicating that the agreement is no better than chance.In remote sensing and crop classification studies, a Kappa coefficient is often used alongside other performance metrics, such as the overall accuracy, producer's accuracy, and user's accuracy, to provide a comprehensive evaluation of the classification model (Congalton 1991): ) The F1 score is used to measure classification performance grouped into categories since the sample data were imbalanced in this study.It relates to the harmonic mean of the producer's accuracy (RecallÞ and user's accuracy Precision ð Þ respectively (Stehman 2001) and is determined as follows:

Classification results
It is found that the deep learning models using m-chi decomposition features yielded better classification results than using backscatter and its cross-ratio.See Table 1.
Especially for TCNN, this 1D-CNN-based architecture benefited from polarimetric features significantly, increasing the OA and Kappa by >20% and >30%, respectively.With regard to the models considering both spatial and temporal dimensions, using m-chi decomposition features slightly improved the classification accuracy over backscatter, but these models outperform TCNN.Compared with other models, the proposed method, 3D-ConvSTAR, achieved the highest OA on both backscatter (82.0%) and m-chi decomposition (89.4%).On the other perspective, incorporating few optical acquisitions into the sequential radar dataset contributes to improved performance for all models compared to using standalone multi-temporal SAR data, and the deep learning models with the combination of multispectral bands and polarimetric features performed the best among all scenarios.Under this circumstance, the proposed method, 3D-ConvSTAR, outperformed other approaches in terms of the highest OA (91.7%) and Kappa (85.7%).The second-best performer is the standalone 3D-CNN under the same scenario.Table 2 presents the F1-score yielded by deep learning models for each crop type, based on different combinations of input features.The polarimetric parameters overall are better than backscatter according to the F1 scores derived by each model for crop types.Especially m-chi decomposition features lead to significant improvement for TCNN on differentiating crops compared to using backscatter.For the models that consider the spatio-temporal dimension, m-chi decomposition features are still reliable predictors over backscatter for identifying maize and soybean, and 3D-ConvSTAR its advantage on the minority class by producing 81.8% of F1 for wheat that is the highest accuracy among the model results with connection to applying m-chi decomposition.For the use of a combination of multispectral bands and SAR features, ConvSTAR produced the highest F1 scores for maize (93.8%) and soybean (92.3%), which are only 0.1% higher than the same measurements derived by 3D-ConvSTAR.However, 3D-ConvSTAR also yielded the leading performance on less frequent classes, including wheat (90.9%) and other crops (74.0%), resulting in the highest mean F1 (87.7%) among all types of input features used.Therefore, the best-performing model for crop classification with an imbalanced dataset is the proposed 3D-ConvSTAR with the combined features of m-chi decomposition and multispectral bands.
The classification performance using 'Optical+m-chi' as input features varies across data augmentation techniques, with some models performing better or worse than the baseline where no data augmentation is applied in the training data (Table 3).The oversampling method generally performed better than other augmentation methods.TCNN has been improved with oversampling significantly from 64.8% to 83.5% on mean F1.In contrast, balanced loss and mix-up reduced the overall performance for 3D-2D CNN and 3D-ConvSTAR.Some data augmentation methods for the certain model, such as 3D-ConvSTAR, marginally increased identification performance for majority crops including maize and soybean by 0.05% and 0.01% respectively while lowered the performance for minority classes.Furthermore, each method is compared based on model predictions for different geographical locations within Bei'an, taking into account the highest mean F1 score achieved by certain models that incorporate data augmentation techniques (Figure 6).A close examination of the classification accuracies reveals that the proposed 3D-ConvSTAR method outperforms the other models in most cases.At Site A, the 3D-ConvSTAR model yields an accuracy of 89.4%, which is higher than the accuracies derived by the TCNN with oversampling (84.3%), 3D-CNN with balanced loss (85.4%), the 3D-2D CNN with oversampling (89.0%) and ConvSTAR with oversampling (87.1%) models.At Site B, the 3D-ConvSTAR method and ConvSTAR with oversampling both achieve a classification accuracy of 95.9%.While the 3D-2D CNN with oversampling model exhibits a slightly higher accuracy (96.5%), the 3D-ConvSTAR model outperforms the other two methods, namely TCNN with oversampling (90.3%) and 3D-CNN with balanced loss (96.0%).Finally, at Site C, the proposed 3D-ConvSTAR model demonstrates the second-highest classification accuracy (97.3%),only surpassed by the 3D-2D CNN with oversampling (97.5%).Nevertheless, the difference is marginal and 3D-ConvSTAR method outperforms the rest of the scenarios.In summary, the proposed 3D-ConvSTAR demonstrates competitive performance in comparison to the other deep learning models fed with augmented data for crop mapping across all three sites.It also performs the highest mean F1 score of 87.7% on the testing dataset compared with the models using augmented data (Table 3).Therefore, the proposed method was used to predict unseen data for the creation of a thematic map.
As mentioned in Figure 2, ten per cent of total ground truth samples were used for dataset split and sixty percent of which was used for model training.Then, the pre-trained 3D-ConvSTAR produced the annual crop map during model inference for Bei'an in 2017, shown in Figure 7.The predicted results were also compared with the total ground truth dataset in the confusion matrix illustrated in Figure 8.The cell in the top left corner shows that 96.58% of maize instances were correctly classified as maize, while 2.92% of them were misclassified as soybean and 0.00% as wheat.
Similarly, the cell in the bottom right corner shows that 86.87% of instances for other crops were correctly classified as themselves, while 11.18% of them were misclassified as soybean.In general, the model performs relatively well in distinguishing maize, soybean, and wheat while it struggles more with identifying other crops.

Model interpretation
The prediction scores, as outlined in Section 4.5, are soft outputs produced by the final layer of the 3D-ConvSTAR model (illustrated in Figure 5).Figure 9 visually demonstrates the confidence level of the proposed model in its crop classification performance on the testing dataset.The results indicate that 3D-ConvSTAR exhibits a higher level of confidence in its mapping of maize, soybean, and wheat crops, as compared to other crops.This is evident from the concentration of prediction scores for most samples of the three classes, which averagely hover around 90%.The proposed model is less confident in accurately identifying other crops with reference to the relatively lower mean prediction score (71%), which is consistent with the misclassification presented by the F1 score in Table 2, despite the fact that the 'other crops' category has a larger number of training samples than wheat (see Figure 2).The average gradient magnitudes are represented by saliency maps, as shown in Figure 10.These maps represent the most important spatial locations for each crop type in the image samples of the testing dataset.The shape of the pixel chunks with the highest importance for maize may complement the part of the lowest importance for soybean and vice versa, indicating that maize and soybean in the samples are mostly intercropped.The pixel importance for wheat shows that the field shapes in the samples are mostly separated.However, the pixel importance for other crops is scattered without forming a clear shape, which corresponds to the relatively lower mapping accuracy for those crops.

Discussion
In this study, various deep learning architectures for patch-based crop mapping are evaluated using SAR and fused SAR-optical data.The proposed method functions as an ensemble deep learning by synergistically connecting 3D-CNN and ConvSTAR, reaching the highest performance overall among all models in terms of the OA (91.7%) and Kappa (85.7%) for the identification and classification of different crops (see Table 1).This study also validated the effectiveness of using SAR polarimetric decomposition parameters of m-chi for identifying certain crops over using SAR backscatter, which confirmed the findings by previous studies (e.g.De, Kumar, and Rao 2014;Sonobe 2019).The integration of a few optical acquisitions and time-series SAR image data does improve the classification performance overall compared to standalone SAR data, but maize and soybean are not increased significantly (<10% on average) regarding the F1-score derived by all models (Table 2).These results are likely to be explained by the fact that maize and soybean are dominant crops in the county that provides an enormous number of labelled ground truth (Figure 2), so both crops can be well-trained by those data-driven models.In contrast, the fusion of optical and m-chi decomposition features enhanced minority classes surprisingly by all models, especially for wheat.The proposed method with multisource fusion yields the highest F1-score (90.9%) for wheat compared to using backscatter (69.4%) and polarimetric features (81.8%).This indicates that multispectral information contributes mostly to the enhancement of mapping for minority crops.
We discovered that crop mapping performance can be enhanced not only by fusing SAR-optical datasets, which provide spatio-temporal, polarimetric, and spectral characteristics related to different crop structures (Gao et al. 2018;Van Tricht et al. 2018), but also by utilizing multispectral information as a reliable complementary source owing to the synergistic nature of SAR and optical data.This study demonstrates that Sentinel-1 and Sentinel-2 imagery exhibit a mutually complementary effect, increasing the sensitivity of both sensors to specific crop class characteristics throughout the growing season.Sentinel-2 data can be associated with the quantitative analysis of chlorophyll and moisture content in crop leaves, with spectral bands such as the Vegetation Red Edge being particularly useful for differentiating certain crops, confirming previous findings by Guerschman et al. (2003) and You and Dong (2020).Sentinel-1 data is sensitive to morphological variations, as it provides biophysical, structural, and agronomic characteristics, and is strongly correlated with the structural development of crops during the growing season (Adrian, Sagan, and Maimaitijiang 2021;Sonobe 2019).
In addition to the SAR backscatter, polarimetric parameters can reflect in-depth scattering properties of crops due to scattering mechanisms with robust physical interpretability, making them useful for crop mapping (Gao et al. 2018;He et al. 2020;Liao et al. 2020;Xie et al. 2019).Polarimetric features directly relate to the underlying physical properties of the crops.These parameters can be used to quantify the contribution of different scattering mechanisms and provide insights into the crops' biophysical, structural, and agronomic properties, such as crop type, plant density, leaf area index, and growth stage.In this study, we capitalized SAR scattering diversity, such as surface scattering, volume scattering, and double-bounce scattering (See Section 4.2) that are responsible for the interactions between the electromagnetic waves emitted by SAR sensors and the various components of a crop's structure.Sentinel-1 data, however, is limited to insufficient decomposition methods, since it is a dual-polarized SAR sensor, which restricts the available decomposition methods.While quad-polarimetric data can be analysed using fully polarimetric decomposition algorithms, this study found that mchi decomposition, originating from compact-pol planforms, effectively maps crops using dual-polarization data and synergizes well with optical bands.It is important to note that fully polarimetric SAR systems often have reduced swath coverage and relatively inconsistent temporal frequency, posing challenges for crop mapping across extensive areas (Sonobe 2019).
With respect to the ablation study regarding applying different input features, the proposed deep learning network showcases the advantage of the proposed model that outperformed the deep learning methods with standalone architectures (TCNN, 3D-CNN, and ConvSTAR), and combined architectures such as 3D-2D CNN proposed in other studies for crop mapping under the same input predictors, as detailed in Table 2.The model performance varies significantly across different crop classes, with wheat and other crops generally exhibiting lower F1 scores compared to maize and soybean.The 3D-ConvSTAR improved all performance overall, in particular per-class performance in separating the maize, soybean, and wheat, providing a beneficial method for local industries due to the commercial interests in these crops.We also investigated the comparative analysis of various deep learning models with data augmentation techniques for crop mapping to deal with imbalanced class distribution (Table 3).Examining the F1 scores for individual crop classes, it is evident that the 3D-ConvSTAR model using the mix-up consistently outperforms other combinations, achieving 94.2% for maize, and 92.3% for soybean, but reducing performance for wheat and other crops.All models produced similar results for maize and soybean after data augmentation techniques are applied, and oversampling generally outperforms other augmentation methods.Wheat and other crops generally exhibit lower F1 scores compared to maize and soybean.This finding, while preliminary, may imply that this data augmentation technique could be more useful for enriching training samples when data collection is a major challenge in certain research fields.For example, it is particularly well-suited to augment imagery data collected from the human nervous system (Smucny et al. 2022), and increase the airborne training sample size for mapping species (Mäyrä et al. 2021).However, it may not be useful to overcome imbalanced class distribution.
The proposed network, in terms of average F1 score, outperforms other approaches that rely on data augmentation techniques, primarily because it effectively integrates the temporal nature of remote sensing data into a more sophisticated input space while accounting for the spatial relationships between features along the time series.This leads to a better separation of crop types with homogeneous representations (Figure 7 and 8).However, the proposed method is prone to generating a higher number of training parameters compared to alternative methods, resulting in increased model training time.This issue is particularly due to the connections between the learned features produced by the 3D-CNN, ConvSTAR, and the subsequent shallow CNNs implemented by the 2D-CNN, as these settings can lead to an increased number of training parameters.All classifiers exhibit suboptimal performance for the 'other crops' category, which can be attributed to the mixture of various crop types.Each of these unknown crop types is only represented by a relatively small sample size in the training data, thereby limiting the model's ability for identification.Although the 'other crops' category has more training samples than wheat (Figure 2), it may still be underrepresented compared to samples for maize, soybean, and wheat in the dataset.This could lead the model to focus on the distinctive class labels and consequently perform poorly in the 'other crops' category with mixed labels for unknown crop types.The 'other crops' category may encompass a wide range of crop types, each with distinct spectral, polarimetric, and temporal signatures.For example, there is only one field parcel for rice in the ground truth data collection, so it was labelled as other crops in this study.This increased diversity may make it more challenging for the deep learning model to accurately identify and classify these crops.In contrast, maize, soybean, and wheat may have more consistent and easily distinguishable characteristics, allowing for a higher F1 score.Additionally, the features extracted from the combination of SAR and optical data might be more informative and discriminative for maize, soybean, and wheat than for the 'other crops' category.The complexity of the features for the 'other crops' category might be higher, making it more difficult for the model to learn and correctly classify these samples, which leads to the lower F1 score observed.
The results overall demonstrate the advantages of using the combination of polarimetric and multispectral data from Sentinel-1 and Sentinel-2.Data fusion provides additional information for classification algorithms to exploit crops' structural details, while also offering supplementary polarimetric and spectral properties.The discrepancy in F1 scores between the wheat and 'other crops' categories in crop mapping highlight the need for further research into optimizing model architectures and methodologies to enhance crop mapping across all classes.Moreover, future work should focus on integrating fully polarimetric data with optical data to further improve crop mapping accuracy by applying popular deep learning architectures, such as Fully Convolutional Neural Networks (FCN), which perform pixel-wise segmentation on images.For instance, the 3D U-Net architecture can be employed to extract the spatio-temporal features of crops (Adrian, Sagan, and Maimaitijiang 2021;Ji et al. 2018).Investigating the contribution of SAR texture information combined with optical data to semantic segmentation for crop mapping also enables further exploration.Recent studies have employed a self-attention-based convolutional recurrent network to learn temporal dependencies of multivariable time series (Fu et al. 2022) and combined 3D-CNN with an attention-based recurrent network for crop yield prediction (Nejad et al. 2023).Both studies assessed the feasibility of attention mechanisms in extracting attentive spatio-temporal features.Consequently, future research could involve the integration of 3D-CNN with attention-based convolutional recurrent networks, such as ConvSTAR, for crop mapping and comparison with architectures for semantic segmentation.More importantly, the model's robustness should be further evaluated for predicting crop types in different years.Model behaviours may be influenced by interannual variability within the same region, and recurrent structures have shown promise in capturing crop phenological characteristics and enhancing model generalization (Xu et al. 2021).Assessing the model's spatial transferability is also a critical aspect of future research, given the potential application of the model in diverse geographical contexts.This could facilitate the design of efficient strategies for improved applicability, potentially contributing to the optimization of agricultural practices and crop mapping on a global scale.One such strategy refers to training the model with representative crop datasets that can accurately reflect the complexity and heterogeneity of the agricultural landscape.Alternatively, the model's parameters could be adjusted to accommodate the unique conditions of specific locations.

Conclusions
In this research, we proposed a workflow for multi-temporal crop mapping based on the fusion of Sentinel-1 polarimetric parameters and Sentinel-2 multispectral reflectance, combined with various deep learning architectures.The proposed 3D-ConvSTAR, which connects 3D-CNN layers and convolutional recurrent layers, delivers enhanced classification performance for crop mapping in comparison to the architectures designed in previous studies.Additionally, the designed architecture is robust when training the dataset with imbalanced class distribution and outperforms other data augmentation techniques.This study demonstrates that crop mapping can be conducted with high accuracy using the proposed 3D-ConvSTAR in terms of overall accuracy and F1 score for each crop class.Although the implemented architecture is likely not the optimal solution, given the training parameters overload, it still manages to produce accurate and valuable results for separating the crops with significant commercial value in Bei'an.While the proposed network exhibits superior performance in terms of crop type separation and accounting for the temporal and spatial relationships in remote sensing data, it is essential to address the challenges posed by the increased number of training parameters and the inherent limitations in classifying underrepresented crop types.Future research should focus on optimizing the network architecture and exploring alternative approaches to improve classification accuracy across all crop types while minimizing the computational cost associated with training the model.The model's generalization for crop mapping needs further assessment based on interannual and spatial transferability.

Figure 1 .
Figure 1.The study area in Bei'an.The multi-temporal Sentinel-1 and Sentinel-2 data are overlapped to capture the area that is covered by complete time-series acquisitions.

Figure 2 .
Figure 2. The sample class distribution with the number of pixels (y-axis) at the logarithmic scale for the Bei'an dataset collected in 2017 (left), and 10% of the dataset is split into subsets for training, validation, and testing (left).The distribution of crop parcel size overall (right).The parcels large than 9 hectares are accumulated in the last bin in the histogram.The parcel size on average is 1.39 hectares.

Figure 3 .
Figure 3.The overall workflow of the experiments.

Figure 4 .
Figure 4.The structure of a ConvSTAR cell.

Figure 6 .
Figure 6.Comparison of classification performances between the models with applying data augmentation techniques and the proposed method across various sites within Bei'an.Percentages indicate the proportion of correctly classified samples with respect to ground truth labels.

Figure 7 .
Figure 7.The annual crop map for Bei'an 2017.It was produced by 3D-ConvSTAR, weakly supervised with ten per cent of all ground truth samples.The areas not designated as cropland were excluded using a cropland mask introduced in Section 3.1.

Figure 8 .
Figure 8.The confusion matrix for the comparison between predicted labels derived by 3D-ConvSTAR and all ground truth labels.

Figure 9 .
Figure 9.The prediction score distribution for each crop, derived by the last dense layer of 3D-ConvSTAR.The red dashed lines indicate average prediction scores.

Figure 10 .
Figure 10.The saliency maps represented by the average magnitude of gradients for each crop.1500 image patches were randomly extracted from the testing dataset and fed into 3D-ConvSTAR to generate saliency maps for illustration.
).The input X t is firstly non-linearly projected through the activation function in Z t .In addition, the previous state H tÀ 1 and new inputs are linearly combined in the gating modulewhich is the determinant of the stateto-state flow to create a new hidden state.W and B are matrices for weight and bias, respectively.The hidden state H t is the output of a single ConvSTAR layer, which can be used for classification, or be used as the new inputs for the next layer or other decoders.

Table 1 .
The comparison of model performance based on multiple composite features.The best scores for each metric are highlighted in bold, and the second best are underlined.

Table 2 .
The comparison of model performance in each class.The best score for each column is highlighted with bold and the second best is underlined.

Table 3 .
The comparison of model performance with applying data augmentation techniques.The best measurements for each column are highlighted in bold, followed by underlines indicating the second-best performance.