Objective evaluation-based efficient learning framework for hyperspectral image classification

ABSTRACT Deep learning techniques with remarkable performance have been successfully applied to hyperspectral image (HSI) classification. Due to the limited availability of training data, earlier studies primarily adopted the patch-based classification framework, which divides images into overlapping patches for training and testing. However, this framework results in redundant computations and possible information leakage. This study proposes an objective evaluation-based efficient learning framework for HSI classification. It consists of two main parts: (i) a leakage-free balanced sampling strategy and (ii) an efficient fully convolutional network (EfficientFCN) optimized for the accuracy-efficiency trade-off. The leakage-free balanced sampling strategy first generates balanced and non-overlapping training and test data by partitioning the HSI and its ground truth image into non-overlapping windows. Then, the generated training and test data are used to train and test the proposed EfficientFCN. EfficientFCN exhibits a pixel-to-pixel architecture with modifications for faster inference speed and improved parameter efficiency. Experimental results demonstrate that the proposed sampling strategy can provide objective performance evaluation. EfficientFCN outperforms many state-of-the-art approaches concerning the speed-accuracy trade-off. For instance, compared to the recent efficient models EfficientNetV2 and ConvNeXt, EfficientFCN achieves 0.92% and 3.42% superior accuracy and 0.19s and 0.16s faster inference time, respectively, on the Houston dataset. Code is available at https://github.com/xmzhang2018.


Introduction
Hyperspectral images (HSIs) contain hundreds of narrow bands spanning from the visible to the infrared spectrum, forming a 3-D hypercube.With abundant spectral information, each material possesses a specific spectral signature, like a unique fingerprint, serving as its identification.Because of their strong representability, HSIs have become economical, rapid, and promising tools for various applications, such as medical imaging (Mok and Chung 2020), environmental monitoring (Stuart et al. 2019), and urban development observation (Alamús et al. 2017).Semantic segmentation (also called pixel-level classification) is one of the most fundamental tasks for these applications.
Many HSI classification methods have been developed over the past few decades.Earlier approaches mainly focused on spectral information mining using machine learning methods, including unsupervised algorithms (e.g.clustering (Haut et al., 2017)) and supervised algorithms (e.g.support vector machines (Cortes and Vapnik 1995) and random forest (Breiman 2001)).Unsupervised algorithms do not rely on labeled data; however, supervised algorithms are generally preferred because of their superior performance.Nevertheless, the inherent high dimensionality and nonlinearity of HSIs limit the performance of supervised algorithms, especially when labeled samples are limited.Several dimensionality reduction techniques, such as band selection (Paul et al. 2015), feature selection (Quan et al. 2023), and manifold learning (Huang et al. 2015), have been introduced to project hypercube data into lower-dimensional subspaces by capturing essential information in HSIs.Given the spectral heterogeneity and complex spatial distribution of objects, spatial feature mining has attracted considerable attention (Gao and Lim 2019).Spatial feature extraction methods, such as gray-level co-occurrence matrix (Pesaresi, Gerhardinger, and Kayitakire 2008), guided filtering (Wang et al. 2018), and morphological operators (Bao et al. 2016), have been employed to extract spatial features of HSIs.Other studies adopted kernel-based methods (Lin and Yan 2016), 3-D wavelets (Cao et al. 2017;Tang, Lu, and Yuan 2015), and 3-D Gabor filters (Jia et al. 2018) to learn the joint spectral -spatial information for better classification.Although these traditional methods have achieved considerable progress, they are limited to shallow features and prior knowledge, resulting in poor robustness and generalization.
Deep learning (DL) can automatically learn highlevel representations, overcoming the limitations of traditional feature extraction methods.It has achieved high performance in many challenging tasks, including object detection (Hou et al. 2019), scene segmentation (Fu et al. 2019), and image classification (Tan and Le 2021).Subsequently, various DL techniques have been adopted for HSI classification.A multilayer perceptron was designed as an encoderdecoder structure to extract the deep semantic information of HSIs (Lin et al. 2022).Chen (Chen, Zhao, and Jia 2015) introduced a deep belief network to HSI classification and designed three architectures based on this network for spectral, spatial, and spectralspatial feature extraction.In (Hao et al. 2018), a stacked autoencoder and a convolutional neural network (CNN) were employed to encode spectral and spatial features, respectively, which were then fused for classification.Recurrent neural networks (RNNs) (Mou, Ghamisi, and Zhu 2017) and long shortterm memory (LSTM) (Xu et al. 2018) have been applied to analyze hyperspectral pixels as sequential data.Moreover, graph convolutional networks have been employed to model long-range spatial relationships of HSIs because they can handle graphstructured data by modeling topological relationships between samples (Jiang, Ma, and Liu 2022).In (He, Chen, and Lin 2021;Hong et al. 2022;Sun et al. 2022), transformers were introduced to capture long-range sequence spectra in HSI.Among these DL algorithms, CNNs generally outperform the others in HSI classification because of their ability and flexibility to aggregate spectral and spatial contextual information (Sothe et al. 2020).The properties of local connections and shared weights allow CNNs to achieve higher accuracy with fewer parameters.
Many CNN-based methods have been proposed for HSI classification, including patch-based classification and fully convolutional network (FCN)-based segmentation.Previous studies (Paoletti et al. 2018;Zhang et al. 2021) mainly focused on patch-based classification, which assigns the category of a pixel by extracting features from the spatial patch centered on this pixel.However, redundant computation is inevitable using this method because overlap occurs between adjacent patches, as shown in Figure 1(a).Many FCNbased approaches (Wang et al. 2021;Xu, Du, and Zhang 2020;Zheng et al. 2020) have been proposed to reduce computational complexity.They feed the initial HSI cube into the network, perform pixel-topixel classification, and output the entire classification map.Compared to patch-based classification, FCNbased segmentation usually produces competitive or superior results with less inference time.
However, unlike computer vision datasets containing thousands of labeled images, HSI datasets often include only one partially labeled image.Almost all of the aforementioned methods employ the random sampling strategy, where the training and test samples are randomly selected from the same image, resulting in the feature extraction space of the training and test data overlapping, as shown in Figure 1(b).Consequently, in the training stage, information from the test data is used to train the network, leading to exaggerated results (Liang et al. 2017).Similarly, the existing FCN-based approaches that take the same entire HSI as input for training and testing also lead to higher training -test information leakage.Therefore, their performance and generalizability results are questionable because they violate the fundamental assumption of supervised learning (Liang et al. 2017).Although several new sampling strategies (Liang et al. 2017;Zou et al. 2020) have been proposed to avoid training -test information leakage, other limitations may emerge, e.g.imbalanced sampling results in certain categories for which all data are selected as the test or training set.In addition, the existing FCN-based segmentation networks that take an entire HSI as input result in significant memory consumption and limited batch size, dramatically slowing down the training speed.
To address these limitations, we propose an objective evaluation-based efficient learning (OEEL) framework for HSI classification and objective performance evaluation.First, to ensure balanced sampling and no training-test information leakage, a leakage-free balanced sampling strategy is proposed to generate training and test samples.Then, the EfficientFCN is designed to learn discriminative spectral -spatial features from the generated samples for effective and efficient data classification.Therefore, the proposed framework not only ensures that the feature extraction spaces of the training and test data are independent of each other, but also improves the classification accuracy and efficiency.
The main contributions of this study are summarized as follows: (1) The OEEL framework is proposed for HSI classification to achieve fast classification and objective evaluation.

Patch-based classification
Most previous studies (Paoletti et al. 2018;Zhang et al. 2021) employed the patch-based classification framework to facilitate feature extraction and classifier training.An end-to-end network takes 3-D patches as input and outputs a specific label for each patch in its last fully connected (FC) layer (Paoletti et al. 2018).Another end-to-end 2-D CNN (Yu, Jia, and Xu 2017) uses 1 × 1 convolutional kernels to mine spectral information and uses global average pooling to replace FC layers to prevent overfitting.Santara (Santara et al. 2017) proposed a band-adaptive spectral -spatial feature learning neural network to address the curse of dimensionality and spatial variability of spectral signatures.It divides 3-D patches into sub-cubes along the channel dimension to extract band-specific spectralspatial features.To enhance the learning efficiency and prevent overfitting, a deeper and wider network with residual learning was proposed (Lee and Kwon 2017), which employs a multi-scale filter bank to jointly exploit spectral -spatial information.
Two-branch CNN-based architectures (Hao et al. 2018;Liang et al. 2017;Xu et al. 2018) employ 2-D CNNs and other algorithms (e.g.1-D CNN, stacked autoencoder, and LSTM) to encode spatial and spectral information, respectively, and then fuse the outputs for classification.Another type of spectral -spatial-based CNN architecture employs 3-D CNN to extract joint spectral -spatial features for HSI classification (Li, Zhang, and Shen 2017;Paoletti et al. 2018).For instance, the spectral -spatial residual network (SSRN) (Zhong et al. 2018) uses spectral and spatial residual blocks consecutively to learn spectral and spatial information from raw 3-D patches.A fast, dense spectral -spatial convolution framework (Wang et al. 2018) uses residual blocks with 1 Þ convolution kernels to learn spectral and spatial information sequentially.
Recently, attention mechanisms have been introduced to adaptively emphasize informative features (Zhang et al. 2021).The squeeze-and-excitation (SE) module (Hu, Shen, and Sun 2018), which uses global pooling and FC layers to generate channel attention vectors, was adopted in (Fang et al. 2019;Huang et al. 2020) to recalibrate spectral feature responses.The convolutional block attention module (Woo et al. 2018) was adopted in (Zhu et al. 2020), where the spatial branch appends a spatial-wise attention module while the spectral branch appends a channel-wise attention module to extract spectral and spatial features in parallel.Similarly, the position self-attention module and the channel self-attention module proposed in (Fu et al. 2019) were introduced into a doublebranch dual-attention mechanism network (DBDA) (Li et al. 2020) to refine the extracted features of HSIs.In (Zhang et al. 2021), a spatial self-attention module was designed for patch-based CNNs to enhance the spatial feature representation related to the center pixel.
Although the above patch-based classification methods achieved high performance, it is unclear whether this is attributed to the improved performance of the methods or the training -test information leakage (Liang et al. 2017).Furthermore, redundant computation of overlapping regions of adjacent patches is inevitable in these methods.

FCN-based segmentation
Many FCN-based frameworks have been developed to mitigate redundant computation caused by overlap between adjacent patches.The spectral -spatial fully convolutional network (SSFCN) (Xu, Du, and Zhang 2020) takes the original HSI cube as input and performs classification in an end-to-end, pixel-to-pixel manner.A deep FCN with an efficient nonlocal module (Shen et al. 2021) was proposed that takes an entire HSI as input and uses an efficient nonlocal module to capture long-range contextual information.To exploit global spatial information, Zheng et al. (Zheng et al. 2020) proposed a fast patch-free global learning framework that includes a global stochastic stratified sampling strategy and an encoderdecoder-based FCN (FreeNet).However, this framework does not perform well with imbalanced sample data.A spectral -spatial dependent global learning (SSDGL) framework (Zhu et al. 2021) was developed to handle imbalanced and insufficient HSI data.
Although these FCN-based frameworks alleviate redundant computation and achieve significant performance gains, they may lead to higher trainingtest information leakage.This is because they use the same image for both training and testing, thus leading to overlap and interaction between the feature extraction spaces of the training and test data.

Sampling strategy
The aforementioned training -test information leakage not only leads to a biased evaluation of spatial classification methods but may also distort the boundaries of objects, as shown in Figure 1(c).Therefore, the pixel-based random sampling strategy inadvertently affects feature learning and performance evaluation.
Several new sampling strategies have been proposed to address these limitations.A controlled random sampling strategy was designed to reduce overlap between training and test samples (Liang et al. 2017).Specifically, this strategy randomly selects a labeled pixel from each unconnected partition as a seed and then extends the region from the seed pixel to generate training data.Finally, pixels in the grown regions are selected as training data, and the remaining pixels are selected as test data.This sampling strategy dramatically reduces the overlap between training and test data, but it cannot eliminate it because pixels at the boundaries of each training region still overlap with test data.Nalepa et al. (Nalepa, Myller, and Kawulok 2019) proposed to divide the HSI into fixed-size patches without overlapping and then randomly select some patches as the training set.The method proposed in (Zou et al. 2020) only selects training samples from multiclass blocks following a specific order.Nevertheless, both methods may suffer from a severe sample imbalance, i.e. there may be certain categories for which all data are selected as the test or training set.The former causes the trained model to fail to recognize these categories, while the latter results in a lack of test samples for evaluation.Furthermore, these methods disregard boundary pixels, where a patch cannot be defined.Therefore, the significant loss of samples together with the scarcity of training samples can cause overfitting.

Method
This section presents the OEEL framework.As shown in Figure 2, it comprises two main steps.First, the proposed leakage-free balanced sampling strategy divides the HSI cube into non-overlapping training and test data.Second, the generated training and test data are used to train and test the proposed EfficientFCN for future extraction and data classification.The relevant details of both steps are described below.

Leakage-free balanced sampling strategy
As discussed in Section 2.3, the commonly used sampling strategy exaggerates the classification results because of training -test information leakage.Although several new sampling strategies have been proposed to address this problem, other limitations may emerge.Based on these observations and empirical studies (Liang et al. 2017;Zou et al. 2020), we derived four basic principles for effective sampling strategy design: P1) balanced sampling to ensure that all categories are present in both training and test sets; P2) samples should be maximally utilized; P3) regions that contribute to feature extraction from training data cannot be used for testing to satisfy the independence assumption; and P4) random sampling to avoid biased estimates.
As per these principles, we designed a leakage-free balanced sampling strategy, as shown in Figure 3. Since many spatial-based methods require square patches as input, the HSI and its ground truth need to be divided into square windows of equal size.To satisfy P1, the window size should ensure each class in at least two windows, and there is a trade-off between the window size and the number of windows.We will mirror the pixels on the right and bottom borders outward if the width and height of the image cannot be divided by the window size, as shown in the first step of Figure 3.This step allows all the border pixels to be fed into the network and used as any other pixels in the image.Once the border pixels are mirrored, the HSI and its ground truth are split in disjoint windows.
The next step is to divide these windows into training and test windows according to a predefined order to satisfy P1 and P3-4.The predefined order can be either by category or by the number of samples within each category.Here, we perform window-based random sampling within each category in order.As shown in the dotted box of Figure 3 are collected; then, a predefined proportion of windows are randomly selected for training while the remaining windows are used for testing (P4).To satisfy P3, the corresponding positions of windows containing the first class are set to zero in the HSI and its ground truth, which are then used to collect the windows that contain the next category.This process is repeated until sampling is complete for all categories.Note that each window is selected only once to avoid repeat sampling.These windows are only selected as training or test windows and the pixel categories within each window are independent of each other.
It is necessary to perform data augmentation to avoid overfitting due to the limited training windows.As performed in most previous studies (Xu et al. 2018;Zhang et al. 2021), each training window is randomly rotated between 0° and 360° and horizontally or vertically flipped.We add noise or change the brightness of training windows to enhance the robustness of approaches under various conditions such as different sensors, light changes, and atmospheric interference.
A summary of the proposed sampling strategy is provided in Algorithm 1.It follows all of the abovementioned principles, enabling the accurate and objective performance evaluation of approaches.

EfficientFCN
The prior works mainly sought to make very deep models converge with reasonable accuracy, or design complicated models to achieve better performance.Consequently, the resultant models were neither simple nor practical, hence limiting real world applications.Therefore, this subsection proposes a EfficientFCN, which is optimized for faster inference speed and higher parameter efficiency.It includes two main blocks -the efficient feature extraction (EFE) block and the fused efficient feature extraction (fused EFE) block -which are described as follows.

EFE block
Because the depthwise convolution (Chollet 2017) has fewer parameters and floating-point operations (FLOPs) than regular convolutions, it was introduced into MBConv (Tan and Le 2021) to achieve higher parameter efficiency.MBConv is defined by a 1 × 1 expansion convolution followed by 3 × 3 depthwise convolutions, an SE module, and a 1 × 1 projection layer.Its input and output are connected by a residual connection when they have the same number of channels.MBConv attaches batch normalization (BN) and a sigmoid linear unit (SiLU) activation function to each convolutional layer.
To improve the network efficiency, we first replace SiLU with the scaled exponential linear unit (SELU).SELU exhibits self-normalizing properties, which are faster than external normalization, confirming that the network converges faster.The SELU activation function is defined as: where x is the input, α and λ(λ >1) are hyperparameters, and e denotes the exponent.SELU reduces the variance for negative inputs and increases that for positive inputs, thereby preventing vanishing and exploding gradients.Moreover, it produces outputs with zero mean and unit variance.Therefore, SELU converges faster and more accurately than SiLU, leading to better generalization (Madasu and Rao Vijjini 2019).Layer normalization (LN) has been used in ConvNeXt (Liu et al. 2022) and slightly outperformed BN in various application scenarios.Following the same optimization strategy as (Liu et al. 2022), we substitute BN with LN in our network.
Considering that LN and activation function operations take considerable time (Ma et al. 2018), ConvNeXt uses fewer LN and activation functions and achieves better results.Therefore, we also use fewer LN and SELU activation functions to improve accuracy and efficiency.As shown in Figure 4(a), the LN and activation function are attached only after the expansion convolution and depthwise convolution, respectively.Furthermore, the SE module is removed due to the high computational cost of FC layers in SE.The results in Section 5.2 demonstrate that this modification not only improves training speed and parameter efficiency, but also improves classification performance.
Figure 4(a) shows the detailed architecture of the EFE block.It comprises an expansion convolution with LN, followed by 3 × 3 depthwise convolutions with the SELU activation function and a 1 × 1 projection layer.The expansion ratio of the first 1 × 1 convolution is then set to 2. Similarly, the input and output of the EFE block are connected via a residual connection when they have the same number of channels.

Fused EFE block
Since depthwise convolutions cannot fully utilize modern accelerators, Fused-MBConv replaces the 3 × 3 depthwise convolutions and 1 × 1 expansion convolution in MBConv with a single regular 3 × 3 convolution (Tan and Le 2021).We follow the operation of Fused-MBConv to replace the 1 × 1 expansion convolution and 3 × 3 depthwise convolutions in the EFE block with a single regular 3 × 3 convolution to improve the training speed, as shown in Figure 4(b).
Similarly, LN and SELU are only appended after the 3 × 3 convolution and 1 × 1 convolution, respectively.Similar to the EFE block, the expansion ratio of the first 1 × 1 convolution is set to 2.

Efficiency FCN
It has been demonstrated that depthwise convolutions are slow at the early stages but effective in deep layers (Tan and Le 2021).Thus, the EFE block is placed in deep layers.After incorporating the EFE and fused EFE blocks in the network, the EfficientFCN architecture can be developed, as shown in Figure 4(c), where the number of repetitions and output channels is presented to the left and right of each block, respectively.The network aims to learn a mapping of X i 2 n h�w�B !Y i 2 n h�w�K for classification, where h � w and B are the spatial size and the number of bands of X, respectively, and K is the number of categories to be classified.
In our network, the number of channels starts at the maximum value and decreases as the layer deepens.We refer to this operation as inverted channels.HSIs with abundant spectral information inevitably contain a high degree of redundancy between bands.Inverted channels can allow the network to learn additional valuable information from redundant bands.
There are no pooling layers throughout the network.The reasons for this are mainly twofold.First, pooling operations are performed on aggregated rather than positioned features, making the network more invariant to spatial transformations.Spatial invariance, in turn, limits the accuracy of semantic segmentation.Second, pooling operations are primarily used to reduce computational complexity by reducing the spatial dimensions of feature maps.This operation results in a significant loss of spatial information and may blur land cover boundaries, especially when the input size is small.Moreover, our task is pixel-wise classification.The network outputs should have the same spatial dimension as the input.Therefore, we do not perform any downsampling operations.Note that our EfficientFCN still maintains the capability to process images with arbitrary spatial sizes.Extracting patches and sending them to the network to generate the final full classification map has two main reasons: 1) it ensures that the feature extraction spaces of the training and test data are independent of each other, and 2) smaller input sizes lead to fewer computations and allows for large batch sizes, thus improving training speed.
After the EfficientFCN is constructed, its parameters are initialized and trained end to end.The performance of the proposed FCN is presented in Section 4.

Experiments
This section describes the experimental datasets and settings, including comparison methods, evaluation metrics, and parameter settings.Quantitative and qualitative analyses of the experimental results are also presented.

Description of datasets
We conducted experiments on four datasets of different sizes: Indian Pines (IP), Pavia University (PU), Salinas (SA), and University of Houston (UH).
The IP dataset was collected in 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over northwestern Indiana, USA, which is an agricultural area with irregular forest regions and crops of regular geometry.The dataset has 145 × 145 pixels with a spatial resolution of 20 m.Each pixel has 224 spectral bands ranging from 0.4 to 2.5 µm.After discarding 24 noise and water absorption bands, 200 bands were used for classification.The ground truth has 16 land cover classes.The UH dataset covers an urban area that includes the University of Houston campus and neighboring areas.It was collected by the National Center for Airborne Laser Mapping in June 2012.It has 144 spectral bands in the wavelength range of 0.38-1.05µm.Furthermore, the spatial dimension and resolution of this scene are 349 × 1905 and 2.5 m, respectively.There are 15 classes in this scene, and detailed information about this dataset is presented in Figure 8.
Before these experiments, we normalized these datasets to [−1, 1] to unify the data magnitude to promote network convergence.

Experimental settings
We compared the performance of the proposed network with that of state-of-the-art DL architectures,        There are many parameters related to DL architectures.In EfficientFCN, the convolutional stride and spatial padding size are set to 1, while the dropout rate is set to 0.2.Other hyperparameters are presented in Figure 4.These hyperparameters can be adjusted per different situations.For example, the number of output channels can be halved for the PU dataset with fewer channels.The above hyperparameter setting was used for all four datasets in the following experiments for a fair comparison.The proposed network adopted the AdamW optimizer (Loshchilov and Hutter 2019), where the learning rate, weight decay, and the number of training epochs were set to 1 × 10 −4 , 1 × 10 −2 , and 150, respectively.The hyperparameters of the comparison methods were set according to the recommended values and then selected after fine-tuning to achieve the best performance.For EfficientNetV2 and ConvNeXt, we adopted their minimum model settings (i.e.EfficientNetV2-S and ConvNeXt-T) and reduced their number of stages and layers to an equal proportion to keep their total number of stages and layers the same as our EfficientFCN.All methods were run on the PyTorch platform and were trained and tested on the same sample sets generated by the proposed sampling strategy.The batch size was set to 64 for all methods.Furthermore, all experiments were conducted on a workstation with an AMD Ryzen 7 5800 × 8-core processor with a 3.40 GHz CPU and NVIDIA GeForce RTX 3060 GPU.
Classification performance was evaluated by producer accuracy (PA) of each class, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa).All experiments were repeated 10 times to avoid biased estimation, and mean values were calculated for comparison, as presented in Section 4.3.

Quantitative evaluation
Table 1-Table 4 summarize the classification accuracy of all compared methods.From these tables, we can observe that the performance of all methods was considerably lower on the IP dataset than on the other datasets, especially for the 4th, 9th, and 15th categories on the IP dataset.This may be due to the lack of training data and the low spatial resolution of this dataset.Nevertheless, on all four datasets, our network achieved the highest OA, AA, and Kappa and exhibited the best or near-best  accuracy in most classes.For example, on the IP dataset, the proposed method obtained the highest OA of 84.72%, which exceeded that of SSRN, DBDA, SS3FCN, FreeNet, SSDGL, ConvNeXt, and EfficientNetV2 by ~7.04%, 6.35%, 8.05%, 2.96%, 2.65%, 5.54%, and 3.64%, respectively.Although some comparison methods achieved satisfactory results in previous studies, they failed to perform well on certain datasets using the proposed sampling strategy.Among these methods, SS3FCN generally exhibited the worst performance since it uses a 3-D FCN and a 1-D FCN to learn spectral -spatial features and spectral features, respectively, resulting in high spectral redundancy and increased model complexity.
Regarding FCN-based methods, FreeNet and SSDGL performed better on the IP dataset but worse on the other datasets.A possible reason for this is that the scarcity of labeled data makes it difficult for these methods to optimize, as these networks are more complex than others.Compared with FreeNet and SSDGL, the patchbased methods (i.e.SSRN and DBDA) performed worse on the IP dataset but better on the other three datasets.However, ConvNeXt and EfficientNetV2 performed well on all four datasets, indicating superior generalization performance.Note that the proposed network exhibited significant improvement over all of the above comparison methods on all four datasets, demonstrating its effectiveness and generalizability.The proposed network classified the corresponding test data with relatively high accuracy, even for certain indistinguishable classes (e.g.Gravel in the PU dataset and railways in the SA dataset).These results confirm the robustness of the designed network under challenging conditions.

Qualitative evaluation
Figures 9-12 visualize the corresponding classification maps alongside the false color images and ground truth maps.As can be seen, the classification maps are consistent with the reported quantitative results.For example, the classification maps produced by SS3FCN contained more noise and speckles than those produced by other methods on the IP, PU, and SA datasets, which is consistent with the quantitative results in Table 1-Table 4.Among these methods, the proposed network produced the least noise and the most accurate classification maps on all datasets.In addition, objects covered by shadows could be identified using the proposed framework.As illustrated in the black rectangles in Figure 12, parts of buildings, roads, and vegetation were covered in shadows.SS3FCN, EfficientNetV2, ConvNeXt, and the proposed network could detect shadow regions more effectively than SSRN, DBDA, FreeNet, and SSDGL.
Furthermore, using the proposed sampling strategy, the class boundaries of classification maps produced by the spectral -spatial methods are more consistent with those of false color images, especially for the IP dataset.However, there are many square-like noises in the classification maps.The reason behind this phenomenon lies in two main aspects, i.e. 1) the input window size was extremely small to provide sufficient spatial information, resulting in inconsistent segmentation across window boundaries, and 2) it is caused by window stitching.Therefore, selecting a larger window size is preferable if the basic principles of designing an effective sampling strategy are met, as described in Section 3.1.Furthermore, the overlay inference strategy (Zheng et al. 2021) can alleviate this problem.In summary, the experimental results demonstrate the superiority of the proposed network and indicate that the performance of spectral -spatial methods can be more accurately reflected and evaluated using the proposed sampling strategy.

Leakage-free balanced sampling strategy analysis
It can be seen from Figures 5-8 that there is no overlap between the training and test data and all classes are present in both training and test data, thus demonstrating that the proposed sampling strategy can avoid information leakage and achieve balanced sampling.
In addition, we observed a trade-off between the window size and the number of windows because overly small windows result in limited spatial information for spatial-based methods to learn.Conversely, excessively large windows caused certain classes with limited samples to be only in the training or test set.Therefore, we analyzed the effect of window size on the performance of the proposed EfficientFCN.Due to the limited number of labeled samples for specific classes in the IP dataset (e.g. the Oats category had only 20 labeled pixels), we set its window size to the minimum value of 4. For other datasets, we conducted experiments to select the optimal window size.We varied the window size during the experiments while fixing all other parameters.Unlike patch-based classification, where accuracy improves as patch size increases, in our experiments, accuracy did not increase with increasing window size, as illustrated in Figure 13.Accuracy even decreases with an increase in window size.Moreover, the difference in accuracy for different window sizes was minor, again demonstrating that the proposed sampling strategy can eliminate the spatial dependence between training and test data.
Although the smallest window size achieved the highest accuracy on certain datasets, it failed to provide sufficient spatial information for methods with strong spatial information extraction ability.
Moreover, smaller window sizes resulted in lower inference efficiency and more scatter points (Figure 9).Therefore, it is preferable to choose a larger window size with comparable accuracy.Weighting the efficiency and accuracy, we set the window size to 6 for the PU dataset, 9 for the SA dataset, and 9 for the UH dataset.
Our sampling strategy applies not only to HSI data, but also to other real-world remote sensing data, especially data with imbalanced categories.However, it is unsuitable for large datasets containing hundreds or thousands of labeled images, such as computer vision datasets.

EfficientFCN analysis
We then analyzed the proposed network design by following a trajectory from EfficientNetV2 to the EfficientFCN.Experiments were conducted on the IP  5.
The normalization layer and activation function are important components of a network.We first evaluated the performance of the proposed network with different activation functions, including SiLU, SELU, and GELU.SiLU and GELU are used in EfficientNetV2 (Tan and Le 2021) and ConvNeXt (Liu et al. 2022), respectively.SELU possesses self-normalizing properties that allow neural network learning to be highly robust.As shown in Table 5), the network trained with GELU achieved the best results with an OA of 81.39% on the IP dataset and an OA of 89.56% on the UH dataset.Therefore, the proposed network adopted GELU as the activation function and used it in the following experiments.
For normalization, as illustrated in Table 5), our network trained with LN obtained a slightly higher OA than BN, with gains of 3% and 1% on IP and UH datasets, respectively.Therefore, we use LN for normalization in our proposed network.
The number of activation and normalization layers also affects the performance of networks.As shown in Table 5), after reducing the number of LN and SELU activation layers, the classification accuracy on both   datasets did not decline but slightly improved.This may be because SELU induces self-normalizing properties.Therefore, it is not necessary to perform normalization again.
To avoid overfitting and reduce the number of trainable parameters, we used the channel attention module (Fu et al. 2019) with no trainable parameters to replace the SE module, which did not contribute to accuracy.We then tried to remove the attention module from our network.Interestingly, this led to marginal improvements on both datasets (from 82.47% to 82.72% on the IP dataset and from 89.81% to 89.84 on the UH dataset).Thus, our network did not contain an attention module.
As detailed in Table 5), the inverted channels significantly increased the OA from 82.72% to 84.72% on the IP dataset and from 89.84% to 91.43% on the UH dataset.This demonstrates that the inverted channels setting can help the network excavate additional discriminative spectral information.Due to the inverted channels setting, our EfficientFCN only applies to HSI data and is unsuitable for data with fewer bands, such as multispectral data.
In addition, we replace the proposed EFE and Fused EFE blocks with normal convolution in our EfficientFCN, respectively, for further comparison.The corresponding results are summarized in Table 6.After replacing the EFE and Fused EFE blocks separately with normal convolution, the classification accuracies on the IP and UH datasets all decreased to varying degrees.This further demonstrates that EFE and Fused EFE blocks can consistently improve performance by enhancing the discriminative feature learning ability of networks.
The modified network improved classification results for both datasets.The superior performance of our network is attributed to its better ability to capture valuable information from redundant spectral bands.

Model complexity and speed analysis
To comprehensively analyze the complexity of the proposed network, we calculated the number of trainable parameters (Params) and FLOPs as well as training (Trn) and inference (Infer) time for the comparison methods on the IP and UH datasets.Params and FLOPs are indirect measures of computational complexity, while runtime is a direct measure.
As shown in Table 7, the proposed method generally achieved the best results, especially in training and inference time, and near-best results in Params and FLOPs.SS3FCN, FreeNet, and SSDGL have more Params than other methods.Although SS3FCN has the fewest FLOPs, its training and inference time is the longest as it not only employed 3-D networks with many Params but also used a triple prediction averaging strategy.Compared to the patch-based methods (i.e.SSRN and DBDA), the FCN-based methods (except SS3FCN) took less time for inference.Note that the time-consuming training process is conducted offline, while the inference speed is the main factor determining whether a method is practical.Thus, the pixel-to-pixel classification strategy is more suitable for practical applications.The proposed network had the fastest inference speed among the comparison networks.

Impact of the number of training samples
To test the robustness and stability of the proposed method, we performed experiments with fewer training samples per class on the IP dataset.This dataset is a typical unbalanced dataset with extremely few labeled samples, thus posing significant challenges to supervised methods.

Extended experiments
For further performance assessment of the proposed method, we compared it with MobileNetV2, EfficientNetV2, spatial-spectral transformer (SST), and spectral -spatial feature tokenization transformer (SSFTT) on DFC2018 and Chikusei datasets.Specifically, MobileNetV2 and EfficientNetV2 are highefficiency networks that are more parameter efficient and much faster for image recognition.SST and SSFTT are transformer-based networks designed for HSI classification.DFC2018 and Chikusei are two large realworld datasets and detailed information can be found in (Xu et al. 2019) and (Yokoya and Iwasaki 2016), respectively.We adopted the minimum model settings of EfficientNetV2 and MobileNetV2 and reduced their number of stages and layers in equal proportion to keep their total number of stages and layers the same as our EfficientFCN.The parameters of the SST and SSFTT are kept the same as the original paper.
According to the dataset and memory size, we set the window size of the DFC2018 and Chikusei datasets to 32 × 32 and 48 × 48, respectively.The predefined proportion of training windows is set to 8% for both datasets.Again, AA, OA and Kappa are used for quantitative performance evaluation and the results are summarized in Table 8 for comparison.
Table 8 shows that the proposed method yields the best results on both datasets.EfficientNetV2 and MobileNetV2 are superior to SST and SSFTT, confirming that EfficientNetV2 and MobileNetV2 have better generalization than SST and SSFTT.Although SST and SSFTT are transformer-based networks specifically designed for HSI classification, they still follow the patch-based classification framework.In the patch-based classification framework, a large patch size is effective in capturing spatial information for center pixel classification.However, a too large patch size results in the decreasing accuracy, mainly due to having pixels from other classes included for learning.The proposed EfficientFCN shows better results than EfficientNetV2 and MobileNetV2, demonstrating its superiority and generalizability.Although our  EfficientFCN can extract information from a larger receptive field by stacking multiple layers, it still lacks global connectivity.Therefore, in the future, we will introduce the transformer to FCN-based networks to capture long-range dependencies in both spatial and spectral dimensions.

Conclusion
This study proposes an OEEL framework for HSI datasets to facilitate efficient classification and objective performance evaluation.In this framework, the proposed leakage-free balanced sampling strategy can generate balanced training and test samples without overlapping and information leakage, enabling objective performance evaluation.Based on the generated samples, the EfficientFCN is proposed to avoid redundant computation while exhibiting a favorable accuracy-speed trade-off.Both quantitative and qualitative experimental results show that the proposed EfficientFCN outperforms many state-of-theart methods.However, the experimental results in this study may fail to identify suitable DL-based architectures because the lack of HSI datasets prevents some of these architectures from realizing their full potential.Therefore, future work should construct large benchmark datasets to facilitate future research on HSI analysis.Furthermore, we will consider weakly supervised approaches to relieve the demand for expensive pixel-level image annotation.

Figure 1 .
Figure 1.Demonstration of the traditional sampling strategy, which results in (a) overlap between adjacent patches, (b) overlap between the training and test data, and (c) blurred boundaries of the classification map.In (a), dots represent the central pixels of the corresponding patches with white borders.In (b), green and red dots represent the training and test pixels of the corresponding patches, respectively.

Figure 2 .
Figure 2. Overview of the proposed OEEL framework.The framework includes two core components: a leakage-free balanced sampling strategy and a EfficientFCN.

Figure 3 .
Figure 3. Flowchart of the proposed leakage-free balanced sampling strategy.The operation process of an HSI is the same as that of the ground truth.For convenience, only the ground truth operation process is presented.

Figure 4 .
Figure 4. EfficientFCN architecture designed for HSI classification.(a) EFE block.(b) Fused EFE block.(c) EfficientFCN embedded with EFE blocks and fused EFE blocks, where the number of output channels and repetitions per block is listed on the left and right sides, respectively.
Figure 5(a) summarizes the class name and number of samples.The spatial distribution of training data is provided in Figure 5(b), which was produced using the proposed sampling strategy.The PU dataset covering the University of Pavia, Northern Italy, was collected by the Reflective Optics System Imaging Spectrometer sensor in 2001.The dataset is a 610 × 340 × 115 data cube with a spatial resolution of 1.3 m and a wavelength range of 0.43-0.86µm.Before the experiments, the number of spectral bands was reduced to 103 by removing water absorption bands.The scene is an urban environment characterized by natural objects and shadows, where nine classes of land-cover are labeled.Detailed information about this dataset is provided in Figure 6.The SA dataset was recorded by the AVIRIS sensor over several agricultural fields in Salinas Valley, California, USA.It contains 512 × 217 pixels with a spatial resolution of 3.7 m per pixel.Each pixel has 224 spectral bands in the spectral range of 0.36-2.5 µm.As in the case of the IP dataset, 20 noise and water absorption bands were discarded before the experiments.As summarized in Figure 7(a), 16 landcover classes were defined.Figure 7(b) shows the spatial distribution of training data.

Figure 6 .
Figure 6.PU dataset.(a) Land cover type and sample settings.(b) Spatial distribution of training samples (white windows).
including SSRN(Zhong et al. 2018), DBDA(Li et al. 2020), spectral -spatial 3-D fully convolutional network (SS3FCN)(Zou et al. 2020), FreeNet(Zheng et al. 2020), SSDGL(Zhu et al. 2021), ConvNeXt(Liu et al. 2022), and EfficientNetV2 (Tan and Le 2021).Both SSRN and DBDA are patch-based 3-D CNN networks.SSRN uses consecutive spectral and spatial residual blocks to learn spectral and spatial representations, respectively, followed by an average pooling layer and an FC layer.DBDA includes a dense spectral branch with the channel attention module and a dense spatial branch with the position attention module.The outputs of both branches are concatenated and fed to an average pooling layer, followed by an FC layer for classification.SS3FCN considers small patches of the original HSI as input and performs pixel-to-pixel classification, where parallel 3-D and 1-D FCNs are used to learn joint spectral -

Figure 7 .
Figure 7. SA dataset.(a) Land cover type and sample settings.(b) Spatial distribution of training samples (white windows).

Figure 8 .
Figure 8. UH dataset.(a) Land cover type and sample settings.(b) Spatial distribution of training samples (white windows).

Figure 13 .
Figure 13.Variation of test accuracy with input window size on the IP dataset.
Figure 14 shows the OA of different methods with different numbers of training samples, where the training percent represents the proportion of training samples in Figure 5(a).For example, 100% is the total number of training samples, as listed in Figure 5(a).For all methods, accuracy decreased with fewer training samples, especially when the training percent was < 50%.Nevertheless, the proposed network consistently outperformed the other methods in accuracy, demonstrating its robustness.

Figure 14 .
Figure 14.The classification accuracy of different methods with a varying number of training samples on the IP dataset.

Table 1 .
Comparison of classification accuracy of different methods on the IP dataset.

Table 2 .
Comparison of classification accuracy of different methods on the PU dataset.

Table 3 .
Comparison of classification accuracy of different methods on the SA dataset.

Table 4 .
Comparison of classification accuracy of different methods on the UH dataset.

Table 5 .
Ablation analysis of the proposed EfficientFCN on the IP and UH datasets.

Table 6 .
Effects of the EFE and Fused EFE blocks on the performance of the proposed EfficientFCN.

Table 7 .
Comparison of Params, FLOPs, training (abbreviated as Trn), and Inference (abbreviated as Infer) time of different methods on IP and UH datasets.