TrmGLU-Net: transformer-augmented global-local U-Net for hyperspectral image classification with limited training samples

ABSTRACT In recent years, deep learning methods have been widely used for the classification of hyperspectral images. However, their limited availability under the condition of small samples remains a serious issue. Moreover, the current mainstream approaches based on convolutional neural networks do well in local feature extraction but are also restricted by its limited receptive field. Hence, these models are unable to capture long-distance dependencies both on spatial and spectral dimension. To address above issues, this paper proposes a global-local U-Net augmented by transformers (TrmGLU-Net). First, whole hyperspectral images are input to the model for end-to-end training to capture the contextual information. Then, a transformer-augmented U-Net is designed with alternating transformers and convolutional layers to perceive both global and local information. Finally, a superpixel-based label expansion method is proposed to expand the labels and improve the performance under the condition of small samples. Extensive experiments on four hyperspectral scenes demonstrate that TrmGLU-Net has better performance than other advanced patch-level and image-level methods with limited training samples. The relevant code will be opened at https://github.com/sssssyf/TrmGLU-Net


Introduction
Hyperspectral remote sensing images have the combined characteristics of an image and a spectrum, and they contain both spectral and spatial information of surface objects.Different objects have unique spectral characteristics (S.Li et al., 2019).Because of this correlation, hyperspectral image (HSI) has been widely used in various geological tasks, such as recognition of tree species, precise extraction of water boundaries, and statistical analyses of land use (Y.Zhong et al., 2018).Such applications are inseparable from HSI classification methods.Although rich spectral and spatial information offers great possibilities for finescale differentiation between objects, it also poses great challenges to classification tasks.Early HSI classification research focused on mining spectral features and classification algorithms.During that time, machine learning algorithms, such as support vector machines (SVMs) (Melgani & Bruzzone, 2004), random forests (L.Zhang et al., 2018), linear discriminant analysis (C.Li et al., 2011), and neural networks, were successively applied to solving HSI classification problems.Facing limited computing resources, researchers have explored different methods of data dimensionality reduction (Agarwal et al., 2007;Luo et al., 2020;Villa et al., 2010) and band selection (Cai et al., 2020;B. Fang et al., 2020;Q. Wang et al., 2020;Zeng et al., 2019) to deal with the high dimensionality of hyperspectral images.Although early studies made improvements to the classification of hyperspectral images, the following three problems still must be solved for HSI classification: the limited availability of labelled training samples, which leads to overfitting, high dimensionality of HSI data and the high correlation between adjacent bands, which further aggravates overfitting, and limited capability to distinguish different objects by spectral characteristics alone because they may have the same spectral characteristics.
The existence of these issues severely limits the classification accuracy.To improve HSI classification accuracy, researchers have incorporated spatial information into classification.Many methods use spatial information, and they can be broadly classified into three types.The first is represented by methods of spatial feature extraction, such as extended morphological profiles (EMPs) (Benediktsson et al., 2005), Gabor (Jia et al., 2019), and local binary patterns (W.Li et al., 2015).These methods account for the information of the neighboring pixels of sample points in feature extraction and input the extracted features into CONTACT Yifan Sun sincere_sunyf@163.comInformation Engineering University, 62 Science Avenue, High-tech Zone, Zhengzhou, Henan Province, China the classifier to complete the classification process.The second type considers the distance between samples.The closer the samples, the higher the probability that they belong to the same class.Such correlations are used to constrain the classifier during classification to improve accuracy.The third type focuses on the problem of classification noise in the initial classification result of hyperspectral images, which are mainly denoised by using morphological filtering and other methods.The introduction of spatial information has improved the accuracy of HSI classification.However, the limited availability of labelled training samples continues to restrict its development and application.To this end, researchers are exploring semi-supervised learning methods, such as label propagation (J.Zhang et al., 2020), transductive SVM (Bruzzone et al., 2005), collaborative training (Wan et al., 2015), and active learning (Z.Wang et al., 2017).This activity has resulted in the rapid development of HSI classification methods, giving rise to spatial -spectral classification and semi-supervised classification.The methods are effective in improving the classification accuracy of hyperspectral images while addressing the availability issue of labelled samples to some extent.However, classification performance with these methods relies heavily on expert experience, and it usually requires manual designs of complex feature extraction rules as well as the setting of different hyperparameters for different data.
In recent years, as computing capabilities and data numbers continue to grow, data-driven methods represented by deep learning have gained great popularity in many tasks, including image recognition (K.He et al., 2016), object detection (Ren et al., 2017), semantic segmentation (Shelhamer et al., 2017), and three-dimensional (3D) reconstruction (Yao et al., 2019).Deep learning can automatically extract features from data for downstream tasks.Hence, researchers have introduced deep learning to HSI classification to develop more versatile classification methods.Commonly used deep learning models include autoencoders (Xing et al., 2016), deep belief networks (Y.Chen et al., 2015), convolutional neural networks (CNNs) (Hu et al., 2015), and recurrent neural networks (RNNs) (Mou et al., 2017).Among these, CNNs have shown good performance in dealing with high-dimensional images.Thus, those methods have received wide attention in HSI classification.Inspired by spatial-spectral classification, researchers take a certain sample point as the center and slice out a local image patch of a certain size from the hyperspectral image as the feature of that sample, which is then input to a two-dimensional (2D) CNN (Haut et al., 2019;B. Liu et al., 2018;L. Zhang et al., 2018), 3D CNN (Y.Chen et al., 2016), or other models for classification.To further improve classification accuracy, more recent deep learning-based methods are used, such as residual learning (Liu, Yu, Zhang, et al., 2021), DenseNet (Huang et al., 2017), and attention mechanisms (Kuiliang Gao, Yu, et al., 2021;Xue et al., 2021).Considering that different bands of hyperspectral images can be used as time-series data, RNNs, long short-term memory models, and others have been proposed for the classification of hyperspectral images.Besides, the transformer-based model has made some progress in HSI classification at present.A bidirectional encoder representations from transformers for HSI classification task (HSI-BERT) was early on thinking about using the transformer to capture the global dependence among pixels (J.He et al., 2020).A spatial-spectral transformer (SST) was proposed to utilize the transformer to solve the problem of extract dependency of long-distance sequential spectra (X.He et al., 2021).The spectral and spatial transformer block replacing convolution was developed in a novel spectral -spatial transformer network (SSTN) to extract features, which achieved superior performance (Z.Zhong et al., 2022).The SpectralFormer (Hong et al., 2022) rethought the issue of HSI classification from a sequential perspective to purely utilize the transformer to finish the task.
Although these methods can effectively lower the difficulty of training deep learning models with hyperspectral images, the limited availability of labelled training samples for HSI classification remains a problem.Besides, these explorations and attempts still belong to the methods that use local image patches as input.Therefore, numerous works have been proposed to solve the problem of small samples HSI classification in recent years.For instance, Data augmentation can alleviate sample scarcity, which allows the model to be more fully trained for HSI classification when there are not enough labels (Nalepa et al., 2020a).Transfer learning methods have been widely explored to enable the model to have a higher generalization ability for target data by training on the source data, so the model can quickly converge on the target domain and alleviate the dependence of training samples (Yifan Sun, Bing Liu, Xuchu Yu, Anzhu Yu, Kuiliang Gao et al., 2022).And the practice has verified the effectiveness of transfer learning both on convergence speed and improvement with limited samples (J.He et al., 2020;Lee et al., 2022;Nalepa et al., 2020b).Few-shot learning specializes in solving the problem with less labels, which is also explored to improve the model's performance in HSI classification of small samples (Liu, Yu, Yu, et al., 2019).As a prevalent and effective scheme of few-shot learning, meta-learning mainly makes model had a capacity of learning to learn and has been explored to improve the performance of small samples HSI classification (Kuiliang Gao et al., 2020Gao et al., , 2022;;Kuiliang;Gao, Liu, et al., 2021).Besides, due to the massive growth of labelled samples, except for the supervised learning paradigm, semi-supervised learning and unsupervised learning are defined according to the participation of unlabelled samples during the training.Among them, semi-supervised learning methods can combine unlabelled samples and labelled samples for joint learning, represented by generative adversarial network (GAN) (L.Zhu et al., 2018) and graph convolution network (GCN) (Hong et al., 2021).And the strategy of pseudo-label is also a representative method for semisupervised learning and explored widely to utilize unlabelled samples to expand the number of training data (B.Fang et al., 2020).Unsupervised learning is explored to enable the network to learn representation of feature for HSI classification merely with unlabelled samples.The reconstruction-based methods with encoder-decoder architecture are representative and have been widely utilized to accomplish spectralspatial feature learning (Mei et al., 2019;S. Zhang et al., 2022).And a novel unsupervised learning framework is constructed to track the spectral variation information to extract the spectrum motion feature (Sun, Liu, Yu, Yu, Gao, & Ding, 2022).As an influential branch of unsupervised learning, self-supervised learning aims to construct loss function by using some attributes of data instead of labelled samples, which involves contrastive learning and generative learning.Contrastive learning utilizes a special contrastive loss function to cluster the positive samples and alienate the negative samples, which enables networks to learn robust capacity of feature learning without labels (Hou et al., 2022;B. Liu et al., 2021;M. Zhu et al., 2022).Generative learning usually enables networks to be trained by generating samples, and a relevant method is proposed to enable network to learn spectral-spatial feature through recovering the information of masked samples (Xue et al., 2022).Numerous works above mentioned are significant exploration for the problem of HSI classification with limited training samples, which also effectively improves the performance of deep models under the condition of small samples (Sun et al., 2022).While almost all of these methods are designed for patch-level methods, how to improve the classification performance of image-level classification methods with limited training samples is still worth exploring.
Second, deep learning methods that use local image patches as input have been successfully used for HSI classification tasks (Sun et al., 2023).However, these methods suffer from two major problems: one is that a model having local image patches as input fails to perceive the contextual information of the whole scene; and the other is that these highly overlapping image patches generate a lot of redundant computations, which reduces classification efficiency.To address these problems, semantic segmentation models have been introduced to the classification of hyperspectral images.Specifically, whole hyperspectral images are input to the segmentation model, and the classification result for the whole scene is produced.By this approach, contextual information is utilized to improve classification accuracy while unnecessary computations are reduced.Classical semantic segmentation models include fully convolution networks (FCNs) (Shelhamer et al., 2017), U-Net (Ronneberger et al., 2015), SegNet (Badrinarayanan et al., 2017), and DeepLab (L. C. Chen et al., 2018).These models usually require a large number of densely labelled samples to optimize thousands of network parameters in a semantic segmentation task.Simultaneously, the global learning framework specializes in capturing contextual information is gradually applied in remote sensing field to obtain better performance (M.Zhu et al., 2022).However, the labelled samples tend to be sparse in HSI classification task.More importantly, the number of these labelled samples is usually small, which leads to poor performance when applying these classical segmentation models directly.For this reason, DSSNet (Pan et al., 2020) was proposed for HSI classification, which is a segmentation network containing four convolutional layers.To improve the classification accuracy by utilizing the contextual information in HSI more fully, researchers have developed segmentation models while considering the characteristics of HSI, such as the deep fully convolutional network-based spatial distribution prediction (Jiao et al., 2017), the spectralspatial fully convolutional networks (SSFCN) (Xu et al., 2020), fast patch-free global learning (FPGA) framework based fully convolutional network (Zheng et al., 2020), the fully convolutional network with channel and spatial attention (FCN-CSA) (Jiang et al., 2021), PBiNet (B.Liu & Yu, 2021), FOctConvPA (Yifan Sun, Bing Liu, Xuchu Yu, Anzhu Yu, Zhixiang Xue et al., 2022) and the FullyContNet-Pyramid (D. Wang et al., 2022).
Owing to the use of local connections for reducing the number of parameters, CNNs tend to have restricted receptive fields and cannot perceive long time dependencies.This may be improved by increasing the convolutional kernel or using dilated convolution, but that can result in blurred feature boundaries.A transformer regards images as sequential onedimensional data and learns features using the selfattention mechanism.Thus, the transformer structure has a global receptive field.Hence, it can be used to perceive contextual information in hyperspectral images.
However, except for the contextual information, the local details are also crucial because the purpose of HSI classification is to assign a class label to each pixel.To achieve the purpose of modelling both contextual information and local details, this paper proposes a global-local U-Net augmented by transformers (TrmGLU-Net) for small-sample classification of hyperspectral images.Specifically, the hyperspectral image is input as a whole.Then, alternating transformers and convolutional layers are used to perceive both its contextual and local information to improve the accuracy of HSI classification.This also allows the characteristic of transformer's global receptive field to work better on image rather than just local patches.
The originality of this paper is shown from the following aspects: • TrmGLU-Net is proposed, which takes a whole hyperspectral image as input and performs endto-end training.TrmGLU-Net comprises alternating transformers and convolutional layers.Skip connections are used between encoders and decoders, which allow the model to improve classification accuracy by making full use of contextual and local information.• A superpixel-based label expansion method is proposed to improve the performance of imagelevel methods under the condition of small samples.With this method, the original image is partitioned into segments, and the results of superpixel segmentation are used to expand labels to effectively increase supervision information and obtain higher classification accuracy.
• The validity of the proposed method for small sample classification of hyperspectral images is verified using four sets of hyperspectral images with artificially labelled samples.Quantitative and qualitative experiments suggest that the combination of TrmGLU-Net and the superpixelbased label expansion method can obtain better classification results than those of semisupervised methods, providing better adaptability when only a small number of labelled samples are available for each class of objects.
The remainder of this paper is organized as follows.
Section 2. introduces the proposed classification method.Section 3. describes the validation of the proposed method for small-sample classification and its adaptability to sample size through classification experiments on four sets of hyperspectral images.Section 4. concludes the paper.

TrmGLU-Net and superpixel-based label expansion
The proposed method for HSI classification combines TrmGLU-Net and superpixel-based label expansion.The following subsections introduces the architecture of TrmGLU-Net, then its components and the method for label expansion based on superpixel segmentation.

TrmGLU-Net architecture
The architecture of the proposed TrmGLU-Net is illustrated in Figure 1.The size of the model input is B � H � W, where B, H and W refer to the band number, height and width of the whole image, respectively.To fully utilize whole HSI data information, an embedding layer composed of one convolution layer is deployed to projected the HSI data into the standard dimension (64).The embedding layer guarantees all spectral information can be modelled and simultaneously avoids the huge computational consumption.The network is very concise and adopts the encoder -decoder scheme of U-Net in general.What differs is that only two alternating transformers and convolutional layers are used for encoding and decoding to respond to the small sample size.First, feature transformation is performed on the input image using a conventional 2D convolution with a step size of one.During the first stage of encoding, the feature map is input into the transformer layer.Then, the output features are aggregated through two 3�3 convolutional layers, where the step size of the first convolutional layer is set to one, and the step size of the second convolutional layer is set to two.After the second convolution layer, the dimension of the feature space is doubled with the number of feature map channels.
Next, the second stage of encoding begins.In this process, the original input is used to concatenate with the downsampled features, thus helpfully supplementing the missing details.Finally, the features output from the second stage of encoding are fed to the decoder through two transformers.The decoder and the encoder are similar in structure.However, the encoder downsamples the feature map through a convolutional layer with a step of two to aggregate features, whereas the decoder upsamples the feature map by using a deconvolution operation with a step of two.The following subsections introduce the components of TrmGLU-Net and the superpixelbased label expansion method.

Transformer
Transformer has self-attention mechanisms, allowing them to have global receptive fields and capture longdistance dependencies.However, if we want to use transformers to process the feature map in Figure 1, it is first necessary to convert the feature map into a sequential form.The transformer-based method originally used to solve the image problem simply originally used to solve an image problem was simply to cut the image into fixed-size patches (Dosovitskiy et al., 2020).It is obviously not conducive to high-resolution visual tasks, especially hyperspectral image classification related to dense prediction task.For hyperspectral image classification, it is necessary to establish the pixel-level dependencies when using transformer, but it will lead high computational complexity.The original transformer does not have high computational   As shown in Figure 2, the feature map is first divided into M � M non-overlapping windows along the spatial dimension, and then it is evenly divided into k non-overlapping parts along the channel dimension.Thus, we can obtain a total number of H�W M�M � k independent windows.Next, the feature map in each window is expanded into a 2D matrix upon which the self-attention operation is performed.The k denotes the number of heads for the transformer.For the feature map, X, the transformer can be expressed as: where X i k refers to the i-th window for the k-th head.W Q k ,W K k ,W V k denote the queries, keys, and values for the k-th head, respectively.In the W-MSA model, the attention calculations are assigned for inside of each window, so different windows can be treated as different batches when training.The attention calculations can be formulated as: where B is the relative position-encoding, softmax denotes the softmax normalized function.Like the original transformer, a fully connected layer is utilized to enhance the nonlinearity of the features following W-MSA.It should be noted that each window in the feature map must be flattened to a 2D matrix before being input to the transformer; it is then reverted to a feature map after the transformer before being input to the convolution layer.OðM 2 � H � W � kÞ.Besides, as the size of the feature map decreases with the deepening of the network, the receptive field of window actually increases, which attends better to contextual and local information.Therefore, it can be used to effectively improve the accuracy of HSI classification.

Convolutional layers
TrmGLU-Net employs three types of 3�3 convolutions.A convolution layer with a step size of one is used to extract local features.A convolution layer with a step size of two is used to reduce the spatial dimension of the feature map for feature aggregation.An inverse convolution layer (ConvT2D) with a step size of two is used to upsample the feature map.The convolution operation can be described with the following equation.
In the j-th feature map of the i-th convolutional layer, the value of v xy ij at (x; y) can be obtained using the following equation: where P i and Q i denote kernel sizes, mis the number of feature maps of thei ¼ 1 layer, v is the value of the k-th feature map in the i À 1 layer at ðx þ p; y þ qÞ, w pq ijk is the convolution kernel connected to the k-th feature map of the layer i À 1, b ij denotes the bias, f ð�Þ is the activation function.The bidirectional encoder representations from transformers uses Gaussian error linear units (GELUs) as the activation function.Therefore, the model in this paper also adopts GELUs as the activation function.ConvT2D involves the process where the feature map is first interpolated before the convolution operation.are available for training, which are also sparsely distributed.Such weak supervision is likely to cause noise in the classification results and blurred boundaries between different classes.Considering that the sample points within the superpixel are very likely to have consistent labels, we propose a superpixel-based label expansion method.Specifically, superpixel segmentation is performed on the hyperspectral image, and the result of superpixel segmentation is used as a mask.For a superpixel containing labelled samples, the classes of the samples within that superpixel are labelled with the classes of known samples to greatly increase the number of labelled samples.The expansion of labels will significantly strengthen the supervision information of image-level task and thus the network can be trained more adequately.

Superpixel-based label expansion
The accuracy of superpixel segmentation is the key to the result of the proposed label expansion method.Therefore, the segmentation results of simple linear iterative clustering (SLIC) are used to expand a small number of labels.The SLIC algorithm is simple and efficient in that it only requires the input of the number of superpixels, K, to obtain the segmentation results.The SLIC algorithm is as shown in Alg.1.
Extensive evidence suggests that good superpixel segmentation results can be obtained with 10 iterations for most image data.Therefore, the number of iterations in SLIC is set to 10 for later experiments.It is noted that the superpixel-based label expansion here is different from the common superpixel segmentation post-processing method, in fact, it utilizes the result of superpixel segmentation to extend the labelled sample before model training.Therefore, the superpixel-based label expansion is actually not a post-processing, and it can be understood as a pseudo-label technology to solve the problem of insufficient samples.

Experimental results and analysis
The validity of the proposed method was tested by classifying three HSI datasets.In terms of hardware settings, Intel Core Intel(R) Xeon(R) Gold 6152 central processing unit, Nvidia A100 PCIE GPU, 40-GB GPU memory, and 128-GB RAM were used.The algorithms were all implemented in Python and PyTorch.The overall accuracy (OA), average accuracy (AA), and Cohen's kappa ðKÞ were selected as the evaluation criterion.All trials will run 10 times with different samples, and the average results were report to smoothen errors as far as possible.Besides, an additional two-tailed Wilcoxon test was conducted to show if the experimental conclusion were statistically significant (i.e.reporting the p-value).

Experimental data
To test the validity of the proposed method, four HSI scenes were used for classification (i.e. a University of Pavia scene, an Indian Pines scene, a Salinas scene and a Houston scene).The scenes were obtained by the Reflective Optics Spectrographic Imaging System (ROSIS-03), the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), again AVIRIS, and the ITRES CASI-1500 sensor, respectively.To quantitatively evaluate the performance of different classification algorithms, some regions of the images were manually labelled.Specifically, data from the University of Pavia scene were classified into 9 classes with a total of 42,776 samples; data from the Indian Pines scene were classified into 16 classes with a total of 10,776 samples; data from the Salinas scene were classified into 16 classes with a total of 54,129 samples; and data from the Houston scene were classified into 15 classes with a total of 15,029 samples.The properties of these HSIs are listed in Table 1.To verify the efficacy of the proposed method for small sample classification, samples were randomly selected from each class as training samples (i.e. 20, 10, 10 and 20 samples from the University of Pavia data, the Indian Pines data, the Salinas data, and the Houston data, respectively).The remaining samples were used as test samples.For the four HSIs, the number of samples used for training was 180, 160, and 160, and 300 respectively.The proportion of training samples to the total number of samples was 0.4%, 1.5%, 0.3% and 1.9%, respectively.

Parameter settings and analysis
The Adam optimization algorithm was used for network training with a learning rate of 0.00005 and 150 epochs.The expansion of superpixel tags directly affects the accuracy of classification, and SLIC requires prior input of the number of superpixels.Therefore, how to determine the number of segments becomes a key problem.To make this process more rational, we utilize a soft turning mechanism.Concretely, we firstly define a space of hard segments number Θ ¼ f100; 500; 1000; 2000; 5000; 10000g to test the effect of different number of superpixels on the classification accuracy and observe the variation trend.Each experiment was repeated 10 times.Figure 4 presents the box plots of overall accuracy with different numbers of segments used for the four HSIs under the defined space.The experimental results in Figure 4 show that both too few (100, 500) and too many (5000, 10000) superpixels lead to a decline in classification accuracy, except for the Houston dataset.The lower the number of superpixels and the higher the number of image points contained within each superpixel, the higher the probability of introducing noise in the label, thus causing a decrease in classification accuracy.On the other hand, the higher the number of superpixels, the lower the number of image points contained in each superpixel.Thus, the expanded labels were more accurate, but the number of expanded labels was smaller.
Hence, there was a slight decrease in classification accuracy.As for the Houston scene, maybe it is too complex and contains a large number of linear objects, so it is not suitable to contain too many points in each superpixel.Secondly, we further narrow the range of optimal values according to the results of the previous step.We narrow the range to (500, 5000) for the first three scenes and to (5000, 13000) for the Houston scene by taking one value as the new experimental number of segments every 100 intervals within this range.In order to facilitate intuitive comparison, we fix the training sample and the variation trend of classification performance with number of segments is shown in Figure 5. Thus, the optimal range is further compressed.Taking the University of Pavia scene as an example, its range is narrowed to (1200, 1300).Next, in order to improve accuracy and give consideration to efficiency, the interval is further reduced to 5 to determine the final optimal range.Therefore, the final optimal range of number of segments for four scenes are determined to (1200, 1300), (1100, 1300), (530, 565) and (11010, 12030), respectively.And the minimal of each range is selected as the final number of segments for each scene.

Ablation study
To demonstrate the enhancing effect of transformers and the effectiveness of superpixel-based label  improve classification accuracy ðp < 0:005Þ, showing the effectiveness of the proposed method.
Besides, as the TrmGLU-Net is proposed for small sample classification, we consider inputting data with high dimensionality may cause overfitting.Besides, the 3-channel input is a standard form for conventional Transformer model.Therefore, original hyperspectral images are transformed through principal component analysis (PCA), and the first three principal components are taken as input.The comparative results are shown as in Table 2.As we can observe, the input after PCA cannot improve the performance by reducing the dimensionality of data.And the step of PCA even weakens the performance to an extent because of potential information loss.By contrast, the full data as input can obtain higher overall accuracy on all HSI data, which shows the necessity of keeping full-dimension information.Therefore, we  retain the setting of inputting the full-dimension HSI data.

Comparison of methods
Tables 3-6 show the per-class accuracy, overall accuracy, average accuracy, and Cohen's kappa of different methods on the four HSIs.The experiment was repeated 10 times for each method.The mean and standard deviation for the 10 experiments are listed in the tables, and the pvalue of two-tailed Wilcoxon test are also reported.
Combined with the kernel method, SVM (Melgani & Bruzzone, 2004) can address the overfitting issue caused by small sample sizes to a certain extent, and it has been widely used for HSI classification.Therefore, the SVM with spectral features as the direct input was chosen as the benchmark.Then, the classical EMP (Benediktsson et al., 2005) was selected as a representative method for manual feature extraction.Given the large number of patch-level deep learning methods, we selected three popular methods (i.e.2D-CNN (Yue et al., 2016), 3D-CNN (Y.Chen et al., 2016) and the double-branch multi attention mechanism network (DBAM) (Ma et al., 2019)) for comparison.Additionally, deep few-shot learning (DFSL) (Liu, Yu, Yu, et al., 2019) and convolutional neural network with model-agnostic meta-learning algorithm (CNN-MAML) (Kuiliang Gao, Liu, et al., 2021) were also selected, which were both specifically designed for small-sample classification.The DFSL uses multiple previously collected HSIs to pretrain the model, and the trained network can be applied to process target HSIs as a feature extractor.The CNN-MAML optimizes the CNN model on plenties of different tasks to make it more general, so the model can better adapt to target tasks with only small samples.The last method was the latest image-level method FCN-Pyramid (D. Wang et al., 2022) with an attention mechanism, which also takes the whole hyperspectral image as model input and directly outputs the classification results of the whole scene.As shown in the table, the SVM, which used only spectral features, had the lowest classification accuracy for the four HSIs, whereas EMPs with spatial information achieved higher classification accuracy.In addition, the classification accuracy was improved by using deep learning methods based on local image blocks (e.g.2D-CNN, 3D-CNN and DBAM), improving the accuracy.A significant increase was observed when using DBAM.A high classification accuracy was obtained with both DFSL and CNN-MAML on the University of Pavia, Salinas and Houston scenes, whereas their performance was relatively poor on the Indian Pines scene, which had lower resolution.By contrast, FCN-Pyramid only performs well on Salinas and Indian Pines scenes consisting mainly of planar objects, and it did not offer significant advantages over DBAM and DFSL owing to the small number of labelled samples.For all four HSIs, the highest overall accuracy was obtained by using the method proposed in this paper ðp < 0:005Þ.For example, on the University of Pavia scene, the average of the overall accuracy scores of TrmGLU-Net+Aug was more than 7.52% higher than that of SVM and 3.57% higher than that of FCN-Pyramid, which is also an image-level classification method.The advantage of TrmGLU-Net+Aug became more obvious when classifying the more challenging Indian Pines scene.Its accuracy was nearly 23.9% higher than that of SVM and 5.03% higher than that of FCN-Pyramid.
Figure 7-10 show the maps obtained with different classification algorithms.It can be seen from the maps that isolated noise points often occur in the maps obtained with pixel-based or patch-level HSI classification methods.In contrast, classification maps with better visual results were obtained    by using image-level classification methods (e.g.FCN-Pyramid and TrmGLU-Net+Aug), though their misclassified samples were often distributed in blocks.The results of the qualitative evaluation in Figures 7-10 are consistent with the results of quantitative evaluation in Tables 2-5.In both cases, the proposed TrmGLU-Net+Aug obtained the best classification results, demonstrating its effectiveness.

Conclusion
By utilizing the contextual information in a hyperspectral image, this paper obtained the classification results of the whole scene directly by inputting the whole image into the segmentation model.A non-local TrmGLU-Net is proposed for the small sample classification of hyperspectral images.This proposed network has a wider receptive field and can attend to both local and global features, which greatly improves the accuracy of HSI classification.To obtain high-accuracy classification with small sample sizes, a superpixel label expansion method is also proposed to increase the number of labelled samples to further improve classification accuracy.The classification performance of the proposed method on four popular HSI datasets is compared with that of other classification methods, including 3D-CNN, DBAM, DFSL, CNN-MAML and FCN-Pyramid.The results show that the proposed method achieves higher classification accuracy than the others.
However, the form of inputting the whole image will lead to tremendous hardware stress such as GPU memory.We will further propose corresponding strategies to address the issue in subsequent studies.
efficiency, and both contextual and local information must be used for HSI classification.Considering these facts, a window-based multi-head attention (W-MSA) model (B.Liu et al., 2021) was adopted in this study.

Figure 2 .
Figure 2. Schematic of the W-MSA model.
the model can be called global multi-head attention model, which attempts to establish the dependency pixel by pixel in the whole image.Compared with global multi-head attention models, the W-MSA model in Figure 2 can significantly reduce computational efforts from OðH 2 � W 2 � kÞ to

Figure 3 Figure 3 .
Figure3shows that, in the image-level HSI classification task, only a small number of labelled samples

Figure 5 .
Figure 5.The variation trend of the overall accuracy with different number of segments.

Table 1 .
The properties of the three HSIs.

Table 3 .
Classification results with different methods on the University of Pavia dataset (%).

Table 4 .
Classification results with different methods on the Indian Pines dataset (%).