DSS-TRM: deep spatial–spectral transformer for hyperspectral image classification

ABSTRACT In recent years, the wide use of deep learning based methods has greatly improved the classification performance of hyperspectral image (HSI). As an effective method to improve the performance of deep convolution networks, attention mechanism is also widely used for HSI classification tasks. However, the majority of the existing attention mechanisms for HSI classification are based on the convolution layer, and the classification accuracy still has margins for improvement. Motivated by the latest self attention mechanism in natural language processing, a deep transformer is proposed for HSI classification in this paper. Specifically, deep transformer along the spectral dimension and the spatial dimension are explored respectively. Then, a deep spatial-spectral transformer (DSS-TRM) is proposed to improve the classification performance of HSI. The contribution of this paper is to make full use of self attention mechanism, that is to use transformer layer instead of convolution layer. More importantly, a DSS-TRM is proposed to realize end-to-end HSI classification. Extensive experiments are conducted on three HSI data sets. The experimental results demonstrates that the proposed DSS-TRM could outperform the traditional convolutional neural networks and attention based methods.


Introduction
Hyperspectral image (HSI) could provide the spectral and spatial information of ground objects at the same time (L. Zhang et al., 2016). These abundant information could be used to distinguish different classes of ground objects . However, the abundant information would lead to highdimensional data characteristics, which would make the "curse of dimension" problem serious. In this context, the critical issue of HSI classification is how to use the abundant spatial-spectral information (Liu et al., 2018).
Aiming at taking advantage of the rich spectral information, numerous traditional machine learning methods are introduced into the HSI classification tasks. These methods include knearest neighbor, support vector machine (SVM) (Fauvel et al., 2008), logical regression (J. Li et al., 2013), extreme learning machine (W. Li et al., 2015), random forest (Peerbhay et al., 2015) and so on. Simultaneously, some spectral dimension reduction method like principal component analysis (PCA) (Yang et al., 2017), independent component analysis (ICA), linear discriminant analysis (LDA) (C. H. Li et al., 2011) are utilized to improve the classification efficiency of HSI.
In remote sensing image, the closer the distance between two pixels is, the more likely they are to be regarded as the same class of object (Guo & Zhu, 2019). This means that considering the spatial neighborhood information in the classification process will help to improve the classification accuracy. One of the most common ways to consider spatial information is to introduce neighborhood information in the process of feature extraction. These feature extraction methods considering both spectral and spatial information are called spatial-spectral feature extraction method. At present, spatial-spectral features based classification method has become the mainstream method in HSI classification. The classic spatial-spectral feature extraction methods are local binary patterns (LBP) (Sen Jia, Hu et al., 2017), Gabor features (S. Jia, Deng et al., 2017), morphological profiles . Compared with only using spectral features, these feature extraction methods greatly improve the classification performance of HSI. However, the biggest disadvantage of these feature extraction methods is that they rely on artificial feature extraction rules.
Deep learning methods could automatically learn hierarchical feature representation in an end-to-end manner and don't need the process of hand-crafted, thus the research of the HSI classification methods based on deep learning has become a hotspot in recent years (Audebert et al., 2019;S. Li et al., 2019;L. Ma et al., 2019). Typical deep learning methods include deep belief network (DBN) (T. Li et al., 2014), stacked auto-encoder (SAE) (Chen et al., 2014), recurrent neural networks (RNNs) (Mou et al., 2017a), convolutional neural network (CNN) Guo & Zhu, 2019;Lee & Kwon, 2017;Yu et al., 2020;M. Zhang et al., 2018) and so on. These deep learning based methods have been widely used for the classification of HSI. Particularly, CNN have achieved great success on HSI classification. 1D-CNN is first used to extract spectral features in HSI. But it requires the input to be a one-dimensional vector. To make full use of spatial-spectral information, many researchers designed 2D-CNN models to extract discriminative features. 2D-CNN usually combined with dimension reduction methods like PCA is easy to miss channel relationship information and lack detailed spectral information, thus 3D-CNN is proposed to extract the spectral-spatial features. In addition, residual network (Mou et al., 2017b;Xue et al., 2021), dense connected network  and other modern network structures are introduced to make the network for HSI classification easier to train.
Although the aforementioned deep learning based methods make great progress on HSI classification performance. How to use fewer labeled samples to obtain higher classification accuracy has always been the goal of HSI classification research Wang et al., 2021). Attention is an inherent signal processing mechanism of human brain. Human brain quickly selects the areas that need attention from visual signals, which is commonly known as attention focus, and then focuses on the details of these areas. The attention mechanism of human vision greatly improves the efficiency and accuracy of visual information processing (Huang et al., 2021;Xu et al., 2021). Inspired by this, researchers introduced attention mechanism into deep learning model for visual task and natural language processing task to improve the performance of the model. As an effective method to improve the performance of CNNs, attention mechanism is also widely used for HSI classification tasks. For example, attention mechanisms is introduced to a ResNet to make the model learn more discriminative spatial-spectral features (Haut et al., 2019). A spectral-spatial attention network (SSAN) is designed to capture discriminative features from HSI cubes . 3D attention module is also introduced to enhance the expressiveness of the features .
Attention mechanisms for HSI classification have achieved significant improvement. However, the majority of the existing attention mechanisms for HSI classification are based on the convolution layer, and the classification accuracy still has margins for improvement because CNN is not good at modeling the long-distance dependencies and obtaining global context information (Dosovitskiy et al., 2020;Tan et al., 2021). By contrast, the transformer model can better utilize the global context information within a large range by treating the input image as the sequential patches. Based on self attention mechanism, transformers are first proposed for machine translation, and have since became the state of the art method in many natural language processing (NLP) tasks (Vaswani et al., 2017). On account of NLP successes, multiple works try combining CNN-like architectures with self attention. Motivated by the latest self attention mechanism in NLP, a deep transformer is proposed for HSI classification in this paper. Specifically, deep transformer along the spectral dimension and the spatial dimension are explored respectively. Then, a deep spatial-spectral transformer (DSS-TRM) is proposed to improve the classification performance of HSI.
The main contributions in this article are concluded as follows: Firstly, we explore the application of self attention mechanism to improve the classification accuracy of HSI, which provides a new method for HSI processing and analysis. Secondly, a DSS-TRM is proposed to pay attention to the discriminative features of the spectral dimension and the spatial dimension respectively. This makes the model obtain higher classification accuracy. Thirdly, huge amount of experiments are conducted on three HSI datasets. As far as we know, this is the first time that the transformer model is used to extract features along the spectral and spatial dimensions respectively, which undoubtedly provides a new direction for the study of hyperspectral image classification. The experimental results demonstrate that the proposed DSS-TRM could outperform the traditional convolutional neural networks and attention based methods.

Proposed framework
In this work, we develop a novel classification framework (DSS-TRM) for HSI classification. As shown in Figure 1, the proposed framework based on transformer consists of a spectral self attention model (SpecSAM) and a spatial self attention model (SpatSAM). The SpecSAM learns to pay attention to the features along the spectral dimension and the SpatSAM learns to pay attention to the features along the spatial dimension. The features extracted by SpecSAM and SpatSAM are fused to input the the classifier, this enables the proposed framework to make better use of spectral and spatial information to improve the classification accuracy.

Transformer
Transformer is a novel deep learning model based on self attention mechanism and feed-forward neural network. Different from the convolution layers commonly used in HSI classification, the transformer layer can calculate the feature representations completely dependent on self attention mechanism, which can obtain rich informational and robust feature representations. A trainable deep learning model based on transformer can be built by stacking several transformer layers. More specifically, this deep model can be described manually by three parts: positional encoding, self attention and feed-forward network, as shown in Figure 2.

Positional encoding
Different from CNN or RNN models, transformer layers contain no convolution and no recurrence. Therefore, in order to enable the model based on transformer layers to make use of the order of the sequence and position information, positional encoding is introduced. The positional encoding operation can output a positional vector with the same dimension as the input feature vector. Formally, sine and cosine functions of different frequencies are used to produce positional vectors: (1) where pos denotes the position index of feature vector. Let the length of feature vector be L, then the range of pos is 0, 1, . . ., L-1. d model is the feature dimension, and i = 0, 1, . . ., d model = 2 denotes index of the feature dimension. After positional encoding, some position information about the relative or absolute position is injected into the original feature vectors. It should be noted that the positional coding operation is performed only before the first transformer layer, and the sum of the obtained positional vectors and the original feature vectors is the input of the first transformer layer.

Self attention
Self attention mechanism is the core part of a transformer layer. Self attention mechanism can be regarded as a mapping function that maps a query vector and a set of key-value pairs to an output. The output of the self attention mechanism is the weighted sum of the value vectors, and the weights assigned to each value vector are calculated from the query vectors and the corresponding key vectors. In this work, scaled dot-product attention is used.
, key matrix and value matrix, respectively. X is the feature vector sequence. d k is the dimension of feature vector.
Furthermore, multiheaded attention (MHA) mechanism is introduced to improve the performance of the self attention model. Specifically, in a transformer layer, multiple query matrices, key matrices and value matrices are generated simultaneously, and multiple output features are generated according to Eq. (2). This means that one input vector corresponds to multiple output vectors, which can enable the model to extract richer feature representations. Then, multiple output vectors are concatenated and multiplied by a matrix parameter to obtain the final output vectors. Formally, multihead attention mechanism is as follows: where W O is a learned parameter matrix, XQ W i ; XK W i ; XV W i are the ith matrix parameter. Just as different convolution kernels can extract different features, different heads of the MHA mechanism in transformer layers can learn different attentions.
As shown in Figure 2, self attention mechanism maps a feature vector sequence X to another feature vector sequence Z containing the information about the original word or pixel and the relationship between words or pixels.

Feed-forward network
The feed-forward neural network is the last part of a transformer layer. In the proposed method, two fully connected layers are used to build the feed-forward network. In a fully connected layer, the features are caculated by: where i is the feature vector index. It should be noted that the parameters in the feed-forward network are shared across all features in the corresponding transformer layer. n addition, as shown in Figure 2, residual connections are introduced into each self attention layer and feed-forward network layer to improve the trainability of the model and make full use of the features extracted at different levels.

Deep spatial-spectral transformer
Compared with convolutional neural network, transformer has less training parameters and could learn more abstract features by using self attention mechanism. So, it could improve the accuracy of classification and recognition task. HSI provide not only spectral information but also abundant spatial information. Consequently, we use the transformer model from both spectral and spatial dimensions. As shown in Figure 1, we call the proposed framework DSS-TRM.

Pixel embedding
Transformer takes feature sequence as input. So hyperspectral data cube need to be transformed into feature sequence. Each sample of a HSI is a pixel, so we call this conversion process pixel embedding. DSS-TRM consists of a SpecSAM and a SpatSAM. As for the spectral dimension, we use a convolution layer to transform the image blocks of different bands into a one-dimensional feature vector. The length of the feature sequence is equal to the number of bands. Then, the feature sequence could be input into the SpecSAM to learn to pay attention to the features along the spectral dimension. It is notable that the SpecSAM is a model with several stacked transformers.
As for the spatial dimension, PCA is applied to the hyperspectral data cube, and the first three principal components are selected. Referring to the relevant research of transformer on image processing (Dosovitskiy et al., 2020), the image block (the selected three bands) is divided into 16 patches with the same size along the spatial dimension. Specifically, patches are divided from top to bottom and from left to right. Similar to the spectral dimension, a convolution layer is applied to map the 16 patches into 16 one-dimensional feature vectors. The length of the feature vectors is equal to the number of the convolution kernel. In this way, the feature vector sequence along the spatial dimension is obtained, and then the sequence is input to another stacked transformer model (SpatSAM).
It should be noted that SpecSAM and SpatSAM actually focus on different features, even though the both inputs are image blocks during pixel embedding. In SpecSAM, image blocks in each band are converted into feature vectors, and all feature vector sequences are arranged in band order. In SpatSAM, image blocks composed of three principal components are divided into patches in spatial order before pixel embedding, so the generated feature sequences are arranged in spatial order. In addition, PCA makes SpatSAM more suitable for learning spatial features, while the input of SpecSAM contains all spectral information.

Feature fusion
The SpecSAM and SpatSAM are responsible for extracting important spectral and spatial features, respectively. The extracted spectral and spatial features are input into a multilayer perceptron (MLP) with two fully connected layers, respectively. Then DSS-TRM need to fuse the spectral and spatial features to futher improve the classification performance. In this work, we tested three feature fusion methods. The first method is concatenating the spectral and spatial features. The second method is point-wise addition. The third method is point-wise multiplication. Then, the fusion features are input into a MLP to output the label.
The proposed DSS-TRM is an end-to-end framework. It could be trained by the back propagation algorithm. In this work, the widely used Adam (adaptive moment estimation) optimizer is adopted to train the framework.

Experiments
In this section, extensive experiments are carried out and the results are analysed in detail to demonstrate the effectiveness of the proposed method. All the algorithms are implemented by Pytorch library and all experimental results are generated on a computer equipped with an Intel(R) Xeon(R) Gold 6152 CPU, an Nvidia A100 PCIE and 256 GB memory.

Data description and experimental design
Three widely used hyperspectral data sets are selected for experiments to verify the effectiveness of the proposed method in HSI classification. The detailed data description is described below.
The first data set, University of Pavia (UP), were collected by Reflective Optics Spectrographic Imaging System (ROSIS) over the city of Pavia, Italy. The spatial size of the data set is 610 × 340 pixels with a 1.3 m/pixel spatial resolution, and the spectral range covers from 430 to 860 nm with 103 bands after removing 12 noisy bands. Besides the unlabeled pixels, the data set consists of 9 manually labeled classes. The number of training samples and testing samples used in the experiments is listed in Table 1.
The second data set, Salinas (SA), were collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the region of Salinas Valley, CA, USA. The spatial size of the data set is 512 × 217 pixels with a 3.7 m/pixel spatial resolution, and the spectral range covers from 400 to 2500 nm with 204 bands after removing 20 noisy bands. Besides the unlabeled pixels, the data set consists of 16 manually labeled classes. The number of training samples and testing samples used in the experiments is listed in Table 2.
The third data set, Indian Pines (IP), were collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the region of north-western Indiana.
The spatial size of the data set is 145 × 145 pixels with a 20 m/pixel spatial resolution, and the spectral range covers from 400 to 2500 nm with 200 bands after removing 20 noisy bands. Besides the unlabeled pixels, the data set consists of 16 manually labeled classes. Referring to relevant researches, however, only 9 labeled classes are used in this paper to avoid a few classes that have very few training samples. The number of training samples and testing samples used in the experiments is listed in Table 3.
In the experiments, the data cube within the pixel neighborhood is used for pixel embedding. Specifically, for spectral dimension, 16 × 16 × C (C denotes the number of bands in HSI) cubes around pixels are selected as input data. For spatial dimension, 32 × 32 × 3 cubes around pixels after dimensionality reduction are selected as input data. The purpose of selecting large neighborhood is to make full use of the spectral and spatial information in HSI and further improve the classification accuracy. It should be noted that the pixels to be classified are located in the center of the patches, and in each patch with an even size, the right and bottom sides have one more row of pixels than the left and top sides.

Hyperparameter settings
Similar to other related researches, the classification performance of the proposed method is further improved by searching for the optimal hyperparameters. The influence of three main hyperparameters, depth of the model, learning rate and feature fusion method, on classification accuracy is explored, and the detailed experimental results and analyses are as follows.
The proposed method, DSS-TRM, mainly consists of transformer layers. The number of transformer layers directly determines the depth of the model and mainly affects the feature representation ability of the model. To simplify the space of model structure, the depth of SpecSAM and SpatSAM are always consistent in the experiments. The number of transformer layer is denoted as variable L, and its range is set as 2, 4, 6, 8, 10. Figure 3 shows the relationship between model depth and overall classification accuracy. As we can see, with the increase of the number of transformer layers, the overall classification accuracy presents a trend of first rise and then decline on the three HSI data sets. For the UP and IP data sets, the optimal value of L is 6; for the SA data set, the optimal value of L is 8.
The value of learning rate directly affects the training effect of the model. Appropriate value of learning rate can enable the model to obtain better training effect with limited labeled samples and thus achieve higher classification accuracy. Loss value is the most direct index reflecting the effect of model training. Therefore, the influence of different learning rates on the loss value of the model in the training process is    explored, and the results are presented in Figure 4. It can be seen that a larger learning rate (lr = 0.0001, lr is the abbreviations of learning rate) can always make the model obtain a smaller and more stable training loss, which means that the model has stronger abstract representation ability at this time. Therefore, the learning rate is uniformly set as 0.0001 in the experiments. Then, we analyze the influence of different feature fusion methods on the classification accuracy. Table 4 summarizes the experimental results. In Table 4, SpecSAM means that only the features along the spectral dimension extracted by SpecSAM model are used for classification, while SpatSAM means that only the features along the spatial dimension extracted by SpatSAM model are used for classification. The symbol -means concatenation operation between features, the symbol + means point-wise addition, and the symbol × means point-wise multiplication. These three operations can jointly utilize features extracted along spatial and spectral dimensions. It can be seen from the statistics that, on the whole, the classification accuracy of SpecSAM or SpatSAM is lower than that obtained by using spectral and spatial features simultaneously. This indicates that the classification accuracy can be further improved by using both spatial and spectral information, and the effectiveness of the structure of the proposed model is also verified. In the three feature fusion methods including concatenation, addition and multiplication, multiplication operation can always enable the model to obtain higher overall classification accuracy. Therefore, the point-wise multiplication operation is selected as the feature fusion method in the subsequent experiments.
In addition, other hyperparameters and basic experimental settings are given directly by referring to relevant literatures and researches. In the training process, the number of iterations is set as 600, the batch size is set as 64, and Adam algorithm is used for optimization to ensure that network parameters could be fully updated and optimized. In the applied self mechanism, the number of head is 8. In the process of pixel embedding, the number of convolution kernel is uniformly set to 128, so the dimension of feature vectors extracted along spectral and spatial dimension is both 128. At the end of the designed model, the MLP that plays the role of classification is composed of two fully connected layers, and the number of neuron is 128 and K, respectively (K denotes the number of class in HSI). In addition, the cross entropy loss is used as loss function for model training.

Results and analysis
To verify the advantages of the proposed method in HSI classification, two machine learning methods, two classical CNN-based methods and two advanced methods based on attention mechanism are selected for comparative experiments. These methods are briefly described below.
RBF-SVM (Radial Basis Function-SVM): A classical classifier widely used in HSI classification. When processing high-dimensional data, RBF-SVM can achieve better classification performance compared with other machine learning classifiers.
EMP+SVM (Extended Morphological Profiles + SVM): Firstly, EMP features are extracted from HSI, and then RBF-SVM are used to complete classification. Compared with RBF-SVM, the introduction of EMP features can make better use of the spatial features in  HSI, so as to obtain the higher classification accuracy. In the experiments, the parameters of EMP are set by referring to the relevant literature .
3D-CNN (Y. Li et al., 2017): classical supervised deep learning model, which can fully extract the spatial-spectral features in HSI utilizing 3D convolution. Specifically, this method consists of two 3D convolutional layers and one fully connected layer.
S-CNN (Liu et al., 2018): In this method, pixel pairs are taken as inputs, and based on the CNN model, the loss function is modified to realize metric learning, so as to ensure that the same classes cluster together and different classes separate from each other in the deep metric space. DBMA (W. : DBMA is short for the double-branch multiattention mechanism network. In the DBMA model, two network branches are built to extract spectral and spatial feature in HSI respectively. In addition, two different types of attention mechanism are applied in the two branches respectively, so as to further improve the classification accuracy.
CACNN : This method first extract spectral-spatial features in HSI using 2D and 3D CNN respectively, then utilize the NonLocalBlock serving as a typical attention mechanism to combine these two kinds of features. Finally, a deep multilayer feature fusion strategy is used to combine the features of different hierarchical layers, so as to further improve the classification accuracy.
For a fair comparison, all methods are trained with 200 labeled samples. The hyperparameters and basic experimental settings of the methods used for comparison are consistent with the relevant literatures. Specifically, overall classification accuracy (OA), average classification accuracy (AA) and kappa coefficient are selected as the quantitative evaluation measurements. In addition, to reduce the fluctuation of classification results caused by the randomness of sample selection, the average value of 10 experiments is used as the final result to measure the classification performance.
Tables 5-7 list the experimental results of different methods on the three HSI data sets. Several observations can be obtained from the statistical results in the tables.
(1) The classification performance of traditional classification methods is worse than that of deep learning-based classification methods. The traditional classification methods are all shallow models, which cannot make full use of the deep features in HSI, so they cannot obtain satisfactory classification results. More specifically, the classification accuracy of EMP+SVM is significantly higher than that of SVM in all the three HSI data sets, indicating that the introduction of EMP features can effectively improve the classification performance.   (2) Both 3D-CNN and S-CNN use convolution operation to extract the spatial-spectral features in HSI, so they can effectively improve the classification accuracy compared with traditional methods. Furthermore, S-CNN achieves metric learning by improving network structure and loss function, so its classification performance is generally better than that of 3D-CNN. (3) The introduction of attention mechanism can further improve the classification performance of deep learning models. In the three HSI data sets, the classification performance of the method equipped with attention mechanism is better than that of the general deep learning methods. For example, the OA of DBMA, CACNN and DSS-TRM in SA dataset is above 98.5%, which is about 2.5-3.2% higher than that of 3D-CNN and S-CNN. (4) Among all the classification methods in Tables 5-7, the proposed method, DSS-TRM, can obtain the best classification results. Compared with other deep learning models, the advantages of DSS-TRM lie in the full use of spatial-spectral information and self attention mechanism. On the one hand, the proposed method first extracts the deep features along the spectral dimension and the spatial dimension, respectively, and then performs feature fusion, which can make full use of the spatial-spectral information in HSI. On the other hand, building a backbone network by stacking transformer layers containing self attention mechanism, and introducing residual connections for feature reuse, can enable the model to focus more on the deep features beneficial to the classification task, thereby obtaining better classification performance.
The average value of classification accuracy can show the classification performance of different methods from the view of statistics. To compare the stability of the classification results of different methods, three box plots are drawn based on OA. In Figure 5, different colors partition different classification methods, and circles (○) represent outliers in the experimental results. In general, the stability of classification results of deep learning models is better than that of traditional classification methods, and the   introduction of attention mechanism can further improve the stability of classification results. The box corresponding to the proposed method possesses the smallest length, indicating that the classification results of the proposed method are the most stable among all the methods.
Finally, we draw classification maps using the label predictions of different methods, and compare and analyze the classification results from a visual perspective. Compared with the quantitative measurements, classification maps can display classification results more intuitively. As shown in Figures  6-8, as OA increases, the misclassification phenomenon and noise in the classification maps gradually decrease. The classification map of the proposed method is the closest to the ground truth, indicating that the classification results of the proposed method can better restore the real distribution of surface features.

Influence of the number of training samples
Deep learning models need enough labeled samples for network optimization, while in practice, it is very difficult to accurately label the pixels in HSI. Therefore, the deep learning models for HSI classification should be adaptable to the change in the number of training samples. To explore the classification performance of different methods when training samples gradually decrease, the number of training samples is reduced from 200 to 100 at an interval of 20 for the experiments. As can be seen from Figure 9, with the reduction of training samples, the classification accuracy of all classification methods gradually decreases. The accuracy curves of the three deep learning models based on attention mechanism, DBMA, CACNN and DSS-TRM, change relatively smoothly, indicating that they possesses better adaptability to the change in the number of training samples. In addition, it is noted that the accuracy curve of the proposed method is always higher than that of other methods, indicating that it possesses the best classification performance when the training samples are gradually reduced.

Influence of the spatial size of the input cubes
In the experiments, cubes with a certain size around the center pixels are selected as the inputs of the model, so as to make full use of the spatial-spectral information in HSIs. Obviously, the spatial size of the input cubes can affect the classification accuracy to a certain extent. Therefore, by combining cubes of different size along the spectral and spatial dimension, the influence of the spatial size of the input cubes on the classification results of the proposed method is explored. It can be seen from Table 8 that when the size of the cubes along the spectral dimension is fixed,  the classification accuracy rises as the size of the cubes along the spatial dimension increases. When the size of the cubes along the spatial dimension is fixed, the classification accuracy rises first and then declines with the increase of the size of the cubes along the spectral dimension.

Conclusion
In this work, we introduce the transformer model, which is widely used in natural language processing, into HSI classification. Based on transformer, we build self attention model of spectral dimension and self attention model of spatial dimension. Then we fuse the features of the two models to form spatial-spectral features for final classification. The proposed framework (DSS-TRM) could use self attention mechanism to extract important features in spectral and spatial dimensions. Therefore, compared with CNN, the proposed DSS-TRM could improve the classification accuracy. Experimental results on three real HSI data sets demonstrate that the DSS-TRM could outperform the CNN and CNN based attention models. Moreover, this work provides a novel means for HSI processing and analysis.

Disclosure statement
No potential conflict of interest was reported by the author(s). Table 8. OA of the proposed method when the spatial size of the input cubes is changed.