Spectral–Spatial Features Exploitation Using Lightweight HResNeXt Model for Hyperspectral Image Classification

Abstract Hyperspectral image classification is vital for various remote sensing applications; however, it remains challenging due to the complex and high-dimensional nature of hyperspectral data. This paper introduces a novel approach to address this challenge by leveraging spectral and spatial features through a lightweight HResNeXt model. The proposed model is designed to overcome the limitations of traditional methods by combining residual connections and cardinality to enable efficient and effective feature extraction from hyperspectral images, capturing both spectral and spatial information simultaneously. Furthermore, the paper includes an in-depth analysis of the learned spectral–spatial features, providing valuable insights into the discriminative power of the proposed approach. The extracted features exhibit strong discriminative capabilities, enabling accurate classification even in challenging scenarios with limited training samples and complex spectral variations. Extensive experimental evaluations are conducted on four benchmark hyperspectral data sets, the Pavia university (PU), Kennedy Space Center (KSC), Salinas scene (SA), and Indian Pines (IP). The performance of the proposed method is compared with the state-of-the-art methods. The quantitative and visual results demonstrate the proposed approach’s high classification accuracy, noise robustness, and computational efficiency superiority. The HResNeXt obtained an overall accuracy on PU, KSC, SA, and IP, 99.46%, 81.46%, 99.75%, and 98.64%, respectively. Notably, the lightweight HResNeXt model achieves competitive results while requiring fewer computational resources, making it well-suited for real-time applications.


Introduction
Hyperspectral pictures contain hundreds of continuous spectral bands that can be utilized to distinguish between various substances.As a result, hyperspectral pictures are now widely recognized as a crucial data source in remote sensing for object recognition and classification.Numerous classification strategies, notably supervised models, have been developed for labeling hyperspectral information.Supervised classification methods have been used for many classification tasks using random forest (Sun et al. 2019;Joelsson et al. 2005;Gadekallu et al. 2023) and support vector machine (SVM; Ravi et al. 2022;Saab et al. 2022).A random forest is an algorithm that averages out a set of data.The final classes of test samples are chosen either by a majority vote or the maximum posterior (MAP) rule, and a collection of decision trees is generated from a set of randomly selected subsamples of the training data.In contrast, an SVM seeks a hyperplane to prioritize differences across classes.However, "shallow" models like the random forest and SVM (Melgani and Bruzzone 2004;Waske et al. 2010) are considered inferior to "deep" networks that can obtain hierarchical, deep representations of features (Guo et al. 2022;Mou et al. 2020).
Many supervised approaches for HSI categorization (Audebert et al. 2019) have been proposed over the past 20 years.When HSI classification first began, spectral data were less available.Standard spectral classification using the SVM was reported in He and Chen (2021).In addition, several SVM-based classifiers (Deng et al. 2018;Ghamisi et al. 2017) have been proposed to manage the land cover classification of HSI due to SVM (Chen et al. 2022).SVM-based methods have poor sensitivity to huge dimensionality.The spatial features of HSI have been extracted using a variety of morphological operations, including morphological profiles (MPs; Chen et al. 2022), extended morphological profiles (EMPs; Benediktsson et al. 2005), extended multi-attribute profiles (EMAPs; Dalla Mura et al. 2011), and extinction profiles (EPs; Fang et al. 2018).
The use of deep learning algorithms in processing remote sensing images, particularly in HSI categorization (Wang et al. 2023;Xu et al. 2022;Ji et al. 2023), has the potential to revolutionize the sector radically.Depending on the different features utilized in the classification process, it is feasible to categorize deep learning-based HSI classification strategies into three primary categories.These three types of networks have been utilized based on geographical information, spectral properties, and hybrid networks.Both spectral-spatial feature-based networks and spatial feature-based networks have received more attention in recent years (Zhuo et al. 2022;Fu et al. 2023), which is possible because the HSI categorization is dependent not only on geographical information but also on spectral information.
The R-VCANet (Pan et al. 2017), Bayesian 2D convolutional neural networks (CNNs;Cao et al. 2018), and the squeeze multi-bias network (SMBN; Fang et al. 2019) are a few examples that have been used for land cover classification using spatial features.On the other hand, an HSI has an excessive number of channels, which frequently results in two-dimensional convolution kernels that are overly deep.The number of parameters has also increased significantly.Consequently, HSI classification methods could be based on three-dimensional CNNs.A deep contextual CNN (Bashir et al. 2023), as defined by (Bashir et al. 2023), employs several three-dimensional local convolutional filters of various sizes.A deep contextual CNN enables simultaneous utilization of an HSI's spatial and spectral components.To enhance the extraction of the essential spectral-spatial aspects of HSIs, Chen et al. ( 2016) created a three-dimensional CNNbased feature extraction model with regularization.
Ben Hamida et al. (2018) created a new threedimensional deep learning technique to process spectral and spatial data concurrently using less computer power (i.e., floating point operations, FLOPs).Even though the overall number of parameters for a 3D CNN may be less, it still needs more processing resources than a 2D CNN.This is due to the depths it must penetrate and the absence of a bird's-eye perspective of the spectral data.The most current approach developed by Roy et al. (2020) utilized 3D and 2D CNN layers to develop a deep learning model called HybridSN.The HybridSN model improved the classification accuracy through joint exploitation of the spectral and spatial features.
However, deep learning models are difficult to train despite their impressive HSI classification performance because of the challenges of obtaining tagged pixels and the high cost of labeling them.In addition, the pixels among these labeled data are not distributed equitably.When dealing with irregularly dispersed data and a limited number of samples, it is incredibly challenging to construct reliable deep-learning models (Hang et al. 2019;Mou and Zhu 2020) that have excellent performance and need fewer processing resources.A novel lightweight 3D convolutional neural network is an asymmetric inception network (AINet ; Fang et al. 2022).This concentrated on spectral characteristics rather than geographical settings and utilized a data fusion transfer learning approach to speed up training and improve the initialization of the model.However, its performance could be more optimal when used on smaller samples.The network was created using the double-branch dual-attention (DBDA) method described in Li et al. (2020), which simultaneously utilized spectral and spatial properties and achieved high accuracy across a broad range of HSI data sets using channel and spatial attention mechanisms that enhance the feature maps.
While state-of-the-art performance in CNN models for HSI classification has developed, certain limitations remain.For example, while using a CNN-based technique, some aspects of the input HSI are ignored and must be thoroughly explored.The CNN technique is vector-based; therefore, it reads the inputs as a set of pixel vectors (Linzen et al. 2016).HSI's data structure in the spectral domain is fundamentally built on a sequence.Because of this, CNN might cause data loss while processing hyperspectral pixel vectors (Vallathan et al. 2021).Second, the long-range sequential reliance needed to switch between band locations can be challenging to model.As the size of the kernel and the number of layers restrict the receptive field of CNNs, they could be better at gathering longrange relationships of input data (Peng et al. 2022).The convolutional processes focus on a small area around the input point.Because HIS (Vaswani et al. 2017) often consists of hundreds of spectral bands, understanding its long-range correlations is challenging.
The Transformer (Glorot and Bengio 2010;Jiang and Chen 2022;Hong et al. 2022) paradigm was recently proposed for use in natural language processing.The concept of self-attention serves as the foundation of this approach.By paying close attention, the Transformer (El-Assal et al. 2022;Xie et al. 2017;Yadav et al. 2022) may infer a worldwide dependence among a set of inputs.While training, deep learning models (Saab et al. 2022;Arikumar et al. 2022) like Transformers frequently experience the vanishing-gradient problem, which hinders or even prohibits convergence.Even while these backbone networks and their modifications (Garg et al. 2022; Grupo de Inteligencia Computacional (GIC) 2023) have shown promise in classification accuracy, they still need to adequately characterize spectral series information (Sharma and Biswas 2018;Zhao et al. 2022)particularly regarding collecting minor spectral disparities along spectral dimensions.Several recent methods are discussed in Table 1.
1. To reduce computational resources, spectral features are obtained through a single layer of threedimensional convolution.2. We have utilized a modified ResNeXt network with fewer trainable parameters to improve the classification accuracy and reduce computation resources.3. Four distinct data sets and six different state-ofthe-art methodologies are used to assess the model's performance.

Proposed method
We assume the spectral-spatial hyperspectral data cubes, where I m is the input, W, H, D are the spectral bands' total number, H is their width, and D is their combined height.A single HSI pixel in I m has D spectral band and forms a one-hot label vector L ¼ ðL 1 , L 2 , ::::: , where c stands for types of land cover.However, the mixed land cover classes shown in the hyperspectral pixels give a lot of variety within each class and numerous similarities across classes.The proposed model architecture is shown in Figure 1.
The classification of the class model would need to be reliable and effective for it to work.First, to eliminate the spectral redundancy, the principal component analysis (PCA) is used along the spectral bands of the initial HSI data (Im).The number of spectral bands is reduced from D to A via the PCA, but the spatial dimensions remain the same (W, width and H, height).We minimized the number of spectral bands while preserving the essential spatial information.Following PCA, the reduced data cube will be S number of bands after principal component analysis.The data cube is broken up into smaller, overlapping patches, and the pixel in the center of the cube is used to determine the truth label.The 3D data are then sent to a 3D CNN.Within this network, convolution is carried out with a 3D kernel (El-Assal et al. 2022), which records spectral characteristics from contiguous bands.
For the ith feature map of the jth layer's spatial position's activation value (x, y, z) is calculated as follows. where is the kernel's width, height, depth and w j, i ¼ weight parameter of the ith feature map of the jth layer.

Modified ResNeXt
The features obtained from the 3D CNN layers are first reshaped and fed into the modified ResNeXt for spatial feature extraction.The original ResNeXt included 23 Â 10 6 parameters, resulting in substantial calculation costs (Xie et al. 2017).We reduced the The main contribution of the methods is as follows.number of trainable parameters down to 9 Â 10 6 by lowering the convolution size of the first layer filter from 7 Â 7 to 5 Â 5 because a larger filter size dampened the intensity of the edge pixels.Because of this, there was an increase in the number of false negative values.The sizes of conv2, conv3, conv4, and conv5 are all decreased similarly, as shown in Table 2.The modified ResNeXt model architecture is shown in Figure 2. The modified ResNeXt contains grouped convolution, ReLU activation and residual blocks.The pooling layer maps the features of the CNN block obtained from the cluster of adjacent neurons.Pixels are separated, and surrounding pooling units rarely overlap, reducing overfitting (Saab et al. 2022;Yadav et al. 2022).Further, convolution is performed by the inner dot product of neurons in the convolution layer to generate aggregate transform as follows.
X N where, x ¼ ðx 1 , x 2 , ::::::x N Þ ¼ input vector to the N channel and w i ¼ filter weight of the ith neuron.To reduce the dimension depth, elementary transform w i x i is replaced by a more generic function called aggregated transformation, as shown in Eq. (3).
where T i ðxÞ can be an arbitrary function.Analogous to a simple neuron, T i should project x into an (optionally low-dimensional) embedding and then transform it.

Local response normalization (LRN)
The normalization of the input does not affect the saturation of the deep learning algorithm using ReLU.ReLU enables the neurons in the model to learn with fewer positive training examples (Arikumar et al. 2022).Despite this, we determine the action of neurons at a given location (x, y) by employing a kernel (k) to facilitate generalization.Afterward, ReLU nonlinearity is applied in the HResNeXt.The LRN of the neurons with N layers is determined as follows.by the flatten layer is passed to the Softmax layer.Finally, the Softmax layer converts probabilities into their corresponding class.

Data set
The proposed method is evaluated on four standard open access data sets Salinas scene (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC) (Grupo de Inteligencia Computacional (GIC) 2023).The 224-band AVIRIS sensor in orbit took this picture of the Salinas Valley in California, and it has an impressive level of detail (3.7-m pixels).In the area that was studied, there are a total of 512 lines and 217-112, 154-167, and 224).Radiance data were the only way to access this image at the sensor level.It is made up of both wilderness and agricultural terrain, namely grapes.The underlying reality of Salinas may be broken down into 16 different levels.The Indian Pines test site in northwest Indiana was the location of the AVIRIS sensor images that were taken and included in the IP collection.It has a 145 Â 145 pixel resolution and can pick up wavelengths between 0.4 and 2.5 Â 10 -3 m.Cropland makes up two-thirds of IP, while forests and other types of perennial natural vegetation comprise the other third of the territory.
Along with the low-density residences, other buildings, and narrower roadways, the region has a rail line, two large dual-lane highways, two big dual-lane motorways, and two significant dual-lane highways.As the shot was taken in June, many crops are still in their early phases of development, with coverage of less than 5%.This is because of the date of the photo acquisition.There are 16 different kinds of ground truth, all related to one another.The number of spectral bands utilized in this investigation was cut down to 200 after bands in 220-nm ranges were removed from consideration.The ROSIS sensor gathered the information for the PU when it flew over Pavia in northern Italy.It has a resolution of 610 Â 610 pixels and contains 103 spectral bands; however, none of the samples in either picture is meaningful.The resolution of the device is 610 Â 610 pixels.It is possible to obtain a geometric resolution of 1.3 m.The data from the sample were divided into 99 different groups according to the underlying facts.
On March 23, 1996, the Florida-based Kennedy Space Center (KSC) was photographed by the NASAoperated AVIRIS satellite.The 224 bands from which AVIRIS may collect data have a 10-nm bandwidth and a center wavelength ranging from 400 nm to 2500 nm.The data from KSC have an 18-m spatial resolution.After considering water absorption and low SNR bands, the study utilized 176 bands.The comprehensive land cover maps were created using color infrared images captured by the Kennedy Space Center.Because many species in this area have similar spectral signatures, it is challenging to identify the vegetation in this area.The diverse land applications in this place have been organized into thirteen distinct categories.A detailed description of the data set is shown in Tables 3-6.

Experimental setup
In the proposed study, the experiment is conducted using Python 3.8 on NVIDIA Quadro RTX4000 GPU, having 128 GB RAM and a dual graphics card of 8 GB.For each data set, the initial learning rate was set to 0.0001 and trained for 100 epochs using the Adam optimizer with a mini-batch size of 64.

Quantitative result analysis
SVM, 1D CNN, 2D CNN, 3D CNN, HybridSN, and Spectral Former (SF) are all machine learning and deep learning-based approaches that are compared to gauge the method's performance [42].All parameters are kept at their literature-referenced values for consistency's sake.Because there are few instances in several classes of the IP data set, the experiment is conducted by splitting the PU, KSC, and SA data sets into 5% for training and 95% for validation.We split the data set in half, using the first 10% for training and the second 90% for validation.For HSI classification, we have used the libsvm toolbox3's support for SVMs by adjusting the RBF's two parameters.The 2D CNN consists of a softmax layer and three 2D convolutional blocks.The convolutional blocks of 2D CNNs use the same 1D conventional layer, BN layer, max-pooling layer, and ReLU activation function as their 1D counterparts.Separate spatial and spectral features extractors of size 3332, 3364, and 11,128 are included in each 2D convolutional layer.The 3D CNN has two convolutional layers that use 3D max-pooling and batch normalization to provide optimal results.HybridSN combines three 3D convolutional layers with one 2D convolutional layer.The spectral former used a cross-layer skip connection to extract features in both patch-and pixel-based ways.
Local and global attention methods also improve the HSI's classifying precision.We have summarized the classification performance of each method and proposed method from Tables 7-10.
The SVM classification performance is lower in several classes due to a lack of high-dimensional features, whereas 1D CNN improves the classification accuracy through one-directional convolution.Further, 2D CNN calculates spatial features via    convolution in both directions.3D CNN computation cost is high but capable of extracting high-dimensional spectral features.To make use of spectral and spatial characteristics and enhance classification performance, HybridSN used 3D and 2D CNN layers.The global and local attention of the feature is provided using the SF network through a transformer, which improves the accuracy.However, this requires high computation costs and large volumes of data.
The proposed HResNeXt utilized spectral and spatial features via one 3D CNN layer and modified the 2D convolutional block of the ResNeXt network to improve the classification.In addition, the computation cost is less due to fewer parameters.

Performance evaluation on different patch sizes
Path size plays an essential role in the computation model for HSI data.We can see in Figure 3 that the   HResNeXt accuracy is less in 9 Â 9 and 11 Â 11, whereas the highest classification accuracy is achieved on a 15 Â 15 patch size.Further, increasing the patch size reduces classification accuracy.

Visual analysis
We present a class visual map of the PU, KSC, SA, and IP data sets in Figures 4-7.In Figure 4, we can see that the land cover classification map using SVM is less close   In contrast, the HybridSN visual map of the Asphalt class is similar to GT.The SF utilized global feature attention to improve the land cover classification map.The HResNeXt visual map and GT are very close in the Trees, Bare Soil, Bitumen, and Shadows classes.Similarly, in Figure 5, we can observe that the classification map of SVM, 1D CNN, and 2D CNN in several land covers suffered from noise.However, 3D CNN has a better Graminoid marsh class classification map than other methods.The Oak and Scrub classes' HybridSN and SF land cover map is much better.Furthermore, the proposed method classification maps are very close to GT in several other classes.
In Figure 6, we can see that the classification map of the land covers using SVM is further from the ground truth (GT) in several classes.In contrast, the 1D CNN has improved visual maps in several classes.Much better object visualization can be seen in the 2D CNN approach, which achieved a very close map in the Celery class compared to GT.The visual map of the Stubble class using 3D CNN is much better than other methods.In contrast, the HybridSN visual  The classification accuracy was significantly improved by using a variety of machine learning and deep learning techniques.Nevertheless, computation costs and highly efficient method is less that can be used for real-time applications.We compared these methodologies with the proposed HResNext syste g m by usinquantitative and visual maps.The SVM-based approach cannot extract high-dimensional features due to its design limitations.The feature extraction process used by 1D CNN is called directional convolution.In addition, 2D CNN is capable of calculating spatial features in both directions, but it does not have spectral features.3D CNN computation cost is high but capable of extracting high-dimensional spectral features.
The training loss of the proposed method on different data sets We have performed several experiments on data sets.We notice no significant changes in the training accuracy after 100 epochs.Therefore, the model was trained for 100 epochs only.By doing this, computation costs can be reduced.The training loss curve proposed method on PU, KSC, SA, and IP is shown in Figure 8.

The computation time of the proposed method on training and validation data
We have summarized the training and validation time of the proposed method on PU, KSC, SA, and IP data sets.The detailed summary of the training time in minutes and test time in seconds is shown in Table 11.

Conclusion
Hyperspectral image classification is challenging and requires a sophisticated method to better utilize the rich spatial and spectral features.Many machine learning and deep learning techniques enhance classification accuracy.Nevertheless, computation costs and highly efficient method is less that can be used for real-time applications.The lightweight HResNeXt model is specifically designed to overcome traditional methods' limitations.The HResNeXt successfully captures spectral and spatial information concurrently.In the proposed study, we utilized only one 3D convolution block for spectral features and a modified 2D residual block to capture spatial features.The original ResNeXt has many trainable parameters, which can increase the computation cost.Hence, first, we reduced the trainable parameters that reduce the costs.
After that, we jointly extracted spectral and spatial features to improve the quantitative and visual performance.Subsequently, it enables efficient and effective feature extraction from hyperspectral images, resulting in competitive classification accuracy.The HResNeXt obtained an OA on PU, KSC, SA, and IP, 99.46%, 81.46%, 99.75%, and 98.64%, respectively.In future study, we will explore more advanced and lightweight graph CNNs and vision transformers.In addition, the integration of handcrafted features and deep features can be used to improve classification accuracy.Further, high-dimensional features extracted by the model can be optimized using the natureinspired algorithm to enhance the classification performance.The computation cost of the algorithm is still a big challenge that needs further reduction.In addition, the dimension reduction algorithm PCA has been applied in the proposed study.There may be slight differences in the performance of other dimension reduction algorithms.Further, the model can be implemented on a real-time data set.

Figure 1 .
Figure 1.The model architecture of the proposed method.
where t, a, b are hyperparameter constants, and n is the adjacent kernel feature map.The efficiency of deep CNN models is dependent on their architecture.The major component of the deep CNN model is hyperparameters.Through the proper use of hyperparameters, CNN classification accuracy can be improved.The division by zero is avoided by setting t ¼ 2: The consecutive pixels of input layers that undergo normalization are defined by n.The normalization process a is set to 10e-4, and the contrasting constant b is set to 0.65.The feature map generatedTable 2. Parameters comparison of the original ResNeXt and Modified ResNeXt.

Figure 3 .
Figure 3.Effect of patch size on classification performance.

Figure 8 .
Figure 8.The training loss of the proposed model on (a) PU, (b) KSC, (c) SA, and (d) IP.

Table 1 .
Summary of the recent methods used for HSI classification.

Table 4 .
PU data set description with land cover color map.
Ã Color Code Details for table 4.

Table 8 .
Performance evaluation on the KSC datas et.

Table 5 .
KSC data set description with land cover color map.
Ã Color Code Details for table 5.

Table 6 .
SA data set description with land cover color map.
Ã Color Code Details for table 6.

Table 7 .
Performance evaluation on the PU data set.

Table 9 .
Performance evaluation on the SA data set.

Table 10 .
Performance evaluation on the IP data set.

Table 11 .
Computation time on training and validation data.