Spatial-spectral feature classification of hyperspectral image using a pretrained deep convolutional neural network

ABSTRACT Deep learning based methods have recently been successfully explored in hyperspectral image classification field. However, training a deep learning model still requires a large number of labeled samples, which is usually impractical in hyperspectral images. In this paper, a simple but effective feature extraction method is proposed for hyperspectral image classification. Specifically, a pretrained deep convolutional neural network based on the ImageNet dataset is used to extract spatial features of a hyperspectral image. Recently, it is easy to obtain a pretrained convolutional neural network on the Internet. Note that the pretrained models are trained by using the ImageNet dataset. This means that the proposed method does not need labeled hyperspectral samples to train the deep model. Therefore, the proposed method alleviates the problem of lacking labeled samples and avoids the artificial design of feature extraction rules. Finally, the extracted features are stacked with spectral features as the input of a support vector machine classifier. The proposed method is conducted on three widely used hyperspectral image datasets. The experimental results demonstrate that the proposed method could outperform the conventional feature extraction methods and deep learning based methods.


Introduction
Hyperspectral image (HSI) classification has become a hot topic in the field of remote sensing and has also been widely used in many applications. (Bing  In general, the complex characteristics of hyperspectral data make the accurate classification of such data challenging for traditional machine learning methods. (S. Li et al., 2019) Recently, deep learning based methods have been successfully explored for HSI classification and demonstrate good performance. (Bing Liu, Yu, Zhang, Yu et al., 2018) (B. Liu et al., 2019) (Bing Liu, Gao et al., 2020) Deep learning based pixelwise classifiers including one-dimensional convolution neural network (1D-CNN), (Hu et al., 2015) deep belief network (DBN) (Chen et al., 2015) and recurrent neural network (RNN) (Mou et al., 2017) are first used for supervised classification of HSI. A disadvantage of the pixelwise classifiers is that they do not consider spatial information in the classification procedure. In this context, 2D-CNN (Yue et al., 2015) (Bing Liu et al., 2017) (Y. Chen et al., 2017), 3D-CNN (Y. Chen et al., 2016) (Bing Liu, Yu, Zhang, Tan et al., 2018) (H.  and 2D-RNN (Bing Liu, Yu, Yu et al., 2018) (Hang et al., 2019) models are used to mine the spatial information of HSIs, which greatly improves the classification performance of HSI. Meanwhile, to further improve the classification accuracy, deep residual network (Zilong et al., 2017) (Haut et al., 2019) and deep dense network (C.  are also used for HSI classification. The supervised deep learning classifiers could improve the classification performance of HSI. However, the aforementioned methods require a certain number of labeled samples to ensure ideal classification results. To deal with the lack of labeled samples, researchers have explored unsupervised deep learning of HSIs and obtained many meaningful research results. For example, an autoencoder network (Koda et al., 2019) is designed to extract spectralspatial features from HSI. Furthermore, an unsupervised spatial-spectral feature learning strategy (Mei et al., 2019) is proposed for HSIs using 3-Dimensional (3D) convolutional autoencoder (3D-CAE). A wasserstein generative adversarial network (WGAN) (M. Zhang et al., 2018) is designed to train a deep learning based feature extractor without supervision.
The above unsupervised deep learning methods could improve the classification performance. However, they usually require complex training strategies and long training time. With the rapid development and application of deep learning, it is easy to obtain a pretrained deep convolution neural network on the Internet. More importantly, a lot of research and practice show that the pretrained network on large datasets such as Imagenet can effectively improve the performance of the model in other tasks. In particular, some studies (Zeiler & Fergus, 2014) (Yu et al., 2014) have also shown that the underlying layers of a deep convolution neural network can learn to extract general features for different tasks. Motivated by this, a simple but effective feature extraction method is proposed to improve the classification accuracy of HSIs. Specifically, a pretrained VGG19 model (Simonyan & Zisserman, 2014) trained by the ImageNet dataset is used as a spatial feature extractor of a HSI. This means that it does not need the process of training deep learning models and the manual design of feature extraction rules. The proposed method can not only alleviate the lack of labeled training samples, but also effectively improve the classification accuracy of HSIs. The main contribution of this paper is that a simple but effective feature extraction method is proposed for hyperspectral image classification.

Proposed method
In this section, we explain the architecture of VGG19 and the proposed feature extraction method in detail.

Architecture of VGG19
VGG19 (Simonyan & Zisserman, 2014) is first designed for large-scale image recognition task on the ImageNet dataset. As shown in (Figure 1), VGG19 contains 16 convolutional layers, 3 fully connected layers with learning parameters and 5 pooling layers. In (Figure 1), Conv 3-n represents that this convolutional layer uses n convolutional kernels with a size of 3 � 3, Block*n represents repeating this convolutional layer n times, Max-pool represents the max pooling layer, FC-n represents the fully connected layer with n neurons. The input of VGG19 is a fixedsize 224 � 224 RGB image. In the lth convolutional layer, the input size is w 1 � h 1 � c 1 , where w 1 and h 1 are the row and column in spatial dimensions, respectively, and c 1 is the feature dimension. Given inputs x l i of the lth convolutional layer, the output x lþ1 j is defined as below: whereM l is the set of input maps, K lþ1 ij is the convolution kernel of layer l þ 1 that connects the ith map in layer l and the jth map in layerl þ 1, b j is the bias of the jth map of layer l þ 1, f ð�Þ is a nonlinear activation function. Each output map of a convolutional layer is the combination of convolution of input maps.
Compared with conventional CNN, VGG19 has a deeper architecture, which makes the network hard to train. Rectified Linear Units function is selected as the nonlinear activation function in Eq (1) to avoid the vanishing gradient problem. To prevent the overfitting, dropout is also introduced. Note that dropout is a technique that some weights of hidden nodes are randomly discarded. This can be seen as deleting part of the network structure, but retaining their weights (only not update temporarily) for next training use.

Spatial-spectral feature extraction
As for HSI classification task, one of the greatest challenges is determining what types of features should be extracted from the pixels. In this paper, a pretrained VGG19 model is selected as a spatial feature extractor. The extracted spatial features are then concentrated with the original spectral features to generate the spatial-spectral features.
The proposed spatial-spectral feature extraction method is shown in (Figure 2). Note that the input size of the pretrained VGG19 is 224 � 224 � 3. The dimension of HSI is generally different from that of VGG19 input. First, we use PCA to reduce the HSI to 3 dimensions. Then we resample the reduced image to 224 � 224. By this way, the processed HSI can be input into VGG19 model to extract features. Each  convolutional layer of VGG19 can generate features. The feature dimension is related to the number of convolution kernels in convolution layer. VGG19 contains 16 convolutional layers. The number of convolution kernels of 16 convolutional layers is listed in (Table 1). Some studies have shown that the underlying features of deep convolution neural network are easier to be transferred to other tasks. Therefore, the outputs of the first 5 convolutional layers are stacked as the spatial features. Note that the output size of the convolutional layer is different from the HSI size. Therefore, we need to up sample the output of each convolution layer to ensure the same size as the HSI. We use the first 5 convolutional layers to extract features, thus the feature dimension is 640ð64 þ 64 þ 128 þ 128 þ 256Þ.
Finally, the extracted spatial features and the original spectral features are stacked to form the spatial-spectral features. After feature extraction, we can use traditional classifier (e.g., SVM) to complete HSI classification.

Experimental results
The experimental results are generated on a PC equipped with an Intel 2.59 GHz Core i7-9750 H and an Nvidia GeForce RTX 2070 M. The PC uses 16 GB of memory. The VGG19 model is implemented by Keras.

Experimental data
To evaluate the efficacy of the proposed feature extraction method, classification experiments were conducted on three well-known data sets. The first one is the University of Pavia data set acquired by the Reflective Optics System Imaging Spectrometer sensor over the campus at the University of Pavia, Northern Italy. This data set mainly contains an urban environment with multiple solid structures (asphalt, gravel, metal sheets, bitumen, and bricks), natural objects (trees, meadows, and soil), and shadows. After discarding the noisy bands, the considered scene contains 103 spectral bands, with a size of 610 × 340 pixels in the spectral range from 0.43 to 0.86 µm and with spatial resolution of 1.3 m.
The second dataset is the Salinas dataset collected by the 224-band AVIRIS sensor over the Salinas Valley, USA. This dataset is characterized by a spatial resolution of 3.7 m. The image size is 512 × 217 pixels. 20 water absorption bands are discarded. This dataset includes a total of 16 ground-truth classes, such as vegetables, bare soils, and vineyard fields.
The third data set is the Indian Pines data set. This data set is gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana and consists of 145 × 145 pixels and 224 spectral reflectance bands in the wavelength range 0.4-2.5 µm. 24 bands covering the region of water absorption are removed, resulting in 200 bands for classification.
Class name, number of labeled training samples, number of testing samples are listed in (Tables 2-4).

Parameter analysis
Training VGG19 is carried out by optimising the multinomial logistic regression objective using minibatch gradient descent with momentum. According to the original paper, the batch size is set to 256, the learning rate is initially set to 0.01, and then decreased by a factor of 10 when the validation set accuracy stopped improving. The training is regularised by weight decay (the L2 penalty multiplier set to 0.0005) and dropout regularisation for the first two fullyconnected layers (dropout ratio set to 0.5).
At present, all deep learning open source frameworks provide the classical deep models (e.g., VGG19, ResNet50) trained by the ImageNet dataset. Therefore, the VGG19 model is easy to obtain. Once obtaining the pretrained model, VGG19 could be considered as a spatial feature extractor. The FLOPs of the spatial feature extractor (VGG19) is 11,276.35. The number of trainable parameters of VGG19 is 555,328.
The VGG19 model does not need training, thus the proposed feature extraction method is very simple. We only need to set the convolution layer number using   for feature extraction. To analyze the influence of convolution layer number on the classification accuracy, the convolution layer number is set to be 1, 2, 3, 4, 5, 6, 7, 8 respectively. Note that we realize spatial feature classification and spatial-spectral feature classification to demonstrate the effectiveness of spatialspectral features. The classification results on three HSI data sets are shown in (Figure 3). First, it is found that the classification accuracy increases with the number of convolution layers. However, the classification accuracy tends to be stable when the number of layers is greater than 5. This means that using more convolutional layers to extract spatial features is not helpful to improve the classification accuracy. In addition, more convolutional layers would increase feature dimension greatly, which lead to a sharp increase in training time. Considering the above factors comprehensively, we finally use the first five convolutional layers of the VGG19 to extract spatial features. Second, it is found that the classification accuracy using spatial-spectral features is better than using only spectral feature. This demonstrates the necessity of using spatial-spectral features.
In this paper, support vector machine (SVM) and random forest (RF) are selected as the classifiers to classify HSIs. SVM uses RBF kernel function. The optimal hyperplane parameters C (parameter that controls the amount of penalty during the SVM optimization) and K (spread of the RBF kernel) have been traced in the range of C ¼ 2 À 2 ; 2 À 1 ; . . . ; 2 7 and K ¼ 2 À 2 ; 2 À 1 ; . . . ; 2 7 using cross validation.  As for RF classifier, the number of decision trees is set to be 500.
The spatial feature extractor (VGG19) takes 3-band image as input. In this paper, PCA is used to reduce the dimension of HSI. We also test other band selection strategies, such as randomly selecting three bands, band selection via adaptive subspace partition strategy.  The mean classification accuracy of 20 runs are shown in (Figure 4). It could be found that "Band Selection" obtains a higher classification accuracy than that of "Random" and "PCA" obtains the best classification result. This is because band selection method can select more representative bands, so its classification accuracy is higher than randomly selecting three bands. Moreover, PCA can concentrate the information to the first three bands, thus retaining more information, so it obtains the best classification result.
In order to better observe the extracted spatial features, we take the University of Pavia data set as an example to visualize the features. The spatial feature visualization results are shown in (Figure 5). To help observe the features, we also show the HSI after dimensionality reduction. We can find that the first convolution layer could extract the edge information of the image. The higher the number of layers, the more abstract the features extracted. These extracted abundant features are helpful for further processing.

Classification performance
To demonstrate the effectiveness of the proposed method, VGG19+ SVM is compared with two traditional methods and several deep learning based methods. Two traditional methods for comparison are support vector machine (SVM) and spectral-spatial classification(EMPs). (Fauvel et al., 2008) In addition, we also apply PCA to SVM and EMPs. Deep learning based methods include 3D-CAE, (Mei et  To further demonstrate the effectiveness of the proposed method, we also test VGG19 with a random forest classifier (VGG19+ RF). 200 samples per class are randomly selected as the training data set. Note that the training data set for different methods are exactly the same.
In order to compare different classification methods more comprehensively, class-specific accuracy, overall Accuracy (OA), average accuracy (AA), κ are used as evaluation criteria, in this paper. The compared results of class-specific accuracy, OA, AA and κ are listed in (Tables 5-7). PCA+SVM and EMPs+PCA achieves the lowest overall classification accuracy. This shows that PCA can reduce the feature dimension and    the classification accuracy. EMPs, 3D-CAE, 3D-CNN, CNN-PPF can both improve the classification accuracy of HSIs. However, 3D-CAE, 3D-CNN and CNN-PPF require complex training procedure and the effect of EMPs on improving accuracy is limited. In contrast with the compared methods, our proposed method (VGG19+ SVM) achieves the state-of-art result and bypassings the complex training procedure of deep learning models. Meanwhile, VGG19+ RF could also provide a competitive classification result, which further demonstrates the effectiveness of the proposed method. (Figures 6-8) show the ground truth map and the classification maps obtained by different methods. The   To further validate the classification performance of VGG19+ SVM, we repeat the experiments of different methods 20 times. 200 samples per class are randomly selected as the training data set in each experiment. Note that the training data set for different methods are the same in each experiment. (Figure 9) shows the distribution of κ of different algorithms in 20 experiments through box plots. In the box plots, the horizontal line inside the box represents the median, and the upper and lower edges of the box correspond to the upper quartile and the lower quartile. The two horizontal lines connected to the box represent the maximum and minimum values, respectively. The shape of rhombus represents the outlier points in the data. From the results of ( Figure 9), we can see that VGG19+ SVM outperforms the other compared methods. We also use paired t-test (Pan et al., 2017) to show the statistical evaluation about the results, the paired t-test is a widely used statistical method to verify whether there is a significant difference between the two groups of related samples. We accept the assumption that the mean κ of VGG19 + SVM is larger than a compared method only if Equation (2) is valid.
ða 1 À a 2 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi n 1 þ n 2 À 2 p ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi (2) Where a 1 and a 2 are the means of κ of VGG19+ SVM and a compared method, s 1 and s 2 are the corresponding standard deviations, n 1 and n 2 are the number of realizations of experiments reported which is set as 20 in this paper. The t values of t-test larger than 3.57 mean that two results are statistically different at the 99.9% confidence level. As listed in (Table 8), all the values are much larger than 3.57, which indicates that the increases in κ are statistically significant. In order to further illustrate the effectiveness of the VGG19 network structure for spatial feature extraction of HSI, we compare VGG19 with other pretrained networks under the condition that 200 samples per class are randomly selected as the training data set, including ResNet50 (He et al., 2015) with deeper network structure and InceptionV3 (Chollet, 2016) with wider network reception domain. ResNet50 uses the first two convolutional layers for spatial feature extraction, and InceptionV3 uses the first five convolutional layers for spatial feature extraction. The classification  results are shown in (Table 9). From (Table 9), it can be found that among the three pre-training networks, whether it is classified using SVM or RF, the classification accuracy of ResNet50 is the lowest, and the classification accuracy of InceptionV3 is slightly lower than that of VGG19, VGG19 obtains the highest classification accuracy on three data sets. Therefore, the experimental data in (Table 10) illustrates the effectiveness of VGG19 for feature extraction. In addition, we conduct an ablation study to demonstrate the effectiveness of the proposed method in combining spatial and spectral features. Concretely, we record the classification results produced using single spectral feature and single VGG feature respectively. And the simplified model is denoted as "VGG19-Spectral" and "VGG19-Spatial". The comparative results are demonstrated in the (Table 10). As shown in the table, whether SVM or RF is selected as the classifier, the overall classification accuracy of VGG19 is higher than that of VGG19-Spectral and VGG19-Spatial, which fully demonstrates the effectiveness of combining spatial features and spectral features of the proposed method. The feature extraction time of different methods are listed in (Table 10). Note that deep learning based methods learn to extract features from data. Thus 3D-CAE, 3D-CNN and CNN-PPF show the training time in (Table 11). Deep neural networks' drawback of a long training time is becoming less and less decisive the, as rapid development of hardware technology, especially of GPU. However, training a deep learning model is still very time-consuming. From (Table 11), it can be found that the time of feature extraction of deep learning model is measured in minutes or hours. In contrast with deep learning based methods, the traditional spatial-spectral feature extraction method takes less time to extract features. The proposed method does not need to train the deep learning model, thus it takes less time to extract features.

Conclusion
In this paper, a simple but effective spatial-spectral feature extraction method is proposed for HSI classification. A pretrained VGG19 model is used to extract spatial features of a HSI. This method only needs to set    the number of convolution layers for feature extraction. It is found that using the first 5 convolutional layers are more reasonable by experiment and analysis. The proposed method does not need to train the deep learning model, so the feature extraction speed is fast. The extracted spatial features are stacked with the original spectral features to form the spatial-spectral features. Experiments on three HSI datasets show that using the extracted spatial-spectral features could greatly improve the classification accuracy.

Disclosure statement
No potential conflict of interest was reported by the author(s).