A new kernel method for hyperspectral image feature extraction

Abstract Hyperspectral image provides abundant spectral information for remote discrimination of subtle differences in ground covers. However, the increasing spectral dimensions, as well as the information redundancy, make the analysis and interpretation of hyperspectral images a challenge. Feature extraction is a very important step for hyperspectral image processing. Feature extraction methods aim at reducing the dimension of data, while preserving as much information as possible. Particularly, nonlinear feature extraction methods (e.g. kernel minimum noise fraction (KMNF) transformation) have been reported to benefit many applications of hyperspectral remote sensing, due to their good preservation of high-order structures of the original data. However, conventional KMNF or its extensions have some limitations on noise fraction estimation during the feature extraction, and this leads to poor performances for post-applications. This paper proposes a novel nonlinear feature extraction method for hyperspectral images. Instead of estimating noise fraction by the nearest neighborhood information (within a sliding window), the proposed method explores the use of image segmentation. The approach benefits both noise fraction estimation and information preservation, and enables a significant improvement for classification. Experimental results on two real hyperspectral images demonstrate the efficiency of the proposed method. Compared to conventional KMNF, the improvements of the method on two hyperspectral image classification are 8 and 11%. This nonlinear feature extraction method can be also applied to other disciplines where high-dimensional data analysis is required.


Introduction
Hyperspectral images can synchronously obtain spatial and spectral information of earth targets and have extremely high spectral resolution which makes it possible to precisely identify the target classes (Plaza et al. 2009;Li et al. 2012). However, hyperspectral data have relatively more bands. The number of bands can reach up to hundreds, which produce challenges for analysis and application of hyperspectral data. On the one hand, the large data size and high calculation complexity of hyperspectral images generate high requirements for hardware equipment. On the other hand, the high-dimensional features of hyperspectral data are unconducive to the precise classification of hyperspectral images. The higher the number of hyperspectral data bands is, the more the samples are required to achieve the desired classification accuracy. Yet, it is difficult to obtain labeled samples, which usually demands significant human and material resources (Hughes 1968).
In this study, dimensionality reduction is considered to be an effective method to solve these problems (Jia, Kuo and Crawford 2013;Benediktsson, Palmason and Sveinsson 2005). Hyperspectral dimensionality reduction aims to simplify and optimize image features. It could effectively express high-dimensional data information, greatly reduce data size, and enable rapid and precise extraction of target information. Therefore, dimensionality reduction is important to the effective use of hyperspectral images. The common methods of dimensionality reduction are divided into two major categories: band selection and feature extraction. According to relevant evaluation indicators, band selection chooses band subsets from original data. The criteria of band selection are mainly based on information theory and spectral variance, such as methods based on information entropy proposed by Bajcsy and Groves (2004), mutual information to evaluate the information content of different bands proposed by Guo et al. (2006), and information covariance proposed by Ball et al. (2007). Feature extraction refers to the transformation of original data to the optimized subspace using mathematical transformation and thereby generating novel spectral data features. Traditional feature extractions include the linear and nonlinear feature extraction methods.

OPEN ACCESS
Principal component analysis (PCA) (Roger 1996) and minimum noise fraction (MNF) (Green et al. 1988) are the two commonly used linear dimensionality reduction methods. PCA uses information content as evaluation index of feature extraction and sorts the components by descending order of image information content after transformation. However, some studies found that, when noise distribution is uneven in each band, PCA transformation could not guarantee that the image components were sorted in accordance with quality (Green et al. 1988). Noise fraction (NF) is used as evaluation index of MNF. After MNF transformation, the components are sorted in accordance with image quality and are not affected by the noise distribution (Lee, Woodyatt and Berman 1990). However, MNF is unable to achieve reliable performances in real applications (Gao et al. 2013b). Some work has reported that the fundamental reason constraining the transformation of MNF was inaccuracy in the calculation of NF (Greco 2006;Liu et al. 2009;Gao et al. 2011;Zhao, Gao and Zhang, 2016;Gao et al. 2017). The information content for a particular hyperspectral image remains unchanged; the calculation accuracy of NF mainly depends on the noise estimation results. The original MNF transformation uses spatial neighborhood information to estimate noise. However, hyperspectral images usually have low spatial resolution with severely mixed pixels. There are relatively large errors related to the sole use of spatial information to calculate NF. Furthermore, the estimation results are unstable. Therefore, Gao et al. (2011;Gao, Du, et al., 2013;Gao, Zhang, et al., 2013; introduced the optimized minimum noise fraction (OMNF) method. This method adopts spectral and spatial decorrelation (SSDC) method to estimate noise by considering the high correlation between bands. This measure improves the results of the noise estimation and increases the accuracies of NF. Ultimately, the performance of original MNF transformation is improved. However, the above-mentioned methods can only extract linear features of hyperspectral data and are unable to find the nonlinear features.
Currently, the widely used nonlinear feature extraction methods are primarily kernel-based. The advantages of these methods mainly are that nonlinear features can be extracted effectively. Based on kernel function, these algorithms can transform the original data into a higher dimensional feature space. Thus, linear inseparable data in the original space can be separated in high-dimensional feature space. Nielsen (2011), Gómez-Chova and Nielsen (2011), Nielsen and Vestergaard (2012 proposed the kernel-based minimum noise fraction (KMNF) method based on this principle. This method has achieved great results in variation detection of remote sensing images with high spatial resolution, while failing to obtain the desired effect for hyperspectral images in practical application. From the theoretical analyses of KMNF, similar to MNF, KMNF also adopts spatial neighborhood information to estimate noise; thus, NF is also inaccurate. To overcome this problem, the optimized kernel minimum noise fraction (OKMNF) method was proposed (Zhao, Gao and Zhang 2016). This method also uses SSDC to estimate noise; thus, the performance of KMNF was improved greatly by introducing SSDC to estimate noise. However, in practical applications, it was found that the performance of OKMNF is still unstable for hyperspectral images with complex surface features. The main reason is: there is certain inadequacy in using SSDC to evaluate noise in hyperspectral images. In SSDC, the minimum region of multiple linear regression (MLR) is determined by empirical value, such as 6 × 6 pixels and 8 × 8 pixels (Gao et al. 2013a). This would involve other surface features inevitably to minimum regions, such as borders, when hyperspectral images have complex earth objects. Hence, the consistency of classes cannot be ensured within all divided areas, and this would produce errors while estimating noise and decrease the performance of feature extraction.
To overcome the above limitations, we introduce image segmentation to feature extraction algorithm. Image segmentation divides image into several subregions which do not intersect each other. The features are homogeneous within each subregion, while there are significant differences from different subregions. Traditional image segmentation methods can be divided into four categories: threshold-based image segmentation, region-based image segmentation, edge-based image segmentation, and theory-based image segmentation (Huang 1998;Kanungo et al. 2002). The key to threshold-based image segmentation is to determine a suitable threshold to divide images. Common threshold processing techniques include the adaptive threshold method, global threshold method, and optimal threshold method. The region-based image segmentation searches region directly. This method has two different search modes, including region growing and region splitting with merging. The edge-based image segmentation attempts to divide images through detecting marginal information of different regions. However, this method is extremely sensitive to noise and thus only suitable for low-noise images. Theory-based image segmentation combines theories and methods based on the development of various disciplines in new theories. In recent years, common theory-based image segmentation methods are primarily based on the mathematical morphology, fuzzy theory, gene coding, wavelet transform, machine learning, and clustering analysis. These image segmentation methods based on different specific theories have their own applications. In particular, image segmentation based on clustering analysis, by considering not only spatial information but also spectral information, is more suitable for the processing of hyperspectral images. This paper proposes a new kernel method (KM-KMNF) for hyperspectral image feature extraction. Figure 1 shows the flowchart of the proposed KM-KMNF method.
The method exploits the homogeneous region generated by image segmentation as the minimum region for multiple linear regression. By integrating the spatial information of the homogeneous region with spectral decorrelation, KM-KMNF improves noise estimation for feature extraction, enabling better performances on feature extraction and the post-applications (e.g. classification).
The remainder of this paper is organized as follows: Section 2 introduces the KM-KMNF algorithm. Section 3 evaluates the classification performance of KM-KMNF in comparison with another established dimensionality reduction methods by using two real hyperspectral data sets. Finally, Section 4 concludes the paper.

Optimized K-means clustering method
Conventional K-means clustering is a rather effective unsupervised clustering method. With a certain similarity as a measure criterion, this method divides hyperspectral images into different subregions. The characteristics are homogeneous within the same subregion, while there are significant differences for different subregions. For a given hyperspectral image X containing n pixels and b bands, the algorithmic flow of K-means clustering is as follows: at the beginning, k pixels are randomly chosen as initial cluster centers from hyperspectral image. In addition, other pixels are assigned to the most similar clusters by computing their similarity with all cluster centers. The next step is to calculate the cluster centers of new clusters and to repeat the process until the standard measure function begins to converge. After K-means clustering, each cluster is a subregion of image segmentation. However, as showed in Figure 2(c) and Figure  4(c), pixels within the same subregions are distributed discretely. Thus, not all pixels within the same subregion can form a connected region. Moreover, homogeneity of the internal features cannot be guaranteed in the same subregion. Therefore, the segmentation results of the conventional K-means clustering cannot meet the algorithm requirements. To solve these problems, we optimize the K-means (OK-means) clustering algorithm. Firstly, in the segment of hyperspectral images, to ensure the homogeneous features within the subregions, the number of subregions k should be far higher than that of normal segmentation. Then, to ensure the connection within each subregion, the search area of pixels, in which the similarity of pixel with its adjacent cluster center be measured, should be limited. Finally, pixels are assigned to the most similar clusters. By conducting OK-means image segmentation, all pixels within the same subregion can form a connected region, and homogeneity of the internal features can be guaranteed in each subregion. Therefore, OK-means image segmentation algorithm is more suitable for the processing of hyperspectral images. The segmentation results of the OK-means algorithm are shown in Figure 2(d) and Figure 4(d).
The flow of algorithm is as follows: Step 1: Initialize K cluster centers. For a hyperspectral image X with n pixels and b bands, the ideal subregion size is √ n∕k × √ n∕k. With √ n∕kas step length, K pixels are selected as initial cluster centers (c 1 , c 2 , ..., c k ).
Step 2: Assign pixel x i to the most similar cluster. According to the formula , the similarity between pixel x i and cluster center c v located in the spatial neighborhood is calculated, and this pixel is assigned to the cluster with the greatest similarity. In other words, this pixel is labeled with the cluster that is most similar to it. Where the cluster center located in the spatial neighborhood of x i is where the cluster center is covered in the area of 2 √ n∕k × 2 √ n∕k with x i as the center.
Step 3: Correct cluster center. The centers of pixels belonging to the same cluster labels are regarded as the new cluster centers.
Step 4: Calculate distance D. This distance is the difference between pixels and their cluster centers. The algorithm will be terminated if D converges. Otherwise, it is required to return to Step 2.

KM-KMNF algorithm
Remote sensing images are inevitably degraded by certain types of noise due to the influence of sensor instrument errors and other environmental factors. The hyperspectral image X, which contains n pixels and b bands, is composed of two parts signal and noise (Green et al. 1988;Landgrebe and Malaret 1986), where x(p) is the pixel vector at position p, x s (p) and x N (p) are the signal and noise contained in x(p), respectively. For optical images, the signal and noise are commonly Similarly, Let us consider x Nk as the average of the noise in k-th band, the mean noise matrix X mean with n rows b columns can be obtained.
The centralization matrix Z N of the noise matrix X N can be expressed as: The covariance matrix S N of the noise matrix X N can be expressed as: considered independent of each other (Landgrebe and Malaret 1986;Green et al. 1988). Thus, the covariance matrix S of image X can be expressed as the sum of the noise covariance matrix S N and the signal covariance matrix S S : Using x k to represent the mean value of all pixels in the k-th band of the image, the mean matrix X mean with n rows b columns can be obtained.
Then, the centralization matrix Z of X can be expressed as follows: The covariance matrix of image X can be expressed as The adopted MLR model can be expressed as: where B represents the spectral neighborhood matrix, μ is the coefficient matrix, and is the residual value. Then, μ could be estimated by: The signal value could be estimated through: The noise value can be obtained from the difference between the true and estimated values: Then, estimating noise in all subregions was performed by image segmentation. The noise estimation result N of the whole hyperspectral image can be obtained accurately. To acquire the kernel minimum noise fraction (KNF), the noise estimation result is inserted into the NF calculation Equation (9), and NF is performed with dual transformation a ∝ Z T b and kernel transformation (Nielsen and Vestergaard 2012).
where κ is the Gaussian radial basis function (RBF).
where σ is the proportional parameter, which can be obtained by calculating the mean distance between the observed value x i and x j (Nielsen 2011). To sort the image components according to the quality order after dimensionality reduction, KNF needs to be minimized, thus by solving the transformation matrix for feature extraction. Solving the minimized KNF equals to solve the symmetric generalized eigenvalues that can be solved using the maximized Rayleigh entropy (Nielsen 2011) which does not elaborate the detailed solution process. It should be noted that as far as KNF is concerned, the kernel noise value is calculated in the kernel space. In (12) n i,j,k = z i,j,k −ẑ i,j,k = z i,j,k − (a + bz i,j,k−1 + cz i,j,k+1 ) The NF is defined as the ratio of the noise covariance matrix to the total covariance matrix of image X.

Therefore, for the linear combination a T Z(p),
where a is the eigenvector of NF. In NF, it is significant that the noise is estimated reliably. The original KMNF method primarily uses the spatial neighborhood 3 × 3 information of a hyperspectral image to estimate noise Z N , as shown below: where z i,j,k is the pixel value in the i-th row, j-th column, and k-th band of hyperspectral image Z, ẑ i,j,k is the estimated value of this pixel, and n i,j,k represents the estimated noise value ofz i,j,k . Many works have reported that the estimated results have a relatively large error and are unstable when the noise is estimated by only using the spatial neighborhood information alone (Gao et al. 2013a;2013b). In comparison, the results are more accurate using the SSDC method in OKMNF, as shown in Equation (11): where the parameters a, b, c, and d are the coefficients of the MLR, and z p,k represents the pixel value of the adjacent spatial positions at the same band. To estimate noise, we calculate the inversion parameters of the MLR, which is equivalent to identify minimum region of the MLR. In OKMNF, a region covering 6 × 6 pixels is set as the minimum region to obtain MLR parameters by virtue of the empirical value. However, this segmentation method would inevitably separate other classes to the same minimum region, such as boundary, this makes it difficult to guarantee homogeneous category within the region. In comparison, in the KM-KMNF algorithm, subregion X sub generated by OK-means clustering is used as the minimum region to estimate noise, ensuring the homogeneity of the pixel features in the minimum region. Meanwhile, within each minimum region, the noise estimated by spectral decorrelation (which only considers the spectral information of hyperspectral image and the correction made by spatial neighborhood pixel z p,k to z i,j,k in Equation (11)) is eliminated. Thus, the impact of heterogeneous information of the image spatial neighborhood on the noise estimation results can be avoided. The equation for the spectral decorrelation excluding the spatial neighborhood information is as follows: (11) n i,j,k = z i,j,k −ẑ i,j,k = z i,j,k − (a + bz i,j,k−1 + cz i,j,k+1 + dz p,k )

Experiment on Indian Pines data
Indian Pines hyperspectral data are acquired by the Airborne Visible/Infrared Imaging Spectrometer in 1992, which contains 145 × 145 pixels and 220 bands. The spectrum ranges from 0.4 μm to 2.5 μm, and the spatial resolution is 20 m. Only 200 bands of the image are taken into account in this experiment, whereas the noise bands and the atmospheric vapor absorption bands are excluded. Nine large classes are considered in this experiment (Figure 4(a) and (b)). About 25% percent of the samples are randomly selected as the training samples. The remaining 75% of the samples are used as testing samples. The sample categories and quantities are shown other words, the real X of hyperspectral image and the estimated X of the MLR are converted to the kernel space at the same time. Subsequently, the noise value is obtained in the kernel space instead of the original space (Nielsen and Vestergaard 2012). Finally, the feature extraction results can be expressed as:

Experiment and analysis
We compare the KM-KMNF with OMNF, KMNF, and OKMNF on two real hyperspectral remote sensing images. The features extracted by each method are used as input of the maximum likelihood (ML) classification and support vector machine (SVM) to validate the performances. The SVM classifier with RBF kernels in MATLAB SVM Toolbox, LIBSVM (Chang and Lin 2011;Liao et al. 2012), is applied in our experiments. The fivefold cross-validation is used to find the best parameters in SVM (Liao et al. 2012). Each experiment runs ten times, and the average value of these ten experiments is reported for comparison.

Experiment on Pavia University data
The Italian hyperspectral data of Pavia University were collected using the reflective optical system imaging spectrometer in 2001. The data contain 610 × 340 pixels and 103 spectral bands. In addition, the spectrum ranges from 0.43 μm to 0.86 μm, and the data have a spatial resolution of 1.3 m. The image contains nine different types of surface classes (Figure 2(a) and (b)). About 50% of the samples are randomly selected as the training samples. The remaining 50% of the samples are used as testing samples. The sample categories and quantities are shown in Table 1. The classification accuracy of the different feature extraction methods is shown in Table 2 and Figure 3. The results of classification are shown in Figure 2(e−h). In Table 2, the classification accuracy of the four different feature extraction methods increases with the growing number of features. In particular, KM-KMNF produces a higher accuracy than the other three methods. Compared with KMNF and OKMNF, the accuracy of KM-KMNF could be improved by 10.95 and 4.69%, respectively, which indicates that incorporating the spatial information through image segmentation to the KM-KMNF algorithm as the minimum region of the MLR benefits post-applications. The accuracies of KM-KMNF and OKMNF are much higher than that of KMNF. In the meantime, the KM-KMNF performs better than OKMNF. The results show that for the noise estimation in KMNF algorithm, the spectral information is more important than the spatial dimension information of hyperspectral image. Moreover, Figure 2(e-h) shows that the classification results after KM-KMNF transformation are more consistent with the actual distribution of the surface features than those of other methods.

Classes
Training Testing  asphalt  3315  3316  meadows  9324  9325  Gravel  1049  1050  trees  1532  1532  metal sheets  672  673  Bare soil  2514  2515  Bitumen  665  665  Bricks  1841  1841  shadows  473  474  total 21,385 21,391  Tables 4 and 5 show that KM-KMNF contributes a higher accuracy than the other algorithms as the number of features increases, similar to that of the Pavia University data. This further demonstrates that the efficiency of KM-KMNF on feature extraction. Figure  5 also reveals that the accuracies of the three methods. By taking the spatial and spectral information of image into consideration, our KM-KMNF produces higher accuracy than that of the KMNF method which solely considers the spatial information for noise estimation. This confirms that the accurate calculation of the noise is crucial to the performance of the feature extraction for KMNF and MNF-based methods. In Table 4, we also can find that most standard deviations of KM-KMNF are relatively smaller than the others, indicating the stable performance of KM-KMNF. From Figure 3 and Figure 5, in Table 3. The classification accuracy of the different classifiers (ML and SVM) using different feature extraction methods is shown in Table 4, Table 5    find that KM-KMNF consumes relative more time but with better performances. When extracting less features, KM-KMNF can reduce the consumed time, and get the best performance when using around 14 features.

Conclusions
This paper is to propose a new kernel minimum noise fractional transformation for the feature extraction of hyperspectral data. Our method uses spectral-dimension decorrelation for calculation of the NF, and further improves the accuracy of the NF by introducing the spatial information through image segmentation to determine the minimum region of the MLR. Moreover, the conventional K-means clustering method has been improved to make OK-means algorithm more suitable for hyperspectral image segmentation. Two real hyperspectral image data sets are used for the experiments, and the classification accuracy is used as the index to evaluate the performances of feature extraction algorithms.
The results show that the accuracy of the MNF based on image segmentation is much higher than that of the original KMNF algorithm. Meanwhile, it is also much higher than that of the OMNF and OKMNF algorithms. Better noise estimation brings better performances on both feature extraction and its post-application to hyperspectral image classification for MNF-and KMNF-based methods.  as the number of extracted features increases, the performance of each feature extraction method first increases and then decreases. It confirms that feature extraction can improve the classification performance on hyperspectral images, and most information can be preserved even with a few extracted features. With SVM classifier, we have similar findings as that of the ML classifiers. The KM-KMNF produces better results than the others. In particular, KM-KMNF with ML classifier outperforms those of SVM classifier. We also conduct experiments by increasing the number of training samples, as shown in Table 6. We find that as the number of training samples increases, the classification accuracy first increases and then keeps a relative stable status. To compare the efficiency of feature extraction methods, we take Indian Pines data as an example (by extracting 20 features), and the consumed time of KM-KMNF, OKMNF, OMNF, and KMNF is 21.57 s, 19.99 s, 20.71 s, and 1.50 s, respectively. We Table 5. overall accuracies of the sVm classification on indian pines image using different dimensionality reduction methods.  Figure 5. comparison of the accuracies of the sVm classification on indian pines image after using different dimensionality reduction methods.