Spectral–spatial feature extraction using orthogonal linear discriminant analysis for classification of hyperspectral data

ABSTRACT Hyperspectral image classification is among the most frequent topics of research in recent publications. This paper proposes a new supervised linear feature extraction method for classification of hyperspectral images using orthogonal linear discriminant analysis in both spatial and spectral domains. In fact, an orthogonal filter set and a spectral data transformation are designed simultaneously by maximizing the class separability. The important characteristic of the presented approach is that the proposed filter set is supervised and considers the class separability when extracting the features, thus it is more appropriate for feature extraction compared with other filters such as Gabor. In order to compare the proposed method with some existing methods, the extracted spatial–spectral features are fed into a support vector machine classifier. Some experiments on the widely used hyperspectral images, namely Indian Pines, Pavia University, and Salinas data sets, reveal that the proposed approach leads to state-of-the-art performance when compared to other recent approaches.


Introduction
The availability of remotely sensed hyperspectral images acquired in the adjacent bands of the electromagnetic spectrum, makes it necessary to propose techniques by which one is able to interpret such high-dimensional data in various applications (Aguilar, Fernández, Aguilar, Bianconi, & Lorca, 2016Nagabhatla & Kühle, 2016Plaza et al., 2009;Zehtabian & Ghassemian, 2016). One of the most important applications of hyperspectral images is classification in which land covers are distinguished from each other. For this purpose, several techniques have been developed using the rich source of spectral information along with spatial information buried in the hyperspectral images (Su, 2016).
Various experiments have proven the sparseness of high-dimensional data spaces such that the data structure involved exists primarily in a subspace (Green et al., 1998). As a consequence, there is a need for feature extraction methods by which the dimensionality of the data can be reduced to the right subspace without losing the essential information that allows for the separation of classes. These techniques can be mainly classified into two categories (Li et al., 2015). (1) Several methods use the linear transforms to extract the spectral or spatial information from hyperspectral data. Widely used linear feature extraction methods in the spectral domain include the principal component analysis (PCA) (Plaza, Martinez, Plaza, & Perez, 2005), independent component analysis (ICA) (Bayliss, Gualtieri, & Cromp, 1997), and linear discriminant analysis (LDA) (Bandos et al., 2009), while those in the spatial domain include Gabor filter bank (Mirzapour & Ghassemian, 2015) and wavelets. (2) Several techniques exploit spectral or spatial features obtained through nonlinear transformations. Examples of these methods are morphological analysis (Benediktsson, Palmason, & Sveinsson, 2005;Plaza et al., 2005), kernel methods (Camps-Valls & Bruzzone, 2005), and manifold regularization (Ma, Crawford, & Tian, 2010).
On the other hand, feature extraction methods can be classified into either supervised or unsupervised ones. For instance, PCA, Gabor filter bank, ICA, and morphological analysis are unsupervised, i.e. these methods do not use the training samples available for the classification procedure. But, LDA is an example of supervised methods whose features are extracted by considering training samples. Some state-of-the-art supervised methods are: (1) Partletsbased methods which use a library of pretrained part detectors (Cheng et al., 2015). (2) kernel-based feature selection which maximizes the separability of the feature space with respect to the radial basis function (RBF) kernel (Kuo, Ho, Li, Hung, & Taur, 2014). (3) rotation-invariant and fisher discriminative convolutional neural networks which introduces and learns a rotation-invariant layer and a Fisher discriminative layer, respectively, on the basis of the existing highcapacity CNN architectures (Cheng, Zhou, & Han, 2016). (4) collection of part detectors which uses a set of part detectors to detect objects or spatial patterns within a certain range of orientation (Cheng, Han, Zhou, & Guo, 2014). This paper focuses on using linear transformations for feature extraction and feature reduction. Linear transformations are attractive for image processing owing to the fact that they have a very low computational burden, while at the same time they are successful in extracting relevant information from hyperspectral images.
To linearly extract the spectral and spatial features, a possible solution is LDA, which not only has an easy implementation but also has a clear physical interpretation. In addition, it provides high classification accuracy. These good capabilities make it very useful in practical applications such as hyperspectral feature extraction. However, LDA has not been yet used to design a filter able to extract the spatial features of hyperspectral imagery. Therefore, this paper proposes a new orthogonal LDA-based filter set extracting the spatial features of hyperspectral data and maximizing the class separability.
In spite of the fact that LDA provides good performance, conventional LDA cannot work in illposed problems, when the number of training samples is less than the number of features (Bandos et al., 2009). To overcome the singularity problem, many approaches have been proposed, including uncorrelated LDA (ULDA), orthogonal linear discriminant analysis (OLDA), regularized LDA (RLDA), etc. (Bandos et al., 2009;Ye & Xiong, 2006a).
The transformation used in the OLDA method is based on the simultaneous diagonalization of both between-class and within-class scatter matrices, by which the singularity problem is solved implicitly (Ye & Xiong, 2006a). On the other hand, the performance of OLDA is better than that of ULDA (Ye, 2005). This is due to the effect of the noise removal property inherent in OLDA (Ye, 2005). In addition, because OLDA does not search over the regularization parameter, the computational burden of it is very low in comparison with that of RLDA (Bandos et al., 2009). Therefore, OLDA is an appropriate feature extraction method.
One of the main disadvantages of LDA-based methods, particularly when used for extracting spatial features, is that the number of LDA features is equal to the number of classes minus one. Due to this limitation, the sufficient number of features cannot be extracted. For example, in the Pavia University data, there are only nine classes of materials. So, only eight features can be extracted by LDA-based methods. These features do not contain the essential information that allows for the separation of classes. To cope with this problem, we have used the solution proposed in Okada and Tomita (1985). We will describe the solution in the corresponding section of the paper.
The paper is structured in seven sections. Section 2 presents the LDA and OLDA feature reduction methods. Extracting the spatial features based on OLDA and PCA is proposed in Section 3. The proposed methodology is presented in Section 4. In Section 5, the support vector machine (SVM) classifier is reviewed. Section 6 is devoted to the experimental results and discussions. Eventually, our conclusions are given in Section 7.

Linear discriminant analysis and orthogonal linear discriminant analysis
The following subsections review the classical LDA and OLDA as presented in Bandos et al. (2009), Ye andXiong (2006a).

Linear discriminant analysis
Given a data matrix X ¼ ðx ij Þ 2 R mÂn , where each column corresponds to a training sample and each row corresponds to a particular feature, we consider finding a linear transformation matrix G that maps each column In classical LDA, the transformation matrix G is computed so that low-dimensional feature space fulfills a given maximization criterion of separability among class distributions: (1) n k ðm k À mÞðm k À mÞ T is the between-class scatter matrix, S w ¼ 1 n P K k¼1 P i2I k ðx i À μÞ ðx i À μÞ T is the within-class scatter matrix, and m k , I k and n k are the mean vectors of class k, the index set, and the number of training samples of class k, respectively. In addition, m is the mean vector of the data, K is the number of classes, and n is the number of training samples. The maximization criterion in Equation (1) can be rewritten as the following maximization problem: where S ¼ S b þ S w denotes the estimate of the covariance matrix. It is worth mentioning that the solution of Equations (1) and (2) can be obtained if S w and S are nonsingular. The scatter matrices S w , S b and S can be redefined as: ffiffiffiffi ffi n 1 p ðm 1 À mÞ; :::; ffiffiffiffi ffi n k p ðm k À mÞ ! ; (5) in which O is a column vector of n ones, O k is a column vector of n k ones, and X k denotes the data matrix of elements in class k. Note that matrices S and S w are singular when the number of observations is smaller than the number of features, and consequently, the solution cannot be obtained. The OLDA method, described in the following subsection, can alleviate this problem.

Orthogonal linear discriminant analysis
OLDA has been proposed by Ye, et al. as an extended version of classical LDA (Ye, 2005). Unlike LDA, OLDA provides orthogonal features. In addition, OLDA can be used even if all scatter matrices are singular, and as a consequence, it solves the singularity problem to some extent. The OLDA algorithm can be decomposed into three steps (Ye & Xiong, 2006b): (1) removing the null-space of the estimated covariance matrix S; (2) using classical ULDA as an intermediate step; and (3) applying an orthogonalization step to the output of the ULDA transformation. The optimal transformation in OLDA is obtained by solving the following optimization problem: where † denotes pseudo-inverse. On the one hand, the pseudo-inverse of a matrix is well defined, when the inverse of a matrix does not exist. On the other hand, pseudo-inverse of a matrix coincides with its inverse, when the matrix is invertible. The orthogonality condition is then imposed on the constraint (step 3 of the algorithm). The main advantage of OLDA is that it is less susceptible to overfitting and more robust against noise than other LDA-based methods e.g. ULDA (Ye, 2005). The overfitting phenomenon generally occurs when a model is excessively complex, such that there are too many parameters relative to the number of training samples. Thus, to work well, when the number of training samples is limited, the proposed method uses OLDA for feature extraction. In summary, we prefer OLDA for the following reasons: (1) the performance of OLDA is better than that of ULDA (Ye, 2005); (2) overfitting is not a serious problem in OLDA; (3) the computational burden of OLDA is less than that of RLDA; (4) the eigenvectors of OLDA are orthogonal to each other which are in harmony with the Gram-Schmidt procedure used in our methodology (see Sections 3 and 4), such that all of the extracted features are orthogonal to each other (see Okada and Tomita (1985), Ye, Janarda, Li, and Park (2006), Ye and Xiong (2006b) for more details).

Designing a filter bank based on OLDA and PCA
This section proposes a new filter bank which is able to extract the spatial information of a single band image. The proposed filer bank aims at maximizing the Fisher score. Let the single band image be denoted by I 2 R r 1 Âr 2 , where r 1 and r 2 are the number of rows and columns in the image, respectively, and the two-dimensional (2-D) kernel be denoted by h: ::: To extract the spatial information of the image I, we should obtain A = I*h, where * denotes the convolution operation and the matrix A represents the extracted features. To obtain A(i, j), i.e. the (i, j)th element of matrix A, one should perform the following steps. Firstly, we rotate the matrix h 180°about its center. Secondly, we slide the center of the rotated kernel so that it lies on the (i, j)th element of I (see Figure 1). Finally, we multiply each element of the rotated kernel by the corresponding element of I and sum these values (see Figure 1). The convolution operation can also be represented by matrix multiplication. To obtain A(i, j), the element I(i, j) and its neighboring elements are inserted into a vector denoted by J ij 2 R f 2 Â1 . In addition, the elements of the rotated kernel are inserted into a vector denoted by H 2 R 1Âf 2 . Multiplying H by J ij gives A(i, j). This procedure is also shown at the bottom of Figure 1 (f is equal to 3 in this figure).
In fact, J ij represents the initial feature space of the (i, j)th element of I and the kernel vector Haims at extracting the relevant feature from the initial feature space J ij . To obtain all of the elements of the matrix A, one should multiply the kernel vector Hby the matrix M representing the initial feature space of all of the pixels (J ij 2 R f 2 Â1 ): M ¼ ½J 11 ;J 12 ; :::;J 1r 2 ;J 21 ; :::;J r 1 r 2 !initial feature space of different pixels 2 R f 2 Âr 1 r 2 : (9) Each column of M represents the initial feature space of the corresponding pixel. Using the matrix M, one can easily obtain the elements of the matrix A: ;J 12 ; :::;J 1r 2 ;J 21 ; :::;J r 1 r 2 ¼ ½HJ 11 ;HJ 12 ; :::;HJ 1r 2 ;HJ 21 ; :::;HJ r 1 r 2 ¼ ½Að1; 1Þ;Að1; 2Þ; :::;Að1; r 2 Þ;Að2; 1Þ; :::;Aðr 1 ; r 2 Þ: To use LDA-based methods, the data matrix X ¼ ðx ij Þ 2 R mÂn should be formed, where each column corresponds to a training sample and each row corresponds to a particular feature (see Section 2. A). So, the new matrix V is defined representing the initial feature space of only training samples: V ¼ ½J ij ; ::: 2 R f 2 Ân ði; jÞ 2 training samples; (11) and H Â V ¼ H Â ½J ij ; ::: ¼ ½HJ ij ; ::: ¼ ½Aði; jÞ; ::: ði; jÞ 2 training samples: (12) In fact, matrix V is the data matrix to which one can apply LDA-based methods and the kernel vector H is the linear transformation which is denoted by G in the previous section. From these equations, one can see that the data matrix V is available and the kernel vector H maximizing the Fisher score should be obtained. So, the kernel vector H is equal to the eigenvector obtained by applying OLDA to the matrix V. Note that the 2-D kernel h can be easily obtained from the vector H, because H is a row-wise version of h (see Figure 1).
As mentioned in Section 1, the number of eigenvectors in LDA-based methods is equal to the number of classes minus one (K−1). So, a filter bank with K−1 kernels is formed in which the kernels are orthogonal to each other. Using this filter bank, one can extract K−1 features. However, the number of features (i.e. K−1) is not sufficient for the classification purpose, because essential information that allows for the separation of classes may be lost. To tackle this problem, the technique proposed in Okada and Tomita (1985) is used in which S w and S b are projected onto a subspace orthogonal to the computed eigenvectors, and the OLDA analysis is repeated in this subspace. So, after designing a number of filters (i.e. K−1) by the OLDA method, the matrices S w and S b are projected onto the subspace orthogonal to the space of the designed filters. Then, a new set of filters is designed. This process is iterated until the sufficient number of features is provided. The Gram-Schmidt method is employed in the process by which the orthogonal subspace is determined (Okada & Tomita, 1985). Note that the maximum number of orthogonal filters which can be designed is f 2 (because the dimension of H is f 2 ), and consequently, the maximum number of features for each pixel is f 2 .
Besides, one can extract the spatial features of a single band image by the PCA transform. To this end, we should apply PCA to the matrix M containing initial feature space of all of the samples. Note that PCA is an unsupervised method. So, we should consider all of the samples and that is why we do not apply the PCA to the matrix V. The principal components corresponding to most of the cumulative variance are the spatial features extracted by PCA. This technique is called spatial PCA (SPCA) by which one can extract the spatial features of a single band image in an unsupervised manner. If the PCA transform is used for extracting the spatial features, the set of orthogonal filters H is equal to eigenvectors of this transform.

Proposed methodology
The methods proposed in the previous section are efficient for extracting the spatial features of a single band image. Because hyperspectral sensors provide a large number of spectral bands, a feature extraction method should be able to extract spatial and spectral features from these images.
When hyperspectral images are used, the initial feature space of (i, j)th pixel is composed of not only the neighboring pixels in the spatial domain but also neighboring pixels in the spectral domain (see Figure 2). So, one should use a 3-D kernel to extract the relevant features (see Figure 2). Similar to the previous section, the kernel vector H 2 R 1ÂNf 2 is the row-wise version of the rotated 3-D kernel, the vector J ij 2 R Nf 2 Â1 represents the initial feature space of the element (i, j), the matrix M ¼ ½J 11 ;J 12 ; :::; J r 1 r 2 2 R Nf 2 Âr 1 r 2 represents the initial feature space of all samples, and the matrix V ¼ ½J ij ; ::: 2 R Nf 2 Ân represents the initial feature space of training samples, where N is the number of spectral bands. In addition, multiplying H by V yields A, i.e. H Â V ¼ ½Aði; jÞ; ::: ði; jÞ 2 training samples. However, estimating the coefficients of the kernel by the reasonable number of training samples is not usually tractable (if OLDA is applied to the matrix V, overfitting occurs). For example, the Indian Pines data set has N = 200 spectral bands. If the length of the kernel f is set to 30, it is essential to estimate 200 × 30 × 30 = 180000 coefficients of the kernel vector H. Therefore, an initial dimensionality reduction stage such as PCA should be used to reduce the dimensionality of V. Using PCA along with an LDAbased method is a popular framework used to extract features from high-dimensional data (Yang & Yang., 2003).
To reduce the dimensionality of the matrix V, it should be multiplied by the eigenvectors obtained from applying PCA to M (because PCA is unsupervised, it should be applied to M). However, applying the PCA transform to the matrix M 2 R Nf 2 Âr 1 r 2 is not possible, because of the high dimensionality of this matrix. On the one hand, using PCA in high-dimensional matrices is computationally expensive. On the other hand, to accurately estimate the coefficients of PCA, the value of r 1 r 2 should be large enough in comparison to Nf 2 (to accurately estimate the covariance matrix of PCA, the number of samples should be several times larger than that of dimension/features). The latter condition is not necessarily satisfied by the matrix M. So, the proposed strategy is as follows: (1) Apply the PCA transform to the spectral bands of hyperspectral data and retain the principal components corresponding to 90% of the cumulative variance. This stage reduces the dimension of spectral bands to few principal components. In fact, this step extracts the spectral information of the hyperspectral data. Suppose that the number of retained principal components is Nu, where Nu ≪ N (see Figure 3). including the spectral and spatial features, extracted in the steps 1 and 2, respectively. Subsequently, apply OLDA to the matrix V t to obtain K−1 eigenvectors of this matrix. The kernel vector H t is set equal to these eigenvectors. So, the vector H t can extract the best spectral-spatial features of the total matrix V t . After obtaining K−1 kernel vectors H t and extracting the corresponding features, the matrices S w and S b are projected onto the subspace orthogonal to these K−1 vectors. Then, a new set of kernel vectors H t is designed and a new set of features is extracted. This process may be iterated several times.
In fact, step 1 provides Nu spectral bands (Nu spectral features) for each pixel. Moreover, step 2 extracts Nu d spatial features from each PC d . Each PC d is a single band image. So, one can easily apply the SPCA technique to it (see section 3). Now, there are Nu spectral features and P Nu d¼1 Nu d spatial features for each pixel. In step 3, the total matrix V t is constituted. Each column of this matrix represents Nu+ P Nu d¼1 Nu d spectral-spatial features of a training sample. Subsequently, OLDA is applied to this matrix for obtaining K−1 kernel vectors H t . The block diagram representing the feature extraction methodology is shown in Figure 3. To compare the proposed feature extractor with some recently proposed spectral-spatial feature extraction methods, the extracted features are fed into an SVM classifier.

Support vector machine
SVMs, introduced by Vapnik, have attracted a great deal of interest within the context of statistical learning theory, classification, and function estimation (Vapnik, 1998). This classifier aims at finding the best hyperplane that correctly separates the points of classes and provides maximum margin among them (see Figure 4). SVMs belong to the general group of kernel methods in which the dependency between the kernel and data is described via dot products (Ben-Hur & Weston, 2010). The idea of kernel-based SVM makes it possible to map the training data into a higher dimensional feature space via some mapping and to construct a separating hyperplane there. This hyperplane is a nonlinear decision boundary in the input space. By use of a kernel function Kðx i ; y i Þ, in which x i and y i are the input vectors, one can compute the separating hyperplane without mapping the data into a higher dimensional feature space (Howley & Madden, 2004;Zehtabian et al., 2015). Typical choices for kernels are: polynomial kernel, RBF kernel, and sigmoid kernel. In order to classify feature vectors extracted in the previous section, a standard SVM classifier with a polynomial kernel of degree 3 is used: in which c 0 , γ and degree of the kernel (which is three) are the kernel parameters. More details on the SVM classifier can be found in Ben-Hur and Weston (2010), Chang and Lin(2001), Howley and Madden (2004), Vapnik (1998).

Experimental results
The goal of the experimental validation using real data sets is to compare the performance of the proposed method with that reported by the stateof-the-art competitors. The proposed method is applied to three well-known real hyperspectral data sets and the obtained features are used to classify these images. As mentioned before, the classifier used in our experiments is SVM with a polynomial kernel of degree 3.
The SVM classifier is implemented using LIBSVM in which the default kernel parameter values, i.e. γ = 1/(Number of features), and c 0 ¼ 0 are used (Chang & Lin, 2001). So, very time-consuming tuning techniques such as cross validation or other methods have not been used. A number of labeled samples of each class are randomly selected to train the SVM classifier. In order to study the effect of training sample size on the proposed feature extraction method, we have considered four different proportional schemes: (a) 1%, (b) 5%, (c) 10%, and (d) 12.5% of the labeled samples of each class are used. A point that should be mentioned here is that there are some classes with a small number of labeled samples in the Indian Pines data. To ensure that we have sufficient training samples for the SVM classifier, a minimum of three samples per class is considered in schemes (a) and (b). In other words, n i ¼ maxð3; 0:01N i Þ for scheme (a) and n i ¼ maxð3; 0:05N i Þ for scheme (b), where N i is the number of labeled samples of class i.
Although the parameter f is dependent on the spatial structure and texture of the remotely sensed image, this parameter should be large enough to capture the spatial information of the image. The size of f is set to 35 in our experiments (H t 2 R N35 2 Â1 ), because the improvement is not significant for a feature extractor with bigger size.

Indian Pines data set
The Indian Pines hyperspectral data set was acquired on 12 June 1992 by the AVIRIS sensor, covering a 2.9 km × 2.9 km portion of Northwest Tippecanoe County, Indiana, USA. Two-thirds of this scene is agriculture, and one-third of it is forest or other natural perennial vegetation. These data include 220 bands with 145 × 145 pixels and a spatial resolution of 20 m. Twenty bands of this data set have been removed because they cover water absorption spectrum band (104-108, 150-163, 220). The corrected data set including 200 bands has been used in our experiments. There are 16 different land-cover classes in the original ground-truth image. Table 1) lists the number of labeled pixels of each class.
Due to the presence of mixed pixels in all available classes and because of the unbalanced number of available labeled pixels per class, this data set constitutes a challenging classification problem (Song et al., 2014).
Several combinations of spectral and spatial features (16 cases) achieving high performance (Mirzapour & Ghassemian, 2015) are fed into the SVM classifier and the classification results are given in Table 2: overall accuracy (OA), average accuracy (AA), and kappa statistics (κ). Note that the results of tables are obtained by averaging the values obtained in 10 Monte Carlo runs. The abbreviations used in these tables are as follows: HS is the hyperspectral image, MP is the morphological profile, Gab is the Gabor features, Gl is the GLCM (gray-level co-occurrence matrix) features, and SGl is the segmentation-based GLCM features. In addition, Figure 5 shows classification maps for the nine best methods (see Table 2).

Pavia university data set
In this experiment, the ROSIS Pavia University scene is used to evaluate the proposed approach. This scene was acquired by the ROSIS optical sensor during a flight campaign over the urban area of the University of Pavia, Pavia, Italy. This flight was sponsored by the European Union. It has 103 spectral bands covering the spectral range from 0.43 to 0.86 μm. Each spectral band comprises 610 × 340 pixels with spatial resolution of 1.3 m per pixel. Nine thematic landcover classes were identified in this scene which are given in Table 3. The OA, AA, and κ obtained by the SVM classifier are reported in Table 4. Figure 6 shows classification maps for the nine best methods (see Table 4).

Salinas data set
The Salinas data set includes 224 spectral bands. These data were collected by the AVIRIS sensor over Salinas Valley, California, USA. The image size in pixels is 512 × 217, while the geometric resolution of each pixels is 3.7 m. As with the Indian Pines scene, we discarded the 20 water absorption bands (108-112, 154-167, 224) to obtain a corrected image containing 204 spectral bands. These data include 16 agricultural land covers with very similar spectral signatures (Plaza et al., 2005). The number of labeled pixels of each class is listed in Table 5. The mentioned indices for different schemes are shown in Table 6. The classification maps for the nine best methods are shown in Figure 7 (see Table 6).

Discussing the results
In this subsection, we will discuss the classification results for various combinations of spatial and spectral features used in the experiments. Then, we compare the performance of the proposed method with that of competing methods. From these tables, one can see that the best accuracies for the same data set with different training schemes correspond to different features. In other words, there may not be found a unique feature extraction method which gives the best results for different number of training samples even for the same data set. Besides, for different data sets, the best cases corresponding to the same number of training samples are different. So, it seems that finding a unique feature extraction method which gives the best results for different training schemes and for different data sets, is difficult. In fact, it is not far from the truth to say that there does not exist such a feature extraction method (Kuo, Li, & Yang, 2009). However, because the coefficients of the proposed method depend on the training samples of the data, it should be more consistent with the characteristics of the images. In other words, it is less sensitive to the different data sets compared with unsupervised methods such as PCA, Gabor, and morphological filters.
As can be seen from the tables, the proposed features are the best ones in the schemes (b), (c), and (d) for the Indian Pines data, and in schemes (a), (b), (c), and (d) for the Salinas data. However, for the Pavia University data set, the proposed features are the best ones only in schemes (c) and (d) (except for AA criterion). These results reveal that the proposed features are almost certainly the best ones when there are enough training samples. Because the coefficients of the proposed feature extractor are estimated from the training samples, the more the Table 2. OA, AA, AND KAPPA Statistics (κ) obtained by the SVM classifier on the Indian Pines data. Different combinations of the extracted features are used. the symbol "+" denotes stacking. The number of features is different for different schemes in the proposed method.      1  BBrocoli_green_weeds_1  2009  9  Soil_vinyard_develop  6203  2  BBrocoli_green_weeds_2  3726  10  Corn_senesced_weeds  3278  3  Fallow  1976  11  Lettuce_romaine_4 wk  1068  4  Fallow_rough_plow  1394  12  Lettuce_romaine_5 wk  1927  5  Fallow_smooth  2678  13  Lettuce_romaine_6 wk  916  6  Stubble  3959  14  Lettuce_romaine_7 wk  1070  7  Celery  3579  15  Vinyard_untrained  7268  8  Grapes_untrained  11271  16 Vinyard_vertical_trellis 1807 Figure 7. Classification results obtained by the SVM classifiers for the Salinas data set (using 10% of the available labeled data for the scene).
filters, have better performance. This is because of the overfitting that typically occurs in unsupervised methods. So, in parametric feature extraction methods, e.g. the proposed method, the number of training samples plays a vital role. However, because the OLDA is used in our methodology, which is less sensitive to the overfitting and limited number of training samples, the proposed method has the best performance in schemes (a) and (b) for the Salinas data set and in scheme (b) for the Indian Pines data set. At the end of this section, we provide a comparative assessment. The proposed algorithm is compared with some recently proposed spectral-spatial classification methods. To do a reliable comparison, the classification results for SVM-CK (Camps-Valls, Gomez-Chova, Muñoz-Marí, Vila-Francés, & Calpe-Maravilla, 2006) are obtained by implementing the algorithm using the publicly available codes. But, the results for other methods are reported from the corresponding papers. Thus, some results are absent. The results of this comparison are given in Table 7. As can be seen from this table, the proposed method outperforms other methods in schemes (c) and (d), i.e. when the number of training samples is large. Moreover, the proposed method outperforms other methods in scheme (b) for the Salinas data, i.e. when the number of training samples is moderate. In addition, Table 8 reports the computational time of the proposed method using a standard notebook (Intel Core i5, 2.66 GHz and 4 GB of RAM) for different data sets (each experiment is repeated 10 times and the average is reported). As can be seen from this table, the computational complexity of the proposed method is moderate.

Conclusion
This paper first proposes a new filter bank able to extract the spatial information of hyperspectral   images. Then, a new supervised feature extraction technique is proposed in Section 4, which takes into account both the spatial and spectral information simultaneously. To obtain the coefficients of the feature extractor, the Fisher score is maximized. To this end, and to overcome the singularity problem, the OLDA method is used which is less susceptible to overfitting and is more robust against noise. To reduce the dimensionality of the feature matrix, the PCA transform is used as a preprocessing step. The steps of the proposed method are (1) applying PCA in the spectral domain, (2) applying PCA in the spatial domain, (3) extracting the best spectral-spatial features in terms of class separability using OLDA. Due to the fact that the number of extracted features is not sufficient for classification, the orthogonalization process is iterated (by the Gram-Schmidt algorithm) until the sufficient number of features is provided. To compare the performance of the proposed method with that of other methods, the extracted features were classified using an SVM classifier with a polynomial kernel of degree 3. Three real hyperspectral data sets, namely Indian Pines, Pavia University, and Salinas were used in our experiments.
In addition, to investigate the influence of the size of training set on the feature extraction methods, we adopted four different schemes for the number of training samples. The experimental results have demonstrated that the proposed feature extraction method can provide excellent features for the classification purpose when the number of training samples is large enough. Finally, the proposed method was compared with some recently proposed spectralspatial classification methods. The comparative results confirm the good capabilities of the proposed method.