Fully polarimetric synthetic aperture radar data classification using probabilistic and non-probabilistic kernel methods

ABSTRACT The data classification of fully polarimetric synthetic aperture radar (PolSAR) is one of the favourite topics in the remote sensing community. To date, a wide variety of algorithms have been utilized for PolSAR data classification, and among them kernel methods are the most attractive algorithms for this purpose. The most famous kernel method, i.e., the support vector machine (SVM) has been widely used for PolSAR data classification. However, until now, no studies to classify PolSAR data have been carried out using certain extended SVM versions, such as the least squares support vector machine (LSSVM), relevance vector machine (RVM) and import vector machine (IVM). Therefore, this work has employed and compared these four kernel methods for the classification of three PolSAR data sets. These methods were compared in two groups: the SVM and LSSVM as non-probabilistic kernel methods vs. the RVM and IVM as probabilistic kernel methods. In general, the results demonstrated that the SVM was marginally better, more accurate and more stable than the probabilistic kernels. Furthermore, the LSSVM performed much faster than the probabilistic kernel methods and its associated version, the SVM, with comparable accuracy.

Among the algorithms mentioned above, there has been a strong inclination to use the SVM thanks to its high efficiency, generalizability and attractiveness in the classification process. However, this method has been characterized by certain drawbacks, such as a time-consuming training phase and nonprobabilistic outputs, which have been addressed by researchers. This is perhaps due to the fact that a great deal of effort has been made in these last few decades to extend the SVM's original formulation, in order to speed up its processing, produce probabilistic outputs or provide sparser models (Mountrakis et al., 2011). For example, the least squares SVM (LSSVM) was recommended with the aim of expediting the training phase of the SVM (Wang & Hu, 2005). In another example, the relevance vector machine (RVM) was proposed as a Bayesian sparser kernel alternative to the SVM (Tipping, 2001). In addition to the RVM, the import vector machine (IVM) was another probabilistic extension of the SVM (Zhu & Hastie, 2005). Although, these three extended algorithms have been used widely for optical remote sensing data classification (Braun et al., 2011(Braun et al., , 2012Tipping, 2001;Zhu & Hastie, 2005), to date, there are no studies using these three methods for PolSAR data classification.
Hence, this work aims at using and comparing these four kernel methods for the classification of three PolSAR data sets, in terms of accuracy and speed. These methods have been divided into two groups: the SVM and LSSVM as non-probabilistic kernel methods vs. the RVM and IVM as probabilistic kernel methods. Section 2 presents the procedures of these algorithms for classification. The data sets are documented in Section 3 and the results of the implementation of the methods are explained in detail for the three data sets in Section 4. The conclusion of the paper is presented in Section 5.

Methodology
Suppose that, X is the training data with n samples x i and labels y i , i ∈ [1, 2, . . ., n]. The kernel methods aim to solve classification problems to obtain y i = w T .ϕ(x i ) + b, where w is a weight vector, ϕ(x i ) is a non-linear mapping function and b is a bias. A kernel function can be accordingly defined as K(x i , x j ) = <ϕ(x i ), ϕ(x j )>. The most well-known kernel in remote sensing applications is the Gaussian radial basis function (RBF), defined by Pal et al. (2013): The Gaussian width parameter (σ) is the main parameter of the RBF kernel. The SVM, LSSVM, RVM and IVM are briefly described in the following section.

SVM
The SVM, as a binary supervised learning algorithm, is based on statistical learning theory (Hasani et al., 2017). This method maps data into a high dimensional space, in which the data have a simpler representation. This mapping is done implicitly using a kernel function. After mapping, the SVM estimates a separating hyperplane between two classes in the kernel space, ensuring it is at a maximum distance from the nearest samples of each class, i.e., support vectors (SVs). The optimization problem can be mathematically expressed as the following quadratic programming problem: subjectto :y i ðw: where, J is a loss function, C is a regularization parameter and ζ i ≥ 0 is a slack variable for falsely assigned training data in favour of generalization. The position of the samples with respect to this separating hyperplane is then used for their classification. For a multi-class task, the SVM uses either one-against-one or one-against-all strategies.

LSSVM
The LSSVM, with a main idea similar to the SVM, employs the least squares technique for solving the objective function problem (Wang & Hu, 2005). In mathematical terms, the loss function is changed to the quadratic loss function and the inequality constraints are replaced by equality constraints. Thus, the optimization problem is modified as follows (Suykens & Vandewalle, 1999): where, γ is similar to C of the SVM, which is used to control the loss function of the LSSVM.

RVM
The RVM is a Bayesian sparse kernel alternative to the SVM, which has many of the characteristics of the SVM, but without its principle limitations (Tipping, 2001). Assuming a Bernoulli distribution, a likelihood function is introduced, where a weight vector λ is sought to maximize the conditional probability of the target t depending on x i and λ: where, σ is a logic function. The weights are dependent on two classes of hyperparameters α, β that are assumed to be Gaussian distributed. These hyperparameters are iteratively optimized to maximize (6). Finally, the model only depends on the subset RV ‫ﬤ‬ X of non-zero elements in λ, which are called relevance vectors. However, unlike the SVs, the RVs are not the points closest to the respective other class but are samples from the centre of the distribution, considered most typical for their own class (Braun et al., 2012). It is claimed that the RVM needs considerably fewer kernel basis functions, i.e., RVs (Bishop, 2006).

IVM
The IVM was proposed based on the idea of similarity in the curved shape of the loss term of the SVM, i.e., 1 -y i .f(x i ) and the negative logarithmic likelihood (NLL) of the binomial distribution, i.e., log 2 (1 + exp(y i .f(x i )). By replacing the loss term in the losspenalty form of the SVM with the NLL of the binomial distribution, a new classifier can be obtained which provides probability outputs directly: As in the case of the SVM, only a subset IV ‫ﬤ‬ X called import vectors are needed to solve (7) and kernel functions are introduced to work in high-dimensional feature spaces. This algorithm needs considerably less kernel basis functions (Braun et al., 2011).

Experiment data sets
Three data sets, i.e., Flevoland, Foulum and Winnipeg, acquired by the L-band AIRSAR, L-band EMISAR and L-band UAVSAR sensors, respectively, were examined in this paper for comparative evaluation of the kernel methods (see Figure 1). These data can be freely accessed from the European Space Agency (ESA) website and NASA JPL website.
The first and the third data sets covered a large agricultural area, the latter in the Southeast of Winnipeg, Manitoba, Canada on 29 June 2012 and the former in Flevoland, Netherlands, in 1989. The second data set covered forest, agricultural and urban areas in a village in Denmark on 17 April 1989. The classes and the number of samples for these data sets are provided in (Tables 1, 2 and 3).
Three data sets were pre-processed using multilooking and speckle filtering. A 2 × 3 multi-looking mask was applied to the third data set and two other data sets were not multi-looked. In addition, a Boxcar filter, 5 × 5 in size, was applied to the third data set, and a Lee filter, 7 × 7 in size, was applied to the two other data sets to alleviate the speckle effect. The spatial resolution of the data sets after pre-processing was 10 m, 5 m and 12 m, respectively.
The polarimetric features, extracted for the classification in this paper, were similar to those outlined in a paper by Khosravi et al. (2017). These features and their symbols were highlighted in (Table 4).
Generally, each of above features may be useful in distinguishing between certain land-cover classes and may be less useful in distinguishing between other classes. Thus, all these features were stacked together in this work to obtain high efficiency and accuracy in distinguishing between all land covers. A total of 48 polarimetric features were extracted from each data set.

Assessment methods
For accuracy assessment, the kappa coefficient and overall accuracy (OA) were applied. They can be computed as follows (Congalton & Green, 2019): where, n ii is the diagonal element of the confusion matrix, related to i th class, i ∈ {1, 2, 3, . . ., k}, k is the number of classes, n i+ is the number of samples correctly classified as class i and n +j is the number of samples of class j in existing reference data.
Kappa is computed to determine whether the values in a confusion matrix are significantly better than the values in a random assignment. OA is another general metric for accuracy assessment. However, the kappa coefficient is preferred in relation to the OA metric (Negri et al., 2016). To evaluate the significance of a difference between the method results, a bilateral test, i.e., the McNemar's test was employed. This test was calculated as follows: z ¼ c 12 À c 21 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi c 12 þ c 21 p where, c 12 were the pixels correctly classified in the first method but misclassified in the second method and vice versa for c 21 . Mathematically proven, if |z| ≥ 1.96, the difference can be considered statistically significant (Braun et al., 2011;Kavzoglu et al., 2015). A radial basis function kernel was used for all the methods. Before classification, its parameters were optimized using a grid search algorithm, based on a five-fold cross validation. The range for optimizing C was set to [10 −5 :10:10 5 ] and for optimizing gamma parameter (σ) was set to [2 −5 :2:2 5 ].
For the experiment, we considered four sizes, i.e., 5%, 10%, 50% and 90% of all samples for the training samples; the remainder were used for quality control. For each size, the selection of training samples was performed randomly during 10 runs from all samples.
The training and test samples were normalized for the implementation of the methods. The average of the kappa coefficients and the standard deviation (STD) of OA values over 10 runs were then calculated for each training size. In addition, the average of kappa and the STD of the OA across training sizes were computed (see Table 5). Moreover, the |z| values of the McNemar's test for comparing the SVM results with the results of other methods, are provided in (Table 6).

Analysis and discussion
In (Table 5), kappa values of the IVM and RVM methods were higher than those of the SVM and LSSVM methods for the Flevoland data set. In addition, their kappa values were higher at different training sizes. However, the STD values of the LSSVM and SVM over 10 runs for each training size, were lower than those of the competitors. Moreover, their STD values across different training sizes were fewer. These issues implied a higher stability of the nonprobabilistic kernel methods, compared to the probabilistic versions in classifying the Flevoland data set. In fact, the non-probabilistic kernel methods were less sensitive to the training size for this data set. Of course, as the number of training samples increased, the stability of the SVM and LSSVM decreased and vice versa for the RVM and IVM methods.
The kappa values of the RVM and IVM were higher than those of the SVM and LSSVM at fewer training sizes in the Foulum data set. However, as training sizes increased, the kappa values of the SVM were higher than those of its competitors. In addition, the mean kappa value of the SVM across training sizes was higher than that of the other three methods. Although the kappa value of the LSSVM was less than the IVM, it was approximately equal to or even better than the RVM in some cases for the second data set. The stability of the SVM and LSSVM was higher than two probabilistic methods over 10 runs for each training size. However, the stability of the RVM and IVM across training sizes was higher than the SVM for the Foulum data set.
In the case of the Winnipeg data set, the SVM had higher kappa values than the RVM and IVM methods in almost all training sizes. In addition, the mean kappa value of the SVM across training sizes was higher than that of the RVM and IVM methods. The stability of all the methods across training sizes was almost the same for classification of this data set. This Table 5. The average of the kappa coefficient and the STD of OA (inside the parenthesis) of the kernel methods over 10 runs. Last column indicates the average of the kappa and the STD of OA across training sizes.  issue indicated that all methods had an equal sensitivity to the training size in the case of the Winnipeg data set. At each training size, the stability of the SVM and the LSSVM over 10 runs was again higher than the competitors. Unlike the first data set, the SVM and LSSVM were more stable, as the training size was bigger and vice versa for the probabilistic kernel methods. (Table 6) confirmed that all the differences between the SVM results and other methods were significant for all three PolSAR data sets. Moreover, (Figure 2) illustrates the total time taken for training and testing all the methods (on the Log10 base) with different training sizes. This figure clearly indicated that the LSSVM performed much faster than the other methods, even the closely linked SVM, in classifying all three data sets. This results from the use of the least squares idea in the LSSVM algorithm, which leads to an acceleration in solving the optimization problem.
In the case of three data sets, with a training size of 5%, the speed of the RVM was more than the SVM. However, as the training size increased, the speed of the RVM decreased. For instance, at a size of 90%, it was around 15 times slower than the SVM. The main reason for this was due to the training and testing phases of these two kernel methods. The RVM usually had a longer training process than the SVM, but a shorter testing process, due to its sparser kernels. In the case of a smaller training size, i.e., a bigger test size, the total time of the RVM, consequently, related more to the test time. Thus, it ran much faster. However, when the training size was bigger, i.e., the test size was smaller, its total time related more to the training time and then, it ran slower than the SVM. In contrast to the RVM, the IVM took less time than the RVM and the SVM across almost all training sizes for the classification of the three PolSAR data sets.

Conclusion
In this paper, four kernel-based methods, namely two non-probabilistic versions, i.e., the SVM and LSSVM, and two probabilistic versions, i.e., the RVM and IVM were compared in relation to the classification of L-band PolSAR data. Three PolSAR data sets with cropland and non-cropland classes were examined for this analysis. In addition, several different sizes of training samples were used for the experiment.
This paper substantiated that the SVM, as a nonprobabilistic kernel method, was more efficient, more accurate and more stable than its probabilistic competitors, i.e., the RVM and IVM in classifying two of the three data sets. Moreover, the LSSVM had a comparable efficiency and accuracy with the probabilistic kernel methods. This conclusion was in contrast to the results recorded in the papers of Braun et al. (2011) andBraun et al. (2012). These two papers had indicated that the RVM and IVM were more efficient than the SVM for hyperspectral data classification.
Another conclusion which can be drawn from this paper concerned the remarkable speed of the LSSVM, compared to the other kernel methods. The LSSVM could obtain almost comparable accuracy with a speed 12 times faster than the associated SVM and around 15 times faster than the probabilistic kernel methods, i.e., the RVM and IVM. Therefore, the LSSVM can be considered as a qualified successor for the SVM in the classification of PolSAR data, with suitable accuracy and greater speed.
In the end, we should take into account that the presented results were valid only for the L-band PolSAR data sets. Therefore, examining these kernel methods for the classification of PolSAR data sets, working in other frequency bands, can be suggested in future work to make a generalized conclusion.