Ensemble of features for efficient classification of high-resolution remote sensing image

ABSTRACT Extracting feature is one of the important methods in classification of high-resolution remote sensing image. A good feature set can result in an efficient classification process. Recent trend moves in extracting the features from the image using neural networks with no human intervention. Our approach uses the deep convolutional neural network for extracting deep features. To still rise the efficiency of the extracted features, the proposed system combines the deep features with other features like Gabor features and novel reformed local binary pattern features. The features are combined and sent for classification. Then, the classification process is done to classify the images. The proposed system introduces two novel ideas, in its feature extraction implementation, namely (1) initialisation of filter values for the CNN and (2) change in local binary pattern feature extraction process. The experimental results are carried out with LISS IV Madurai image, and evaluation is done for the verification of the results. It is found that the system proposed produces good results when compared with other existing methods.


Introduction
The remote sensing (RS) image contains lot of hidden details such as land cover and land mapping of the earth surface. The process of classifying each pixel in the RS image thus helps in revealing various useful details. The feature extraction process plays an important role in the classification process. There are lot of feature extraction techniques used before classifying RS images. In earlier days, those classification was done depending upon the low-level feature extraction process which extracts shape, context or local features based on the methods, namely histogram of oriented gradients and scale-invariant feature transform (SIFT; Fan et al., 2015;Yang & Newsam, 2010). The development in the technology slowly drove the feature extraction process from manually selected features to automatic feature learning by the so-called deep learning techniques (LeCun et al., 2015).
As another step towards improvement in feature learning, the perceptron was introduced. Zhouhan Lin et al. classified the image using a model which utilised the autoencoders for extracting the features in an unsupervised manner (Lin et al., 2013). The stacked autoencoders were used in extracting features from RS images, and those features were used for classification . At some point, the perceptron was replaced by deep learning concepts because of their limitations such as disregarding spatial information and need of many parameters. Meanwhile, the CNN had evolved as one of the efficient neural networks for feature extraction because of its various benefits Yue et al., 2015;Zhang et al., 2016b;Zhao et al., 2015). The main benefit of CNN compared to its predecessors is its automatic feature detection capability without any human direction. The network is computationally efficient with the usage of convolution and pooling operations. It allows parameter sharing thus enabling the model to run on any device. The CNN is used in many applications, namely image classification (He et al., 2015;Krizhevsky et al., 2017), target detection (Wanli Ouyang et al., 2015;Zhang et al., 2014), face recognition (Sun et al., 2014), pedestrian detection (Ouyang & Wang, 2013), hyper spectral image (HSI) denoising (Shi et al., 2021b), change detection in RS images (Shi et al., 2021a), building footprint information (Guo et al., 2021), RS image classification  and so on.
There are many different CNN architectures proposed at different times. Earlier in 1998, LeNet was invented by Yann Lecun for optical character recognition (LeCun et al., 1998). The architecture was tiny and simple. Even though it was a major breakthrough at that time, it does not perform well with RGB images. Then after a decade gap in 2012, AlexNet was introduced (Krizhevsky et al., 2017). It consists of eight layers with ReLu activation function. Compared to later models, it took additional time to attain higher accuracy results. Then, VGG was introduced to show the result of deep architecture on accuracy for image classification or recognition tasks (Simonyan & Zisserman, 2015). The network works based on grouping of multiple convolution layers with small kernels resulting in the reduced number of features at the output. The network includes multiple 3 × 3 conv layer, ReLu, 2 × 2 max-pooling layer, fully connected (FC) and softmax layer. Although it had many advantages, it experienced vanishing gradient problem and some other problems due to huge computational requirements, in terms of both memory and time. Also, due to the large width of convolutional layers, it becomes inefficient. So, GoogLeNet/ Inception devised a module called inception module that constructs a sparse CNN with a normal dense construction (Szegedy et al., 2015). Then in the following year, ResNet was introduced as a very deep network with 152-layer network (He et al., 2016). The key idea of the network is adding a shortcut that bypasses two or more stacked convolutional layers, and at the end, they were added to the output of the stacked convolutions. Some of these ResNet model help in image enhancement and classification.
The transfer learning concepts are applied for the CNN-based classification of RS images. For instance, scene classification for different datasets (Pires de Lima & Marfurt, 2020;Jason Yosinski et al., 2014) and high-resolution RS scene classification (Hu et al., 2015) was done through transfer learning. The classification accuracy of CNN is increased with transfer learning. The main advantage is that pretraining a model on a larger and more general dataset improves the performance of the model for the classification of the smaller datasets in comparison with the same model trained with randomly initialised weights. Meanwhile, many neural network architectures were introduced for various purposes. The system for scene classification was introduced using SceneNet. Here, the network architecture coding and searching were achieved using an evolutionary algorithm (Ma et al., 2021). Another system with multi-view deep neural networks was introduced for HSI classification using deep neural network-based relevant latent representation learning. The system uses both autoencoders and semi-supervised graph convolutional network (Sellami & Tabbone, 2022). Similarly, a multitask deep learning method was introduced for HSI classification with unknown classes . The semantic information modulated deep subpixel mapping network was also developed for mixed pixel problem in RS images He et al., 2020).
In the system proposed by Hu (Hu et al., 2015), in order to abstract the features of HSI based on spectral perspective, a five-layer deep CNN was used, thus leading to an improvised classification performance. The work by Yue includes principal component analysis (PCA), deep convolutional neural network (DCNN) and logistic regression classification of RS image . Another different combination with CNN was dependent on the blend of morphological profiles for feature extraction of HSI (Aptoula et al., 2016). The system proposed the method which combines the neural network and an object-oriented classification for RS image classification. Objects were detected using multi-resolution segmentation algorithm (Zhao W et al., 2017).
Some works were proposed which combines the deep features obtained from CNN with other features. The features from DCNN were combined with Gabor features and tested for the input image (Chen et al., 2017). Then as the next step, the deep features were combined with LBP features (Juefei-Xu et al., 2017). The deep features were also combined with SIFT feature (Yue et al., 2018) for the image retrieval. Another work (Tan et al., 2019) combines the grey-level co-occurrence matrix features with the deep features for classification of the images. For the feature extraction process, is it enough of considering only the deep features? Or is it enough that the deep features are combined with other features as mentioned in these existing methods? Still there is a chance for accuracy improvement because even though these existing works perform classification with deep features, there arise a small gap of not considering the other important features like texture, edge, scale and orientation in the same model. When these features are considered, the classification accuracy is increased and our system tries to fill this research gap.
In this proposed method, the system mainly concentrates on the extraction and combination of different features, namely deep features, Gabor features and reformed local binary pattern (RLBP) features. The deep feature extraction is performed using DCNN. Since our system deals with RS, the texture details of the image can't be neglected. So, the texture features are extracted using RLBP and Gabor features. Also, to enhance the resistance of deep features to the orientation and scale changes, the Gabor features are used. In the literature, the Gabor feature is used as the input of CNN. Similarly, the LBP feature is also used as the input of CNN. But our proposed system claims that the performance of CNN can be greatly enhanced by two-stage textural features consideration based on both Gabor feature and RLBP features. The experimental results are provided to explain that the combination of features in our proposed feature extraction method is best suited for our dataset. The Major process of our system is as follows: • The proposed system pre-processes the input image with an image enhancement technique. Thus, the image is enhanced before it is sent for the next step. By doing so, the clarity of the input is improved.
• The proposed system uses multiple fused features intended for efficient classification. The Gabor and RLBP features are extracted. Both of these features are combined and given as input to DCNN. Thus, the feature set obtained consists of Gabor features, RLBP features and deep features. • The features extracted are then classified using CNN, and the results are obtained.
The rest of the paper is arranged as follows. The earlier works associated with our system is explained in section 2. The methodology proposed is described in section 3. The experimental results and the discussion about the performance of the system are explained in section 4, and finally, conclusion is provided in section 5.

Related work
Certain earlier methods which are related to our proposed system is discussed in this section. For extraction of spatial features, the Gabor filter was applied in different directions and scales (Mirzapour & Ghassemian, 2015). Similarly, the LBP method was used to extract spatial features (Song et al., 2010). Both the methods proved their improved efficiency impact on classification accuracy of the input images. But the main drawback of this shallow spatial features was the diminished accuracies of complex scenes and diverse land covers. In order to overcome the issue, CNN was employed.
Some proposed systems introduced the method of integrating the spectral features with shallow spatial features and used this combination as an input for deep neural networks. The Gabor features with different scales and directional changes were combined with PCA spectral features for the input of deep model (Chen et al., 2017). Another study proposed the approach of combining the features extracted from Gabor filters with deep features from CNN. Even if the training samples were limited, the CNN was able to extract features and update weight (Chen et al., 2017;Luan et al., 2018). The classification results of the previous works are improved when the Gabor features are added to spectral features. The work by Pratiksha Hattikatti (Pratiksha Hattikatti 2017) combined the LBP texture features with the deep features extracted by CNN for classification of the image. This also resulted in the improved classification accuracy. In our proposed method, the extracted deep features are combined with other features namely Gabor feature and RLBP to improve the classification accuracy.

Proposed work
Overview RS plays a major role in the trending research due to its application areas. Classification of the RS image results in revelling a lot of useful information. Thus, the proposed system tries to produce best classified image of the RS image. This section is separated into the following sub-sections: (1) Image Enhancement, (2) Feature Extraction and (3) Classification. In order to classify the RS images, the feature extraction is a crucial step. Earlier to the extraction, if the RS input image is enhanced using some enhancement techniques, then it will still more produce greater classification accuracy (Sharma et al., 2018). The main idea of the proposed system is to create multiple set of features using different methodologies and combine those features into a single set of features. Since, the features extracted from DCNN is combined with other extra features, the classification accuracy is enhanced. The architecture of our system proposed is as shown in Figure 1.

Image enhancement
There is a lot of difference between the image which is simply classified and the image which is classified after enhancement. Before an image is classified, the enhancement of the image plays a major role. There are lot of methods existing to enhance the image. The earlier proposed method tries to apply bilateral filter to obtain the detail image (Kaplan et al., 2017). But our proposed system tries out an efficient method of enhancing the image. The proposed method applies the combination of filter, namely Gaussian filter, bilateral filter and median filter. The three main objectives for the image enhancement are as follows: (1) increase contrast, (2) emphasise edges and (3) preserve colour. The input RS image is decomposed into L number of detail image with L detail layer. To get the detail image, the first step is to construct the approximation image using Equation (1).
Here, the A 1 represents the approximation image and I the original input image. The function FILT represents the filters applied. Using the approximation image, the further detail layer is exploited by finding the difference between the input and approximation image. Equation (2) represents the derivation for the detail layer.
The same idea is implemented to higher layers for decomposition. The same filters are applied to the approximation layers. The higher level of detail image is obtained using Equations (3) and (4).
The enhanced image is obtained by adding the final level approximation layer output to every detail layer output. The detail layer outputs are weighted to give additional importance. The enhanced image is formulated as in Equation (5).
The weight applied, w l , value is 2.

Feature extraction
The proposed method uses three steps for extracting features, namely (i) feature extraction using Gabor filter, (ii) feature extraction using RLBP method and (iii) deep feature extraction using DCNN. The methods are explained as below:

Feature extraction using gabor filter
Gabor filter is a multi-scale and multi-directional filter. They are used to capture spatial domain's frequency and orientation representation. To extract informative features, different frequencies and different directions are applied. Gabor filter with parameters, namely f & θ, indicates frequency and orientation, respectively, and they are defined using Gaussian kernel function. It analyses any specific frequency content in particular direction, especially in a localised area (i.e.) around point or region. The twodimensional Gabor filter is given by the following Equations (6) and (7): Here, G c and G s are gaussian function with cos and sin functions, respectively. B and C denotes the normalising factors. The texture variation in particular direction is obtained by varying θ. Also, "i" and "j" represent the vertical and horizontal dimensions of the Gaussian kernel which convolute around the image, and σ represents the standard deviation of the Gaussian distribution.

Feature extraction using RLBP method
The proposed system also extracts features using a new method which works similar to LBP method. Pratiksha Hattikatti (2017)  between two matrices, namely BNPM and RM, and the resultant matrix is shown in Figure 2(d). Unlike LBP, the RLBP works different in construction of BNPM matrix by considering the mean value of the pixels. In LBP, the central pixel is used for comparison and is directly used as threshold. So, it is sensitive to noise Sudha Sree (2015). So, the proposed system tries to consider the neighbouring pixels also in addition to the central pixel. So that, the texture feature extraction becomes more accurate for RS images.
The values present in Figure 2(d) are added together and a single value is obtained, which represents the feature value for that particular pixel. Figure  2 shows a sample explanation for single pixel and its neighbouring pixels. In the similar manner, the calculation is done for the whole image.

Deep feature extraction using DCNN
Convolutional neural network. The deep features are extracted using CNN. The CNN is a deep neural network, which can effectively extract deep features from the image without any human intervention. The CNN usually initialises random filter values. The first layer in CNN is convolution. The purpose of convolution layer is extracting features from input RS image and obtaining output feature maps by applying convolution operation. The input image is convolved with matrix consisting of filter values. Each pixel value on the input gives an output value after convolution. The necessary details from the image are extracted. The resultant matrix can now be referred as convolution feature maps. The filters help in extracting features. After convolution, the pooling is done. The purpose of the pooling layer is reducing the parameters and compressing data for preserving important features. The dimensionality of the convolved features is reduced by applying pooling. It is of different types, namely max, average, sum and so on. Here, the system goes with max pooling. Our system tries to extract efficient features by increasing the CNN layers to extract efficient features.

Initialising filters.
The CNN usually runs with the random initialised filter values. But the proposed system introduces novelty and improves the efficiency of the overall system by manually initialising the filter values. The filters such as mean, disc, log, prewitt filter for both vertical and horizontal edges, Gaussian, Laplacian, average, Sobel and Scharr for detecting both vertical edges and horizontal edges are used in our DCNN.
Deep feature extraction using DCNN. The proposed system considers the extracted Gabor features and RLBP features as input to the DCNN. The combined input features are thus fed into the convolutional layer of the first layer. The DCNN layer consists of seven layers. Each layer comprises convolutional layer followed by ReLu layer, which is then followed by pooling layer. The DCNN model is as shown in Figure 3. The final pooling result represents the deep feature.

Combined feature set
The proposed system provides Gabor and RLBP features as input to DCNN model. So, the feature extraction process results in the combined feature set with the benefit of deep feature, Gabor feature and RLBP feature. The feature here is directly sent for the classification step. The schematic view of the feature combination is as shown in Figure 4.

Classification
The classification of the RS image is done using the feature subset obtained. The proposed system doesn't apply any different classifying techniques rather classifies the RS image using the CNN itself. The DCNN architecture contains three classification layers, namely FC layer, softmax layer and classification layer. The FC layer is used as a bridge to connect two layers which are preceding and following the FC layer. It connects neuron in one layer to every neuron in other layer. The next is softmax layer, and the purpose of this layer is to assign probability values to each class. The number of neurons present in the softmax layer will be equal to the number of neurons in the output layer. Our system uses Gaussian activation function in this layer and is represented in Equation (8 The final layer of the network is output layer. Here, the neuron with the highest activation would be the model's predicted class. The results of the proposed system are discussed in detail in Section 4.

Dataset description
The effectiveness of the proposed system is proved using the LISS IV image of Madurai region and is shown for the reference in Figure 5. Indian RS Satellite P6 version (IRS P6 Satellite) is utilised for LISS IV data acquisition. The spatial resolution of LISS IV image is 5.8 m. The spectral resolution of red band is 0.62-0.68 μm, green band is 0.52-0.59 μm and near infrared band is 0.76-0.86 μm. The LISS IV image is given in Figure 5. The input image shows the city view of Madurai located in the Tamil Nadu state belonging to India. The land cover of the city is approximately 23.5 km × 23.5 km. The city is found with longitude of 78 � 04 0 47 00 E to 78 � 11 0 23 00 E and latitude of 9 � 50 0 59 00 N to 9 � 57 0 36 00 N. The land cover includes mainly of urban area, vegetation, saline land, waste land and water area.

Architecture of the proposed system
Our proposed system is implemented using MATLAB 2020a. The system specification of the implementation includes Windows 10 with Intel(R) Core(TM) i5 1.80 GHz CPU and 8 GB RAM. As the first step, the image is enhanced. The Gabor and the RLBP features are extracted. The Gabor filer is implemented with 0.2 MHz frequency and four orientations such as θ = 0, 45, 90 and 135. The Gabor and RLBP features are given as input to the DCNN model. Then, deep features are extracted using the DCNN with the pre-defined filter values. The layers of the CNN are convolution, ReLu, max pool, FC, softmax and classification. The CNN contains seven layers. The convolution weight matrix is of 3 × 3 pixel size with stride and padding of "1". The pooling layer uses filter of size 2 × 2 pixel with stride "1" and padding "2". The pooling layer is then trailed by ReLu and classification layers. The DCNN specification is as shown in Table 1 along with its output feature size. The options for training deep neural network are given in Table 2. The deep features are thus obtained using the DCNN. In DCNN, the train set refers to the subset of samples which are used for training, and the test set refers to the subset of samples which are used for testing the CNN model. Here, for training, the size of samples in training set is 7 × 7 pixel. Here, 250 samples (50 samples for each class) are considered for training and 500 samples are considered for testing. The sampling strategy followed here is random sampling. The 500 sample pixels chosen are 131, 216, 66, 59 and 4 pixels of urban, vegetation, water body, waste land and saline land samples, respectively. These 500 ground control points (GCPs) were used to make the reference data set for the assessment using e-Trex venture global positioning system device. Some of the GCPs are collected during the field survey of Madurai city.
To analyse the performance of extracted features, the different combinations of features are tested with the test sample.

Experimental results
The input image is applied to the combination of filter to produce the enhanced version of the original image. The bilateral filter is used for noise removal, whereas it preserves the edges as well. The median filter is used for noise removal. The classification is improved when it handles the enhanced input. The input image is decomposed to L layers of detail image. The proposed system decomposed to 1000 layers, i.e. L =1000. The value of w l is 2 in Equation 5. The results of the input image after applying image enhancement technique are shown in Figure 6.
After the image enhancement, the pre-processed image is sent for feature extraction process. As a first step, the Gabor filter is used for extracting features. The Gabor filer is implemented with 0.2 MHz frequency and four orientations such as θ = 0, 45, 90 and 135. Likewise, the RLBP is used for extracting features. Both of these features are sent as input to the DCNN for extraction of deep features. The depth of the CNN is chosen after certain trail runs. The results of which is shown as graph in Figure 7. It is obvious from the graph results that as the layer depth increases, the classification accuracy is also increased. But, at some point, it reaches its peak value and then starts falling down. The highest classification accuracy is obtained when using seven layers in DCNN. So, the proposed work suggests using the seven-layer CNN architecture for its implementation. Using the layer specification mentioned in Table 1, the proposed system is implemented. Number of Epoch should be properly chosen for training the DCNN. Based on the number of epochs also, the variation in classification accuracies is observed. The trail run with different number of epochs, and their corresponding classification accuracy is shown as a graph in Figure 8. When there are 800 epochs for single run, the classification accuracy reaches its peak values and then starts to maintain it stability. So, it is recommended to specify 800 epochs for single run. Thus, the deep features are obtained with seven-layer CNN which could run for 800 and above epochs. The classified image of this proposed system is shown in Figure  9(E).
The classification results of different combination of feature set are shown in Figure 9. The pre-processed image is given as input to the CNN model and the   features are extracted. The classification results obtained are shown in Figure 9(b). When Gabor features are combined with the deep features obtained from CNN, a new feature set is obtained. Using those combined features, the classification is executed, and the results are shown in Figure 9(c). Similarly, the preprocessed image is used for extracting features with the help of RLBP. The RLBP features are combined with the already available deep features from DCNN. The classification results of which are shown in Figure  9(d). And finally, the classification result of our proposed system is shown in Figure 9(e). To show the local details of the classified image clearly, Figure 10 is added with the small area of the image zoomed in.

Evaluation of results
The confusion matrix is used for expressing the accuracy of classification. The accuracy indices through which the proposed system is going to be evaluated are: user accuracy (UA), producer accuracy (PA), kappa coefficient (KC) and overall accuracy (OA). The accuracies are calculated using Equations (5)-(7). The UA, PA and KC are calculated using the formulas as given in Equations (5) and (6).
UA ¼ Number of correctly classified pixel Total no of pixels belong to a class (9)    Based on the above metrics, the classification evaluation is done and results are listed in Table 3 and 4. The correctness of the classified image is evaluated based on the direct field visit. To test the results, 500 samples are taken for examination. Table 3 represents the class-wise accuracies using different feature extraction methods. Table 4 represents the classification accuracies based on different accuracy evaluation metrics for    different feature extraction methods. The graphical representation to clearly understand the results is also given in Figure 11. And it clearly understood that the proposed method beats the performance of the other methods employed. In order to prove the effectiveness of the image enhancement in increasing the classification accuracies, the accuracies of the method with and without Image enhancement are shown as a graph in Figure 12. From the graph, it can be interpreted clearly that the enhancement process before the feature extraction surely increases the classification accuracies. Also, the proposed system is compared with some of the existing methods like AlexNet, VGG, ResNet, Inception, CNN + Gabor and CNN + LBP. The comparison is done with respect to their overall classification accuracies and is shown in Figure 13. The graph results prove that our proposed method outperforms other state-of-the-art methods.

Conclusion
The proposed system presented a new method of combining three different types of features namely deep features, Gabor feature and RLBP feature. The combined feature set is found to be very efficient for classification. A small drawback of obtaining many features while using three feature extraction method is prolonged training time for training the neural network model. But the greatest advantage of the system proposed is that it succeeds in finding out even the small land covers which is evident from the classification results. In future, it can still be improved by applying different classification methodologies and applying feature selection/ reduction techniques.

Disclosure statement
No potential conflict of interest was reported by the author(s).