A survey of remote sensing image classification based on CNNs

ABSTRACT With the development of earth observation technologies, the acquired remote sensing images are increasing dramatically, and a new era of big data in remote sensing is coming. How to effectively mine these massive volumes of remote sensing data are new challenges. Deep learning provides a new approach for analyzing these remote sensing data. As one of the deep learning models, convolutional neural networks (CNNs) can directly extract features from massive amounts of imagery data and is good at exploiting semantic features of imagery data. CNNs have achieved remarkable success in computer vision. In recent years, quite a few researchers have studied remote sensing image classification using CNNs, and CNNs can be applied to realize rapid, economical and accurate analysis and feature extraction from remote sensing data. This paper aims to provide a survey of the current state-of-the-art application of CNN-based deep learning in remote sensing image classification. We first briefly introduce the principles and characteristics of CNNs. We then survey developments and structural improvements on CNN models that make CNNs more suitable for remote sensing image classification, available datasets for remote sensing image classification, and data augmentation techniques. Then, three typical CNN application cases in remote sensing image classification: scene classification, object detection and object segmentation are presented. We also discuss the problems and challenges of CNN-based remote sensing image classification, and propose corresponding measures and suggestions. We hope that the survey can facilitate the advancement of remote sensing image classification research and help remote-sensing scientists to tackle classification tasks with the state-of-art deep learning algorithms and techniques.


Introduction
With the development of earth observation technologies, an integrated space-airground global observation has been gradually established. The platform is consisted of satellite constellations, unmanned aerial vehicles (UAVs) and ground sensor networks, relies primarily on prior knowledge and the designed features are often in the shallow layer (e.g., the edges or local textures of a ground object), and they cannot describe the complex changes of the objects in the image. Second, machine learning models (e.g., SVM), which are used in classification, are shallow structure models (Cortes & Vapnik, 1995) and have weak modeling capacity, and they are often unable to sufficiently learn the highly nonlinear relationships.
The emergence of deep learning (Hinton & Salakhutdinov, 2006) provides a new approach to solving these problems. Deep learning has been employed to learning models with multiple hidden layers and to design effective parallel learning algorithms (Chang et al., 2016). Deep learning models have more powerful abilities to express and process data and have shown excellent accuracy and precision rates in applications. In 2012, AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), a deep learning model of convolutional neural network (CNN), achieved remarkable accuracy in the computer vision field and won the ImageNet Challenge, a top-level competition in the image classification field. This CNN model is developed from ordinary neural networks, and directly extracts features from massive amounts of imagery data and abstracts the features layer by layer. It learns the boundary and color features of the objects in an image in the relatively shallow layers. As the number of network layers increases, the information in the neurons of the network is continuously combined. Eventually, the network extracts deep concepts and expresses abstract semantic features. AlexNet reduced the error rate for image classification from 25.8% to 16.4%. After that, networks that competed in the ImageNet Challenge continuously reduced the error rate. In 2015, ResNet (He, Zhang, Ren, & Sun, 2015) even reduced the error rate to 3.6%, whereas the error rate of the human eye was approximately 5.1% in the same experiment. Evidently, computer accuracy in image classification has far surpassed that of humans. Apart from image classification (Lin, Chen, & Yan, 2013), CNNs have also achieved satisfactory accuracy in object detection (Simonyan & Zisserman, 2014) and image segmentation (Tai, Xiao, Zhang, Wang, & Weinan, 2015).
Remote sensing data are essentially digit images, but they record richer and more complex characteristics of the earth surface. Parallel to the enormous success of CNNs in computer vision, geoscientists have discovered that CNNs can be applied in the remote sensing field for rapid, economical, and accurate feature extraction. Some articles have reviewed the current state-of-art of deep learning for remote sensing (Zhu, Tuia, & Mou et al., 2017;Ball, Anderson, & Chan, 2017). However, they tend to cover quite broad issues or topics in remote sensing, and is limited in RS image classification which plays a key role in earth science, such as land cover classification, scene interpretation, monitoring of the earth's surface, etc. Quite a few researchers have studied RS image classification based on CNN models in recent years. Systematically analyzing and summarizing these studies are desirable, and is significant for advancing deep learning in remote sensing. Thus, this article focuses on surveying CNN-based RS image classification. We hope that our work is helpful for remote-sensing scientists to get involved in CNN-based RS image classification. In the following sections, the principles of CNNs are introduced. Then, based on an extensive literature survey, studies of CNN model improvements and CNN training data for RS image classification are systematically analyzed, and CNN application cases in scene classification, object detection and object segmentation are presented and summarized. Finally, the problems and the challenges of RS image classification based on CNN are elaborated, and inspiration for addressing the challenges is drawn.

Convolutional neural network (CNN)
CNNs, as one type of deep leaning networks, have the following advantages over shallow structure models: (1) CNNs directly apply a convolution operation to the pixels of an image to extract abstract data features. This feature extraction can be applied to various scenarios and has a more powerful generalization ability. (2) CNNs are able to represent image information in a distributed manner and rapidly acquire image information from massive volumes of data. The structure of CNNs can effectively solve complex nonlinear problems (e.g., the rotation and translation of an image). (3) CNNs are characterized by sparse connections, weight-sharing and spatial subsampling, which result in a simpler network structure that is more adaptable to image structures. In order to better understand CNN-based image classification, this section will briefly introduce the structure of CNNs and its training method, followed by several popular CNN models in the computer vision field.

Structure of CNN
CNNs are multilayer perceptrons that are specially designed to identify two-dimensional (2D) shapes and can be used to establish mapping from the original input to the desired output. In a CNN, each neuron is connected to the neuron in a local area of the previous layer of the network, thereby reducing the number of weights in the network. Similar to ordinary neural networks, a hierarchical connection structure is also used in CNNs. In other words, a CNN consists of components stacked layer by layer. They are convolutional, pooling and fully connected layers and an output layer, as shown in Figure 1. In a typical CNN, the convolutional and pooling layers alternate as the first few layers, followed by the fully connected layer. The final output layer generates classification results. 2.1.1. Convolutional layer A convolutional layer is the basic layer of a CNN. The convolution operation works on a small local area of the image with a certain size of a convolutional kernel. The convolutional kernel is a learnable weight matrix. The output of the convolutional layer goes through an activation function, then a convolved feature map is obtained. The feature map can be the input of the subsequent convolutional layer. Therefore, more sophisticated features can be extracted after several convolutional layers being stacked layer by layer. Moreover, the neurons in each feature map share the weight of a convolutional kernel in a convolution operation, which ensures that the number of parameters in the network will not increase significantly, even if the number of convolutional layers continues to increase, thereby reducing the storage requirement for the model. Consequently, this model can facilitate the establishment of a deeper network structure.

Pooling layer
A pooling layer generally comes after a convolutional layer. General pooling layers include the maximum, average and random pooling. The maximum pooling and the average pooling find the maximum and average values of the neighborhood neurons respectively, and the random pooling selects values for the neurons according to a certain probability. There are other forms of pooling layer, which are often improved on the general pooling layers, including overlapping pooling and spatial pyramid pooling. Regardless of which form of pooling layer is used, pooling layers aims to capture features but is insensitive to their precise locations, which ensures that the network can still learn effective features, even if a small amount of input data shift occurs. In addition, a pooling layer does not alter the number of feature maps of the previous layer, but it reduces the spatial dimensionality of feature maps and preserves the important information in the feature maps, thereby further reducing the computation of network training.

Fully connected layer and output layer
A fully connected layer is composed of multiple hidden layers. Each hidden layer contains multiple neurons, and each neuron is fully interconnected with the neurons of the subsequent layer. One-dimensional (1D) feature vectors obtained by flattening feature maps after operations in the convolutional and pooling layers are used as the input for a fully connected layer. The objective of a fully connected layer is to map these features into a linearly separable space and coordinate with the output layer in classification. The output layer primarily uses a classification function to output the classification results. Currently, the Softmax function (Liu, Wen, Yu, & Yang, 2016) and SVMs are the common classification functions used in CNNs.

Activation and loss functions
In addition, activation and loss functions are also essential modules in a CNN. The activation function is generally nonlinear, which enables the network to be capable of learning on layer-wise nonlinear mapping. Common activation functions include Sigmoid, Rectified Linear Unit (ReLU) (Hara, Saito, & Shouno, 2015) and Maxout functions (Goodfellow, Warde-Farley, Mirza, Courville, & Bengio, 2013). A loss function, which is also referred to as a cost function or an objective function, is used to represent the extent of previous inconsistencies between the value predicted by the model and the actual value. Furthermore, extra units, such as L1 regularization and L2 regularization, can be added to the loss function to prevent model overfitting. The L1 regularization and L2 regularization can be treated as restrictions in some parameters of the loss function.

Training of CNN
A CNN is trained primarily through backpropagation (Hecht-Nielsen, 1989). First, the input data are forward-calculated by a network structure composed of stacked convolutional layers, pooling layers, fully connected layer, an output layer and activation functions. The errors between the network output and the ground-truth value are calculated by a predefined loss function, then the errors are backpropagated based on the partial derivatives. Under the preset learning rate, each weight and the corresponding error term are adjusted. The above process is performed in an iterative way until the network model is converged. For a CNN, parameters that need to be obtained through training include the weights of the convolutional kernels, the weights of the fully connected layer, and the bias for each layer. Before a network is trained, all the parameters need to be initialized. Good network parameter initialization will make network training more efficient. Common methods of network parameter initialization include Gaussian distributions initialization, uniform distribution initialization and so on.

Typical CNN models
In the early image classification tasks, AlexNet, a CNN with five convolutional layers and two fully connected layers, is widely regarded as one of the most influential CNNs. AlexNet competed in the ImageNet Large Scale Visual Recognition Challenge in 2012 . The network achieved a top-5 error of 15.3%. AlexNet was the first time a model performed so well on a historically difficult ImageNet dataset. After AlexNet, Network-In-Network (NIN) (Lin et al., 2013), VGG-Net (Simonyan & Zisserman, 2014), GoogLeNet (Tai et al., 2015), ResNet  and DenseNet (Huang, Liu, Maaten, & Weinberger, 2016) have emerged successively. The NIN improves the network structure unit. In the NIN, a multilayer perceptron is added to each convolutional layer to replace the simple linear convolutional layer, thereby increasing the nonlinearity of the convolutional layers of the network. In addition, in the NIN, global average pooling (GAP) is also used to replace fully connected layers, thereby solving the overfitting problem that can be caused easily by an excess number of parameters of fully connected layers. Other CNN-based networks, including VGG-Net, GoogLeNet and ResNet, focus on increasing the model depth to improve the network structure. VGG-Net is a 19-layer network with even smaller (3 × 3) convolutional kernels. GoogLeNet is a 22-layer network with inception modules that can better use computational resources in the network and increase the network depth and width without increasing computation. The residual network used in ResNet further increases the network depth (>1,000 layers). The residual network primarily solves the problem in which deep networks cannot be trained. In DenseNet, a new dense cross-layer connection is used to improve the network structure. This connection allows each layer of the network to be directly connected to all the previous layers, thereby allowing the features in the network layers to be used repeatedly. In addition, this connection also effectively solves the vanishing gradient problem that occurs during network training and enables faster model convergence.
After several years of rapid development, CNNs have achieved tremendous success in the computer vision field, and it has been being applied in RS image classification. In the following sections, we will survey and analyze the applications in details.

CNN model developments for RS image classification
Compared to ordinary images, RS images contain richer spectral information as well as spatial information reflecting their structure, shape and texture. Therefore, quite a few studies focus on CNN model improvements, so as to enable CNN models to better capture the features in RS images. In this section, we survey and analyze CNN model improvements in terms of each main CNN components, and parameter initialization and optimization for CNN training.

Input layers of CNNs
In the computer vision field, image information in the red, green and blue (RGB) bands is used as the input for a CNN. In remote sensing field, images often contain information in more bands, and they also contain rich multi-scale information and texture information. Moreover, multi-source images from different sensors are often utilized to better analyze geographic features of the earth. Therefore, taking full advantages of multi-source and multi-spectrum information is essential in CNN-base RS image classification. We have found that several studies focus on this point. For example, Zhang et al. (2018) merged two types of high-resolution RS images (GF-2 and World View satellite images) as the input for a CNN to extract roads, and achieved a high accuracy of 99.2%. Similarly, Xu, Mu, Zhao, and Ma (2016) used low-and high-frequency sub-bands that reflected multiscale image features, which was obtained by a contourlet transform on two training datasets (UC Merced land use dataset and the WHU-RS19 dataset), as the input for a CNN and also obtained a scene classification accuracy of above 90%. Furthermore, Xia, Cao, Wang, and Shang (2017) added four types of texture information, namely the mean value, standard deviation, consistency and entropy, to RS images in the RGB bands, and used the resulting data as the input for a CNN to extract roads, vehicles and vegetation based on the CNN and conditional random field methods. This approach greatly improved the classification accuracy. Also, Xu, Wu, Xie, and Chen (2018) used the normalized differential vegetation index (NDVI), the normalized digital surface model (NDSM) and the first component of the principal component analysis (PCA1), together with the original bands (R-G-B-IR), as the input for a CNN to extract buildings and vegetations. The above studies shown that more index, texture and spectrum information in multisource data has been used in CNN-based RS image classification, and the result of these experiments proved that this approach could improve accuracy of extracting geographic objects in RS images.

Fully connected layers of CNNs
CNNs usually contain more parameters to be trained than shallow learning models. The more parameters often requires much more training data to ensure faster CNN model convergence. In computer vision, there are abundant and various datasets available for training, such as ImageNet, COCO, PASCAL VOC, etc., but few of them are suitable for training CNNs for RS image classification. Whereas, in the remote sensing field, there has been open datasets suitable for training CNNs for RS image classification, but they are very limited. Therefore, it is necessary to reduce the parameters when the training data are limited so that a CNN model converges smoothly. In a CNN, the parameters of the fully connected layers occupy the majority of CNN parameters. Thus, reducing the parameters of the fully connected layers of CNNs can make a CNN model converge on relatively limited training data. We selected two typical cases to demonstrate this approach. Li, Fu, et al. (2017) added a dropout and a Restricted Boltzmann Machine (RBM) to the fully connected layers, which reduced the number of parameters. They used UC Merced land use dataset to evaluate the performance of the method. Their method achieved the best performance compared to other methods, and reaches an overall accuracy of 95.11%. Zhong, Fei, and Zhang (2016) proposed a global average pooling layer (GAP), which is used to replace the fully connected network as the classifier. It greatly reduces the total parameters in the improved CNN model (large patch convolutional neural network, LPCNN), and make it easier to train the model with limited training data. LPCNN was evaluated on three different HSR remote-sensing datasets: IKONOS land-use dataset, UC Merced land-use dataset, and Google Earth land-use dataset of SIRIWHU. The improved CNN model achieved the best performance on the three datasets. The overall accuracies (OA) are 92.54%, 89.90% and 89.88% respectively.

Classifiers of CNNs
In the computer vision field, a Softmax or SVM classifier is commonly used in the output layer of a CNN for dividing the feature space and obtain classification results. Whereas, a Softmax classifier is sometimes inadequate for dividing the feature space in the remote sensing field since the training data in remote sensing could have relatively high noise levels or a relatively large difference between features. Therefore, there are a few studies concerning about more powerful classifiers for getting better classification results on RS images. For example, to address the problem of speckle noises in synthetic aperture radar (SAR) images used to detect ships, Bai, Jiang, Pei, Zhang, and Bai (2018) substituted a fuzzy SVM for a conventional classifier, thereby reducing the impact of the noise sample points on the division of the feature space. As a result, they achieved excellent detection accuracy, with a detection accuracy of 98.6% for the ships in the SAR images. In addition, Xu et al. (2016) established a multi-kernel SVM classifier using a linear combination of multiple kernel functions based on the UC Merced land use dataset and WHU-RS19 scene classification dataset. This classifier adaptively selects a kernel function for classification based on the difference in image features, and it consequently has a higher ability to divide complex feature spaces and thus achieves an improved classification accuracy.

Loss functions of CNNs
Unlike the common images produced using close-range photographic techniques, RS images are generally produced using aerial photographic techniques from above. As a result, the same geographic object is multidirectional in a RS image. This multidirectional leads to unsatisfactory classification accuracy with commonly used loss functions proposed in the computer vision field. In addition, common loss functions are good at separating different classes, but hard to differentiate features of individuals in the same class; there are a few studies focusing on improving the loss functions. For example, Cheng, Zhou, and Han (2016) added an L2 regularization and a regularization constraint term that restricts the rotation variation in objects to the original loss function. They evaluated the performance of this method with NWPU VHR-10 object detection dataset, which includes aircraft, ships, bridges and so on, and achieved a relatively high detection accuracy. Li, Qu, and Peng (2018) designed a loss function that the intraclass compactness and interclass separability are maximized simultaneously for ship detection in SAR images. They designed a CNN model of dense residual network based on ResNet50 (He et al., 2016), and used OpenSARShip (Huang, Liu, et al., 2017) dataset to evaluate ship classification accuracy. The accuracy averaged on classes is 77.2%, which higher than that of the original ResNet50.

Network structure of CNNs
The number of layers of CNNs has been increased for achieving higher classification accuracy. However, the increase in the number of network layers lead to the increase in the number of model parameters to be trained, which requires much more training data. Whereas, the training data are seriously insufficient in the remote sensing field. Consequently, several studies seek to improve network structure of CNNs. The main idea is to design several independent CNN models that differ in the depth of the convolutional layers or the number of neurons in the fully connected layers, and then combine them through feature fusion or model integration. This idea will not cause a significant increase in the number of network parameters, and the network model can converge easily with limited training data. For example, in order to extract the objects in a built-up area from SAR images, Li, Zhang, and Li (2016) extracted features at three different scales using three independent CNN models that differed in the depth of the convolutional layers. They then imported the extracted features of various scales into the fully connected layers to fuse them, and they finally classified the fused features using a Softmax classifier. This network structure is capable of learning rich features of buildings in a built-up area. These features contain relatively detailed and abstract information on the buildings, and they are helpful to effectively improve the learning ability of CNNs. Using the UC Merced land use scene classification dataset, Li, Fu, et al. (2017) first trained several independent CNN models that differed in the number of neurons in the fully connected layers and then integrated the classification results using the voting strategy during the test stage. On this basis, they obtained the final scene classification results. The experimental results showed that this structure could yield more accurate scene classification results.

Parameter initialization and optimization for CNN training
Assigning random values to the parameters is a simple parameter initialization method but has low training efficiency when applying in remote sensing images, and it may result in unstable training results. Therefore, parameter initialization is often concerned in the remote sensing field. The surveyed studies show that parameter values are generally obtained from pretrained CNN models in the computer vision field (e.g., AlexNet and VGG-Net). This pretrained network method can rapidly transfer and apply the learned parameter values from visual image features, thereby making the training for RS image classification more efficient and reducing the complexity and cost of training. According to the literature (Zhou, Shao, & Cheng, 2016;Castelluccio, Poggi, Sansone, & Verdoliva, 2015;Nogueira, Penatti, & Santos, 2017), there are currently two approaches to initialize parameters for CNN training in RS image classification. One of the approaches selects several layers of a pretrained network and fine-tunes them based on the remote sensing image dataset, so that the CNN adapts to achieve satisfactory RS image classification accuracy. For example,  compiled a remote sensing image sample set of five types of urban functional land, namely commercial land, residential land, factory land, land for educational purposes and public land, based on Google Earth remote sensing images. They then trained a prediction model by fine-tuning a pre-trained AlexNet CNN using this sample set. Subsequently, they used the prediction model to classify images into the five types of urban land in the cities of Shenyang and Beijing. The results demonstrated that the finetuned pre-trained AlexNet could effectively classify urban functional land. The other approach directly uses a pretrained networks as an extractor for extracting remote sensing image features, and the extracted features are then used to train a classifier. For example, Weng, Mao, Lin, and Guo (2017) used the last convolutional layer of a pretrained AlexNet network structure to extract remote sensing image features to train an extreme learning machine classifier. This classifier achieved a classification accuracy of 95.62% on the UC Merced land use dataset. Marmanis, Datcu, Esch, and Stilla (2016) converted 1D remote sensing image features extracted by the fully connected layers of a pre-trained network to a 2D feature matrix. This matrix was used to train a CNN model containing two convolutional layers, two fully connected layers and one Softmax classifier. They used the CNN model to classify the scenes in the UC Merced land use dataset and achieved an overall classification accuracy of 92.4%. Besides, Lu et al. (2015) used the network parameters obtained by training in a linear land elimination task to as the initial parameters to train a proposed CNN model, and then used the eigenvectors from the trained CNN model as the input for an SVM extracting farmland from UAV images.
Parameter optimization is also a key step during the training process of CNNs, which can be achieved with the help of a parameter optimizer. Training optimizers include stochastic gradient descent (SGD), RMSprop, Adam, Adadelta and Adagrad. In , SGD is employed as a parameter optimizer, and a momentum technique is also used for SGD which can prevent the model from getting stuck in local minima and approach the global minima quickly.

CNN training data for RS image classification
Like other supervised learning algorithm, CNN-based deep learning needs to learn features from a large number of training data to achieve satisfactory model generalization. Training data from remote sensing images are less than data from natural images. As described in Sec. 3, some improvements of CNNs solve the problem of model overfitting which results from insufficient training data. In practice, there are also some studies on data augmentation or moderate use of weakly labeled training data. In this section, we review these studies. Those open datasets available for model training in remote sensing are summarized.

Open datasets
Although open datasets in remote sensing field are limited, they still play a very important role in training CNN models for RS image classification. Table 1 summarizes the open datasets used for training and validating CNN models for RS image classification. These datasets are categorized into three groups based on three kinds of classification tasks: scene classification, object detection and object segmentation. Table 1 shows that most datasets are from Google earth, thus the bands of most datasets provide only red, green and blue (R-G-B). Some datasets are from satellite or airborne sensors. They provide an extra near infrared band. Table 1 also shows that the datasets have high resolution since they are normally from high resolution sensors. Also, these datasets are different in the number and definition of categories. Most studies use a single dataset as training data, and few studies use multiple datasets for training.

Data augmentation techniques
As shown above, although there are several datasets available for training CNN-based RS images classification models, the category types and numbers of labels in the datasets are still extremely limited and often fail to meet the data scale requirement of model training. Acquiring samples with manual visual interpretation has very low efficiency and a relatively high cost. Therefore, some studies also focus on data augmentation techniques and the use of weakly labeled samples. For example, when using the RSOD-Dataset to detect oil barrels, Long et al. (2017) employed three operations, specifically, translation, scaling and rotation to augment the training data. The augmented training data were 60 times the original data in terms of volume. After the data augmentation, the detection accuracy for oil barrels reached as high as 96.7%. Their results demonstrate that properly increasing the sample size can effectively improve a CNN model's performance. Lacking open datasets, Zhou, Shi, and Ding (2017) also augmented a small volume of manually labeled aircraft training data using three processing methods, mirroring, rotation and bit-plane decomposition. The bit-plane method merged eight bit-plane images obtained from the decomposition of each grayscale image at a new ratio. The size of the training set was increased 32-fold, and the test accuracy increased from 72.4% (based on the original training set) to 97.2% (based on the augmented training set). In addition, Zhong et al. (2016) proposed an augmentation method that was applicable to datasets for classifying scenes in RS images. Taking into consideration the random and multi-scale distribution characteristics of spatial objects in a scene, this method increases the sample size by adjusting the sampling window size and sampling from a scene based on a sliding scheme. This method was evaluated with the IKONOS land use dataset, the UC Merced land use dataset and the SIRI-WHU dataset. It was found to have effectively improved the scene classification accuracy.

Moderate use of weakly labeled samples
Accurate sample labeling requires considerable amounts of labor and time. There are large amounts of application-related and weakly labeled datasets, e.g., coarse-grained and labeled RS image data in object detection tasks and non-accurately labeled map data for object extraction tasks (e.g., OpenStreetMap (OSM)). The moderate use of these weakly labeled samples meets the basic quality requirement of training data for most RS image classification, and it is an effective means to increase the number of training data. For example, aircraft detection needs to distinguish aircraft from complex and diverse backgrounds. Zhang, Du, Zhang, and Xu (2016) first obtained a CNN model by training on aircraft sample data with simple backgrounds based on this model, they obtained sample data that were misclassified as aircrafts from the UC Merced land use dataset, a weakly labeled dataset representing background information. They then treated the misclassified sample data as samples with complex backgrounds and added them to the training sample set. The resulting training sample set was more extensive and more representative. In addition, to address the issue of lacking accurate training data for a building extraction task, Maggiori, Tarabalka, Charpiat, and Alliez (2017) obtained a preliminary prediction model by pre-training on the OSM dataset, a weakly labeled dataset with errors, and then fine-tuned and corrected the prediction model based on a small number of accurately labeled building samples. With the corrected model, they eventually obtained extraction results with higher accuracy. This result demonstrates that the weakly labeled samples can effectively alleviate the problem of insufficient training data in some cases.

Application cases
Application cases of CNN-based RS image classification are classified into scene classification, object detection and object extraction. Scene classification is the process of determining the type of a remote sensing image based on its content. Object detection is the process of determining the locations and types of the targets to be detected in a remote sensing image and labeling their locations and types with bounding boxes. Object extraction is the process of determining the accurate boundaries of the objects to be extracted in a remote sensing image. In this section, we summarize these application cases.

Scene classification
Scene classification is a mapping process of learning and discovering the semantic content tags of image scenes (Bosch & Zisserman, 2008). Generally, an image scene is a collection of multiple independent geographic objects. These objects have different structures and contain different texture information, and they form different types of scenes through different combinations and spatial locations. For scene classification studies in the remote sensing field, the UC Merced land use dataset is commonly viewed as the reference dataset. This dataset is used to validate the methods of scene classification. Methods of scene classification were summarized in Table 2. the LPCNN method is characterized with a specific data augmentation technique to enhance CNN training and global average pooling to reduce parameters, and the MS-DCNN method is characterized with multi-source and a multi-kernel SVM classifier. The next two methods use a pretrained CNN model to learn deep and robust features, but an extreme learning machine (ELM) classifier instead of the fully connected layers of CNN is used in the CNN-ELM method to obtain excellent results. The fifth method combines pretrained networks and RBM retrained network as a two stage-training network, which obtains good results. Latest studies on scene classification proposed methods called GCF-LOF CNN, deep-local-global feature fusion framework (DLGFF) and deep random-scale stretched convolutional neural network (SRSCNN). The GCF-LOF CNN is a novel CNN by integrating the global-context features (GCFs) and localobject-level features (LOFs). Similarly, the DLGFF establishes a framework integrating multi-level semantics from the global texture feature-based method, the (bag-ofvisual-words) BoVW model, and a pre-trained CNN as well. The SRSCNN proposes random-scale stretching to force the CNN model to learn a feature representation that is robust to object scale variation. As shown in classification accuracy in Table 2, recent studies have achieved very high classification accuracy.

Object detection
Object detection from remote sensing images detects the locations and the types of objects. The object detection application cases from remote sensing images use the Two-stage neural network ensemble model (Li, Fu, et al., 2017) Combine pretrained networks and RBM retrained network as a two stage-training network and combine the classification results of several networks in testing stages

95.96
candidate region-based object detection method. The method involves three steps: the generation of candidate regions, feature extraction by the CNN and classification of candidate regions. Candidate regions are a series of locations in which the objects may appear in the pre-generated image. All of these locations will be used as the input for the CNN for feature extraction and classification. Based on the candidate region generation method, existing studies can be classified into two categories, namely those that use the sliding window method and those that use a region proposal method. The sliding window method is a type of exhaustion method. In this method, a sliding window is used to extract candidate regions, and whether there are target objects is determined window by window. A region proposal method establishes regions of interest for object detection. Bounding boxes that may contain target objects are first generated, and whether there are target objects is then determined for each bounding box. To determine whether they rely on an external method for candidate region proposals, region proposal CNNs can further be classified into Region-based CNNs (R-CNNs) and Faster R-CNNs (Ren, Girshick, Girshick, & Sun, 2017). The common region proposal methods used in R-CNNs include selective search and edge boxes. In a Faster R-CNN, a region proposal network is used to generate candidate regions and an internal deep network is used to replace candidate region proposals. Table 3 compares the candidate region-based target detection methods in terms of input images for the CNN as well as advantages and disadvantages. Relatively high object identification capability; endto-end object detection; independent of a region proposal method; very high detection speed Disadvantages The method of exhaustion generates a large number of candidate regions, resulting in a significant volume of repeated calculations in the subsequent operations, and therefore has relatively low efficiency when used to process remote sensing images of large scenes.
Relatively long time to calculate candidate regions. Feature extraction will result in repeated calculations.

Object segmentation
To extract objects from a remote sensing image, it is necessary to segment the objects of interest in the image and to produce a pixel-level image classification map. Two types of methods are primarily used in the existing CNN-based studies on object segmentation from remote sensing images, namely patch-based CNN methods and end-to-end CNN methods. A patch-based CNN method generally first obtains a prediction model by training a CNN on a training dataset, and then, based on the prediction model, it generates image patches using a sliding window pixel by pixel and predicts the type of each pixel of the image. The fully convolutional networks (FCN) method, a common end-to-end CNN method, substitutes deconvolutional layers for the fully connected layers of a CNN, allowing the network to accept input images of any size and directly generate pixel-level object extraction results. Table 4 compares the two existing types of deep learning methods in terms of input images as well as their advantages and disadvantages.

Challenges and conclusions
The emergence of deep learning has provided an opportunity for mining and analyzing big remote sensing data. CNNs, a type of deep learning model, play an important role in RS image classification research. In this paper, we surveyed the current state-of-the-art of CNN-based deep learning for RS image classification. Different from images in the computer vision field, extracting features from RS images is difficult because of the complexity of objects in RS images. Thus, there have been many studies addressing CNN-based RS image classification issues. They have achieved certain breakthroughs in CNN model, training data and training methods for RS image classification. However, these studies are just the beginning of CNN-based RS image classification research. RS image classification is still facing unprecedented and significant challenges, and a number of issues are in need to be thought and investigated in depth, which we summarize as follows: According to our investigation, RS training datasets are much less than image datasets in the computer vision field. It is understandable that the preparation of RS training data are much more time-consuming. Remote sensing scientists who devote to deep-learning-based research are still limited, and less of them pay more efforts in RS training dataset production. Some studies are concerned with data augmentation techniques for addressing insufficient training data. With data augmentation techniques, the training sample size and sample diversity are increased. However, they are inadequate when applyed in training of complex or large deep learning models. The large increase of training datasets in remote sensing is indispensable. The insufficient training dataset issue requires the attention from world-wide remote sensing communities, and these communities may promote and sponsor some initiatives of developing RS training datasets. The training datasets could come from multisource remote sensing data, including visible-light, SAR, hyperspectral and multispectral images. Besides, studies of CNN training with non-accurately labeled data, including weakly supervised, semi-supervised and unsupervised data, is expected to further develop. These studies will be the supplement of the studies on developing accurately labeled training datasets.

RS image-specific CNN models
RS images are different from images in the computer vision field. They involve sensors with multiple imaging methods, including optical imaging, thermal imaging, LiDAR and radar. They can also come from satellite platforms or airborne platforms so they are various in spatial scale. At the same time, unlike objects covering most of a natural image, objects in RS images are generally small and decentralized. Furthermore, the viewing angle from RS images, unlike natural images, is often top-down, which makes it difficult to extract features from RS images. Therefore, CNN models developed in the computer vision field is not adequate for RS image classification. As investigated in this review, existing studies have improved CNN from various perspectives, including input data, fully connected layers, classifiers, loss functions and network structures, for achieving better RS image classification accuracy. These studies achieve enormous success in RS image classification. There is no doubt that RS image-specific CNN models need to be studied further. The future research deserved more attention include: (a). study of CNNs with multiple RS image inputs. The multiple RS image inputs refers to multi sensors with the same or very close spatial scales. The CNNs with multiple RS image input could utilize much more features in spectrum, shape and texture. In addition, studies on dealing with multi RS image inputs in different spatial scale are required. (b) study of the general structure of CNNs specific for remote sensing images. CNNs have a flexible structure but there is a lack of sufficient theory for designing structure of CNNs. Existing studies on the structure of CNNs are conducted on the basis of empirical knowledge. The general structure of CNNs with the help of remote sensing theory is desirable.

The CNN's time efficiency
The majority of the CNN-based RS image classification studies focus more on classification accuracy. Very few studies focus on a CNN's time efficiency during of training CNN models. To meet the requirements of big remote sensing images in practical production, high-performance computing devices (e.g., GPUs) can be used to accelerate model training and testing; advanced model training techniques can be used to accelerate model training and testing. Transfer learning, which is a machine learning method where a model developed for a task is reused as the starting point for a model on a second task, is an effective approach to speed up training of CNNs. CNN's training time efficiency could be addressed in future studies.

High-level CNN-based applications in RS image classification
Current CNN-based applications in RS image classification resemble the classification tasks in computer vision. They are scene classification, object detection and object segmentation. The first two tasks being the majority. More attention could be drawn in higher level CNN-based applications, e.g., high-accuracy extraction of semantic information on scenes, extraction of more complex objects, super-resolution reconstruction, multi-label remote sensing image retrieval and so on. CNN-based classification is a state-of-the-art classification approach to extracting geographic features from remote sensing images. This paper reviews the literature on CNN-based remote sensing image classification. We summarized the improvements on CNN models for remote sensing classification. This work is helpful for understanding how CNN can be better applied to remote sensing image classification. Training data is always the key to deep learning methods. Thus, the available open datasets and data augmentation techniques for remote sensing classification are comprehensively surveyed. We also summarized the methods for three typical remote sensing image classification tasks: scene classification, object detection and object segmentation with specific applications of CNN-based models for remote sensing image classification. Finally, the challenges of CNN-based remote sensing image classification research are listed, and corresponding suggestions are proposed. We hope that this paper can facilitate the advancement of remote sensing image classification and help remote-sensing scientists to further explore or discover more remote sensing image classification methods.

Data availability statement
Data sharing is not applicable to this article as no new data were created or analysed in this study.