Mapping and analyzing the local climate zones in China’s 32 major cities using Landsat imagery based on a novel convolutional neural network

ABSTRACT The Local Climate Zone (LCZ) scheme provides researchers with a standard method to monitor the Urban Heat Island (UHI) effect and conduct temperature studies. How to generate reliable LCZ maps has therefore become a research focus. In recent years, researchers have attempted to use Landsat imagery to delineate LCZs and generate maps worldwide based on the World Urban Database and Access Portal Tools (WUDAPT). However, the mapping results obtained by the WUDAPT method are not satisfactory. In this paper, to generate more accurate LCZ maps, we propose a novel Convolutional Neural Network (CNN) model (namely, LCZ-CNN), which is designed to cope with the issues of LCZ classification using Landsat imagery. Furthermore, in this study, we applied the LCZ-CNN model to generate LCZ mapping results for China’s 32 major cities distributed in various climatic zones, achieving a significantly better accuracy than the traditional classification strategies and a satisfactory computational efficiency. The proposed LCZ-CNN model achieved satisfactory classification accuracies in all 32 cities, and the Overall Accuracies (OAs) of more than half of the cities were higher than 80%. We also designed a series of experiments to comprehensively analyze the proposed LCZ-CNN model, with regard to the transferability of the network and the effectiveness of multi-seasonal information. It was found that the first convolutional stage, corresponding to low-level features, shows better transferability than the second and third convolutional stages, which extract high-level and more image- or task-oriented features. It was also confirmed that the multi-seasonal information can improve the accuracy of LCZ classification. The thermal characteristics of the different LCZ classes were also analyzed based on the mapping results for China’s 32 major cities, and the experimental results confirmed the close relationship between the LCZ classes and the magnitude of the Land Surface Temperature (LST).


Introduction
The Urban Heat Island (UHI) effect, i.e. the atmospheric warmth of a city compared with its surrounding countryside, is a significant climatic response to the disruptions caused by urbanization (Stewart and Oke 2012). Measurements of the UHI intensity of the thermal environment have become the key content of recent studies (Streutker 2003;Fisher, Mustard, and Vadeboncoeur 2006;Li et al. 2011). In these studies, most of the researchers have measured the UHI effect through a simple "urban-rural" classification scheme, to show the temperature difference between the urban area and its rural surroundings (Streutker 2003;Fisher, Mustard, and Vadeboncoeur 2006;Li et al. 2011). However, urban-rural systems are typically complex, and the temperature data from various sites can have significant differences due to the distinct thermodynamic characteristics of local landscapes. For these reasons, it is difficult to compare results across cities and to obtain quantitative meta data of the environment around a site using this simple binary system. In order to address the above problems, the Local Climate Zone (LCZ) scheme was introduced by (Stewart and Oke 2012) as a climate-based classification scheme, and is now regarded as a standard way to monitor the UHI effect and analyze urban temperatures . Compared with simple urbanrural division, the LCZ system is segmented into "built" types (LCZ 1-10) and "natural" types (LCZ A-G), based on the regional landscape patterns (Stewart and Oke 2012). LCZs can be used to facilitate inter-site comparison and effectively measure the UHI magnitude among different cities ). The classification system can be applied to monitoring the UHI effect, establishing climatic models, and designing urban landscapes.
There have been a number of studies of computerbased approaches to interpret LCZs. One of the widely used techniques is the Geographic Information System (GIS)-based technique (Lelovics et al. 2014;Geletič and Lehnert 2016), in which a large amount of input parameters about the urban surface are needed, such as the Sky View Factor (SVF), Impervious Surface Fraction (ISF), and building height . However, detailed urban geographical data are not always available in many cities, especially in the developing countries Xu et al. 2017). The other common technique is the use of freely available satellite imagery to delineate LCZs and generate maps worldwide. The World Urban Database and Access Portal Tools (WUDAPT) project is a community-based initiative to produce LCZ maps around the world. The WUDAPT method  involves conducting urban classification based on the LCZ concept with Landsat images, the Google Earth platform, and a random forest classifier. To classify LCZs based on the WUDAPT framework, it is first necessary to create training areas using the Google Earth platform, then collect the spectral features of the Landsat images, and finally achieve LCZ classification via the random forest classifier.
The WUDAPT method has been widely adopted in relevant research Xu et al. 2017) because the input data and tools can be conveniently and freely employed . However, although the WUDAPT method has been applied to LCZ classification in more than 50 cities around the world (Danylo et al. 2016;Bechtel et al. 2019), the results for only a few cities have been quantitatively validated to have satisfactory mapping accuracies Danylo et al. 2016;Wang et al. 2018). For instance, LCZ classification based on the WUDAPT method achieved an OA of 64% in Kyiv in the Ukraine (Danylo et al. 2016), and 62% in Guangzhou in China . These results indicate that the WUDAPT-based LCZ classification approach is not satisfactory, especially for the dense and highly compact Asian cities, and the use of the approach can significantly affect the subsequent applications . To obtain more accurate mapping results, researchers have attempted to incorporate other satellite data sources, e.g. the addition of Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) and Synthetic Aperture Radar (SAR) data, which improved the OA by 1-2% and 0.3%, respectively Bechtel et al. 2015). Xu et al. (2017) also attempted to take into account Gray-level Co-occurrence Matrix (GLCM) textural features, increasing the OA of the classification results for Wuhan and Guangzhou by 2%. However, to sum up, neither the combined use of other satellite images nor the addition of textural features can substantially strengthen LCZ mapping results. Moreover, the accuracy of the mapping results produced by the WUDAPT method over a large number of other cities still requires further validation .
LCZs are defined by a combination of surface cover, urban structure, material, and human activities, and can be regarded as complex scenarios with multiple land-cover types and multiple ground objects, spanning scales of hundreds of meters to several kilometers (Stewart and Oke 2012). Although the addition of textural features did not improve the classification accuracy significantly in the existing literature, these features can actually provide important information on urban structure and can contribute to LCZ interpretation. A possible explanation for the unsatisfactory performance of the WUDAPT method in the current studies is that spectral and textural features are still low-level features, which cannot capture detailed information about the urban landscapes and semantics. In summary, LCZ classification is a very complex problem of remote sensing scene interpretation. For example, the open high-rise (LCZ 4) category is made up of open high-rise buildings, low plants, scattered trees, and soil, and the algorithms based on the WUDAPT method might not be well adapted to such a complicated scene classification task. Therefore, it is necessary to design a more suitable method for LCZ classification from the perspective of scene interpretation, and to substantially improve the quality of the current LCZ mapping products.
Convolutional Neural Networks (CNNs), as a commonly used deep learning method, have the ability to represent higher-level features, and have shown their superiority in semantic understanding . CNNs have demonstrated great potential in many remote sensing tasks, e.g. scene classification (Gao et al. 2021), water body extraction (Yu et al. 2018), semantic segmentation (Kemker, Salvaggio, and Kanan 2018), and land-use mapping . These studies showed that CNN models can substantially outperform the traditional image interpretation methods. Nevertheless, CNN models have rarely been applied to the classification of Landsat images, although they do have the potential to mine the spectral-spatial information.
In general, there are two strategies used to exploit CNN models for remote sensing classification: 1) the use of a pre-trained or fine-tuned CNN; and 2) fully training a CNN from scratch. The first strategy relies on the classical pre-trained CNN networks transferred from an auxiliary domain with natural images. These classical CNN models are directly used as feature extractors, and the number of categories can be modified according to the current sample dataset, and then fine-tuned to conduct the classification. Although some classical pre-trained models have been used in previous studies, e.g. LeNet (Lecun et al. 1998), AlexNet (Krizhevsky, Sutskever, and Hinton 2017), GoogLeNet (Szegedy et al. 2015), and VGGNet (Simonyan and Zisserman 2014), their input channels only refer to one or three spectral channels (RGB). However, Landsat images contain many more spectral bands, i.e. seven multispectral bands with a resolution of 30 m as well as a panchromatic band with a resolution of 15 m. Therefore, the existing CNN models cannot make full use of the rich spectral information in Landsat images. Moreover, the network structures of the existing CNNs are constructed from high-resolution natural images, and are often configured with a large input window, with a lot of convolutional layers and parameters. The collection of massive training samples from remote sensing images also requires a huge amount of labor, and is very timeconsuming (Li, Huang, and Gong 2019). In summary, the aforementioned issues restrict the application of the current CNN models to the mapping of Landsat multispectral images. In contrast, the second strategy, i.e. networks based on training from scratch, allows the network architecture to be customized and parameters set according to the specific requirements, leading to greater flexibility and expandability, especially for Landsat images, which have a relatively low spatial resolution, and for which it is difficult to collect training samples. Some studies have attempted to design and train a small task-specific architecture with the aid of the current remote sensing datasets and the typical filters (Li, Huang, and Gong 2019). These small and fully trained CNNs can avoid the dilemmas of the high cost and risk in training the existing "large-scale" networks with high redundancy and over-parameterization (Li, Huang, and Gong 2019). Therefore, it is interesting and meaningful to build an adaptive CNN model dedicated to the task of LCZ classification using Landsat imagery.
The difference of the thermal properties among the different LCZ categories also deserves comprehensive investigation, from the application viewpoint. However, to date, the relevant studies are inadequate. Alexander and Mills (2014) found that the LCZ D category (low plants) was generally the coolest, while the LCZs with higher impervious and building fractions were found to be warmer in Dublin, Ireland. Geletič and Lehnert (2016) calculated daytime Land Surface Temperature (LST) differences for the LCZs in Prague and Brno in the Czech Republic. The highest LST was found in LCZ 2 (compact mid-rise), LCZ 3 (compact low-rise), and LCZ 10 (heavy industry), while the lowest LST was observed in LCZ A (dense trees) and LCZ G (water) in both cities. Lelovics et al. (2014) found that the high-rise and mid-rise LCZs were warmer than the low-rise types in Szeged in Hungary, while Wang et al. (2018) obtained a different result, in that the low-rise and mid-rise LCZs exhibited higher LST than the high-rise types in Phoenix and Las Vegas in the U.S. Most of the recent research (Geletič and Lehnert 2016;Cai et al. 2018;Lelovics et al. 2014) has been aimed at typical cases of one to two cities, and their conclusions about the thermal properties of the different LCZs have not been entirely consistent, owing to the differences of the urban landscapes and background climates. Although some studies obtained analysis results in multiple cities, they were either based on urban agglomeration from a specific climatic zone (Cai et al. 2018) or a separate analysis of each city (Bechtel et al. 2019), and they did not comprehensively analyze the characteristics and differences of the climatic zones. Stewart, Oke, and Krayenhoff (2014) encouraged more LCZ evaluation studies for urban environments under various climatic conditions. It is therefore necessary to explore the relationship between LCZs and the thermal environment, and to evaluate the LCZ classification scheme performance based on a large number of cities distributed in different climatic zones.
In this context, from the technological aspect, we propose a novel CNN framework (namely, LCZ-CNN), which is designed to cope with the issues of LCZ classification using Landsat imagery. The proposed LCZ-CNN model adequately considers the multi-spectral and multi-seasonal information of the Landsat images. Meanwhile, the traditional classification strategies, including WUDAPT , the Bag of Visual Words (BoVW) model (Yang and Newsam 2010), and the Multi-Layer Perceptron (MLP) (Hu 2011), were also applied for comparison. Furthermore, we applied the LCZ-CNN model to generate the LCZ mapping results for China's 32 major cities distributed in different climatic zones, and the thermal characteristics of the different LCZ classes were then analyzed.

Study area
China is a country with a vast territory and a huge population, located in the east of Asia and the west coast of the Pacific Ocean. China's terrain is high in the west and low in the east. Mountains, plateaus, and hills account for about 67% of the land area, while basins and plains account for the rest. Owing to the tremendous differences in geographical conditions, the climate of China is diverse, ranging from tropical in the south to subarctic in the north, and alpine in the higher elevations of the Tibetan Plateau. The diversity of climate and the uneven distribution of population also lead to large differences in the landscapes of the various cities in China.
As the largest developing country in the world, China has experienced a rapid urbanization process in recent decades, exacerbating the UHI effects in the large cities (Gong et al. 2012). Hence, this study focused on the 32 major cities of China, with regard to their accurate LCZ mapping, as well as a thermal environmental analysis ( Figure 1). All of these cities are municipality or provincial capitals, which are located in different climatic zones and exhibit complex urban forms and landscapes. Therefore, the LCZ mapping in these selected cities was both challenging and meaningful, allowing us to not only test the robustness of the proposed deep neural network from the methodology point of view, but also to investigate the thermal characteristics of the megacities of China. Figure 2 shows examples of the representative LCZ classes in the 32 major cities, with detailed descriptions, as well as the high-resolution images obtained from Google Earth. The numbers of LCZ classes in the different cities are different, due to the differences of their development levels and urban landscapes. In this research, the size of the input window for the LCZ classification was set to 240 m, which was identified according to the definition of LCZs and the previous experimental results obtained for high-density Asian cities (Stewart and Oke 2012;Lau et al. 2015).

Data
In total, 128 Landsat 8 level-1 image scenes from different seasons acquired mainly in 2015 with minimum cloud coverage were downloaded from the U.S. Geological Survey Earth Explorer website for the 32 cities ( Figure A1). Since the image quality for several cities in 2015 was poor, the available images of the most recent years were selected as supplements. We adopted the Additive Wavelet Luminance Proportional (AWLP) method to incorporate the spatial details in the highresolution (i.e. 15 m) panchromatic bands of the Landsat images, allowing us to better capture the spatial structure information of the LCZs. AWLP is a widely used and robust pan-sharpening algorithm (as suggested by the large-scale comparison in (Meng et al. 2020)), which utilizes the electromagnetic spectrum responses of the sensors to improve the spatial resolution of multispectral images (Otazu et al. 2005).

Architecture of the proposed LCZ-CNN model
In this research, we developed the novel LCZ-CNN model, adapting deep learning to Landsat-based LCZ classification. The model architecture is shown in Figure 3. The proposed LCZ-CNN model consists of one input layer, three convolutional layers, two maxpooling layers, one global average pooling layer, and one softmax layer (Figure 3(c)). A stage of the LCZ-CNN model (Figure 3(b)) includes a convolution operation, an elementwise non-linear function, and a pooling operation. As such, the main structure of the LCZ-CNN model can be summarized into three typical convolutional stages. Examples of representative LCZs from Chinese cities. This classification system is segmented into "built" types (LCZ 1-10) and "natural" types (LCZ A-G), based on the regional landscape patterns (Stewart and Oke 2012).
The input layer is intended to integrate Landsat images from four seasons to provide temporal information that facilitates LCZ classification (Wicki and Parlow 2017). The input window size (16 × 16) of a sample patch is determined according to the scale (240 m × 240 m) of the LCZs, which is set according the scale of the LCZ classification of high-density cities (Lau et al. 2015). The experimental results based on the proposed method in Table A1 also confirm that 240 m is a suitable window size for LCZ classification. The number of input channels is 32, consisting of eight Landsat spectral bands with four seasons. A major advantage of the proposed LCZ-CNN model is that it can make full use of the rich temporal and spectral information embedded in the Landsat images. The sample dataset was further divided into a training dataset and a test dataset (Figure 3(a)). The former was used to construct the network, and the latter was used to evaluate the mapping accuracy.
The purpose of the convolutional layers is to extract feature maps by applying a convolution operation to the input bands. Each convolution neuron processes data only for its receptive field. An elementwise non-linear activation function is then applied to these feature maps to complete a non-linear transform. The operation of a convolutional layer can be expressed as: whereX l denotes the activation value of layer l, X lÀ 1 represents the input activation value produced by layer ðl À 1Þ, f ð�Þcorresponds to the activation function, andw l and b l represent the weights and biases at layer l, respectively.
The Rectified Linear Unit (ReLU) is chosen as the non-linear activation function, due to its ability to alleviate the problem of gradient disappearance, and to speed up the training : where x represents the output values for the neurons. The first convolutional layer has 64 small convolutional filters (each of 3 × 3). Its function is not only to extract the low-level visual features (Zhang, He, and Lu 2019), but also to output the linear combination of the spectral bands from the multi-seasonal images to sufficiently exploit the spatio-temporal information from the Landsat data. Please note that, in the proposed LCZ-CNN model, the size of an LCZ patch (16 × 16) is much smaller than that for natural images, e.g. 224 × 224 in GoogLeNet (Szegedy et al. 2015) and VGGNet (Simonyan and Zisserman 2014). With regard to the coarse spatial resolution of Landsat imagery (15 m after pan-sharpening), large convolutional filters cannot capture the small spatial variation within hundreds of meters. Thus, small convolutional filters (3 × 3) are also adopted in the fourth and sixth layers, which also make more nonlinear changes (i.e. by activation function), as well as reducing the parameters.
The purpose of the pooling layers is to achieve translation invariance, making the architecture less susceptible to small variations in location, and allowing the network to represent the more abstract features of the input as the network depth increases (Lai et al. 2019). The operation of a pooling layer can be written as: where Nand Wdenote the sizes of the output maps and input maps, respectively, Frepresents the size of the pooling filters. P corresponds to the number of padding pixels, and S represents the stride value. Feature maps obtained from the last convolutional layer are usually fed into the fully connected layers (Figure 4(a)), followed by a softmax classification layer with the same number of classification categories. However, the redundancy of the parameters produced by the fully connected layers can result in overfitting, thus restricting the generalization ability of the whole network (Xu et al. 2019). Moreover, the previously extracted features cannot maintain their spatial structure after passing through the fully connected layers. Therefore, in the LCZ-CNN model, we adopt the strategy of Global Average Pooling (GAP) to replace the fully connected layers (Hsiao et al. 2019). The idea of GAP is to compute the average value of each feature map and input the result vector directly into the soft-max layer (Figure 4(b)). There is no parameter to tune in the GAP layer, so that overfitting can be avoided, and the spatial information is summed to make the spatial translations of the input more robust.
To summarize, the LCZ-CNN model is a lightweight network, specially designed for the task of LCZ mapping using multi-spectral and multi-seasonal Landsat images. The input window size is 16 × 16, and there are only three convolutional layers in this framework. The employment of the GAP layer also avoids the parameter redundancy caused by the fully connected layers. The proposed lightweight network can obtain satisfactory results with only a moderate amount of training samples.

LCZ-CNN model training
In the training process, we adopted two strategiesdropout and batch normalization -to prevent overfitting. Batch Normalization (BN) is an algorithm for accelerating the convergence and stability of neural networks . When training the LCZ-CNN model, we added a BN operation after each convolutional layer. This can prevent the training from getting stuck in the "saturated regimes of nonlinearities" by normalizing the activation of the whole network ). In addition, a batchnormalized architecture allows for faster learning rates, thus generating models with better generalization capabilities.
Dropout is a strategy used to reduce overfitting and to prevent complex co-adaptations on the training data (Srivastava et al. 2014). The specific operation of this method is to set the output of some hidden neurons to zero, which means that each hidden neuron is randomly omitted from the network, with a probability of p. Thus, the dropped neurons do not participate in the forward-pass and are not used in the back-propagation process. In the different training epochs, different neural networks are formed by dropping neurons randomly. Finally, the neural networks produced in the different training epochs are approximately combined. Therefore, the dropout method prevents overfitting, and the neurons can learn more representative features (Srivastava et al. 2014). In this study, the probability of dropout was set to 0.5, at which point the most random network structures are generated (Srivastava et al. 2014).
Stochastic Gradient Descent (SGD) was selected as the optimization algorithm for LCZ-CNN in the training process. In the phase of model training, we randomly selected 70% of the training data as the training samples, and the remaining 30% were used as the validation samples. The training data were collected from the Google Earth high-resolution images of the same period. Full details of the production process can be found at http://www.wudapt.org/.

Accuracy assessment
In the phase of inference, we used completely independent ground-truth data to evaluate the test accuracy of the different models. The ground-truth production process was the same as the training sample collection process, but was conducted independently. Table A2 shows the training and test data for the 32 cities. The confusion matrix was then calculated from the test samples for the accuracy assessment. Five quality indices calculated from the confusion matrix were then adopted to assess the classification performance (Foody 2002;Stehman 2013), in terms of the Overall Accuracy (OA), Kappa, User's Accuracy (UA), Producer's Accuracy (PA), and F1-score (F).

Analysis of the thermal properties of the LCZ classes
The previous studies (Geletič and Lehnert 2016;Lelovics et al. 2014;Wang et al. 2018) have usually been conducted on a small number of cities (usually one to two cities), and their conclusions about the relationship between LCZs and LST have not been entirely consistent, owing to the different urban landscapes and background climates. Thus, in this study, we further investigated the thermal characteristics of the LCZ classes in the summer daytime for China's 32 cities, based on our mapping results. The LST data were derived from the thermal infrared images obtained from the Landsat 8 Thermal Infrared Sensor (TIRS). Please note that the thermal infrared bands of the Landsat 8 TIRS were not considered in the previous LCZ classification since they would be used in the subsequent LST analysis. To estimate the LST, we adopted the method of Li et al. (2011), which is a single-window algorithm based on Band 10 of the Landsat 8 TIRS.
To explore the relationship between LCZs and the thermal environment, we calculated the departure from the mean LST for all the LCZ classes: where i denotes an LCZ class, ΔLST i is the departure from the mean LST for the i th LCZ class, LST i is the mean LST for the i th LCZ class, and LST mean is the average LST of all the LCZ classes. We summarized the ΔLST i frequency distribution in the 32 cities to evaluate the heat/cold island effects for each LCZ class. Box charts of the ΔLST i of the cities in the different climatic zones were calculated separately, to analyze the LST pattern of the LCZs. Figure 5 shows the entire mapping and analysis flowchart of this research.

LCZ classification of China's 32 major cities
In this section, we describe how the methods of WUDAPT , BoVW (Yang and Newsam 2010), and MLP (Hu 2011) for LCZ classification with Landsat imagery were compared with the proposed LCZ-CNN model.
The WUDAPT method was implemented with two different input features, using only the "spectral" features (i.e. WUDAPT (Spec)) and using both the "spectral" and "textural" features (i.e. WUDAPT (Spec +GLCM)), respectively. The spectral features included the eight Landsat bands in four seasons, which was consistent with the input of the LCZ-CNN model. The textural features included four GLCM features, i.e. contrast, homogeneity, energy, and correlation. The window sizes used for calculating the spectral features (average value and standard deviation) and GLCM features were 16 × 16 and 17 × 17, respectively.
The BoVW model was also implemented with two different input features, using only the "spectral" feature (i.e. BoVW (Spec)) and both the "spectral" and "textural" features (i.e. BoVW (Spec+GLCM)) to create the visual dictionaries (Yang and Newsam 2010). Considering the classification accuracy and the computational efficiency, the size of the visual words for the "spectral" dictionary and "textural" dictionary was set to 128 and 16, respectively, by manual parameter tuning. Random forest was selected as the classifier for WUDAPT and BoVW, as suggested by the other LCZ classification experiments Xu et al. 2019), and its robustness to the training set size and noise. The parameters of the MLP were also tuned manually. The learning rate was chosen as 0.001, the number of hidden layers was set to 3, and the number of nodes in each hidden layer was set to {300,200,100}.
The OAs of the experiments are shown in Figure 6. The proposed LCZ-CNN model achieves satisfactory classification accuracies in all 32 cities, ranging from 65.2% to 99.4%. In fact, the OA for more than half of the cities is higher than 80%. The best results are achieved for Kunming, Beijing, Changchun, and Lhasa, with OA values of higher than 85%. The worst results are obtained for Guiyang and Changsha, with OA values of 65.2% and 66.6%, respectively. The increase in OA is very significant when compared with the other benchmark methods, and the proposed LCZ-CNN model achieves the best classification performance in all 32 cities. In more than 20 cities, the OA values of the LCZ classification obtained by WUDAPT are less than 80%, indicating that the LCZ mapping results for some cities (especially the dense and highly compact Asian cities) obtained with the traditional WUDAPT method are not satisfactory, and may affect the subsequent applications . Both BoVW and MLP achieve better OA values than WUDAPT, but the improvements are not as significant as for LCZ-CNN.
Based on all the test samples from the 32 cities, a confusion matrix was calculated for each method, which are presented in Tables A3-A8. Although considerable confusion among the built types (LCZ 1-10) exists, the confusion between LCZ 1 (compact highrise) and LCZ 4 (open high-rise), and LCZ 2 (compact mid-rise) and LCZ 5 (open mid-rise), for example, can be effectively alleviated by the proposed LCZ-CNN model. Figure 7 shows the OA and Kappa values of the various methods, and the ΔPA, ΔUA, and ΔF values for each LCZ class. To compare the results, we calculated the differences of the per-class accuracies (PA, UA, and F1-score) between each method and WUDAPT (Spec), which are recorded as ΔPA, ΔUA, and ΔF, respectively. It can be seen that the advantage of the proposed LCZ-CNN model is obvious, while the other benchmark methods (e.g. WUDAPT (Spec + GLCM), BoVW (Spec), BoVW (Spec + GLCM), and MLP) do not achieve significant accuracy improvements. The superiority of the proposed LCZ-CNN model is further demonstrated by the per-class mapping accuracies (Figure 7(c-e)), where the ΔPA, ΔUA, and ΔF values of all the LCZ classes are greatly increased compared with WUDAPT (Spec). For most built types (LCZ 1-10), the per-class mapping accuracies of WUDAPT (Spec+GLCM), BoVW (Spec), and BoVW (Spec + GLCM) are also improved, compared with WUDAPT (Spec), which proves that adding texture information and extracting spatial features through the BoVW model can help to improve the classification accuracies for complex scenarios consisting of multiple land-cover types. In addition, it is also interesting that BoVW (Spec) improves the per-class mapping accuracies of most built types, compared with WUDAPT (Spec), and the same situation is also apparent when comparing BoVW (Spec+GLCM) with WUDAPT (Spec+GLCM). This phenomenon indicates that mapping spectral and texture features into sparse mid-level features through the BoVW model can also help to improve the classification accuracy for complex urban scenes. The classification results obtained by MLP are inferior to the  results of LCZ-CNN, overall, revealing the great potential of machine learning algorithms in Landsat image scene classification.

LCZ mapping results for China's 32 major cities
To explore the relationship between LCZ distribution and climatic zones, the average area proportion of each LCZ class in the different climatic zones is shown in Figure 8. There is only one city (Lhasa) in climatic zone TS (tundra climate and snow climate with cool summer and cold winter). Since climatic zone A (climate of arid steppe and desert) and TS are both located in Northwest China, the cities in these two climatic zones are merged in the subsequent analysis.
Concerning climatic zones A and TS, the natural type with the highest area proportion is LCZ F (bare soil or sand), since the arid climate is detrimental to vegetation growth, and there are many deserts around these cities (Hu, Su, and Zhang 1988). Compared with the other climatic zones, the area proportion of LCZ 1 (compact high-rise buildings) in climatic zones A and TS is also significantly higher. The surrounding topographic conditions (e.g. surrounding deserts, narrow valleys, mountains) and harsh climatic conditions (e.g. scarce precipitation, hot weather) limit the speed and scale of urban expansion in these zones, and hence the buildings are usually compact and high (Roeser et al. 2012).
As for the cities in climatic zone W (warm temperature climate with dry winter), LCZ 10 (heavy industry) and LCZ A (dense trees) occupy a larger area than in the other climatic zones. Most of the cities in climatic zone W (e.g. Tianjin, Taiyuan, Chengdu, Shijiazhuang, and Xi'an) are China's important industrial cities, with prosperous industrial construction. Furthermore, the mild climate in this zone is suitable for plant growth.
LCZ 5 (open mid-rise) is the predominant built type in climatic zone S (snow climate with dry winter), and it occupies a larger proportion than in the other climatic zones, while the area proportion of LCZ 6 (open low-rise) in climatic zone EW (equatorial climate and warm and fully humid temperate climate) is notably high. Most cities in climatic zones S and EW are located on plains or basins with agriculture, and hence there is a lot of space to satisfy the horizontal urban sprawl, forming a lot of open mid-or low-rise types. Moreover, most of the cities in climatic zone EW are located in southern China, with abundant precipitation and large lakes or rivers, and the area of LCZ G (water) in this zone is generally much larger than that in the other climatic zones. Figure 9 presents the mapping results for four representative cities in different climatic zones. To show more details of the LCZ distribution, zoomedin regions of 5 km ×5 km for each result are also shown in Figure 9, with the original Landsat images and the results obtained by the proposed LCZ-CNN model. It can be observed that that the maps are consistent with the characteristics shown in Figure 8. In general, LCZ 1 (compact high-rise), LCZ 2 (compact mid-rise), LCZ 4 (open high-rise), and LCZ 5 (open mid-rise) are the predominant built types in the downtown areas, most of which belong to residential and commercial areas. LCZ 8 (large low-rise) and LCZ 10 (heavy industry) are generally located in suburbs or rural areas. In summary, the LCZ categories and the urban structures in the complicated and highdensity Chinese megacities are effectively delineated by the proposed LCZ-CNN model.

Comparison with other benchmark methods
In the previous sections, we compared the proposed LCZ-CNN model with other benchmark methods (WUDAPT, BoVW, and MLP) in terms of classification accuracy. In this section, we further compare these methods from the perspective of theory and computational efficiency.
The LCZ built types usually correspond to complex urban scenes with man-made structures and other spectrally heterogeneous features. Therefore, the results obtained by WUDAPT (Spec) based only on the image spectral features are not satisfactory, as expected, since adequate spatial information is not utilized. The GLCM texture counts the frequency distribution, which describes how often two gray levels occur in a given spatial relationship, capturing the spatial relationships and contextual information in the neighborhood of the current pixel (Huang et al. 2020). In this way, the complex scenes composed of multiple land-cover types in the LCZ categories can be more effectively represented by the addition of the spatial context relations, which explains why the classification results of WUDAPT (Spec + GLCM) are better than those of WUDAPT (Spec). The experimental results obtained by  in Wuhan and Guangzhou in China also verified that the quality of LCZ mapping results can be improved with the addition of textural information.
The BoVW model can extract mid-level semantic features of complex scenes from the low-level visual features such as the spectra and texture (Aslam et al. 2019), and is better able to mine the spatial information in remote sensing images. As a result, the classification accuracies for the representative built types (e.g. LCZ 1, 4, 5, and 6) obtained by the BoVW model show significant improvements when compared with the results of the WUDAPT method. Please note that the BoVW model is a scene classification method, which has been successfully applied to land-use scene classification (Muhammad et al. 2019) and image retrieval (Arun, Govindan, and Kumar 2020) from high-resolution remote sensing images. However, to date, few studies have applied the BoVW model to information extraction from Landsat imagery. Our experimental results indicate that the BoVW model can achieve a satisfactory accuracy for Landsat image scene classification. However, the BoVW model calculates the histogram of all the visual words in an image scene, and thus it runs the risk of losing the spatial arrangement information (Yang and Newsam 2010). In addition, the model construction and feature mapping of BoVW is actually an unsupervised process, which may lead to a reduction in the classification accuracy.
The MLP is one of the most popular neural network models in remote sensing thematic mapping. It has been widely used in land-cover mapping (Jamali 2020), land-use classification (Subiyanto et al. 2019), and change detection (Wang, Lu, and Qin 2020). The MLP has the advantage of data-driven and automatic learning, which is conducive to specific classification tasks. However, the MLP usually learns the non-linear spectral feature space at the pixel level, without considering the spatial context implicit in the images .
Differing from all the aforementioned classifiers, the LCZ-CNN model is a data-driven feature learning and classification method (Levine et al. 2016). It also does not need us to set handcrafted image descriptors (e.g. GLCM, Gabor, Scale-Invariant Feature Transform (SIFT)) in advance. The network parameters can be adjusted during the training process to learn the spectral and spatial features applicable to a specific interpretation task. Therefore, the convolutional layers of the proposed LCZ-CNN model can extract the spatial context information from Landsat images, and can learn more robust and adaptive features than the handcrafted descriptors . Our experimental results demonstrate the superiority of the proposed LCZ-CNN model in mining and utilizing the spatial information contained in Landsat images. Table 1 compares the computation time of the LCZ-CNN model with that of WUDAPT, BoVW, and MLP, for predicting 10,000 and 50,000 samples, respectively. These experiments were implemented using MATLAB 2016b with an Intel(R) Core(TM) i5-7500 CPU and 24.0 GB memory. The computation time of BoVW (Spec + GLCM) is the longest among all the methods. Its time cost mainly arises from the process of assigning local features to the corresponding visual words, the calculation of the frequency histogram of visual words used for the scene classification, and the GLCM feature extraction. It can be said that the efficiency of the LCZ-CNN model is significantly higher than that of BoVW (Spec), BoVW (Spec + GLCM), and WUDAPT (Spec + GLCM), when considering both the computation time and accuracy. The time complexity of a CNN is mainly related to the number of convolutional layers, the parameters of the convolutional kernels, and the channels of the feature maps generated by each convolutional layer (Wu et al. 2018). In this research, the proposed LCZ-CNN model was made up of only three convolutional stages and a GAP layer as a replacement for the fully connected layers in the conventional CNNs. Therefore, compared with the commonly used classical networks, e.g. AlexNet, Google Net, and VGG-16, the complexity of LCZ-CNN is greatly reduced. In conclusion, the LCZ-CNN model is a lightweight CNN that can achieve both a satisfactory classification accuracy and an acceptable computational efficiency. Table 2 records the Floating Point Operations (FLOPs), which are proportional to the model complexity, and the average OA (OA avg ) of the classification results from all 32 cities. Several recent commonly used networks, i.e. LeNet-5 (Lecun et al. 1998), DenseNet-121 (Huang et al. 2017, and ResNet-50 (He et al. 2016), are compared with LCZ-CNN in Table 2. LeNet-5 requires only one input channel, while DenseNet-121 and ResNet-50 require three input channels. In order to meet the requirements of these networks with regard to input channels, the 32channel images were compressed into one or three channels through the use of the Principal Component Analysis (PCA) algorithm, according to the operation in Xiao et al. (2017), the purpose of which was to compare the performance of each network fairly. We adopted two strategies to exploit the CNN models: 1) fine-tuning the pre-trained network; and 2) fully training the network from scratch. It can be clearly seen from Table 2 that the complexity of LCZ-CNN is far less than that of DenseNet and ResNet, and the classification accuracy of the LCZ-CNN model is much higher than that of all three commonly used networks, which proves the superiority of LCZ-CNN. Several other studies Ma et al. 2018;Song et al. 2019) have also concluded that directly applying these original classical CNNs to remote sensing tasks might not obtain satisfactory results.

Function and transferability of the convolutional stages
The proposed LCZ-CNN model is composed of three typical convolutional stages named Stage-1, Stage-2, and Stage-3, respectively (Figure 3(c)). A typical convolutional stage of a CNN includes a convolution operation, an elementwise non-linear function (e.g. ReLU non-linearity), and a pooling operation ( Figure  3(b)). To explore the function and transferability of each convolutional stage in the LCZ-CNN model, we designed a number of experiments (Figure 10), as follows. Specifically, the city used to construct the original network was called City A, and the LCZ-CNN model trained by the samples from City A was denoted as A-CNN. Furthermore, the A-CNN model was finetuned by the samples from City B, in order to test the function and transferability of the different stages of the LCZ-CNN model. We therefore set a series of experiments: Experiment (a): The parameters of Stage-1 of A-CNN were fixed, but the parameters of Stage-2 and Stage-3 were tuned according to the training samples from City B, and the fine-tuned model was applied to classify the images of city B. The purpose of this experiment was to explore the function and transferability of Stage-1.

Experiment (b):
The parameters of Stage-1 and Stage-2 of A-CNN were fixed, and the parameters of Stage-3 were tuned with the training samples from City B. The fine-tuned model were then used for City B. This experiment was aimed at investigating the transferability of the features extracted by Stage-2.

Experiment (c):
The parameters of all three convolutional stages (Stage-1, Stage-2, and Stage-3) of A-CNN were fixed. This is equivalent to using A-CNN as a feature extractor to classify the test samples from City B, without consideration of the samples from City B. In this way, the transferability and function of Stage-3 could be analyzed.

Experiment (d):
A new LCZ-CNN model, i.e. B-CNN, was trained from scratch using training samples from City B. This experiment was conducted as a comparison.
The purpose of the above experiments was to analyze the transferability of the parameters in the various convolutional stages for LCZ mapping. Specifically, we chose three city groups for transfer learning: Guangzhou to Beijing, Hohhot to Hefei, and Shenyang to Fuzhou (the former is City A, and the latter is City B). In the experiments, City A and City B were chosen from different climatic zones. Table 3  compares the results obtained by the three CNN models, i.e. the cases of (a), (b), and (c), with the result of the self-trained network (the case of (d)).
In Experiment (a), although the classification performances are inferior to those of Experiment (d), the OAs are greater than 60%. This implies that the first convolutional stage can extract low-level visual features (e.g. spectra, texture), which are generally invariant between different cities, so that the parameters of Stage-1 demonstrate transferability, to some degree (Alshehhi et al. 2017). In Experiment (b), the LCZ mapping accuracy is significantly decreased when compared to Experiment (a), with OA values of <60%. However, the accuracy of Experiment (b) is superior to that of Experiment (c). The results of Experiment (b) and Experiment (c) indicate that the parameters of Stage-2 and Stage-3 possess poor transferability. The objects, as well as their spectral-spatial characteristics, present considerable differences between the different cities, and although they belong to the same LCZ category, there are notable  distinctions in the image acquisition conditions (e.g. view angle, weather, atmosphere) and urban landscapes among the various cities (Yang and Lo 2000;Vogelmann et al. 2001). Consequently, the features extracted from Stage-2 and Stage-3 are probably more closely related to the specific image characteristics of the LCZs from the target area (Traviglia and Torsello 2017), but are not suitable for transfer to other cities. The function of Stage-2 and Stage-3 is to further extract and highlight the task-oriented image features on the basis of the low-level visual features extracted by Stage-1, which actually limits the generalization ability of the extracted features obtained in Stage-2 and Stage-3. The representative features extracted from all three convolutional layers of LCZ-CNN are displayed in Figure 11. The three convolutional layers in LCZ-CNN are denoted as Conv-1, Conv-2, and Conv-3, respectively. Figure 11(b) highlights the shape and edges of the buildings in Figure 11(a), and demonstrates that the Conv-1 features retain the original contours of the ground objects. Figure 11(c) retains the fuzzy edges and geographical spatial relationships of the buildings in Figure 11(a,d) indicates the semantics of the buildings (or no buildings) in the urban scenes. The features extracted by Conv-3 are more abstract, and can be considered to be more likely to express higher-level semantic information . Figure 12 presents the intra-class similarity and the inter-class separability in the three stages. The intraclass similarity was measured with Pearson correlation coefficients (Hauke and Kossowski 2011), and the inter-class separability between the different classes was calculated with the Jeffries-Matusita distance (Dabboor et al. 2014). Some LCZ categories (e.g. LCZ 1/LCZ 2 and LCZ 4/LCZ 5) that are difficult to distinguish were chosen in the analysis. With the increase of the number of convolutional stages, the receptive field becomes larger and the extracted features become more global and abstract ( Figure 11).
Meanwhile, the inter-class separability exhibit a trend of growth (Figure 12(b)). It can be seen that the intraclass similarity is smaller in Stage-1 compared to the original images, but is gradually increased, and in the final stage, the intra-class similarity is larger than in the original images (Figure 12(a)). In summary, the above results illustrate that the features extracted by Stage-1 and Stage-2 are not as robust and discriminative as those of Stage-3. Therefore, it is necessary to set up Stage-3 to further convolute and integrate the features extracted in the first two convolutional stages.

The contribution of seasonal information
Some studies (Zhao et al. 2016;Zhu and Liu 2014) have pointed out that seasonal information can optimize the mapping results of thematic land-cover classification tasks based on Landsat images. However, the LCZ scheme is a newly developed concept and has intrinsic differences with the traditional thematic mapping tasks such as land-cover mapping, land-use mapping, and object detection (Wang, Lu, and Qin 2020). To investigate the contribution of seasonal information to the proposed LCZ-CNN model, we compared the OA of the LCZ classification obtained using multi-seasonal and single-seasonal inputs, respectively, for five representative cities (Shanghai, Taiyuan, Tianjin, Yinchuan and Lhasa) chosen across different climatic zones ( Figure 13). The results show that, with the addition of multi-seasonal images, the OAs for the five representative cities are improved by 2.6%, 4.0%, 3.9%, 6.8%, and 10.9%, respectively, compared with the classification results based on only spring images. Specifically, taking Shanghai as an example, the OAs of the results for spring, summer, autumn, and winter are 70.9%, 68.9%, 53.4%, and 66.2%, respectively, but the accuracy obtained by considering all four seasonal images is increased significantly to 73.6%. The reason for this is that multi-temporal information can introduce the phenological characteristics of plants (Peng et al. 2009), and  Figure 13 also reveals that the contribution of each season is different in the different cities and climatic zones, and hence it is difficult to clearly identify which season is optimal in LCZ classification. To sum up, our experimental results show that it is appropriate to make full use of the multi-seasonal images and capture the temporal characteristics for LCZ classification.

Thermal characteristics of the LCZs in China's 32 major cities
The LCZ scheme is a standard framework used to describe urban structures, which can reflect the local temperature and microclimate of a city (Stewart and Oke 2012). For instance, Figure 14 shows the LCZ mapping results and the corresponding LST in the two cities of Shanghai and Xining, where it is apparent that the spatial morphology of the LCZs is closely correlated to that of the LST.
According to formula (4), ΔLST i can be calculated to represent the departure from the mean LST for the i th LCZ class. The range of ΔLST i is divided into four classes at an interval of 1°C. The values of ΔLST i > 0 refer to a heat island, and ΔLST i < 0 represents a cold island. Figure 15 shows that the built types generally show a significant heat island effect (ΔLST i � 2 °C), while the cold island effect in most cities is concentrated in LCZ A (dense trees), B (scattered trees), D (low plants), and G (water). Figure 16 presents box plots showing the departure from the mean LST for all the LCZ classes in China's 32 major cities, which indicate significant thermal differences among the various LCZ classes. From the figure, it can be found that LCZ A (dense trees) and LCZ G (water) have the lowest LST, while LCZ 2 (compact mid-rise), LCZ 3 (compact low-rise), and LCZ 8 (large low-rise) are among the highest. LCZ 2 (compact mid-rise) and LCZ 3 (compact low-rise) are usually located in bustling commercial or residential areas, while the majority of LCZ 8 (large low-rise) corresponds to industrial areas. The LST of most built types (LCZ 1-10) and LCZ F (bare soil or sand) is higher than the mean LST of all the LCZ classes, and the LST of the natural types (LCZ A, B, D, and G) is lower than the mean LST. This can be attributed to the large amount of heat generated from the consumption and re-radiation of solar radiation from urban structures and anthropogenic heat sources (Rizwan, Dennis, and Liu 2008). In particular, it is interesting to see that the compact built types (LCZ 1-3) generally have a higher LST than the open types (LCZ 4-6), since the former are mainly located in commercial and residential areas with high population density and frequent human activities. Moreover, the area proportion of the permeable layer, i.e. vegetation, water, and soil, is low in these compact built types, and the effect of evapotranspiration and shading by plants can significantly reduce the amount of heat that is re-radiated by urban constructions. Furthermore, compact buildings and narrow streets can form canyon structures, which increase the absorption of solar radiation, and lead to the trapping of long-wave radiation from the ground (Yuan and Ng 2012). The walls of buildings can also reduce wind speed and prevent air flow from cooling the street canyons, contributing to further heating of the land surface (Oke 1988). It is also found that the LST of the high-rise built types (LCZ 1, 4) is lower than that of the low-and mid-rise built types (LCZ 2, 3, 5, and 6). A reasonable explanation is that high-rise buildings can generate larger shaded areas than low-and mid-rise buildings, leading to lower LST in the neighborhood (Perini and Magliocco 2014). Another possible reason is that highrise built types usually have a lower building density, which actually provides more open or green space, and hence decreases the LST . In general, our results indicate that the spatial configuration of LCZs can significantly affect the magnitude of the LST.  Figure 16, indicating that the built types generally have a higher LST than most of the natural types. However, the LST pattern of climatic zones A and TS is clearly different from the other three (Figure 17(d)), where LCZ F (bare soil or sand) generally has a higher LST than the mean LST for all the LCZ classes, while most built types have a lower LST than the average LST. As shown in Figure 8, the area proportion of LCZ F (bare soil or sand) is high in the cities of climatic zones A and TS. This is due to the fact that, in these climatic zones, desert occupies a large amount of space around the built-up areas of these cities (see Figure  9). Owing to the lower heat capacity of the desert, it warms up faster than the land cover in the built-up areas. The warm air heated over the desert is transported over the built-up areas through local circulation, forming an inversion layer of upper heat and lower cold, which keeps the lower cold air stable, thus forming a relatively cool and humid microclimate in the urban areas (Hu, Su, and Zhang 1988). Therefore, the LST of the built-up LCZ classes in the A and TS climatic zones is significantly lower than that in the other zones. Similar phenomena have emerged in studies of Phoenix in the Southwestern U.S. ) and Dubai in the Middle East (Nassar, Blackburn, and Whyatt 2016), both of which are typical cities surrounded by sand and soil, with an arid climate. However, differing from our results, there are still some LCZ classes in Phoenix with a higher LST than LCZ F (soil or sand), such as LCZ 8 (large low-rise). Our results for Western China are more similar to those in Dubai (Nassar, Blackburn, and Whyatt 2016), where LCZ F maintains a much higher LST in the daytime than all the other LCZ classes, which is perhaps due to the fact that both study areas are located in Asia.

Conclusion
This study was aimed at generating more accurate LCZ maps and investigating the relationship between LCZs and the thermal environment in China's 32 major cities. To this aim, in this paper, we have proposed a novel CNN framework (namely, LCZ-CNN) designed to cope with the issues of LCZ classification  using Landsat imagery. The LCZ-CNN model adequately considers the multi-spectral and multiseasonal information contained in Landsat images. To evaluate its performance, the proposed method was tested over China's 32 major cities distributed in different climatic zones, and the traditional classification strategies of WUDAPT, BoVW, and MLP were also applied for comparison. The proposed LCZ-CNN model produced the most accurate mapping results and a satisfactory computational efficiency in all the study areas. A systematic analysis of the proposed LCZ-CNN model was conducted, including the transferability of the network and the effectiveness of multiseasonal information. It was found that the first convolutional stage, corresponding to low-level features, shows better transferability than the second and third convolutional stages, which correspond to high-level and task-oriented features, It was also found that multi-seasonal information can significantly improve the accuracy of LCZ classification.
The relationship between LCZs and their thermal properties was investigated in detail over China's 32 major cities. To the best of our knowledge, this is the first time that this relationship has been analyzed on the basis of a large number of cities across multiple climatic zones. Overall, it was found that the compact built types have higher LSTs than the open classes, and the LSTs of the high-rise built types are lower than those of the low-and mid-rise built types. Our experimental results confirmed the close relationship between the LCZ classes and the magnitude of the LST. For most cities in climatic zones S (snow climate with dry winter), W (warm temperature climate with dry winter), and EW (equatorial climate and warm and fully humid temperate climate), most built types have a significantly higher LST than their neighborhoods, and hence show the heat island effect. In contrast, the LCZs in climatic zones A (climate of arid steppe and desert) and TS (tundra climate and snow climate with cool summer and cold winter) demonstrate distinct thermal characteristics, i.e. LCZ F (bare soil or sand) has the highest LST, while most built types have a lower LST than the average. This phenomenon can be attributed to the unique urban landscape and climate conditions in the arid cities of China.
In the future, we will work toward constructing more powerful deep networks which have superior transfer capabilities and can obtain higher accuracies. The interpretation of LST from the LCZ viewpoint should also be further investigated by taking into account more cities across the globe. In addition, although Landsat images have been widely used for LCZ mapping in the existing literature, thanks to their free and convenient access and adequate data archives, high-resolution remotely sensed imagery should be considered in the future, since such imagery can provide more spatial and structural information, and has the potential to be more effective in delineating the composition and configuration of the various LCZ classes.

Data availability statement
The data that support the findings of this study are available from the corresponding author [X. Huang] at http://irsip. whu.edu.cn/resources/resources_en_v2.php, upon reasonable request.