Deep learning-based local climate zone classification using Sentinel-1 SAR and Sentinel-2 multispectral imagery

ABSTRACT As a newly developed classification system, the LCZ scheme provides a research framework for Urban Heat Island (UHI) studies and standardizes the worldwide urban temperature observations. With the growing popularity of deep learning, deep learning-based approaches have shown great potential in LCZ mapping. Three major cities in China are selected as the study areas. In this study, we design a deep convolutional neural network architecture, named Residual combined Squeeze-and-Excitation and Non-local Network (RSNNet), that consists of the Squeeze-and-Excitation (SE) block and non-local block to classify LCZ using freely available Sentinel-1 SAR and Sentinel-2 multispectral imagery. Overall Accuracy (OA) of 0.9202, 0.9524 and 0.9004 for three selected cities are obtained by applying RSNNet and training data of individual city, and OA of 0.9328 is obtained by training RSNNet with data from all three cities. RSNNet outperforms other popular Convolutional Neural Networks (CNNs) in terms of LCZ mapping accuracy. We further design a series of experiments to investigate the effect of different characteristics of Sentinel-1 SAR data on the performance of RSNNet in LCZ mapping. The results suggest that the combination of SAR and multispectral data can improve the accuracy of LCZ classification. The proposed RSNNet achieves an OA of 0.9425 when integrating the three decomposed components with Sentinel-2 multispectral images, 2.44% higher than using Sentinel-2 images alone.


Introduction
The city plays an important role in the developing process of human society. Urbanization is one of the most principal phenomena in the world today Shen et al. 2020;Wu, Gui, and Yang 2020;Zhou, Zhai, and Yu 2020). In the process of urbanization, the size of city expands at the expense of occupying agricultural land and green open space (Hadeel, Jabbar, and Chen 2009;Huang and Wang 2020;, and the mass population assembles to the city (Li, Zhao, and Li 2016;Trinder and Liu 2020;Shao et al. 2021). The cities are under pressure from increasing population and ecological threats as a result of rapid urbanization (Alsaaideh et al. 2017). UHI effect is one of the most significant environmental issues caused by urbanization . UHI refers to the phenomena that urban areas are warmer than the surrounding suburban areas. Most researchers assessed the UHI intensity through a classical urban-rural temperature difference (Jiang, Chen, and Jing 2006;Zeng et al. 2010;Oke and Stewart 2012;Yang et al. 2020;Zhou et al. 2020a;Huang, Liu, and Li 2021). However, due to the absence of universal definitions for urban or rural, it's not easy to qualify a site as urban or rural. UHI research has long been limited by the urban-rural classification. LCZ classification scheme is introduced to address this problem (Oke and Stewart 2012). The LCZ is a classification scheme that provides a standardization framework to present the characteristics of urban forms and functions (Oke and Stewart 2012;Bechtel et al. 2015;Qiu et al. 2020). LCZs are defined as regions of uniform surface cover, structure, material, and human that span hundreds of meters to several kilometers on a horizontal scale (Oke and Stewart 2012). The LCZ scheme consists of ten built types and seven land cover types. Each LCZ type represents a culturally neutral description of a specific urban landscape based on its effect on the local air temperature . Illustrations of the LCZ classes are displayed in Figure 1. The LCZ scheme is originally designed for UHI research (Oke and Stewart 2012), but has shown an increasing impact on various climatological studies that include estimating nocturnal cooling effect (Leconte, Bouyer, and Claverie 2020), climatesensitive street design (Maharoof, Emmanuel, and Thomson 2020), and analyzing urban ventilation (Yang et al. 2019;Zhao et al. 2020).
Recent efforts have been focusing on the development of LCZ mapping techniques. The World Urban Database and Portal Tool (WUDAPT) proposed a method that employs Landsat data and open source software for worldwide LCZ mapping (Oke and Stewart 2012;Bechtel et al. 2015;Mills et al. 2015;Cai et al. 2016;Danylo et al. 2016;Ching et al. 2018;Bechtel et al. 2019). The WUDAPT method, which has been applied in many studies (Cai et al. 2016;Danylo et al. 2016;Cai et al. 2018;Shi et al. 2019;Zhou et al. 2020b;Shi et al. 2021), needs experts with local knowledge of individual city to build reference polygons using Google Earth. These polygons are applied to train and test LCZ classification models with Landsat images (resampled into 100 m resolution). Random Forest (RF), a rule-based machine learning approach, is further used for classification in WUDAPT. Despite its popularity, WUDAPT is a pixel-based classification method that largely ignores spatial information, thus leading to relatively low accuracy (Qiu et al. 2020).
To achieve high accuracy, other LCZ mapping methods have been investigated, one notable effort of which is the Geographic Information System (GIS) stream. GIS-based methods use GIS datasets, such as building footprints and high-resolution digital surface models, to obtain parameters that are applied to define LCZ types (Zheng et al. 2018;Oliveira, Lopes, and Niza 2020). GIS-based methods are able to improve the LCZ classification accuracy but require massive input datasets that are not always available to the public (Quan and Bansal 2021).
In recent years, deep learning approaches have been widely adopted in remote sensing image scene classification and achieved state-of-the-art classification accuracy (Cheng et al. 2020). Liu and Shi (2020) pointed out that LCZ mapping can be regarded as a scene classification task to fully exploit the contextual information from remote sensing images. With the growing popularity of deep learning, many scholars have investigated the potential of deep learning algorithms in LCZ mapping. Deep learning methods, especially in the form of CNNs, are expected to further boost LCZ classification accuracy. Huang, Liu, and Li (2021) proposed a novel CNN model to generate LCZ classification results using Landsat imagery for China's 32 major cities, and satisfactory classification accuracies in 32 cities were achieved by the proposed model. Zhu et al. (2020) developed a big benchmark dataset, So2Sat LCZ42, that contains Sentinel-1 and Sentinel-2 image patch pairs and LCZ labels from 42 cities in different countries. This dataset is openly available and is regarded as a standard dataset for deep learningbased LCZ mapping. Qiu et al. (2020) proposed a CNN framework, termed as Sen2LCZ-Net, to classify LCZs using Sentinel-2 images from the So2Sat LCZ42 dataset. Liu and Shi (2020) selected 15 cities in three economic regions of China and used Sentinel-2 data to classify LCZs by employing the proposed LCZNet composed of residual learning and SE block. The effect of image size was investigated, and the results showed that an image size of 48 × 48 (corresponding to 480 × 480 m 2 ) obtained the highest accuracy (Liu and Shi 2020). The aforementioned studies used multispectral images only. SAR, another typical remote sensing sensor, is sensitive to moisture and geometric characteristics and can provide useful information different from multispectral images (Li and Zhang 2014;Shao, Wu, and Guo 2020). The potential of SAR data for LCZ mapping has been studied in recent work. Bechtel et al. (2016) found that individual SAR amplitude and range helped improve LCZ classification accuracy slightly. Demuzere, Bechtel, and Mills (2019) compared Sentinel-1 backscatter, entropy, and Geary C using random forest and found that Sentinel-1 backscatter was most informative (via feature importance ranking). Hu, Ghamisi, and Zhu (2018) compared numerous Sentinel-1 SAR components using canonical component analysis and found features related to VH polarized data contributed the most to LCZ classification. Feng et al. (2019) employed both SAR and multispectral data of the So2Sat LCZ42 dataset for CNN-based LCZ mapping and achieved improved accuracy. Jing et al. (2019) revealed the contributions of the SAR data and the multispectral data of the So2Sat LCZ42 dataset to the LCZ classification performance. Their study suggested that the combination of SAR and multispectral data contributes to improved LCZ classification accuracy, however, in an unnoticeable manner.
Sentinel-1 data of the So2Sat LCZ42 dataset contain eight channels, including four elements of VH and VV intensity images and four elements of the Refined LEE filtered result . La, Bagan, and Yamagata (2020) pointed out that decomposed components of SAR data can also enhance the LCZ mapping performance using supervised pixel-based methods. To our best knowledge, few efforts have been made that focus on the effect of backscattering characteristics of SAR data on the performance of deep learning-based LCZ mapping.
In this study, we employ openly available Sentinel-1 SAR data and Sentinel-2 multispectral imagery to classify LCZs. We propose a deep CNN architecture, termed as RSNNet, for LCZ mapping in three large cities (Beijing, Tianjin, and Wuhan) in China and further investigate the LCZ classification results. Finally, we analyze the effect of different backscattering characteristics of SAR data on the accuracy of LCZ mapping. The rest of this article is organized as follows. Section 2 introduces the study area and Sentinel data. Section 3 elaborates on the proposed network, RSNNet, and the details of network training. Section 4 presents the classification results. Section 5 discusses the effect of input bands on classification accuracy and compares the proposed network with other networks. Finally, Section 6 concludes this article.

Study areas
Three large cities in China, Beijing, Tianjin and Wuhan, were selected for our study. Their geographical locations are shown in Figure 2. Population, area and Gross Domestic Product (GDP) in 2019 of the selected three cities are displayed in Table 1.
The city of Beijing, covering 16,410 km 2 , is the capital of China. It is located at the northern end of the North China Plain. Beijing has a typical monsoondriven semi-humid to humid continental climate, characterized by hot and humid summer and cold and dry winter. It has been a highly urbanized city with preserved historic buildings, such as Siheyuan, which can be categorized into compact low-rise (LCZ 3). As heavy industrial factories (LCZ 10) were relocated from Beijing to its neighboring cities, Beijing now has only a few LCZ 10.
The city of Tianjin, one of the megacities in China, is located in the northeast of North China Plain and borders the Bohai Sea in the east. It covers an area of 11,966 km 2 (plains account for 93%). Tianjin is a city that has a large number of factories with strong industrial activities (LCZ 8 and LCZ 10).
Wuhan is the capital city of the Hubei Province, China, situated in central China. The Yangtze, the world's third longest river, and its largest tributary Hanshui meet in Wuhan and cut the city into three parts: Hankou, Hanyang and Wuchang. Wuhan has a humid subtropical climate with four distinct seasons. Wuhan consists of many rivers and lakes (LCZ G) with water bodies covering 2217 km 2 (accounting for 26.1% of the total area of Wuhan).

Sentinel-1 data
Sentinel-1 is a C-band synthetic aperture radar satellite and comprises a constellation of two polar-orbiting satellites (Sentinel-1A and Sentinel-1B) (Abdel-Hamid, Dubovyk, and Greve 2021). The Sentinel-1 mission provides a public global SAR dataset. We acquired Sentinel-1VV-VH dual-Pol Single-Look Complex (SLC) level 1 data that covers the three selected cities from the Copernicus Open Access Hub. Sentinel-1 images were acquired on 14 August 2019 (Beijing), 21 August 2019 (Tianjin) and 12 August 2019 (Wuhan). The European Space Agency's Sentinel Application Platform was used for the preprocessing of Sentinel-1 data. Figure 3 displays the flowchart of Sentinel-1 data preprocessing. The specific steps are listed as follows.
(1) The implementation of orbit file: This is the first step of any SAR preprocessing. A precise orbit file is used to improve the geocoding of the product.
(2) Radiometric calibration: The operator computes the backscatter intensity using sensor calibration parameters in the metadata.
(3) TOPSAR deburst: The Sentinel-1 IW products contain three sub-swaths for each polarization channel. Each sub-swath image has a series of bursts. The TOPSAR deburst operator can remove the seamlines between the single bursts and merge these bursts and sub-swaths into a SLC image.
(4a) Polarimetric speckle reduction: Speckle filters aims to reduce the number of speckles. The Refined Lee speckle filter was selected to conduct the speckle reduction.
(4b) Polarimetric decomposition: Sentinel-1 has only two polarizations: HH and VH. H-Alpha Dual Pol decomposition was applied to obtain Alpha, Anisotropy, and Entropy.
(5) Terrain correction: Terrain correction geocodes the image by correcting SAR geometric distortions using a Digital Elevation Model (DEM) and producing a projected product. The SRTM DEM was selected as input DEM to accomplish the correction. The WGS84/UTM coordinate system was applied to geocode the product, and the images were upsampled to 10 m Ground-Sampling Distance (GSD).
After preprocessing, the outputs include two intensity images (Intensity_VH and Intensity_VV), three polarimetric decomposition components (Alpha, Anisotropy, and Entropy) and four polarimetric speckle reduction components (C11, C22, C12_img,  and C12_real). The Sentinel-1 data in the dataset we build in this paper contain nine bands, which is displayed in Table 2.

Sentinel-2 data
Google Earth Engine (GEE) was employed to obtain cloud-free Sentinel-2 images (Schmitt et al. 2019). The overall workflow, implemented via the GEE Python Application Programming Interface (API), included three main modules.
(1) Query Module: loading images from the catalog.
(2) Quality Score Module: calculating a quality score for each image.
(3) Image Merging Module: mosaicking selected images based on the meta-information generated in the preceding modules.
Consistent with Sentinel-1 data, the acquisition time of Sentinel-2 data was set as Summer (1st June till 31st August). This workflow of GEE-based procedure for cloud-free Sentinel-2 image generation uses multitemporal information of comparably short time periods to obtain cloud-free Sentinel-2 data, which means that the dates of Sentinel-1 imaging and the dates of Sentinel-2 imaging are not an exact match. Although the acquisition dates of Sentinel-1 and Sentinel-2 sets are different, they are very close. Generally, the form and structure of urban buildings do not change greatly within a few months, and the state of vegetation may change. In this study, we consider the difference between the dates of Sentinel-1 imaging and the dates of Sentinel-2 imaging as acceptable.

Preparing label data
Label data were collected on Google Earth following the standard procedure defined in the WUDAPT project (Ching et al. 2018). First, a region of interest within each selected city was defined by drawing a rectangle of about 50 × 50 km 2 around the city center in Google Earth. We further delineated polygons that enclosed different LCZ types. We ensure that each LCZ type has more than five polygons and there are enough samples for each category. Each polygon must be wider than 200 m to avoid interference of small landscape, such as an individual building.

Sen12LCZ dataset
The dimension of the image patch has a great impact on classification accuracy. It has been proved that a large scene is beneficial for LCZ mapping, as a large input size can provide additional urban environmental features (Liu and Shi 2020). Sentinel-1 and Sentinel-2 images were used to create a dataset named "Sen12LCZ", and the dimension of the Sentinel-1 and Sentinel-2 image patches was defined as 48 × 48, corresponding to an area of 480 × 480 m 2 .
The labeled polygons were sampled using a 480 m by 480 m fishnet, and the center of each grid in the fishnet corresponds to the center of each image patch. When the center of a grid fell within a polygon, we labeled the corresponding grid as the category to which the polygon belongs. We extracted Sentinel-1 and Sentinel-2 image patch pairs with the corresponding LCZ labels by projecting the sampled label data to the registered Sentinel-1 and Sentinel-2 images. Finally, a total of 4083 pairs of image patches were obtained. Table 3 displays the specific number of image pairs for each selected city corresponding to different LCZ types. The number of all image pairs in the dataset is 4083. The dataset was further split into a training set (60%), a testing set (20%), and a validation set by adopting stratified sampling strategy. These three sets consist of 2611, 818 and 654 pairs of image patches, respectively.   Tianjin  Sum  1  28  19  9  56  2  57  56  21  134  3  112  40  121  273  4  96  80  60  236  5  120  83  93  296  6  75  41  16  132  7  29  6  3  38  8  113  153  89  355  9  40  12  7  59  10  42  90  75  207  A  140  91  3  234  B  78  38  38  154  C  40  68  62  170  D  150  153  248

The proposed network
In this study, we proposed a network named RSNNet for LCZ mapping. Figure 4 illustrates the architecture of the proposed RSNNet that includes several Res_SE blocks and non-local blocks. Res_SE block consists of a building block of ResNeXt and a SE block. SE block can integrate channel-wise features by squeezing less important features and excite the useful feature maps (Hu et al. 2020). Every three consecutive Res_SE blocks form a stage. Features extracted from the first and the second stage go through a non-local block. At the end of the last block, a global average pooling layer is applied, followed by a fully connected layer with a softmax classifier for the final prediction. Note that samples in each LCZ type are imbalanced (Table 3).
Studies have proved that sample imbalance tends to have negative impacts on classification performance (Chakraborty et al. 2020;Zhang et al. 2021). To address this issue, we applied focal loss to the proposed CNN (details in Section 3.4).

Res_SE block
Res_SE block in the proposed network includes a building block of ResNeXt and a SE block. Figure 5 depicts the structure of the Res_SE block. ResNeXt adopts the repeating-layer strategy from VGG and ResNets and exploits the split-transform-merge strategy. The concept of cardinality (the size of the set of transformations) is also introduced in ResNeXt. The basic module of ResNeXt consists of the following operations.
(1) Splitting: The first 1 × 1 layer of each branch produces the low-dimensional embedding.
(2) Transforming: The low-dimensional representation is transformed via 3 × 3 layers.   As a key unit in SENet, SE block that achieves dynamic channel-wise feature enhancement by selectively emphasizing informative features and suppress less useful ones (Hu et al. 2020) is stacked in the building block of ResNeXt. During the squeezing process, global average pooling is applied for a feature map U 2 R H�W�C to generate channel-wise statistics z 2 R C : where U C refers to a 2D feature map of channel C with a spatial dimension H � W, F sq is the squeeze process, u c is the value of ðx; yÞ and z c is the channel-wise statistics of the C À th feature map. The excitation process aims to fully capture channel-wise dependencies. A simple gating mechanism with a sigmoid activation is employed to fulfill this objective: where F ex is the excitation process, σ refers to the sigmoid function, δ is the ReLU function, L 1 and L 2 refer to fully connected layers. The final output of the SE block is obtained by multiplying a scalar s and the original feature map U: where F scale is channel-wise multiplication between the scalar s C and the feature map u C 2 R H�W .

Non-local block
A non-local block is capable of capturing long-range dependencies . The structure of the non-local block is displayed in Figure 6. Non-local operations maintain the variable input sizes and can be easily applied to other networks. In this study, we employ non-local blocks to capture long-range contextual information, aiming to further improve classification accuracy.

Focal loss
Proposed by Lin et al. (2017), the focal loss function considers the extreme imbalance between easy and hard samples as well as between positive and negative samples. A modulating factor is added to the Cross-Entropy (CE) loss to focus training on hard negatives in the focal loss. The original focal loss is designed for binary classification. Further, it has been adopted to handle multiclass tasks. The CE loss for multi-class cases is defined as (Liu, Chen, and Chen 2018): where M refers to the number of categories, t i is a real probability distribution, y i denotes the probability distribution of prediction. t i is defined as: To address the issue of class imbalance, focal loss adds a modulating factor 1 À y i ð Þ γ and a weighting factor α i 2 0; 1 ½ � to the CE loss, with a tunable focusing parameter γ � 0. The multi-class focal loss is defined as: In this study, α is set by inverse class frequency. The focus parameter γ is employed to control the rate at which easy-classified examples are down-weighted. When γ ¼ 0, FL is equivalent to CE. The effect of the modulating factor also increases along with the Figure 6. The structure of the non-local block.
increase of γ. Here, we set γ ¼ 2 for RSNNet after a trial-and-error process, and we display the experimental results in Section 5.3.

Metrics for accuracy assessment
We adopted OA, Average Accuracy (AA), Kappa coefficient, Producer's Accuracy (PA) and User's Accuracy (UA) for performance evaluation in this study. These metrics are calculated as: where N refers to the amount of samples applied for accuracy measurement, M is the number of categories, x ii refers to the number of units that come from class i and predicted as class i, x iþ denotes the number of samples in class i, x þi is the number of samples predicted as class i.

Network training strategy
The experiments were conducted on Python 3.5 using Keras with TensorFlow backend. The Nesterov Adam optimizer was applied to train the network. We set the batch size to 16 and the initial learning rate to 0.002 (decreased by half after every five epochs). To avoid overfitting and to control the training time, we employed early stopping. Validation with patience of 50 epochs loss was chosen as the monitored metric.

LCZ classification results in three selected cities
RSNNet was trained and tested using the label data obtained from the same city to classify three cities respectively. We set the parameters to the same, only the input training data and testing data were different. The confusion matrix in Beijing is displayed in Figure 7(a). The classification model we proposed, RSNNet, achieves an OA of 0.9202 and a Kappa coefficient of 0.9138. Urban classes are generally well classified. However, the confusion still exists. LCZ A (dense trees) and LCZ G (water) are easily identified. LCZ B (scattered trees) is confused with LCZ C (bush, scrub). The confusion matrix in Tianjin is presented in Figure 7 Figure 7(c). we notice that LCZ 2 are confused with LCZ 3 and LCZ 1. LCZ G is the easiest LCZ type to classify. We also trained the proposed network using data from all three cities. The confusion matrix of this experiment is presented in Figure 8. Our proposed RSNNet achieves an OA of 0.9328 and a Kappa coefficient of 0.9257. The main misclassifications are between LCZ 1 and LCZ 4, LCZ 2 and LCZ 3, LCZ 4 and LCZ 5, LCZ C and LCZ D, and LCZ B and LCZ D. We observe that LCZ G is the most distinct LCZ type, as our RSNNet achieves 100% accuracy in classifying this LCZ type.

LCZ maps
LCZ classification maps of Beijing, Tianjin, and Wuhan and the corresponding Sentinel-2 images are shown in Figure 9. In general, these classification maps present unique urban fabrics in these cities. For Beijing, LCZ 3 (compact low-rise), LCZ 2 (compact mid-rise), and LCZ 4 (open high-rise) are the dominant types in the urban regions. A large area of LCZ A (dense trees) is identified in Figure 9(a) due to the existence of mountains in the northeastern Beijing. As for Tianjin, LCZ 1 (compact high-rise) and LCZ 3 (compact low-rise) are observed in the central urban area. The entire urban region is mostly covered by LCZ 4 (open high-rise) and LCZ 5 (open mid-rise). There are many LCZ 8 (large low-rise) in the suburban areas. The existence of idle farmland in Tianjin leads to the identification of LCZ F (bare soil or sand) on the west side of Tianjin. A large area of LCZ D (low plants) is also notable in eastern Tianjin.
For Wuhan, the Yangtze River and many lakes can be easily identified in the classified LCZ map. The major LCZ types in urban regions of Wuhan are LCZ 4 and LCZ 5. The heavy industrial area in the northern suburban region is clearly presented. Hills covered by forests in the urban region are classified as LCZ A.
In summary, we observe that the urban structure of these three study areas is well identified and clearly presented using the Sen12LCZ dataset via the proposed network. UHI magnitude is defined as an "urban-rural" difference in most previous studies: where UHI uÀ r refers to the magnitude of UHI, T urban and T rural refer to the temperature of urban and rural areas respectively. Local climate zone is proposed to redefine UHI magnitude in an LCZ temperature difference (Oke and Stewart 2012): where UHI LCZx denotes the magnitude of UHI for LCZ x, T LCZx refers to the temperature of LCZ x, and T LCZD refers to the temperature of LCZ D (low plants). Due to the complex land surface components, it is difficult to accurately discriminate urban and rural areas. LCZ temperature differences are more conducive to analysis, because the standardized description of surface structure and cover is highlighted in this climate-based classification scheme.
For deriving the intensity of UHI in an LCZ temperature difference, the thermometric network design is important. There should be thermometers in each LCZ type, and the number of sensors of each type should be proportional to the area of each type. To avoid the effect of changes in airflow and stability conditions on air temperature, the thermometers are forbidden to locate on or near the border of two zones. Thermometric networks in different cities should be designed with reference to their LCZ maps. For instance, there are many LCZ 8 and LCZ 10 in Tianjin, and many thermometers should be uniformly placed on these areas. However, LCZ 8 and LCZ 10 are few in Beijing, and only a few sensors are needed for LCZ 8 and LCZ 10.

Effect of input bands
Sen12LCZ dataset consists of a total of 19 channels, including nine features obtained from Sentinel-1 SAR data and ten bands of Sentinel-2 multispectral imagery. In this study, seven datasets, designated as D1-D7, were set up to explore the effect of different combinations of multi-source data on LCZ classification (Table 4). D1 includes nine features extracted from Sentinel-1 data. D2 consists of 10 bands of Sentinel-2 images alone. D3 adds three decomposed components (Alpha, Anisotropy, and Entropy) to D2. D4 integrates additional four elements obtained from the Refined LEE filter operation (C11, C12_img, C12_real, and C22) to D2. D5 adds intensity channels of VH and VV to D2. D6 includes all 19 channels from Sentinel-1 and Sentinel-2 data. D7 consists of three Sentinel-1 components in D3, two Sentinel-1 components in D5 and 10 bands of D2.
The proposed RSNNet was separately trained on these datasets, and the classification accuracy metrics are presented in Table 4. The results suggest that SAR data do not contain enough information for LCZ classification, evidenced by the low accuracy of D1 (Sentinel-1 only). D2 achieves an OA of 0.9181, a Kappa coefficient of 0.9095 and an AA of 0.9055. As expected, the addition of SAR data to multispectral bands enhances the LCZ classification performance. As shown in Table 4, D3 achieved the highest accuracy, with OA, Kappa coefficient, and AA of 0.9425, 0.9366 and 0.9060, respectively. The above results show that the decomposition method is an efficient tool to extract structure and scattering information which can be combined with multispectral data to improve classification performance (Lee and Pottier 2009).
The classification result of D4 is quite close to D2, suggesting that the information contained in Refined Lee filter components benefits LCZ mapping in a trivial manner. The OA, Kappa, and AA of D5 are 0.9340, 0.9270, and 0.9155, respectively, which are higher than D2, suggesting that the intensity of VV and VH greatly contribute to LCZ mapping. D6 achieves an OA of 0.9328, which is lower than D3 and D5 but higher than D2. As D6 contains information from all 19 channels, its data redundancy leads to reduced accuracy in the LCZ classification compared to D3 or D5. We combined the Sentinel-1 components in D3 and D5 with ten bands of D2 as a new dataset which was designated as D7. RSNNet was trained and tested on D7. D7 achieves an OA of 0.9315, which is lower than D3 and D5, suggesting that the combination of superior datasets leads to lower accuracy. Considering the great performance of D3, we can conclude that the decomposed components contribute more than filtered components and intensity channels.
To analyze the influence of SAR characteristics on individual class, we present PA and UA obtained with different datasets in Figure 10. The results indicate that LCZ A and LCZ G can be easily identified, evidenced by their high PAs and UAs. Compared with natural land cover types, the classification of built-up types benefits more from SAR characteristics. For LCZ 2, the PAs obtained from D3, D4, D5, and D6 are higher than D2. Similarly, for LCZ 3, the PAs and UAs obtained from D3, D5, and D6 are higher than D2.
The Sentinel-1 SAR data used in this study are in double (VV-VH) polarization. Thus only H-Alpha Dual Pol decomposition can be used to extract decomposed components. The fully polarimetric SAR data can provide more backscattering information than Sentinel-1 data, and many different target decomposition methods have been developed to transform backscattering information into basic backscattering mechanisms (Touzi, Boerner, and Lueneburg 2004;Yamaguchi, Yajima, and Yamada 2006). As fully polarimetric SAR data offers more capacity in terrain classification (Kajimoto and Susaki 2013;Angelliaume et al. 2018), future efforts can be made to integrate fully polarimetric SAR data with multispectral data in deep learning-based LCZ mapping.  Table 5 presents the classification results of our proposed RSNNet and other popular CNNs, such as ResNet, DenseNet, Sen2LCZ-Net proposed by Qiu et al. (2020), MSCNN built by Kim et al. (2020) and CNN constructed by Yoo et al. (2019). ResNet, which won 1st place on the ILSVRC 2015 classification task, introduce the design of shortcut connections to alleviate vanishing gradients (He et al. 2016). DenseNet connects each layer to every other layer in a feedforward fashion and obtains significant improvements in various image segmentation and classification tasks (Huang et al. 2017). Sen2LCZ-Net is a simple CNN that considers multilevel feature fusion for LCZ mapping and achieves great performance in LCZ classification using the So2Sat LCZ42 dataset (Qiu et al. 2020). MSCNN was used to evaluate LCZ classification accuracy with the custom LCZ training data (Kim et al. 2020). Yoo et al. (2019) constructed CNN to compare CNN with random forest classifier for LCZ classification. D6 including all 19 channels from Sentinel-1 and Sentinel-2 data was used to train the proposed RSNNet and other CNNs. Although D6 contains more information than other datasets, it is not the best-performing dataset in Section 5.1 due to its data redundancy. By applying the non-optimal dataset to  train the models, the ability of different CNNs to extract representative features from all 19 bands of Sentinel-1 and Sentinel-2 data can be better compared.

Comparison with other CNNs
The results suggest that the proposed RSNNet which integrates spatial attention module and channel attention module achieves the best performance in all three metrics, with OA, Kappa and AA being 0.9328, 0.9257 and 0.9184, respectively. In comparison, the classification result by CNN is the worst (OA: 0.7885; Kappa: 0.7649; AA: 0.6363). The above results prove the superiority of RSNNet over selected widely-adopted CNNs.

Effect of focal loss
As mentioned in Section 2.3, samples of each category in the Sen12LCZ dataset are imbalanced (see Table 3). To eliminate the impact of the imbalanced dataset on classification accuracy, we implement focal loss in this study. The focus parameter γ has an effect on the performance of focal loss. Table 6 displays the effect of different γ values on LCZ classification accuracy when focal loss is implemented in ResNet, DenseNet, Sen2LCZ-Net and RSNNet. D6 was used to train the models. The experimental results suggest that different networks have different optimal γ values. The best result for RSNNet is obtained when γ is set as 2. For ResNet-50, the optimal γ value is 3.
When γ=0, focal loss is equivalent to the CE loss. We can observe that the implementation of focal loss with optimal γ value improves the classification performance. For example, RSNNet with focal loss achieves the highest accuracy, with an OA of 0.9328, whereas the OA of RSNNet with Cross-Entropy loss is 0.8900.

Conclusions
In this study, we propose a novel CNN architecture, RSNNet, that considers channel attention and position attention for LCZ mapping. We test the proposed RSNNet on three large cities in China: Beijing, Tianjin, and Wuhan. The results suggest that RSNNet outperforms other state-of-art networks. We also find that the combination of SAR and multispectral imagery can improve the accuracy of LCZ classification. In designed experiments, the Sen12LCZ dataset containing LCZ labels is established by employing Sentinel-1 SAR data and Sentinel-2 multispectral data. We further analyze the influence of different SAR features on classification results. The results reveal that decomposed components contribute more to classification accuracy than intensity images (VV and VH). In addition, the involvement of Refined Lee speckle filter components with multispectral imagery leads to reduced classification accuracy compared to the scenario using multispectral imagery alone. We also notice that the imbalance issue of LCZ labels in the Sen12LCZ dataset can be addressed by the implementation of focal loss. As this study focused on the city-scale LCZ mapping, future efforts of LCZ classification can be made largescale LCZ mapping. Only three big cities are studied in this paper, however there are more developing cities in the world. The characteristics of urban form and surface are different in developed and developing cities. To achieve the goal of global LCZ mapping, more attention should be paid to developing cities in the future research. Furthermore, studies have proved that fully polarimetric SAR data contain more backscattering information than Sentinel-1 data. Thus, the potential of fully polarimetric SAR data in CNN-based LCZ mapping deserves further investigation.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work is supported by the National Natural Science Foundation of China 42090012, 41890820, 41771452 and 41771454, with [grant numbers 42090012, 41890820, 41771452, and 41771454]

Notes on contributors
Lin Zhou is currently pursuing the MS degree at the School of Remote Sensing and Information Engineering, Wuhan University. Her research interests include local climate zone classification and urban heat island. Zhenfeng Shao received the PhD degree in photogrammetry and remote sensing from Wuhan University in 2004. He is a Full Professor with LIESMARS, Wuhan University. His research interests are high-resolution image processing, pattern recognition, and urban remote sensing applications. Shugen Wang received the PhD degree in photogrammetry and remote sensing in 2003 from Wuhan University, Wuhan, China. Since 2001, he has been a full Professor with the School of Remote Sensing and Information Engineering, Wuhan University. His research interests include digital photogrammetry and aerial image processing.

Data availability statement
The data that support the findings of this study are available from the corresponding author, upon reasonable request.