Snow detection in alpine regions with Convolutional Neural Networks: discriminating snow from cold clouds and water body

ABSTRACT Accurately monitoring the variation of snow cover from remote sensing is vital since it assists in various fields including prediction of floods, control of runoff values, and the ice regime of rivers. Spectral indices methods are traditional ways to realize snow segmentation, including the most common one – the Normalized Difference Snow Index (NDSI), which utilizes the combination of green and short-wave infrared (SWIR) bands. In addition, spectral indices methods heavily depend on the optimal threshold to determine the accuracy, making it time-consuming to find optimal values for different places. Convolutional neural networks ensemble model with DeepLabV3+ was employed as sub-models for snow segmentation using (Sentinel-2), which aims to distinguish clouds and water body from snow. The imagery dataset generated in this article contains sites in global alpine regions such as Tibetan Plateau in China, the Alps in Switzerland, Alaska in the United States, Southern Patagonian Icefield in Chile, Tsylos Provincial Park, Tatsamenie Peak, and Dalton Peak in Canada. To overcome the limitation of DeepLabV3+, which only accepts three channels as input features, and the need to use six features: green, red, blue, near-infraRed, SWIR, and NDSI, 20 three-channel DeepLabV3+ sub-models, were constructed with different combinations of three features and then ensembled together. The proposed ensemble model showed superior performance than benchmark spectral indices method, with mIoU values ranging from 0.8075 to 0.9538 in different test sites. The results of this project contribute to the development of automated snow segmentation tools to assist earth observation applications.


Background and evidence synthesis of previous published research
Monitoring the variation of snow cover from remote sensing is vital since snow cover is an important indicator of hydrologic budgets, surface radiation, and mountain ecosystems (KONGOLI, Romanov, and Ferraro 2012;DHARPURE et al. 2020). Monitoring the snow cover assists remote sensing applications in the prediction of floods, control of runoff values, and the ice regime of rivers (L.S. KRAMAREVA, Simonenko, and Sorokinb 2019). Besides, the observations of snow cover also give data support of weather forecasting and military operations (VARSHENY 2019). One of the fundamental issues of obtaining accurate snow area in continuous long-time monitoring is that manually labeling snow is time-consuming (ZHAN et al. 2017). In addition, as the color of clouds and snow are similar, it is sometimes difficult to manually distinguish them in true color images. Spectral indices methods (or threshold methods) are traditional ways to solve the above problems, including the Snow Grain Size Index (SGI) and the Normalized Difference Snow Index (NDSI) used to map snow area as well as the Normalized Difference Water Index (NDWI) to plot the water body (YAN et al. 2020). These spectral indices utilize the band information from satellites. For example, the NDSI is calculated through the combination of green and short-wave infrared (SWIR) bands. The NDSI employs a specific difference in the reflectance of the SWIR (1.6 to 2.2 μm) to distinguish snow from surrounding areas and especially clouds (MILLER, LEE, and Fennimore 2005). There are indeed other combinations of spectral indices and active remote sensing used for snow detection (AWASTHI and VARADE 2021) and references therein. However, these indexes are calculated independently for each pixel of the satellite image and then are compared with certain threshold values (KRAMAREVA, Simonenko, and Sorokinb 2019). The shortage of this method is that it does not take contextual features of an image into consideration, which may be affected by the occlusion of other features (JAMES, SCHILLACI, and LIPANI 2021). Clouds are the main issue when utilizing spectral indices methods, since they have various types (thick clouds, cirrus clouds, etc.) with different spectral reflectance values. In addition, the shape and texture of a cloud also have an obvious impact in distinguishing them (VARSHENY 2019). Therefore, spectral indices methods may perform poorly in some complex situations. Beside clouds, water bodies are also a challenge for spectral indices methods in alpine regions since ice in the water affects the spectral signature of the water.
Deep learning architectures are able to work with each pixel of the image separately, as well as with textures containing spatial information about their mutual arrangement and representing a picture of a small surface (L.S. KRAMAREVA, Simonenko, and Sorokinb 2019). Convolutional neural networks (CNNs) are suitable for feature detection and semantic segmentation since they exploit the spatial dependency of nearby pixels, i.e. pixels that are close to each other behave similarly. As CNNs take textures into consideration during learning, the model is able to fully exploit the spatial information efficiently without performing redundant operations on adjacent pixels (DRöNNER et al. 2022). CNNs have been widely used in semantic segmentation (Freisleben, B., and B. Seeger. 2018). For instance, SYRRIS et al., (2019) evaluated the performance of four deep learning variants, i.e. standard neural networks, fully convolutional networks, U-Net, and SegNet, for Multi-Class Segmentation of Sentinel-2 Imagery (SYRRIS et al. 2019). They found that the CNN-5x5 models (especially the one trained on samples from both winter and spring seasons) is the best architecture among the CNN variants mentioned before, with the relatively low architecture complexity and robust for varying conditions. However, they only focused on the segmentation of low-density vegetation, grassland, cropland, forest, built-up, water, and others but not clouds. Besides, most researchers mainly focused on the CNN implementation on high-resolution (30 cm to 5 m/pixel) and low-resolution (over 60 m/pixel) satellite imagery lacking the exploration on medium-resolution (10-30 m/pixel). Medium-resolution imagery plays an important role in mapping activities due to their typically higher spatial and temporal resolution (JAMES, SCHILLACI, and LIPANI 2021). Therefore, we will seek to utilize CNN architectures on mediumresolution satellite imagery in this article.
This work aims to effectively segment snow from clouds and water bodies for the medium-resolution satellite imagery . The novelty of this work is to build an ensemble CNN model that outperforms the spectral indices method for snow segmentation on Sentinel-2 images, with DeepLabV3+ employed as a backbone. To overcome the limitation of DeepLabV3+, which only accepts three channels as input features, and the need to use six features: Green, Red, Blue, NIR, SWIR, and NDSI, 20 three-channel DeepLabV3+ sub-models, were constructed with different combinations of three features and then ensembled together. Twenty is equal to the number of possible combinations of size 3 when six features are available (6 choose 3). The final ensemble model was based on the weighted mean of these 20 submodels, which takes advantage of different combinations of the six features. In addition, we also explored the model performance of DeepLabV3+ sub-models with different input channels, which is valuable to reveal the feature combination that mostly influences the model performance.
The specific aims of this work are the generation of a global semantic segmentation dataset containing images exhaustively annotated with pixel-level objects (snow); the analysis of a traditional spectral snow segmentation method based on NDSI and NDWI as a baseline; the development of an ensemble CNN model as a novel method to get more accurate snow segmentation results with respect to the traditional spectral snow segmentation method in terms of Intersection over Union (IoU); and finally, we explore the influence on model performance caused by different feature combinations as input channels of the DeepLabV3+ sub-models.

Available Earth observation data for snow detection
Earth Observation (EO) could be defined as the information collected from physical, chemical, as well as biological field of the Earth. EOs can be acquired through remote sensors and ground-level monitoring instruments (RAIYANI et al. 2021). With the fast development of satellite technology, it is becoming much easier to obtain remote sensing images, and they have been widely used in various fields including agriculture, geology, hydrology, transportation, and disaster monitoring (ZHENG et al. 2021). The first satellite employed to map snow cover is the Television and Infra-Red Observation Satellite (TIROS-1) active from 1 April 1960 (LUCAS and HARRISON 1990). After that, various mature satellite technologies with different spatial and temporal resolution have been gradually developed in this field ( Table 1).
One of the commonly used satellites for snow detection is the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite, which is available from 2000 (VERMOTE 2015). The MODIS satellite provides 36 bands with wavelength ranging from 0.4 to 14.4 µm for every 1 to 2 days. The spatial resolutions for Bands 1-2, Bands 3-7, and Bands 8-36 are 250, 500, and 1000 m (VERMOTE 2015). However, this spatial resolution is not sufficient for snow mapping in alpine regions as snow cover mainly varies from 10 to 100 m in those places with relatively higher altitudes (GASCOIN et al. 2019). The Landsat satellites solve the above coarse spatial resolution issue. The Landsat 4-5 Thematic Mapper (TM) was available from 1982 and provides 7 bands with spatial resolution of 30 m for bands 1-5, 7, and 120 m for band 6. In addition, the Landsat 7, launched in 1999, carries the Enhanced Thematic Mapper Plus (ETM+) sensor to generate 8 band information. The spatial resolution for bands 1-5 and 7 is still 30 m and for band 6 and band 8 are 60 m and 15 m. The Operational Land Imager (OLI) instrument onboard Landsat 8 (launched in 2013) provides 11 bands, including spatial resolution of 30 m for bands 1-7 and 15 m for band 8 and 100 m for bands 10-11 (ANDREAS et al. 2012). However, the Landsat can only provide images per 16 days, which may limit the high-temporal resolution applications (GASCOIN et al. 2019). In addition to the above two kinds of satellites, the Copernicus Sentinel-2 mission aims to provide systematic global acquisitions with a high spatial resolution and multispectral images allied to a high revisit frequency (DRUSCH et al. 2012). The data from Copernicus Sentinel-2 mission provides 13 bands at different resolution in the succession of five days, covering all of the Earth's land surface for the same view. And the highest spatial resolution for Sentinel-2 mission is 10 m, which can provide more accurate information than the previous two satellites. The relatively higher spatial-temporal resolution of Sentinel-2 can be cooperated with Landsat-8 to provide more frequent observations with a revisit time of 2.9 days (LI and ROY 2017). Besides, there are still some recently used satellites. For instance, Gaofen-1 Multispectral Wide Field of View (GF-1 MFV) satellite imagery, launched by the China Aerospace Science and Technology Corporation (CASC) in April 2013, has a 16 m spatial resolution and 4-day revisit cycle. However, GF-1 MFV only contains three visible bands and one near-infrared band. Since it lacks the SWIR band, which is important for clouds and snow differentiation, it cannot be used to distinguish between snow and clouds ).

Spectral signature for snow detection
To discriminate snow from clouds, water, as well as other types of land surface, their spectral signatures are used (in Figure 1). The spectral signatures refer to a variation of reflectance for different objects changing over the wavelength, i.e. the radiation reflected as a function of the wavelength (ESA 2009). In Figure  1, we observe that clouds have a higher reflectance than snow in the wavelength of 1.61 and 2.19 μm, corresponding to the SWIR band, and snow has a higher reflectance than water in the visible light range (0.49, 0.56, and 0.665 μm).
One of the biggest challenges for snow detection is to accurately distinguish clouds and snow since they have similar appearance in satellite images, making it difficult to be recognized at a first glance. Moreover, clouds and snow have consistent reflectance in both visible and infrared spectra, which also makes the detection of snow from this spectral information difficult (ZHAN et al. 2017); therefore, geophysical parameters such as wetness, density, and permittivity are a significant input in hydrological models (Varade and Dickshit 2020). However, as 66.7% of the Earth's surface is covered by clouds, cloud occlusion is a primary issue to be solved in order to obtain an accurate snow cover map (ZHENG et al. 2021).
Traditional methods for acquiring accurate snow cover maps include manual annotation and spectral indices methods. Manual annotation is the most accurate one, but it is time-consuming when applied for continuous real-time monitoring. Spectral indices methods consider the different optical properties in the optical properties of snow, water, and clouds (LEE et al. 2017). Using the Spinning Enhanced Visible and InfraRed Imager (SEVIRI), the Land Surface Analysis Satellite Applications Facility (LSA-SAF) can distinguish snow and clouds by 22 threshold tests using visible and IR radiance observations (SILJAMO and HYVäRINEN 2011). In addition, the NDSI is a frequently used method since it takes advantage of the spectral features of snow, i.e. the strong reflection in visible and NIR wavelengths and nearly total absorption of middle infrared wavelengths (YIN et al. 2013). For example, NDSI together with Normalized Difference Vegetation Index (NDVI) and two spectral bands from MODIS satellite are used to separately generate cloud and snow cover, which improves the MODIS cloud mask product (THOMPSON, Paull, and Lees 2015) or new promising multispectral snow mapping based on the Geostationary-Korean Multi-Purpose SATellite-2A (GEO-KOMPSAT-2A, GK-2A) imagery (LEE and CHOI 2022). A threshold value of NDSI is usually set to 0.4 for MODIS snow cover product (HALL, Riggs, and Salomonson 1995). However, this is adjusted according to the location of the test. For instance, it is suggested to be 0.57-0.72 for Landsat TM images in Tianshan Station, China (WANG and LI 2003), and 0.3-0.4 for MODIS images around the Tibetan Plateau (FANG Moren, 2005). These studies suggest that the optimal NDSI threshold varies in different sites and needs to be manually adjusted according to their unique characteristics. Besides, considering that NDSI still contains the water body information, some variants of NDSI were proposed, including the NDSI with no water information (NDSINW) to extract snow cover/glacier (SCG) and suppress lake water which takes advantage of NIR and SWIR (YAN et al.  2020). However, these static spectral indices methods for snow detection will probably generate errors when used to discriminate between snow cover and clouds since they do not consider the reflectance of snow measured by satellites, which can vary because of the sun-target-sensor geometry and different characteristics of the snow. The reflectance of snow cover will also change with the variation of the solar position and contamination (LEE et al. 2017). The discrimination of water bodies is also a difficult problem for spectral indices methods to detect snow, especially in alpine regions. Water in the alpine regions is always covered by ice, which affects the spectral signature of the water, contributing to the inaccurate results of spectral indices methods. Suitable fine-tuning methods in each specific site may solve this problem. But it is inefficient for continuous long-time monitoring to do fine-tuning in every site.

Semantic segmentation and convolutional neural networks
In computer vision, semantic segmentation aims to divide the image into several disjoint regions so that each region is homogeneous but the union of any two adjacent regions is not homogeneous (PAL and PAL 1993). Since semantic segmentation is able to give categorical information at a pixel level, it has been widely used in various fields including self-driving, pedestrian detection, and land use type detection (HAO, Zhou, and GUO 2020). Different from the simple image classification task, the image segmentation task needs to output not only the categories of the segmented objects in the image but also their positions. A limitation of pixel-level semantic information is that it does not discriminate instances of the same class. In other words, the semantic segmentation does not distinguish two objects of the same class in the input image as separate objects, which is a task that is instead solved by instance segmentation (HAO, Zhou, and GUO 2020). CNNs are feedforward neural networks, similar to the more traditional Multi-Layer Perceptrons (MLPs), but with the obvious difference that CNNs are mainly designed for image-related tasks. This is because an image classification task greatly increases the number of parameters and calculations for an MLP, while a CNN by using convolutional layers greatly reduces the number of parameters required by the model, thus also reducing the amount of calculations required (O'SHEA and NASH 2015). The main characteristics of CNNs are the local connectivity and weight sharing, which help reducing both the number of parameters and the possibility of over-fitting. CNNs are usually composed of three elements: convolution, pooling, and fully connected layers. The function of convolution and pooling layers is feature extraction and dimensionality reduction. The purpose of fully connected layers is to organize the features extracted from the first two layers to the final output (O'SHEA and NASH 2015). After stacking these layers, a complete CNN architecture has been formed.
Fully convolutional networks (FCNs) were proposed especially for semantic segmentation tasks (LONG, Shelhamer, and Darrell 2015). An FCN classifies images at the pixel level, thus solving the semantic segmentation problem. Different from the classical CNN, which uses the fully connected layer to obtain fixed-length feature vectors after the convolutional layer for classification, FCN is able to accept input images of any size. In addition, the deconvolution layer is employed by FCN to up-sample the feature map of the last convolutional layer; thus, the output can be restored to the same size of the input image. Thus, a predicted value will be generated for each pixel, while preserving the spatial information in the original input image, and finally, pixel-by-pixel classification can be performed on the up-sampled feature map (SULTANA, Sufian, and Dutta 2020). The capability of FCN in detecting the presence of avalanches in SAR products at a pixel granularity was explored, which has superior performance than the current state-of-art avalanche detection algorithm (BIANCHI et al. 2021).
The U-Net, originally developed for segmenting biomedical images, is an extension of the FCN architectures. It has a U-shaped semantic segmentation network, which contains a contracting path and an expansive path. In the contracting path, there is a down-sampling module composed of two 3 × 3 convolution layers (RELU) and one 2 × 2 max pooling layer. In the expansive path, it includes the up-sampling of feature map followed by a 2 × 2 up convolution. During this step, the size of the feature map is reduced by 2 times. Later, the operated feature maps from two paths is concatenated, followed by two consecutive 3 × 3 convolutional operations and a ReLU activation function (SULTANA, Sufian, and Dutta 2020). Beside the application in biomedical images, U-Net was also employed in multi-class land use segmentation on Sentinel-2 Imagery (SYRRIS et al. 2019).
In this work, we use DeepLab. DeepLab is a stateof-the-art model for semantic segmentation and has been updated to its fourth version LIANG-CHIEH et al. 2018;CHEN et al. 2014). DeepLab is a combination of deep CNNs (DCNNs) and probabilistic graphical models (DenseCRFs). The application of DCNN in semantic segmentation is faced with two technical problems: down-sampling and spatial invariance (SULTANA, Sufian, and Dutta 2020). The first problem is due to the resolution degradation caused by repeated maximum pooling and down-sampling in DCNN. In the first version, DeepLabV1, atrous convolution was utilized to solve this problem. The second problem, the spatial invariance, is a property that is required to obtain objectcentered decisions, which limit the spatial accuracy of DCNN. The introduction of DenseCRFs aims to solve this issue. In DeepLabV2, the atrous spatial pyramid pooling (ASPP) was introduced. In addition, the backbone was changed from VGG-16 to ResNet (Residual Network). The ResNet architecture was proposed by Microsoft Research team in 2015, which utilizes the Identity Connection to avoid vanishing gradient problem in the network. DeepLabV3 uses the augmented ASPP module to incorporate multiple contexts. In the latest version, DeepLabV3+, the encoder-decoder structure is used, which makes it able to use atrous convolution to randomly control the resolution of the extracted encoder features in order to balance its precision and runtime. In addition, the modified Xception architecture is employed as backbone in this version (ZHENG et al. 2021). In this study, the DeepLabV3+ with ResNet as backbone was employed as the pre-trained model since it has superior performance on image segmentation.

Data source description
In this article, we use the Sentinel-2 product from European Space Agency (ESA). The Copernicus Sentinel-2 mission consists of two identical satellites in the same sun-synchronous orbit, with Sentinel-2A launched on 23 June 2015 and Sentinel-2B on 7 March 2017. The wide swath high-resolution multispectral imager onboard each satellite can provide 13 spectral bands to monitor the variation of land surface conditions. With the swath width of 290 km, Sentinel-2 has the high revisit time of 10 days at equator for one satellite, 5 days for two satellite under cloud-free condition, and 2 or 3 days at mid-latitude regions (DRUSCH et al. 2012). The orbit of each satellite is at 785 km altitude, and the satellite will obtain the observations from 56° to 84° latitude. Considering the best compromise between minimizing cloud cover and appropriate sun illumination, the descending node is at 10:30 am local time (DRUSCH et al. 2012). The two satellites work in opposite positions of the sun-synchronous orbit.

Study sites
To build a global model with superior generalization ability, we considered several possible sites containing snow and water bodies in alpine regions all around the world ( Figure 2). Thereafter, we selected multiple sites from China, Chile, Canada, Switzerland, and the United States. In China, we selected four sites from Tibet and Xinjiang Uygur Autonomous Region (XUAR), which are both in the southwestern part of China. In Chile, a country in western South America, we selected three sites all gathered around Isla Ofhidro region, which contains an outlet glacier of the Southern Patagonial Icefield (PELTO 2019). We selected two sites from the Switzerland (Lenzerhorn region), which is a mountain of the Plessur Alps. Another site is in Bifertenstoch, a mountain in the Glarus Alps, with an elevation of 3,419 m. In the United States, we selected two sites, which are all in Alaska region. As for the sites in Canada, they are from the Tsylos Provincial Park, the Tatsamenie Peak, and the Dalton Peak. The sites from the United States and Chile were selected because the mountains in those places contain relatively more gullies, where snow is piling up but satellites cannot easily capture it. As for the sites in Canada and Switzerland, here snow accumulates from the summit to the mountainside, forming strips along the mountain, which requires accurate detection to map. The sites from China are collected from relatively flat area, which include large water bodies in order to explore model performance when there is a lot of ice in the water. To capture the temporal variation, we took into consideration also images in different seasons within 5 consecutive years (2016-2021).

Data acquisition and annotation
We employed Google Earth (https://www.google. com/earth/index.html) combined with Sentinel Playground (https://apps.sentinel-hub.com/sentinelplayground/) to determine the suitable study areas. Google Earth was used to select alpine regions with both snow and water body, and we acquired the coordinates of the target areas. Sentinel Playground was used to check the NDSI values for each specific study area. Thereafter, the Sentinel-hub application programming interface (API) which supports programmatic processing and satellite data in a Python environment was used to collect data for each study area during a specific period of time (from January 2016 to May 2021). We chose Images from different time phases to verify the generalization performance of the proposed method. In addition, to avoid too much cloud occlusion, a parameter named "maxcc" was set to 0.2 to filter images with cloud coverage rate larger than 20%. All images are from Level-1C (L1C) and Level-2A (L2A) products. The L1C and L2A levels here represent the top-of-atmosphere (TOA) observations and atmospherically corrected bottom-of-atmosphere (BOA) product. Besides, 12 bands were collected from both L1C and L2A products, which correspond to their true color imagery. The information of each band is shown in Table 2. The true color imagery has the spatial resolution of 10 meters, while the resolutions of bands are various.
We annotated each image to obtain the snow mask as ground truth. The annotation was done using Photoshop with the help of the magic wand tool, as done in the literature (JAMES, SCHILLACI, and LIPANI 2021). The magic wand tool is a selection tool which can clusterize pixels based on tone and color. Therefore, the large snow area can be easily obtained, but details need to be carefully selected. One of the most difficult things we encountered during the labeling process was that it is difficult to distinguish the boundary between clouds and snow when they overlap. To solve this, the false color images, i.e. images constructed by replacing any feature of red, green, and blue (RGB) with NIR, SWIR or NDSI, was used instead of RGB images because it makes it easier to distinguish the boundaries between clouds and snow.

Methodology
In Figure 3, we present the methodology employed in this work. Considering the aim of this study, the performance of 20 DeepLabV3+ sub-models built with different input channels was compared in order to explore the impact caused by different spectral information. In addition, the advantages of novel CNNs were emphasized through the comparison with the spectral indices method. The mean intersection over union (mIoU) is used to evaluate the performance of each model. The mIoU is the common criteria for semantic segmentation (LIU et al. 2019;ZHANG et al. 2020), which first calculates the IoU for each class and then averages over the classes.

Training, validation, and test splits
Considering the limitation of the input size of the selected CNN architecture, all images and labels were split into small pieces with the size of 244 × 244 pixels. Therefore, the input size of each CNN here is 3 x 244 x 244. At this stage, the images were resized into squares first and then were split into smaller tiles. After splitting the large images of the dataset, we created a total of 2960 samples. However, due to the limited computing power, the following experiments were only conducted on a smaller subset. The subset was formed by randomly selecting two large images from the same site of each country, making a total of 770 samples. Of these 770 samples, 52% of the samples were used for the training set and 13% for the validation set, and the remaining 35% for the test set. Thereby, 400 samples were used for training, 100 samples were used to validate the hyper-parameters of the model, and 270 samples were held out to assess the performance of the model on an unseen input. Considering that the main aims of this study are the exploration of a better performance from CNN models with respect to the spectral indices method as well as comparing the performance of CNN models built with different input channels, the selected small subset is deemed sufficient.

Spectral indices-based baseline
To generate the baseline based on the spectral indices, we followed the flow chart in Figure 4. To compare traditional spectral indices method against our proposed CNNs, we calculated the snow only mask by removing NDWI mask from NDSI mask. The optimal thresholds for NDSI and NDWI masks were obtained as described below. The density plot of NDSI values in the training set was first visualized. Then, the point between two peaks was found as the optimal threshold of NDSI, which is 0.35. Therefore, the pixels with NDSI � 0.35 in each testing image were classified as snow, while the pixels with NDSI<0.35 was classified as background. As NDSI mask also contains water information, we employed the NDWI mask and removed it from the NDSI mask in order to get snow- only mask. The NDWI mask is obtained in a similar way to NDSI. But since NDWI includes information for both water body, snow, and background, the optimal threshold is located in the lowest point before the third peak. Based on what we observe in Figure 5, the optimal threshold for NDWI was 0.6, thus the pixels with NDWI � 0.6 were classified as water body. Thereafter, these two optimal thresholds were utilized for testing. The snow-only mask was obtained by subtracting the NDWI mask from the NDSI mask. For the sake of convenience, the results obtained with this method is referred as SI.

The proposed ensemble model
The DeepLabV3+ with the backbone of ResNet101 was employed as a sub-model. The pre-trained DeepLabV3+ with ResNet101 expects three-channel images as input. The shape of input is supposed to be (N, 3, A, B), where N represents the number of input images, A and B are the width (244 pixels) and height (244 pixels) of each channel. In order to explore the influence caused by different feature combinations as input channels of the DeepLabV3+ sub-models, six features were selected as input, i.e. red, green, blue, NIR, SWIR, and NDSI. The first three features (red, green, and blue) are traditional channels used in computer version (ZHAN et al. 2017;SYRRIS et al. 2019;ZHENG et al. 2021). These three bands are beneficial for discriminating snow and water body. The feature of SWIR is used for discriminating snow from clouds, since snow has a lower reflectance than clouds in the SWIR band. The combination of the blue and SWIR can be used for detecting wispy opaque clouds (with low reflectance) in high-altitude locations (COLUZZI et al. 2018). The NDSI is calculated through the combination of green and SWIR bands, which can be used to identify snow. Even though the individual SWIR and green bands have been included here, NDSI can still provide extra information from the perspective of feature combination, providing a more accurate information at a pixel-level. The NIR was utilized to provide additional information in order to  distinguish snow and water bodies, since snow has a higher reflectance than water in this band. The ensemble model was developed as a hybrid of 20 DeepLabV3+ sub-models. Considering that different combinations of spectral features have different impacts on model performance, six features including red, green, blue, NDSI, SWIR, and NIR were prepared in order to train different models. Each DeeplabV3+ submodel was trained with three input channels chosen from six features, making it a total of 20 sub-models built with different combinations of three features. The abbreviation and specific features for each model used in this work are shown in Table 3. The final model output is based on a weighted mean. The weights are based on the performance of each submodule on the validation set in terms of mIoU. Note that these weights are based on the validation set and not the test set. This weighted mean of probability for snow in each pixel is then thresholded, i.e. if it is more than 50%, the pixel is considered snow. The formula used to determine the weighted mean of probability for snow in each pixel is defined as: where c represents the class to be predicted, P i c ð Þ is the output probability of each sub-model, and mIoU i is the mIoU of the i th sub-model calculated on the validation set.
To increase the model generalization ability, the data was augmented, performing horizontal flip, random Gaussian blur, and rotation. An illustration of these methods can be found in (GANDHI 2021; NELSON 2020).

Hyper-parameter tuning.
The model parameters were tuned with 500 samples. Much attention was paid to the learning rate and batch size. Therefore, models were trained with batch size of 16 and 20. The tested learning rates are 0.1, 0.01, and 0.001. According to the results for the hyper-parameter tuning, the final hyper-parameter combination for sub-models trained with 500 samples was 20 for the batch size and 0.01 for the learning rate. To speed up the search of the best hyperparameter combination, an early stopping strategy was employed, with a maximum epoch of 60. If the validation results over ten consecutive epochs did not exceed the previous best value, the training would automatically be stopped. In Figure 6, we show the variation of training loss and validation loss during the training process for a sub-model. The sub-model early stopping strategy stopped the training at the 47 th epoch as the best validation result appeared at the 36th epoch.

Feature contribution analysis.
In order to analyze the feature contribution, we evaluated the performance of each sub-model. In each experiment, the sub-models containing a specific feature were all removed and the remaining sub-models were used to form a new weighted average ensemble model. For example, 10 sub-models containing the red band as one of input channels were removed, and the left 10 sub-models were used to construct the new ensemble model, which gives the contribution of the red band by comparing current and previous ensemble models.

Results
The predicted masks from several sub-models and proposed ensemble model are visualized in Figure 7. From the true color imagery in Figure 7a, the test site in Chile has a lot of clouds mainly in the right side of the image. The commonly used R_G_B sub-models has the worst performance on distinguishing clouds from snow among the sub-models exhibited in Figure  8, leaving a lot of false positive pixels there (i.e. nonsnow pixels incorrectly identified as snow pixels). In other words, the R_G_B sub-model incorrectly identified clouds pixels as snow. The sub-models trained with B_NIR_SWIR and G_NIR_SWIR as well as the proposed ensemble model can solve this problem to some extent. Note that the B_NIR_SWIR sub-model has the best mIoU among the 20 tested sub-models. In addition, it can be found that all the models in Figure 7 are not good at detecting snow area in the  dark side of mountains, which was mainly incorrectly identified as non-snow area.

Comparison between the ensemble model and the SIs
In The predicted masks from the ensemble model in each site can remove the majority of cloudy areas from the snow mask in the sites of Chile (Figure 7) and Canada (Figure 9). In addition, all water bodies and ice covering regions can also be completely identified as non-snow area by the proposed ensemble model, especially in the Chinese site ( Figure 10). In contrast, the SI method exhibited better performance on identifying cloudy areas as non-snow regions in the site of Chile (Figure 7) and Switzerland (Figure 11) but it failed in the Canadian sites (Figure 9). In addition, it is worth noting that the SI method is not as good as the ensemble model when identifying large ice areas as non-snow areas in the Chinese site ( Figure  10), with the whole ice-covered lake identified as snow area by the SI method. Moreover, when applied to identify water bodies as non-snow area, the SI method always failed in detecting the correct boundaries between water and ground. The proposed ensemble model also exhibited superior performance on detecting snow area in the mountain side of the site of Chile, where was incorrectly identified as nonsnow area by the SI method (Figure 7). Also, the SI method is not good at detecting those areas with relatively less snow, contributing to incorrectly identified non-snow pixels as snow (blue pixels in the middle-upper part of Figure 9b and upper left corner  of Figure 12b). We can also note that the proposed ensemble model performs better than the SI method when applied to distinguish snow and non-snow pixels in the gully area of Switzerland, with the SI method wrongly identifying all gully area as snow pixels (the large blue area in the upper left part of Figure 11b). Figure 8 shows the performance of 20 DeepLabV3 + sub-models in the Canadian test site. It can be found that the G_NDSI_SWIR sub-model has the best mIoU equal to 0.8279, followed by the B_NDSI_SWIR sub-model with a mIoU of 0.8223. By contrast, the  B_NIR_SWIR sub-model has the lowest mIoU equal to 0.7811. It is also noticeable that the R_G_B sub-model also showed a relatively lower mIoU (0.7902) in this site. Other sub-models all exhibited mIoU values from 0.80 to 0.81, except for B_NIR_SWIR, G_NIR_SWIR, G_B_SWIR, and R_G_B sub-models. Figure 9 shows the predicted masks from several sub-models and the proposed ensemble model in the Figure 11. Visualization of (a) true color imagery and predicted masks from (b) SI method, (c) ensemble model, (d) R_G_B, (e) G_B_SWIR, as well as (f) B_NIR_SWIR sub-models in test site of Switzerland. test site of Canada. The true color imagery for this Canadian test site is in Figure 9a. It is worth noting that all DeelLabV3+ sub-models and ensemble model shown in Figure 11 cannot perfectly detect the cloudy area in the upper left corner of the imagery and incorrectly identify this cloudy area as snow pixels. Except for the R_G_B sub-model, the other DeepLabV3+ submodels in Figure 9 can distinguish the cloudy area right in the heart of the image. Note the G_NDSI_SWIR submodel has the best mIoU among the 20 sub-models, while the B_NIR_SWIR sub-model has the lowest one. Besides, it is obvious that the B_NIR_SWIR sub-model is particularly bad at detecting boundaries between snow and non-snow areas compared to other sub-models, contributing to incorrectly identifying non-snow pixels as snow (the blue pixels in the right-middle part of Figure 9f). However, all DeepLabV3+ sub-models and the ensemble model can perfectly remove water body areas in the predicted masks, especially the boundaries of water and ground. Figure 13 shows the performance of the 20 DeepLabV3+ sub-models in the test site of the United States. The range of mIoU values for these sub-models is from 0.8439 to 0.894. And the best score comes from G_B_SWIR sub-model, while the worst one from the R_NDSI_SWIR sub-model. The G_B_NIR and R_G_NIR sub-models also showed superior performance than others, with a mIoU of 0.8857 and 0.8832. Except for the worst sub-model, the mIoU values of the others are all above 0.85. But even so, the R_G_B is still not the most outstanding one among these sub-models. Figure 12 shows the predicted masks from several sub-models and the proposed ensemble model in test site of the United States. The true color imagery for this test site is in Figure 12a. All the DeepLabV3+ submodels and the ensemble model are good at correctly identifying water body areas as non-snow. As for the ensemble model and G_B_SWIR sub-model, it is noticeable that they can correctly identify the nonsnow parts of the mountain cliffs; furthermore, false positives and true negatives can possibly consist of shadowed snow. The same thing with previous sites is that false positive pixels always happen to the boundaries of snow and ground area. But another good point is that DeepLabV3+ sub-models exhibited outstanding performance on detecting non-snow pixels in areas with less and loosely distributed snow (upper left part in Figure 12c-f). Figure 14 shows the performance of 20 DeepLabV3 + sub-models and the ensemble model in the Chinese test site. It can be found that the B_NIR_NDSI submodel has the best mIoU equal to 0.954, followed by the R_NDSI_SWIR sub-model with a mIoU of 0.9508. By contrast, the NIR_NDSI_SWIR sub-model has the lowest mIoU equal to 0.9303. It is also noticeable that the R_B_NIR sub-model also showed relatively lower mIoU (0.9352) in this site. Other sub-models all exhibited mIoU values from 0.94 to 0.95. The performance of all models in this site is obviously better than in the previous sites. Figure 10 shows the predicted masks from several sub-models and proposed ensemble model in test site of China. The true color imagery for this test site is in Figure 10a. The aim of this site is to examine the model performance when there is a large ice-covered area. It is worth noting that all DeelLabV3+ sub-models and ensemble model shown in Figure 10 can detect the ice-covered lake as non-snow area, but the SI method fails to do so. But in NIR_NDSI_SWIR sub-model, there is still a small ice-covered area that is incorrectly identified as snow. Figure 15 is the performance of 20 DeepLabV3+ submodels and the ensemble model in the Switzerland test site. The sub-model trained with Green, Blue, and SWIR bands (G_B_SWIR) has the best mIoU equal to 0.8107, while the worst one is the R_G_B sub-model with a mIoU of 0.78. Besides, the G_NDSI_SWIR, B_NDSI_SWIR, R_G_SWIR, and G_B_NIR sub-models also have superior performance than others with a mIoU of 0.8083, 0.8066, 0.8029, and 0.8021. But the overall performance of the 20 DeepLabV3+ sub-models in the Switzerland test site is worse than the performance in previous sites. Twelve of 20 sub-models in this site have a mIoU value lower than 0.8. Figure 11 exhibits the predicted masks from several sub-models and the ensemble model in the test site of Switzerland. The true color imagery for this test site is in Figure 11a. It is noticeable that all DeelLabV3+ submodels and ensemble model shown in Figure 11 cannot perfectly detect the cloudy area and incorrectly identify those cloudy pixels as snow pixels (large blue areas). But the ensemble model, G_B_SWIR, and B_NIR_SWIR sub-models perform better than the R_G_B model on detecting cloudy pixels. All models appearing in Figure 11 have bad performance within heterogeneous topography areas. For example, in the middle orange pixels of the imagery, there is a large false-negative area, which means that the snow in this gully was incorrectly identified as non-snow area by the models. But another good point is that all DeepLabV3 + sub-models and the ensemble model can perfectly remove water body areas. Figure 16a shows the performance of the 20 DeepLabV3+ sub-models across the five test sites. It can be found that the G_NDSI_SWIR sub-model has the best mIoU equal to 0.8873, followed by the G_B_SWIR sub-model with a mIoU of 0.8872. By contrast, the R_G_B sub-model has the lowest mIoU of 0.8683 among 20 sub-models. Other sub-models all exhibited mIoU values from 0.87 to 0.88.

Overall performance
The performance of 20 DeepLabV3+ sub-models in the Chile test site is shown in Figure 7b. The mIoU values of 20 sub-models range from 0.8208 to 0.8704. The submodel B_NIR_SWIR has the best mIoU equal to 0.8704 while the worst one is the R_G_B sub-model with an mIoU of 0.8208. Besides, G_NIR_SWIR model, B_NDSI_SWIR, G_NDSI_SWIR, and R_B_NDSI also have superior performance than others with the mIoU of 0.8692, 0.8684, 0.8669, and 0.8655. However, the performance of the sub-models R_G_NIR, G_B_NIR, and R_B_NIR are relatively worse with mIoU values of 0.8473, 0.8395, and 0.8347.

Feature contribution
The results of feature contribution analysis are shown in Table 5. From the overall performance across five test sites, the mIoU value of the ensemble model decreased when sub-models including green, blue, NDSI, or SWIR were removed, especially when removing SWIR, the ensemble model performance decreased most, from 0.8861 to 0.8850. In the test site of Canada, the ensemble model performance only got worse after removing the sub-models including NDSI with mIoU values decreasing from 0.8158 to 0.8106. In terms of site in Chile, if sub-models including NDSI or SWIR bands were removed, the mIoU value would decrease from 0.8683 to 0.8655 and 0.8612.
Each column represents the ensemble result of the rest of the sub-models when removing those submodels containing the specific feature. Figures in bold represent the largest mIoU value in each site.
However, the situation in site of the United States is different. The model performance increased from 0.8732 to 0.8776 and 0.8739 if NDSI or SWIR was not included. Removing the sub-models that contained green or blue bands would decrease the model performance in this site. When it comes to the site of Switzerland, except for the red band, removing other bands would always decrease the model performance. As for the Chinese site, excluding the sub-models containing red, blue, NIR, NDSI, and SWIR would always decrease the model performance. It is obvious that the NDSI is especially important in China and Canada sites, since the mIoU values decreased most if we deleted the sub-models containing NDSI. The SWIR is the most important one in Chile and Switzerland. It is noticeable that the green is the feature that mostly influences the ensemble model in the United States.

Comparison between the ensemble model and the SI baseline
The proposed ensemble model outperforms the SI baseline in all test sites except for Switzerland where its heterogeneous topography makes it more difficult the definition of its ground truth . The proposed ensemble model exhibited superior performance on identifying ice in water as non-snow areas. Especially, in the China site, the mIoU of the proposed ensemble model is 0.9538, but it is only 0.4733 for the SI method. This difference in performance is due to a large ice-covered lake in this site, where the ensemble method, in contrast to the SI method, is capable to recognize it as non-snow pixels. The predicted mask of the SI method came from removing the NDWI mask from the NDSI mask. The ice in the water affects the spectral signature of the water so that it influences the accuracy of the NDWI mask, contributing to the difficulty in identifying ice-covered areas as water pixels. The possible ways to solve this issue is to alter the formula of the NDWI by adding a coefficient to the NIR band to modify the index's sensitivity to the NIR channel. However, as the optimal thresholds for NDWI and NDSI masks were calculated using the whole training set, which may not contain a large ice in each image, the NDWI calculation was not recalculated to ensure a fair comparison across the tested methods. This shows that the generalization ability of the SI method is poor to a certain extent. For a global model, it is inefficient to alter the formula for each target site. The Chile site is characterized by a lot of small cloudy areas, which have similar color to snow. Here, both the ensemble model and the SI method demonstrated a good ability to classify clouds in this site. However, in the site of Canada, both methods failed to classify the cloudy area in the upper left corner as nonsnow pixels since the cloudy area here is completely overlapped together with snow. It is difficult to distinguish the boundary in this kind of situation. Therefore, it can be found that the proposed ensemble model  showed good capability to identify clouds if they do not completely overlap with snow. This issue can also be improved in Switzerland, where other approaches were developed using Sentinel-1 (LIEVENS et al. 2022). In the test site of the United States, the proposed ensemble model exhibited better ability to identifying thin snow. In those thin snow areas, the contrast between ground and snow are not obvious. The SI method only takes the value in each pixel into consideration when found snow and ignores the contextual information of the image, making plenty of false positive pixels in the predicted mask of the United States. This test site demonstrates the merit of using our ensemble model, which by considering the neighborhood information of each pixel, achieves a better performance. In the Switzerland test site, however, the mIoU value of the proposed ensemble model is lower than that of the SI method. This is because the proposed ensemble model failed to identify those snow pixels in very heterogeneous topography areas, which contributed to a lot of false negative pixels. The illumination condition is not good in heterogeneous topography areas so that the reflection of red, green, and blue bands may be affected. This can be proved by the slight improvement appearing in the predicted mask from NIR_NDSI_SWIR model, which excludes red, green, and blue bands ( Figure 17). According to the feature contribution analysis, the NDSI is an important feature for segmenting snow since the ensemble model performance in four out of five sites would decrease if sub-models containing NDSI were removed, especially for the sites in China and Canada. Similarly, the SWIR is the most important feature in Chile and Switzerland since these two sites contain a lot of cloudy areas. The proposed ensemble model was built by performing the weighted mean of 20 sub-models so that it has good robustness and generalization ability compared with each sub-model. The three-channel DeepLabV3 + model was utilized in this study but can be extended to more channels in further study.

Ensemble model
The proposed ensemble was built by performing the weighted mean of 20 sub-models so that it has good robustness and generalization ability compared with each sub-model. It is clear from the above results that the proposed ensemble model does not outperform the best sub-model in each site. However, the models that perform best at each different site are not fixed.
Considering the proposed ensemble model always exhibited superior performance in each site, the benefit of employing ensemble model is that we are still able to obtain relatively accurate predictions in any new place. Comparing our ensemble model with the SI method, the test site in the United States demonstrates the merit of considering the neighborhood information of each pixel and the proposed model achieves a better performance on detecting thin snow areas, where the contrast between ground and snow are not obvious.

Exploration of different input channels for DeepLabV3 +
The commonly used input channels for CNN to do image segmentation are three channels in RGB imagery (ZHAN et al. 2017;SYRRIS et al. 2019;ZHENG et al. 2021), but the results are sometimes not ideal. It is clear from our results that RGB always showed worse performance on snow segmentation than other feature combinations since the clouds and snow have similar color and shape in RGB images. From the analysis performed in this work, we have demonstrated that the R_G_B sub-model is not good at identifying clouds as non-snow pixels. Other implementations of RGB images in snow and clouds segmentation can also prove this point. For example, the DeepLabV3+ models with Xception and ResNet50 were employed to do clouds and snow segmentation, which only showed an mIoU of 0.769 and 0.805 for 23,520 RGB images from TH-1 satellite dataset (ZHENG et al. 2021). Even though they proposed an end-to-end DCNN framework, their mIoU values is still only 0.811 when using RGB images. It is also worth noting that the best feature combination is different in each site. For instance, the best sub-model is G_B_SWIR in the United States and Switzerland, but the best ones for Chile, Canada, and China are B_NIR_SWIR, G_NDSI_SWIR, and B_NIR_NDSI. This is due to the different characteristics in each site. Clouds exist in the sites of Switzerland, Chile, and Canada so that their best sub-models all contain SWIR. Introducing SWIR as one of channels to build DeepLabV3+ was found to be advantageous since it assists to discriminate clouds and snow (VARSHENY 2019). In the SWIR band, snow has lower reflectance than clouds, making it easy to separate them. The advantage of SWIR can also be proved through the results from feature contribution analysis. The ensemble model performance would decrease in Chile and Switzerland sites if all sub-models containing the SWIR channels were removed. It is also obvious that NDSI is an important feature for CNN in segmenting snow since the ensemble models' performance in four out of five sites would decrease if the sub-models containing NDSI were deleted. Especially in China and Canada, the mIoU values decreased the most in such a case. The characteristic of the China and Canada sites is that they all contain a large area of thick snow but relatively less clouds compared with other sites. Therefore, the advantage of NDSI channel here is that it can especially improve model performance when it meets a large continuous area of snow.

Limitations
The proposed ensemble model is unable to discriminate cloudy area when clouds and snow completely overlap, which can be seen from the cases in Canada and Switzerland. This is probably because it is difficult to find the boundary between clouds and snow when they overlap together, since they have very similar color and spectral signature. Some sub-models built with different input channels can solve this to some extent, but they still cannot perfectly identify clouds as non-snow pixels. This is due to the reflectance combination in those overlapping areas is like that of snow pixels. Another limitation of this work is that we only conducted the experiment on a small subset due to the limiting computing resources. A larger dataset may lead to better a performance for our proposed model, which we leave to future work.

Conclusion
In this study, an ensemble CNN model with DeepLabV3 + employed as sub-models for snow segmentation on medium resolution satellite imagery (Sentinel-2) was proposed, which especially aims to distinguish clouds and water body from snow. To overcome the limitation of DeepLabV3+, which only accepts three channels as input features, and the need to use six features: Green, Red, Blue, NIR, SWIR, and NDSI, 20 three-channel DeepLabV3+ sub-models, were constructed with different combinations of three features and then ensembled together. Twenty is equal to the number of possible combinations of size 3 when six features are available (6 choose 3). The final ensemble model was based on the weighted mean of 20 sub-models, which takes advantage of different combinations of the six features. According to the presented results, the proposed ensemble model has superior performance than the traditional spectral indices method when segmenting snow. The proposed ensemble model outperforms the SI method in all test sites except for Switzerland. In addition, the proposed ensemble model exhibited superior performance on identifying ice in water as non-snow area, while the traditional spectral indices method failed. Besides, our ensemble model can also detect cloudy area as non-snow pixels if clouds do not overlap completely together with snow. It was found that the performance of each sub-model varies among each site. The best sub-model is G_B_SWIR in the United States and Switzerland, but the best ones for Chile, Canada, and China are B_NIR_SWIR, G_NDSI_SWIR, and B_NIR_NDSI. It is also clear from our results that the commonly used RGB combination always showed worse performance on snow segmentation than other feature combinations.

Disclosure statement
No potential conflict of interest was reported by the author(s).