An improved deep learning network for AOD retrieving from remote sensing imagery focusing on sub-pixel cloud

ABSTRACT Following the success of MODIS, several widely used algorithms have been developed for different satellite sensors to provide global aerosol optical depth (AOD) products. Despite the progress made in improving the accuracy of satellite-derived AOD products, the presence of sub-pixel clouds and the corresponding cloud shadows still significantly degrade AOD products. This is due to the difficulty in identifying sub-pixel clouds, as they are hardly identified, which inevitably leads to the overestimation of AOD. To overcome these conundrums, we proposed an improved deep learning network for retrieving AOD from remote sensing imagery focusing on sub-pixel clouds especially and we call it the Sub-Pixel AOD network (SPAODnet). Two specific improvements considering sub-pixel clouds have been made; a spatial adaptive bilateral filter is applied to top-of-atmosphere (TOA) reflectance images for removing the noise induced by sub-pixel clouds and the corresponding shadows at the first place and channel attention mechanism is added into the convolutional neural network to further emphasize the relationship between the uncontaminated pixels and the ground measured AOD from AERONET sites. In addition, a compositive loss function, Huber loss, is used to further improve the accuracy of retrieved AOD. The SPAODnet model is trained by using ten AERONET sites within Beijing-Tianjin-Hebei (BTH) region in China, along with their corresponding MODIS images from 2011 to 2020; Subsequently, the trained network is applied over the whole BTH region and the AOD images over the BTH region from 2011 ~ 2020 are retrieved. Based on a comprehensive validation with ground measurements, the MODIS products, and the AOD retrieved from the other neural network, the proposed network does significantly improve the overall accuracy, the spatial resolution, and the spatial coverage of the AOD, especially for cases with sub-pixel clouds and cloud shadows.


Introduction
AOD is a crucial parameter used in meteorological observation and represents aerosol extinction effects in the vertical atmospheric column.It is a fundamental optical property of aerosol derived from satellite imagery (Guo et al. 2020;Li, Carlson, and Lacis 2015;Lin et al. 2015;Martins et al. 2002;Su et al. 2017Su et al. , 2020;;Van Donkelaar, Martin, and Park 2006).Following the success of MODIS, several algorithms have been developed to retrieve AOD from satellite imagery.The most well-known algorithms as the Dark Target (DT) algorithm (Gupta et al. 2016;Jackson et al. 2013;Kaufman et al. 1997;Levy, Remer, and Dubovik 2007) and the Deep Blue (DB) algorithm (Hsu et al. 2004(Hsu et al. , 2006(Hsu et al. , 2013;;Lyapustin et al. 2018).The DT algorithm estimates the reflectance of "dark targets" such as dense vegetation and dark soils to decouple the ground and atmosphere and derive AOD.However, it is generally overestimated and limited to "dark target" regions.The DB algorithm uses the minimum reflectivity method to build a surface reflectivity database to derive AOD, which is stable and can derive AOD over bright surfaces.However, it is not designed to deal with the influence of surface anisotropy, which increases errors in surface reflectivity determination particularly in urban areas with complex surfaces.Additionally, the DB algorithm employs variance statistics to further reduce the impact of cloudlets on retrievable pixels (Bian et al. 2018(Bian et al. , 2022;;Wang et al. 2020).Other algorithms, such as MAIAC, can retrieve aerosol properties over both dark and bright surfaces (Lyapustin et al. 2018).Multispectral lidar measurements are also utilized in aerosol studies (Pérez-Ramírez et al. 2019, 2020;Young et al. 2018).
Under the above mentioned algorithms, long-term AOD products with extensive coverage for polarorbiting satellites have been released and used widely (Su et al. 2020).However, satellite-based AOD retrieval is reliant on clear skies and appropriate surface conditions (Kim et al. 2008(Kim et al. , 2020)).Cloud cover is the most significant factor that affects the retrieval of satellite-derived aerosols (Winker et al. 2009;Winker, Pelon, and Patrick McCormick 2003).It should not be overlooked that the cloud screening process may omit cirrus or the edge of a cloud to a certain extent, where thin clouds exist as sub-pixel clouds in low-resolution sensors (Lin et al. 2016).Despite improved detection of clouds and snow using quality screening in data pre-processing, residual clouds and sub-pixel clouds can still lead to an overestimation of AOD from the MAIAC algorithm (Li, Carlson, and Lacis 2015).
To better illustrate the impact of sub-pixel clouds on low-resolution images and the difficulties involved in their identification, we present Figure 1.This figure shows an example of sub-pixel clouds in low-resolution imagery (Figure 1(c)) and their corresponding high-resolution imagery (Figure 1(b)) at the same location and time.The clouds are easily recognizable in high-resolution imagery, but low-resolution cloud products (Figure 1(d)) cannot identify them accurately.As a result, pixels with unidentified clouds in low-resolution imagery are treated as "clear" pixels, leading to overestimation of AOD.
Sub-pixel clouds are especially difficult to identify and remove (Motohka et al. 2011).Noise due to irremovable sub-pixel clouds is one of the severest problems for AOD retrievals using satellite sensors with low-to-medium spatial resolution such as MODIS.Although cloud masks have been produced and demonstrated to be effective (Platnick et al. 2003), they are not always able to completely offset the contamination caused by clouds (Nagai et al. 2010).Residual sub-pixel clouds potentially contaminated about 40% of the MODIS data after cloud screening by the state flag of Surface reflectance product (Motohka et al. 2011).Despite the development of an efficient method suitable for widespread applications, which utilizes spatiotemporal optimization algorithms for gap filling in products (Zhang et al. 2022), the key challenge in low-resolution satellite image aerosol retrieval remains how to enhance aerosol retrieval accuracy and ensure spatiotemporal continuity, especially under sub-pixel cloud conditions.Until now, the strong influence of the sub-pixel clouds on AOD retrieval remains an unsolved problem.Data screening based on variances has improved the accuracy of AOD products through a strict threshold setting, but a large number of retrievable pixels have been excluded.The spatial coverage of retrieved AOD is greatly reduced.Consequently, how to increase the retrieved AOD while keeping the accuracy becomes a dilemma.
With the accumulation of satellite data and long-term ground observations, big data analytic methods (Chen et al. 2022) and machine learning algorithms (Kang et al. 2022) can be employed to construct a better fitting between satellite data and ground observations to better solve the above dilemma.We present an improved deep learning model incorporating a bilateral filter and channel attention mechanism to robustly retrieve AOD over the BTH region.The proposed model is specially designed to focus on sub-pixel clouds, and we call it the SPAODnet.Two specific improvements considering subpixel clouds have been made; a bilateral filter is applied to the TOA reflectance images for removing the noise induced by sub-pixel clouds and the corresponding shadows in the first place and a channel attention mechanism is added into the convolutional neural network to further emphasize the relationship between the uncontaminated pixels and the ground measured AOD from AERONET sites.In addition, a compositive loss function, the Huber loss function, is used to further improve the accuracy of the retrieved AOD.In the end, daily AOD within ten years (2011 to 2020) over the BTH region has been retrieved and is consequently used for model validation.The retrieved AODs are evaluated against AERONET AOD, Aerosol products, and those derived from other neural network approaches in terms of accuracy and spatial coverage.

Study area
This study area is the BTH region in China located within geographical coordinates (113.27°E-119.50°E,36.05°N-42.4°N), which covers an area with 0.216 million kilometers (Figure 2).As an important economic core area with heavy industries in northern China, heavy pollution weather with a long duration has appeared in recent decades, and the aerosol properties vary dramatically in different seasons.In addition, the well-maintained AERONET stations have lasted for many years, and more AOD samples can subsequently be collected over a long period for supporting the proper training of SPAODnet and the validation for retrieved AODs from the trained network.Therefore, the diversity of aerosols and the data availability make it an excellent study area for this study.

Data
In this study, TOA reflectance and surface reflectance from MODIS with the corresponding solar-viewing geometries including solar zenith, solar azimuth, viewing zenith and viewing azimuth angles, and the groundmeasured AOD from AERONET sites are selected as the training datasets and the corresponding labels, respectively.Multiple steps for data pre-processing are applied to regulate training data that are fed into the SPAODnet model.The pre-processing steps include variable transformation, spatial correlation correspondence, and collocation at different temporal and spatial resolutions.

AERONET AOD
The label for training SPAODnet over the study area is from the ground-based observation data from AERONET.AERONET is a global network for observing aerosol parameters on the ground by using sun photometers, which have been commonly used for validating satellite derived AOD products (Bilal et al. 2013;Gao et al. 2021;van Donkelaar et al. 2013).AOD data are taken from ground-based observation stations of AERONET Version 3 Direct Sun Algorithm product (Giles et al. 2019), which provides observation data in three data quality levels: level 1.0 (L1, unscreened), level 1.5 (L1.5, cloud-screened and quality controlled), and level 2.0 (L2, quality-assured) (Wang et al. 2020).To demonstrate the data available for this study better, the available period of ten sites and their corresponding land cover types are listed in Table 1.The Beijing and Beijing-CAMS sites are distributed within the 5th Ring ROAD and have the same urban surface type (Wang et al. 2020;Wei et al. 2018).XiangHe is a typical suburban site located between the two megacities of Beijing and Tianjin.This site is surrounded by cropland and experiences both natural and anthropogenic aerosols from urban, rural, or mixed origins (Li et al. 2008).
The level 1.5 AOD measurements from 2011 to 2020 are employed as the ground truth in this study.To accord with the satellite retrieved AOD, AOD ground measurements from the adjacent bands 500 nm and 675 nm were interpolated to 550 nm using Ångström exponent (Ångström 1929;Eck et al. 1999), which defined as Eq (1-3).α ¼ À Where τ λ ð Þ is the AOD at wavelength λ, α is the wavelength index, β is the turbidity coefficient of Ångström, λ 0 , λ 1 and λ 2 are wavelengths at 550 nm, 500 nm and 675 nm, respectively.

Satellite data
MODIS is a widely used sensor aboard the Terra and Aqua polar-orbiting satellites (Wang et al. 2020), which provide rich spectral information, wide viewing swath widths, and frequent global coverage.MODIS data were obtained from the Goddard Space Flight Center (GSFC) level 1 and Atmosphere Archive and Distribution System (LAADS).MODIS data have derived many data products that are convenient for scientific research (Chen et al. 2020;Levy et al. 2009;Liang, Sun, and Haoxin 2021;Wei et al. 2019); therefore, more similar research can be used for comparisons.In this research, TOA reflectance and surface reflectance as well as viewing angles are the main feature vectors in the SPAODnet model.
The aerosol products combined with M×D04 dark target algorithm and deep blue algorithm are used for comparative verification of SPAODnet.Table 2 summarizes the characteristics of MODIS data used and the spectral measurement information of MODIS sensor.
During the study period, the B5 damage was observed on both MODIS sensors, and the B6 damage was only observed on MODIS/Terra (Chen et al. 2020); therefore, only bands 1, 2, 3, 4, and 7 of MOD02HKM and MOD09 are extracted as inputs.MOD03 provides the solar-viewing geometry including solar zenith, solar azimuth, viewing zenith, and viewing azimuth angles, which are also necessary input features for AOD retrieval.In addition, to retrieve the AODs under clear sky conditions, the cloud mask MOD35 is applied to screen out pixels in thick clouds and snow.To obtain a reasonable sub-region size of input data, a cloud-free threshold test is performed.First, the cloud mask MOD35 product is used to identify MODIS pixels that include ice, snow, ocean, and clouds, and these pixels are set to 0. Then, the cloud-free threshold is determined by calculating the ratio of cloud-free land pixels to the total number of pixels.If the ratio of is less than 90%, the entire sample is removed from the dataset.To determine the cloud-free threshold in SPAODnet, samples of pixels with 80%, 90%, 95%, and 100% cloud-free pixels were tested.There was no significant difference between the results obtained with 90%, 95%, and 100% cloud-free pixels.However, when the threshold was set below 90%, the validation results showed lower detection accuracy.Therefore, it is recommended to use a cloud-free threshold of at least 90% in SPAODnet to achieve optimal results.Finally, the features of cloud-free pixels are extracted to construct the nonlinear relationship between observation information and ground observation sites.

Training set constructing
Prior to performing the AOD retrieval, the MODIS data and ground-measured AOD data are required to construct a training dataset for the SPAODnet model through spatial-temporal correlation, which assures the spatial and temporal consistency.Assuming that aerosol is relatively homogeneous within a certain time-space boundary (Anderson et al. 2003), the geo-location (geographical latitude and longitude coordinates) of the AERONET sites are used as the center point to reproject and subset the corresponding TOA reflectance (MOD02), solar-viewing geometry (MOD03), surface reflectance (MOD09), and cloud mask (MOD35) images to align all the satellite data.The AERONET AOD data were averaged over ±30 min window based on the MODIS imaging time to reduce anomalous perturbations.
After being processed through the above procedure, the inputs including the sub-regions of TOA reflectance (MOD02), solar-viewing geometry (MOD03), surface reflectance (MOD09), and cloud mask (MOD35) images and the ground truth (AOD ground measurements) compose a cloud-free temporally and spatially consistent dataset (Ichoku et al. 2002;Remer et al. 2005).In this study, 4176 data records are collected as training datasets for SPAODnet from ten AERONET sites during a period of ten years.In deep learning, common methods for dividing a dataset include hold-out, crossvalidation, and bootstrap sampling.We are using the hold-out method, which involves randomly dividing the pre-matched dataset into three sets: training, validation, and testing.The training set is used to train the model, the validation set is used to adjust the model's hyperparameters during training and prevent overfitting, and the testing set is used to evaluate the final performance of the model.The size of each set can vary depending on the size of the dataset and the complexity of the model.We shuffle the dataset into training, validation, and testing with a ratio of 8:1:1, and the independent test set which is randomly selected in overall spatialtemporal ranges into SPAODnet for stringent validation.

Determining the sub-region size of the input satellite images
Since aerosols change slowly in space (Zhong et al. 2017), a single pixel may be affected by cloudlets and sub-pixel clouds.Therefore, sub-region images centered on AERONET sites are extracted to obtain spatially autocorrelated observations information as input.However, sub-regions smaller than 3 × 3 pixels, which is the size of convolution kernel, are not recommended.A sub-region block size of 12 × 12 pixels (km 2 ) is selected for two reasons.First, this size is large enough to capture a variety of spatial variability scales.Second, reducing the number of model parameters following the assumption that the aerosol is uniformly distributed over a certain range."Sub-region size" is a hyperparameter in the SPAODnet model that directly affects the model's performance.It is assessed using the two metrics, coverage and accuracy, to determine the optimal value.The comparison results are listed in Table 3, and the best performance is obtained with a sub-region size of 12 × 12 km 2 for the satellite images.The detailed results are shown in Section 4. Overall, the windows of 12 × 12 km 2 and 8 × 8 km 2 have higher AOD site coverage, and the ratio of Within EE is performing well when the subregion size is set to 12.

Methodology
The flowchart of SPAODnet for retrieving AOD from MODIS data is shown in Figure 3. Convolutional neural network (CNN), which has excellent performance on nonlinear fitting problems, are specifically used between TOA reflectance of satellite images and AOD for successful AOD retrieving (Chen et al. 2020).Additionally, the proposed SPAODnet model incorporates two core components into CNN including a spatial adaptive bilateral filtering and a channel attention mechanism.These components aim to reduce the impact of sub-pixel clouds and the corresponding cloud shadows.

The theoretic basis of AOD retrieval
The SPAODnet is theoretically based on the atmospheric radiative transfer equation (Kaufman et al. 1997).The radiation information received by satellites is a combination of scattering from the Earth's atmosphere as well as reflection from the surface.Assuming that the land surface is a uniform Lambertian surface and that the atmosphere is uniformly attenuated in the vertical direction, the radiation intensity value L received by satellite can be depicted as Eq. ( 4): where μ S F 0 is the radiation flux density at the top of the atmosphere in the direction perpendicular to the sunlight.L 0 μ ν ð Þ represents path radiance in the direction of observation; ρand S denote surface reflectance and atmospheric single scattering albedo, respectively; μ S ¼ cos θ s , μ v ¼ cos θ v indicates the solar zenith angle and the observed zenith angle, respectively; T # θ S ð Þ and T " θ v ð Þ represent the downward and upward total transmittance of the atmosphere at solar direction and satellite direction, respectively; τ is the Atmospheric AOD.With the above equation, we can solve for any of the parameters when the other factors are known.The atmospheric radiative transfer model solves for the radiation flux density by the above equation, normalized to give the topof-atmosphere reflectance ρ TOA .
φ is the relative azimuth of the sensor.μ s ; μ v are the cosine of the solar zenith angle and the satellite observation angle.ρ 0 is the equivalent reflectance of atmospheric radiation along its atmospheric transport path.ρ s is the surface reflectance of a Lambertian body with a uniform surface.This shows that ρ TOA is not only a function of the optical thickness of aerosol, but also a function of the apparent reflectance.If ρ s is known, AOD can be obtained for the corresponding image element, which is the principle of AOD retrieval, and it is also the basis for the construction of SPAODnet.It is worth mentioning that abnormal values, which surface reflectance ρ s is greater than 0.3 in all wavelengths, are defined as cloudy or snowy weather.Above equation describes how solar radiation is transferred through the atmosphere and land surface into the satellite's sensor; it is an integral differential equation and nonlinear, so it has no analytic solutions.Due to the excellent performance on nonlinear fitting problems (Su et al. 2020;Wang et al. 2019), CNN is selected for this research as one of the best tools.

Spatial adaptive bilateral filtering for input TOA imagery cleaning
Despite quality control and cloud screening of satellite imagery, residual sub-pixel clouds may contaminate approximately 40% of MODIS data (Motohka et al. 2011), which significantly reduces the accuracy of AOD retrieval using satellite imagery.The presence of sub-pixel clouds increases the TOA reflectance and leads to the derived AODs being overestimated.To eliminate the effect of sub-pixel clouds on TOA reflectance as possible while retaining more edge features, adaptive bilateral filtering is adopted in this study.Bilateral filtering is defined as Eq.(6-10), which takes both the spatial closeness and the similarity of grey-scale values into consideration while filtering the TOA image.Gaussian functions are used as the kernels by default for determining the closeness ω s i; j ð Þ at the spatial domain and the similarity ω r i; j ð Þat the greyscale domain in Eq. ( 7) and (8).
For a pixel (x, y) in a TOA image, is the denoised image, ω s i; j ð Þis the weight at spatial domain, ω r i; j ð Þ is the weight at grey-scale value domain, ω p is normalization parameter, I i; j ð Þ is the noisy image, Ωis the neighborhood range at pixel (x, y).As shown in Eq. ( 10), f denotes the noise-free image, assuming that n is the noise caused by the presence of sub-pixel clouds in this study, I is the noisy image, and I i; j ð Þ denotes the pixel value of image I at i; j ð Þ.From Eq. (6-8), it can be seen that the weight coefficients of the bilateral filter are determined by the two parameters spatial variance σ s and the grey scale variance σ r together.Thus, to eliminate the influence of discrete sub-image cloud points on TOA images as possible while being able to retain the features of image edge, adaptively adjusting the spatial variance σ s and the grey scale variance σ r become the key issues.In this study, a local adaptive mechanism is adopted to dynamically determine the spatial variance σ s of a pixel by calculating the target scale (R xy ) of each pixel in the image, which depicts the extent of smoothness around that pixel through the highest similarity of the pixel with the neighbor pixels.The target scale R xy characterize the maximum neighborhood radius of a connected circle in the same target region, as shown in Eq. ( 11).
For a pixel (x, y) in the image, U xy r ð Þ defines the similarity between the pixel and another neighbor pixel within the neighborhood boundary region, and T s is the threshold function that defined as 0.75 in this study (Saha, Udupa, and Odhner 2000).
In the edge and texture regions, a smaller target scale is derived and a smaller value of spatial variance σ s is taken to retain more details on the TOA image; on the contrary, a larger value of σ s is derived for better smoothing and denoising.Therefore, the value of σ s in the bilateral filter determines the degree of expansion of the Gaussian curve in the filter window, the larger the σ s , the slower the Gaussian curve decreases and the more obvious the smoothing effect, σ s is defined in Eq. ( 12) From Eq. ( 8), it can be seen that the σ r sets large, the TOA image shows more blurred; the σ r sets small, the edges and details show clearer, which are more sensitive to the σ s .In Eq. ( 13), it can be seen that in order to obtain the value range variance σ r , the noise variance σ n of the image needs to be estimated at the first place, and σ n can be quickly estimated according to the Laplace transform as shown in Eq. ( 14).
W and H are the width and height of the image, � represents the convolution operation, and I x; y ð Þ � N represents the image and mask N for convolution operation, and N is the mask obtained by discrete Laplace transform as shown in Eq. ( 15).
In general, this method allows adaptive correction of TOA images and it is effective in regions with sub-pixel clouds, which preserves edge information while denoising TOA images with sub-pixel clouds and subsequently improves the input TOA images for the network greatly.
To better demonstrate the improvement of the bilateral filter, Sentinel-2(1 July 2020 03:16:49) and MODIS (1 July 2020 03:15:00) have been added to show the sub-pixel cloud as Figure 4(b-c).Figure 4(a) (6 July 2020 03:16:46 UTC) shows clear weather condition compared to Figure 4(b), in order to demonstrate that the red circle in Figure 4(b) is a sub-pixel cloud that is not observable in MODIS (Figure 4(c)).Figure 4(d) shows the results of bilateral filtering, Gaussian filtering, and thresholding, respectively.The results show that the Gaussian filtering based on distance removes a large number of edge features and the thresholding methods hardly eliminate the noise induced by subpixel clouds.In contrast, the adaptive bilateral filtering smooth the sub-pixel cloud region while retaining most of the edge information.The red boxes in Figure 4(b) show the sub-pixel clouds and the effects after the adaptive bilateral filtering, which provides more accurate TOA feature information to the following network for better prediction.

Convolutional neural network architecture combining channel attention
The input features of SPAODnet model can be divided into two modalities: the TOA and surface reflectance are images and the solar-viewing angles are numerals.To feed the matrix (remote sensed images) and numerical data (observed angles) into a non-linear regression prediction network, the branches of input 1 and input 2 are designed for multimodal feature extraction respectively, and the feature maps are concatenated to obtain the output of the network as AOD values using dense layers as shown in Figure 5.To prevent overfitting of the network while training, dropout layer is employed to mitigate the occurrence of overfitting and achieve a degree of regularization.
The network architecture and the vector dimensions for each block are shown in Figure 5, where the input dimension is (N, H, W, C).Here, N represents the number of samples, H*W represents the sub-region size of the input satellite images, and C represents the bands of TOA and surface reflectance.The output dimension is (N,1), indicating the output of the single variable, AOD.The network contains five basic blocks to extract input data features.Given MODIS TOA reflectance and surface images as feature vectors of input data, the shallow features are extracted through convolutional layers in the first place and the shallow features are subsequently passed into five consecutive blocks incorporating channel attention to extract effective feature information; the final feature map is passed into the concatenating layer by global pooling layer.Both average pooling and global average pooling (GAP) are employed within the network at different blocks; the average pooling layer located behind the convolutional layer is used to reduce the dimensionality and remove redundant information and the global average pooling replaces the flatten layer to regularize the whole network to prevent overfitting, which eliminates the black-box features in fully connection (FC) layer and subsequently enhances the interpretability of the network.
The core component of SPAODnet framework is the UpSample layer add Channel Attention module (Up-CA).This module is specially designed to further exclude the influence of noise, such as sub-pixel clouds and the corresponding shadows.Usually, the deep learning network employs the global average pooling or global max pooling layer to aggregate the global features, which are extremely susceptible to noise points, such as the presence of sub-pixel clouds in low resolution remote sensing imagery, and lead to that the weights are susceptible to outliers.Therefore, the Up-CA module is proposed in this study to replace the global pooling layer to aggregate spatial information with a dynamic pooling layer to lower the influence of sub-pixel clouds and the corresponding shadows.It assigns higher weights to the uncontaminated pixels by learning the agreement between the pixels and the label and several examples of the expected effect are presented at section 4.3.Our implementation is shown diagrammatically in Figure 6 and the lower part of the fame is the detailed structure of CA for AOD module.
The feature maps F2 R C�2H�2W as the input of the UpSampling layer and CA module.For the CA module, a parallel branch with sequential operations including a squeeze operation (1 × 1 convolutional layer), an excitation operation (ReLU), and a bottleneck structure (a full connection with a Sigmoid function) is added to combine with the general convolutional operation to retrieve the weights of different channel.The CA weights T d is element-wise multiplied by the feature map of input F to obtain the weighted feature map T d .Here, the depth-wise separable convolution is applied to reduce the number of convolution layers significantly.The CA mechanism is mathematically expressed as follows: where Conv is the convolution operation with a filter size of 1 × 1.The CA reduces the impact of sub-pixel clouds on global features by learning the importance of spatial features without sub-pixel cloud and assigning different weights for different spatial information.

Model evaluation metrics and loss functions
Based on the assumption of uniform aerosol distribution within a certain space, the center pixel can represent the whole input image (12 × 12).Assuming a d-dimension input x, m-dimensional output y, weight matrix w, bias vector b, and the set of parameters, the process of AOD retrieving can be defined as the following formula: θ w;b is obtained by minimizing the loss function between the ground truth (y) and the predictions (ŷ) over the training data.MSE and MAE are two of the most used loss functions (Chicco, Warrens, and Jurman 2021) and each of the two loss functions has its advantages.Instead of using one of them, the Huber loss function incorporating the advantages of both MSE and MAE is employed in this study, which is defined as Eq. ( 20).
Compared with the MAE and MSE, the Huber loss is less sensitive to outliers while converging fast. Where Þ denote Huber loss, δ is the hyperparameter, y is the true value, ŷ is the predicted value of the model, the prediction bias is y À ŷ j j.When δ 0,the Huber loss tends to MAE; when δ 1 (a large number), the Huber loss tends to MSE.The δ y À ŷ j j À 0:5δ 2 is adopted to ensure the MAE and MSE are consistent, which ensures that the Huber loss is continuously derivable.At value 0, it is also differentiable.The choice of δ is an adaptive selection process, which is not directly given.When the size of research window takes 12, δ gets 0.3 after training.As shown in Figure 7, the results suggest that Huber Loss shows a better overall predictive ability for AOD retrieval.
The evaluation metrics used in this study include root-mean-square error (RMSE) and coefficients of determination (R 2 ), which are shown in Eq. (21)(22).The RMSE is widely accepted as an evaluation metric of the difference between satellite retrievals and ground-based measurements (Che et al. 2018).

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n Where y and y i represent the AODs from satellites and ground-based sites, respectively.The network architecture of SPAODnet is flexible and can accommodate different types of variables to model complex associations in the general applications of spatial regression.Given the input x and the output y variables normalization (e.g.standardization) is required for both.Network training aims to optimize the following objective function: Where θ opt w;b denotes an optimal solution for the network parameters, and the total loss function L is given in Eq. ( 20), depending on which interval the loss value belongs to.We use Adam as the optimizer.Sensitivity analysis was conducted to find an optimal structure (the type of Network structure and the number of nodes for each layer).Grid search (Li et al. 2020) is conducted to obtain the optimal hyperparameters, including initial learning rate and minibatch size.The normalization is used to initialize the parameters.After the optimal model is obtained, it can be used to make predictions.All deep learning models are evaluated on a Linux server with Intel CPU i7, 16GB RAM, and Nvidia TITAN RTXTU102 GPUs.The full pipeline is implemented with tensorflow 2.6.0.

Results and discussion
To better present the results and validate our method, we compared the accuracy of AOD retrieval and the number of retrieval points.Finally, we used the Grad-CAM algorithm to visualize the effectiveness of channel attention in aerosol retrieval under sub-image cloud conditions, and we conducted the ablation experiments on each contribution measurement.

The spatial-temporal variations of the retrieved AOD from SPAODnet model
The SPAODnet is an end-to-end network, which implies that the network can directly output an AOD image.Consequently, once the SPAODnet has been adequately trained, it can predict daily AOD over the entire study area, covering both dark and bright surfaces.Although In the retrieval process, SPAODnet addresses the issue of removing isolated noise points in the AOD images retrieved under extreme conditions such as snowy weather, thick clouds, and shimmering water surfaces.To achieve this, it employs a Laplacian operator with a second derivative to compute image gradients and subsequently executes an eightconnected edge detection on the gradient image.Since the presence of clouds or water areas in the image can lead to biased high AOD values, calculating the gradient differences around pixels assists in   (d-f) the daily maps of AOD retrieval as derived over the BTH in China using the SPAODnet model applied to MODIS-Aqua data at 0.550 μm with 1 km resolution.(g-I) the daily maps of AOD retrieval using the deep blue algorithm and deep target algorithm applied to MODIS-Aqua with a resolution of 10 km and 3 km, respectively.(m-r) the daily maps of AOD retrieval using the deep learning model NNero and MAIAC with 1 km resolution.The spatial distribution of AERONET site is represented by yellow triangles the white area in the maps represents no data.identifying the noise points, which correspond to areas occupied by clouds or water.Following this, the eightconnected edge detection is utilized to pinpoint edge regions, effectively eliminating isolated noise points in the image.Compared to DB and DT, MAIAC has better spatial resolution and higher spatial coverage.However, when compared to (r) and (e), MAIAC exhibits higher AOD values in pixels near clouds and water in urban areas.
To demonstrate the difference quantitatively, AODs obtained through MAIAC, SPAODnet, MODIS DB product, and major AERONET sites were tabulate in Table 4. Table 4 quantitatively compares the prediction accuracy of the SPAODnet model with DB(QA = 3) and MAIAC(QA = 3), which shows that the SPAODnet retrieved AODs are more in numerous and closer to the observed AODs from the AERONET sites.For all three dates, the SPAODnet retrieved AODs for all the sites, but the MODIS DB product missed five of them.For the missed AODs from the MODIS DB product, although the difference between the observed and the AOD from the SPAODnet was larger, the retrieved AODs from the SPAODnet were not biased very much.From the perspective of accuracy, the DB product only had one retrieval at Beijing-CAMS on 2,019,267 that was better than that from SPAODnet; therefore, SPAODnet outperformed.The discrepancy between SPAODnet and DB is prominent in the northeastern portion.SPAODnet showed more details and much higher aerosol loading, which were visually obvious.The SPAODnet model had a higher spatial coverage, which was even slightly higher than that of the DB algorithm with QA = 1; Furthermore, the prediction accuracy of the SPAODnet model achieved the highest values of 83.49%.Therefore, SPAODnet can retrieve AOD with better spatial and temporal coverage than the MODIS DB method, and it also has better accuracy.Most of the AOD predictions from MAIAC were similar to those from the SPAODnet model, but the AOD values for the Beijing-PKU site and Beijing-CAMS site were significantly higher, indicating that the MAIAC algorithm tends to overestimate AOD values in urban areas.

Accuracy evaluation and comparison to the MODIS AOD products
The SPAODnet model was trained and tested using the 4176 collocated data records over ten years from 2011 to 2020.The validation results indicate that over 91% of the AOD retrieved within the EE envelopes using all data, as shown in Figure 9 (Chen et al. 2020;Wei et al. 2019).To further demonstrate the accuracy of the SPAODnet model, Level 1.5 daily swath products with 3 km spatial resolution (MOD04_L2 for Terra) were used.The scatterplots of the MODIS AOD products from the MAIAC, DB, DT, and DBDT algorithm against the observed AOD from AERONET were plotted in Figure 9 (a)~(e).Based on major indicators including the ratio within EE, R 2 , and RMSE, the SPAODnet exhibited the highest accuracy.In particular, the ratio above EE was reduced to 5.51%.To match the number of validation points with the MODIS AOD products, all 4176 collocated data records, including both training and testing records, were used for this validation, which led to the SPAODnet's accuracy appearing to be too high.To be more reasonable, another comparison was conducted using independent samples, which included only 1/10 of the 4176 collocated data records, and the corresponding AODs of the MODIS AOD products were picked out for comparison.The results are plotted in Figure 10.At this time, over 83% of the AOD retrieved using the training data were within the EE envelopes, which is still much better than obtained using the MODIS products.However, the R 2 and RMSE of the DT method improved significantly because the strict screening criterion for dark surfaces excluded many points with higher error, such as bright surfaces and higher aerosol loading.However, the DT method had a ratio above EE of 19.18%, indicating that it overestimated AODs in urban areas.Figure 10 (a) indicates that the R 2 and RMSE metrics of the MAIAC algorithm are slightly higher than those of SPAODnet, whereas the Within EE and above EE metrics exhibit better performance for SPAODnet.This suggests that the MAIAC algorithm has relatively small prediction deviations, but there is a tendency for overestimation in urban areas.Conversely, the SPAODnet model demonstrates lower overestimation rates in independent testing and performs better in urban areas.Therefore, the SPAODnet model has competitive accuracy in AOD retrieval under sub-pixel cloud conditions compared to the widely used MODIS products.

Comparison to the state-of-the-art method
Compared to the widely used traditional methods such as DB and DT, MAIAC and deep learning methods have recently been developed and are considered state-ofthe-art in this study.Among these methods, the Neural

Retrievable number of aerosols
In addition to the accuracy of retrieved AOD, we also evaluate the performance of the SPAODnet model based on the number of retrievable aerosols over ten years.The more AODs that can be retrieved, the better the usability and capability of the SPAODnet performance.Figure 12. displays the retrievable number of AOD over the ten AERONET sites from DB algorithm with QA = 1, 2, and 3 respectively, alongside the proposed SPAODnet.Figure 12(b) displays the corresponding AOD site coverage.For most sites, the SPAODnet's retrievable number is higher than that from the DB algorithm, especially with QA = 1, and a ~ 10% increase in retrievable numbers can be achieved.The SPAODnet performs the best over the XiangHe and Beijiing-CAMS sites, and the retrievable number is significantly higher than that from the DB algorithm with QA = 2 and 3. Particularly, the SPAODnet's retrievable number is 50% more than that from DB with QA = 3.In terms of the retrievable number of aerosols within ten years, our proposed method retrieves more available AODs while reducing the impact of sub-pixel clouds in low-resolution images, resulting in better usability and capability for applications.

The visualization of the attention mechanism in the SPAODnet network
A channel attention mechanism is introduced to reinforce the weights of the channels with important features and suppress the weights of the channels with minor features.The effectiveness of channel attention mechanism for AOD retrieval was visualized using Gard-CAM.Figure 13 displays the attention maps generated by the Grad-CAM algorithm for four scenes with different appearances of clouds or sub-pixel clouds.
The feature map depicts the distribution of weights after training the model, where dark blue pixels represent lower weights, and the yellow regions represent the higher weights, indicating the greater importance of the corresponding regions for of AOD retrieval.
It can be seen that the channel attention Up-CA effectively extracts important information for these scenes.The regions with possible cloud contamination are identified as minor features and are given lower weights.Additionally, the scattered bright points that may represent sub-pixel clouds at the top of are also located as minor features in the SPAODnet network, which reflects the effectiveness of the incorporated channel attention mechanism.
Bilateral filtering improves the distribution of sub-pixel clouds in TOA images, and the Up-CA attention mechanism is used to extract cloud-free features in the model for corresponding with observation stations.The "above EE" evaluation metric is used to quantify "overestimated AOD," where the retrieval of AOD values exceeding the confidence interval range is considered overestimation.The effect of bilateral filtering was evaluated through an ablation experiment, with the quantification of "overestimated AOD" shown in the "above EE" column of Table 5.The relative contribution by each measurement to improve the overall improvement are presented in Table 5.

Conclusions
In this study, the SPAODnet model is proposed to retrieve AOD in the BTH region by integrating satellite observations and ground-measured AOD from AERONET sites.To improve AOD retrieving accuracy, the SPAODnet model addresses three essential issues while constructing a nonlinear regression prediction network.The first issue is addressed by using a bilateral filter to pre-process TOA reflectance in order to reduce the noise induced by sub-pixel clouds and cloud shadows.This significantly improves the inputs quality to the model.The second issue is addressed by employing the Huber loss function.The last issue is addressed by adding a channel attention mechanism to the backbone network, which slightly improves performance by reinforcing the contribution of important features and suppressing the  minor features.Compared to other CNN-based models such as NNAero, the SPAODnet model achieves a 10 ~ 15% improvement in accuracy.Additionally, compared to the traditional methods such as DB and DT from MODIS AOD products, the SPAODnet model achieves over 80% of within EE ratio, with an accuracy that is 5-10% higher than those from MODIS products.Moreover, the predicted AOD values from the SPAODnet have a spatial resolution of 1 km by moving the window pixel by pixel over the entire image.In terms of removing sub-pixel clouds, employing bilateral filtering to process TOA reflectance reduced the above EE metric by 4%.Furthermore, using the attention mechanism Up-CA improved the above EE metric by 3%.Moreover, removing sub-pixel clouds also increased the spatial coverage of AOD subsequently.Therefore, through the simultaneous consideration of the nonlinearity and spatial correlation, the SPAODnet model achieved excellent performance in spatial resolution, spatial coverage and AOD retrieval accuracy.Furthermore, the SPAODnet model is trained using over ten years of collocated data from both satellite and ground-measured AOD, making it a viable option for predicting AODs within a certain region and its spatial distribution without additional work.Based on these results, we believe that the proposed SPAODnet model can be considered an effective method for retrieving AODs from satellite observations.In future work, we intend to employ the SPAODnet model to produce global AOD with higher accuracy, better spatial coverage, and higher resolution.Additionally, we plan to incorporate an additional feature extraction branch at the time dimension to capture the spatio-temporal properties of aerosols simultaneously.The coupling of a neural network that focuses on sequential data, such as Transformer models, may improve the accuracy of the retrieved AOD.
Figure 2(a,b) show the detailed information on land cover data near the AERONET stations.The ranges represented in Figure 2(a) and (b) are located in the red boxes in Figure 2(c).The blue dots in Figure 2 indicate the ten AERONET sites within the study area.

Figure 3 .
Figure 3.The pipeline of the SPAODnet algorithm.

Figure 4 .
Figure 4.The visualization results of bilateral filtering in removing sub-pixel pixels.(a) and (b) show sentinel 2 satellite images under clear and cloudy conditions, respectively.(c) shows a low-resolution MODIS image under cloudy conditions.(d) demonstrates the results of Gaussian filtering, thresholding, and bilateral filtering in removing sub-pixel clouds from MODIS images.

Figure 5 .
Figure 5.A schematic representation of the channel-attention-based convolutional neural network architecture used to retrieve AOD.The same architecture was optimized multiple times with different loss functions to compare the sensitivity of the model.Convolution, channel attention, average pooling, up-sampling plus channel attention, global pooling, dense, and dropout layers are shown in purple, white, blue, green, blue, pink, and sky blue respectively.The concatenated denotes the concatenation operation between feature maps.

Figure 6 .
Figure 6.Diagram of the Up-CA module.As illustrated, this module utilizes 1*1 convolutional layer instead of global pooling.The green and yellow blocks represent the depthwise separable convolution for conv 3*3 and conv1*1 respectively.⊙ represents the sigmoid function.⊙ is the matrix dot product.⊗ denotes matrix multiplication.Convolution kernel size is 3*3 and the stride is 1 in convolution operation.
a 12 × 12 km 2 window (12 × 12 pixels) is used for training and retrieving AOD to reduce noise, the SPAODnet model retrieves AOD with a spatial resolution of 1 km by moving the window pixel by pixel over the entire image.As shown in Figure 8...., AOD with 1 km spatial resolution provides much more details than the MODIS products with a resolution of 3 or 10 km.To more accurately represent the spatial and temporal variability of the SPAODnet-retrieved AODs, plots of AODs across the entire study area on three different dates 2,019,072 (YYYYDDD), 2019267, and 2,020,158, are plotted in Figure 8..... True color composites of MODIS TOA reflectance at 0.5 km resolution are also shown in Figure 8.... to visualize the spatial distribution of the AOD, along with corresponding AODs from the MODIS DT algorithm and DB algorithm in Figure 8.....The high-precision MAIAC algorithm was compared in Figure 8b.In addition, the retrieved AODs through the general CNN model, NNAero, are also presented in Figure 8b.From Figure 8...., the spatial and temporal variations of the SPAODnet retrieved AODs closely match those of the TOA reflectance.

Figure 7 .
Figure 7.Comparison of the performance of AOD retrieval on SPAODnet using different loss functions.

Figure 8 .
Figure 8. (a-c) the daily MODIS TOA reflectance true color maps over the BTH in China with 0.5 km resolution on 2,019,072; 2019267 and 2,020,158 respectively.(d-f) the daily maps of AOD retrieval as derived over the BTH in China using the SPAODnet model applied to MODIS-Aqua data at 0.550 μm with 1 km resolution.(g-I) the daily maps of AOD retrieval using the deep blue algorithm and deep target algorithm applied to MODIS-Aqua with a resolution of 10 km and 3 km, respectively.(m-r) the daily maps of AOD retrieval using the deep learning model NNero and MAIAC with 1 km resolution.The spatial distribution of AERONET site is represented by yellow triangles the white area in the maps represents no data.

Figure 9 .
Figure 9. Density scatterplots of total samples for (a) the MAIAC algorithm, (b) the deep blue algorithm, (c) the dark target algorithm, (d) the deep blue algorithm and the dark target algorithm (DBDT) and (e) SPAODnet algorithm.The colour bars represent the density of retrieved values that rely on the grid points (his2D).The black solid line is the 1:1 line, and the red dashed lines represent the within EE lines.

Figure 10 .
Figure 10.Scatter plots with independent test for the retrieval of AOD.(a) the MAIAC algorithm, (b) the deep blue algorithm, (c) the dark target algorithm, (d) the deep blue algorithm and the dark target algorithm (DBDT) and (e) the SPAODnet algorithm.

Figure 12 .
Figure 12.The histogram of the retrievable number and the AOD site coverage from DB algorithm and SPAODnet in the BTH region during 2011-2020.(a) reveal the number of retrievable aerosols.(b) reveal the corresponding AOD site coverage.

Figure 13 .
Figure 13.The daily of MODIS TOA reflectance true colour maps and the attention maps derived from Grad-CAM.A lighter colour indicates a higher weight and yellow indicates the highest weight.

Table 1 .
Detailed information about the used AERONET sites.Lat: latitude; Lon: longitude.

Table 2 .
Detailed information of MODIS/Terra data used in this study.

Table 3 .
Performance of AOD retrievals under different subregion sizes.In the collection 6 AOD validation, EE is defined as ±0.05 or ±0.15×AOD, and the AERONET site coverage represents effective coverage of AERONET AOD for the period 2011-2020.

Table 4 .
Test performances for the AODs from the MAIAC, SPAODnet, MODIS DB products, and the major AERONET sites.QA = 1, 2, and 3 represent the quality assurance levels of DB products.

Table 5 .
The ablation experiment of SPAODnet."aboveEE" is the metric used to quantify sub-pixel clouds.