For-backward LSTM-based missing data reconstruction for time-series Landsat images

ABSTRACT Reconstructing the missing data for cloud/shadow-covered optical satellite images has great significance for enhancing the data availability and multi-temporal analysis. In this study, we proposed a deep-learning-based method for cloud/shadow-covered missing data reconstruction for time-series Landsat images. Central to this method is the combined use of autoencoder, long-short-term memory (AE-LSTM)-based similar pixel clustering and for backward LSTM-based time-series prediction. First, manually delineated cloud/shadow-covered masks were overlaid onto multi-temporal satellite images to produce pixel-wise time-series data with masking values. Second, these pixel-wise time series were clustered by an AE-LSTM-based unsupervised method into multiple clusters, for searching similar pixels. Third, for each cluster of target images, a for-backward-LSTM-based model was established to restore missing values in time series data. Finally, reconstructed data were merged with cloud-free (image) regions to produce cloud-free time-series images. The proposed method was applied onto three datasets of multi-temporal Landsat-8 OLI images to restore cloud/shadow-covered images. The reconstruction results, showing an improvement of greater than 10% in normalized mean-square error compared to the state-of-the-art methods, demonstrated the effectiveness of the proposed method in time-series reconstruction for satellite images.


Introduction
As one of the most powerful tools for monitoring earth surface, remote-sensing images have been applied in a variety of applications, such as landcover classification (Gomez, White, and Wulder 2016;Ma et al. 2017), waterbody mapping (Zhou et al. 2014), urbanization monitoring (Feng et al. 2017), and crop forecasting (Zhou et al. 2019). Especially, multi-temporal images can further demonstrate the changes of earth surface and evolutions of ground objects (Zhu and Woodcock 2014). However, due to the influence of the weather, (optical) images acquired from satellite sensors are often contaminated (or covered) by clouds and shadows, especially in the humid tropical areas (Tseng, Tseng, and Chien 2008). In cloudy and shadowy regions, the true ground information is occluded by clouds and shadows and difficult to obtain, which strongly limits the application of optical images (Pouliot and Latifovic 2018). Further, for multi-temporal remote-sensing applications (such as deforestation, agricultural planting, urbanization, and wetland loss), the cloud-and shadow-covered (for convenience, hereafter, we simply group them into cloud-covered) data will produce irregular time intervals, increasing difficulties in further time-series analysis (Wu et al. 2018). Therefore, reconstructing the missing data (regions) of such cloud-covered images has great significance for enhancing the data availability and multi-temporal analysis (Gao and Gu 2017).
Much research effort has been devoted to reconstruct missing data for cloud-covered images, which can be classified into three major categories (Chen, Huang, and Chen et al. 2016): self-complementationbased, reference-complementation-based, and multitemporal-complementation-based methods. In the self-complementation-based approaches, without the incorporation of other complementary (image) data, missing data in the cloud-covered regions is reconstructed using the remaining cloud-free regions in the same image. Based on the assumption that the missing regions and remaining regions in the image share the same (or similar) statistical and geometrical structures, the cloud-covered regions are reconstructed by propagating the geometrical structure from the local or nonlocal cloud-free regions. Under theories of spatial interpolation and error propagation, missing pixel interpolation methods (Van der Meer 2012; Pringle, Schmidt, and Muir 2009;Zhu, Liu, and Chen 2012;Zhang, Li, and Travis 2009), image inpainting methods (Lorenzi, Melgani, and Mercier 2011;Cheng et al. 2013), and model fitting methods (Maalouf et al. 2009) have been widely used in reconstructing cloud-covered regions. Although these methods yield visually plausible results, they are sensitive to the land-cover types underneath clouds (Zhang, Wen, and Gao et al. 2019) and make the results unsuitable for data analysis in further applications (Cheng et al. 2014). Furthermore, as uncertainty and error accumulate along with propagation, these methods can hardly deal with largeregion or heterogeneous data missing (Zhang, Wen, and Gao et al. 2019).
To overcome the bottleneck of the selfcomplementation approaches, referencecomplementation methods reconstruct missing data by modeling the mapping relationship from auxiliary cloud-free images (or bands) to cloud-covered images (or bands). This category of methods relies on the fact that there is usually high correlation between different spectral data. When cloud-insensitive spectral bands available in multispectral or hyperspectral images, they are utilized to reconstruct the missing data of cloud-covered bands. Based on the close correlation between band 7 and band 6 in moderate resolution imaging spectroradiometer (MODIS) data, various fitting methods (Wang et al. 2006;Shen, Zeng, and Zhang 2010;Rakwatin, Takeuchi, and Yasuoka 2008;Gladkova et al. 2011) were proposed to reconstruct the missing data of MODIS band 6 by using MODIS band 7. Ji (2008) utilized the near infrared band to estimate the spatial distribution of haze intensity in visible bands of Landsat images through a linear regression model established over deep water area. Some studies attempted to employ cloud-free images from different sensors to reconstruct cloudcover images. For example, Roy et al. (2008) proposed to predict the missing data of Landsat images by incorporating the corresponding MODIS image (with close acquired dates). With the advantage of insensitivity to cloud and rain, synthetic aperture radar (SAR) data were employed to reconstruct cloud-covered optical images (Hoan and Tateishi 2009;Huang et al. 2015;Grohnfeldt, Schmitt, and Zhu 2018). Reference images provide much information for reconstructing missing data, but these approaches are usually constrained by the spectral compatibility (Lin et al. 2012), spatial resolution, and temporal coherence, which are incapable of restoring the quantitative products and modeling change information over time (Cheng et al. 2014;Malambo and Heatwole 2015).
With or without reference images, the previous two approaches are limited in that they are formulated with the assumption of no or gradual change in the images reconstructed (Mariethoz, McCabe, and Renard 2012). This inherent stationarity assumption is clearly a weakness in time-series applications when documenting land-cover change and monitoring crop growth (Mariethoz, McCabe, and Renard 2012). Since remote-sensing satellites with a fixed repeat cycle can regularly acquire images in the same area where it is not always cloud-covered all the time, it is easy to get multi-temporal images for the same region. These multi-temporal (cloud-covered or cloud-free) images provide the potential to address the cloud-covered missing data for time-series analysis (Cheng et al. 2014). For reconstructing time-series images, multitemporal-complementation approaches (most relevant to this work) involve two main steps: searching cloud-free pixels with similar land-cover types to cloud-covered pixels, and predicting cloud-covered pixels by using these cloud-free pixels. In searching similar pixels, incorporating spatial, spectral, and temporal relationships were widely investigated to calculate the similarity between cloud-free pixels and cloud-covered pixels (Gao and Gu 2017;Melgani 2006;Benabdelkader and Melgani 2008;Zhang et al. 2018). In reconstructing cloud-covered pixels, various models were developed. Assuming small change between cloud-covered pixels and cloud-free pixels, Tseng, Tseng, and Chien (2008) simply and directly replaced cloud-covered pixels with cloud-free pixels of the same location from reference images. By solving a group of constrained Poisson equations, Lin et al. (2012) cloned information from cloud-free regions to fill corresponding cloud-covered regions.
Based on the geo-statistical techniques, Chen, Huang, and Chen et al. (2016) introduced a spatially and temporally weighted regression model to reconstruct the missing data of a target scene from both the selftarget scene and multi-temporal reference scenes. Padhee and Dutta (2019) proposed a moving offset method, restoring missing values in time-series NDVI at a pixel by using coefficients of linear regression with reference NDVI at another pixel having identical conditions. Furthermore, some multi-temporal methods designed for filling gaps caused by sensor failure can be used for reconstructing cloud-covered pixels, such as the neighborhood similar pixel interpolator (NSPI) method Zhu et al. 2011) and the weighted linear regression (WLR) approach (Zeng, Shen, and Zhang 2013).
Recently, some deep-learning-based reconstruction methods have been developed to reconstruct cloud-covered remote-sensing images. Grohnfeldt, Schmitt, and Zhu (2018) used generative adversarial networks to fuse SAR data and optical images to generate cloud-free optical images. Malek et al. (2018) attempted to use an autoencoder neural network to find the essential mapping function from cloud-free reference images to cloud-covered target images. Dai, Ji, and Zhang (2020) proposed a gated convolutional neural network (CNN) for cloud removal from bi-temporal images. Zhang et al. (2018),  constructed and improved CNN-based frameworks for reconstructing missing data in multi-temporal satellite images caused by sensor failure and thick cloud. For further exploit temporal tendency of multi-temporal observation for series reconstruction, some studies introduced recurrent neural networks (RNN) into fields of geoscience for temporal-spatial interpolation (Ma et al. 2019;Liu et al. 2019;Yoon, Yeeh, and Byun 2020), and prediction (Chang et al. 2020;Qi et al. 2019;Zhou et al. 2020).
Although encouraging reconstruction results have been produced in the previous studies, there are several limitations to these approaches. (1) In addition to spectral and spatial similarity, temporal tendency in multi-temporal images representing land-cover change, was underutilized and hard to be used in previous methods (especially the traditional methods).
(2) Available deep-learning-based methods focused mainly on CNNs to capture spectral and spatial information, only a few studies employed RNNs to learn temporal tendency across multi-temporal images.
(3) Since clouds may appear anywhere and anytime, pixel-wise time series always has irregular time intervals, which makes it hard to be used in timeseries analysis.
The objective of this article is to develop a deeplearning-based method for time-series reconstruction in Landsat images. Inspired by the promising results of RNNs, we tried to use RNNs to learn multi-scale temporal features and tendency for reconstruction. And to overcome the problems of unequal time intervals and unaligned time steps, we introduced the masking layers for time-series reconstruction. We implemented this method through the combined use of autoencoder-LSTM (long-short-term memory, an implementation of RNN)-based similar pixel clustering and for-backward LSTM-based missing data reconstruction. Compared to previous approaches, this study combines two principal contributions. The first contribution is establishing a for-backward LSTMbased framework to incorporate temporal tendencies both from previous and subsequent images for reconstruction. The second one is introducing masking layers to utilize incomplete and unaligned time series to improve reconstruction results.
The proposed method is further discussed and validated through missing data reconstruction on datasets of multi-temporal cloud-covered Landsat-8 OLI images in two study areas, compared to the stateof-the-art methods.

Study areas and datasets
The first study area is in the southwestern region of Iberian Peninsula, Europe, straddling Spain and Portugal, with central coordinates of 7°22ʹW and 38° 54ʹN (as presented in Figure 1), which is covered by World Reference System 2 (WRS2) with Path 203 and Row 33. This region is dominated by Mediterranean climate of hot and dry summer and mild and rainy winter. The vegetation is mainly subtropical evergreen hardwood forest. The study area covers most of landforms and land uses types, including waterbody, forest, grassland, farmland, and urban areas, which is suitable to test the proposed method.
All available 22 Landsat-8 OLI images (hereafter referred to as Landsat images) with P/R 203/33 during 2015 (as listed in Figure 2) were employed to validate the proposed method. Landsat images with a 16-day revisit comprise seven multi-spectral bands with spatial resolution of 30 m, from which the widely-used blue, green, red, and near-infrared (NIR) bands (band 2 ~ 5) were selected for experiments. All images were obtained from the USGS website at Level 1 G (the image with DOY 42 was unavailable from the USGS website). There are 15 images contaminated by clouds and shadows (cloud-covered percentage  from 1% to 100%). And the mean cloud-covered percentage is approximately 33%. Especially, the image with DOY 362 is totally covered by clouds and shadows. Thus, we excluded images with DOY 42 and 362 from reconstruction in experiments, since there is not any reference information.
The second study area is in Haiyang County, Shandong Province, China. Being different from the first study area, it has typical temperate monsoonal climate. It is covered by typical land-cover types including artificial area, water, wetland, farmland, forest land, and grass land. In this area, all available 23 Landsat images with P/R 120/34 during 2017 were employed. They have a mean cloud-covered percentage of 38%.

Methodology
A for-backward LSTM-based reconstruction method was proposed to restore cloud-covered missing data for time-series Landsat images, as illustrated in Figure 3. The method involved three main steps: (1) pixel-wise time-series data with masking values, (2) autoencoder-LSTM-based time-series pixel clustering, and (3) forbackwards-LSTM-based time-series reconstruction.
Before the main process, data pre-processing was conducted, which involved the geographic registration, pixel alignment and common-region clipping of multi-temporal images. (1) Firstly, all cloud-covered regions were manually delineated to produce cloudcovered masks for each Landsat image (Subsection 3.1). Then these masks were overlaid onto corresponding satellite images to generate pixel-wise time-series data with masking values (Subsection 3.2). (2) These pixel-wise time series were clustered by an AE-LSTM-based unsupervised method into multiple clusters, for searching similar pixels (Subsection 3.3). (3) First, the pixel-wise timeseries set was split into training sample set and predicting sample set (Subsection 3.4.1). Then, a forbackward LSTM-based reconstruction model trained by training samples was employed to predict the time-series missing values in predicting samples (Subsection 3.4.2). Finally, reconstructed (predicted) data were merged with cloud-free (image) regions to produce cloud-free time-series Landsat images, and verified by simulated actual images (Subsection 3.5).

Data pre-processing and cloud mask
Landsat images were employed in this study. Since these data have well-characterized radiometric quality, data pre-processing including geographic registration, pixel alignment and common-region clipping were conducted. Firstly, a polynomial model with 2 degrees and 9 manually selected and evenly distributed ground control points were employed to geometrically correct the Landsat image to a Google satellite map (produced from Google Tile Map Service with a spatial resolution of 2.5 m), with registration errors amounted to less than 10 m. Secondly, taking image with DOY 10 as reference, we resampled other images into the same pixel grids as image with DOY 10, for pixel alignment in coordinate. Finally, the common parts of the multi-temporal images were clipped for time-series analysis.
For reconstructing cloud-covered pixels of timeseries images, cloudy and shadowy regions were delineated for each image. Firstly, according to the aerosol band and the NIR-red-green band composed image (as presented in Figure 2), we manually delineated cloudy and shadowy regions to generate initial cloud-covered masks for Landsat images. Then, the cloud-covered masks were properly extended to further cover those ambiguous pixels, which were hard to determine being cloudcontaminated or not, around delineated regions. Finally, the cloud-covered mask was converted to a mask band and appended to the corresponding Landsat image, for the following reconstruction.

Pixel-wise time series with missing values
In the Landsat image, there was a mask band m to indicate whether a pixel p is cloud-covered or not. To produce pixel-wise time-series data, we defined the masked pixel value mp of the j (1 � j � T) time step (T is the number of time steps), in the i (1 � i � B) band (B is the number of bands) and at (x, y) coordinates as: where mv is the masking value (a value much more than the biggest pixel value, and we set it to be 19,999 in this study) in time-series data, indicating a missing value caused by clouds. Then, we constructed pixel-wise time series TS for the i band at (x, y) coordinates as This pixel-wise time series with masking values was the basic element in reconstruction. And just the masking value filled up the missing data and aligned time steps in pixel-wise time series, which prepare data for the following AE-LSTM-based pixel clustering and LSTM-based prediction.

Autoencoder-LSTM-based pixel clustering
An autoencoder-LSTM-based network was employed for clustering time-series data with masking values (masked pixels), as shown inFigure 4. In the AEbased framework for time-series clustering, we firstly fed pixel-wise time-series from all bands TS 1; x;y ; TS 2; x;y ; . . . ; TS B; x;y h i into a masking layer and a stacked LSTM network (with four LSTM layers) to produce the latent high-level features z of input timeseries data (referred as encoder). Then, on one hand, the latent features z were fed into another stacked LSTM network (with four LSTM layers) to produce reconstructions (referred as decoder), under the constraints of reconstruction loss; on the other hand, the latent features z were fed into a clustering layer to calculate clustering probabilities of the latent features (referred as clustering), under the constraints of clustering loss. Finally, the reconstruction loss and the clustering loss were incorporated for joint training, as L ¼ L r þ αL c , where L r and L c are reconstruction loss and clustering loss, respectively, and α � 0 is the coefficient of clustering loss, to balance L r and L c . The clustering loss (Xie, Girshick, and Farhadi 2016) is defined as Kullback-Leibler (KL) divergence (measuring the non-symmetric difference) between distributions P and Q, where Q is the distribution of soft labels measured by Student's t-distribution and P is the target distribution derived from Q. So, the clustering loss is defined as where p ij is target distribution, q ij is the similarity between latent features z and cluster center μ measured by Student's t-distribution. The reconstruction loss is measured by meansquared error (MSE): where f and g are encoder and decoder mappings, respectively. More details can be found in (Xie, Girshick, and Farhadi 2016;Guo, Gao, and Liu et al. 2017;Min et al. 2018).
In details, we set the number of hidden neurons of LSTM layers to be 32, and the coefficient of clustering loss α to be its default value 0.1.
The number of clusters is an open parameter. On one hand, more clusters may improve the homogeneity of clusters, facilitating reconstruction; on the other hand, too many clusters would reduce the pixel numbers in clusters, hampering training reconstruction models. When determining this parameter, we took the practical land-cover types of study areas into consideration. This was, how many land-cover types can be interpreted from Landsat-8 images. For example, there are 10 types in the GLOBELAND30 dataset (Chen, Ban, and Li 2015). For a special region, there are often less than 10 types of land cover. In this study, we set it to be 10, which was also found to produce more better results in simple series experiments with various numbers of clusters.

Sample sets for model training and prediction
Cloud may appear in any region of the image of any time step. We had to reconstruct missing values at any time step t, for any band b. For a special timeseries curve TS b; x;y at pixel (x, y) and band b, we grouped it into the predicting sample set PSS b;t when mp b;t x;y was equal to the masking value mv, or into the training sample set TSS b;t . Pixel-wise time series in the training sample set TSS b;t were used to train the reconstruction model at time step t, for band b, and the predicting sample set PSS b;t was time-series data to be reconstructed.
Further, for the for-backward reconstruction model, a time-series curve TS b; x;y from the training or predicting sample set was split at the target time step t, into the forward time-series sub curve fTS b; x;y with a length of t, the corresponding backward time-series sub curve bTS b; x;y with a length of T À t À 1, and the target value mp b;t x;y . Here, we produced a sample for reconstruction models as. .

For-backward LSTM-based model structure
Using the traditional LSTM networks, the forward tendency in time-series images can be easily captured to predict cloud-covered pixels. While, it failed to utilize the backward trend information when targets in the middle section. Thus, incorporating the forward and backward trend information, a forward-backwardfusion (for-backward) LSTM network was constructed for better time-series reconstruction, as presented in Figure 5. Firstly, a pixel-wise time series TS was split into a forward time series fTS and a backward time series bTS, as demonstrated in the above Subsection. Then, in the forward part, the forward time series fTS was successively fed into a masking layer, a stacked LSTM network, and a dense layer, to produce a forward time-series feature ff; and following the similar network structure, the backward time series bTS (in reverse order) produced a backward timeseries feature bf. Finally, the forward time-series feature ff and the backward time-series feature bf were concatenated and fed into a dense layer, to predict the missing value.
In detail, in the staked LSTM network, four LSTM layers with 32 hidden neurons were stacked to transfer raw time-series curves into high-level features. The length of the forward and backward time-series feature are eight times that of the forward and backward time series, respectively.

Comparative methods
To make a comparative analysis, the proposed LSTMbased method was compared with the state-of-theart methods of spatially and temporally weighted regression (STWR-based) method (Chen, Huang, and Chen et al. 2016), spectral-temporal patchbased missing area reconstruction (STP-based) (Wu et al. 2018), and pixel-based contextualized AE neural network (PCAE-based) method (Zeng, Shen, and Zhang 2013). The STWR-based method makes full use of cloud-free pixels in the reference timeseries cloud-covered scenes and optimally integrates both spatial and temporal information (using spatially and temporally weighted regression) to recover missing pixels (using inverse distance weighted interpolator) in the target scene, which is selected as a traditional geostatistical method for experimental comparison. The STP-based method takes spectral-temporal patches (retrieved from multi-temporal image segmentation) as basic units for searching similar structures (scenes) from cloud-free regions to cloud-covered regions and injecting textural information to preserve spatial and spectral continuities and suppress salt-and-pepper noises and false edges, which is selected as a patch-based method for experimental comparison. The PCAE-based method employs an autoencoder neural network to calculate the transformation (mapping) function from the cloud-free reference to the target images, to restore missing pixels for multispectral images, which is selected as a deep-learning-based approach for experimental comparison.

Validation datasets and evaluation metrics
Missing data reconstruction performance was assessed using simulated cloud-covered regions, with the advantage of the ability to evaluate the reconstruction against actual observed data. Firstly, we randomly removed patches (with varied shapes, areas, and distributed positions) from target images to generate the simulated cloud-covered images. Then, we compared the reconstructed pixels with the actual pixels in the simulated cloud-covered regions pixel-by-pixel, to calculate reconstruction accuracy.
For pixel i with an actual value x i at a special band, we restored it to be predicted value x i . Its reconstruction error can be calculated as e i ¼ x i Àx i j j. The smaller the value of e, the higher the accuracy. For the total reconstruction accuracy of cloud-covered regions, we adopted the normalized mean square error (NMSE) index, mean absolute error (MAE) index, mean absolute percentage error (MAPE) index, and correlation coefficient (CC) index for quantitative evaluation.
Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where N is the total number of missing pixels in cloudcovered region R, u x ð Þ and ux ð Þ are the mean values of the actual and the reconstructed values in the cloud-covered regions, respectively.
NMSE indicates the deviation between the actual observation and the reconstructed results, MAE reflects the actual errors of predicted values, MAPE provides a measure of relative comparison even when data vary in magnitude. CC depicts the textural similarity between the actual observation and the recovered result. For NMSE, MAE and MAPE, lower values indicate better reconstruction and are preferable. A larger CC indicates a closer consistency between the regions of data, and the data will be identical when CC equals 1.
Furthermore, an improvement ratio (IR) index ) is introduced to evaluate the degree of the improvement of the proposed method with respect to comparative methods.
where V sota is the best evaluation value in comparative (state-of-the-art) methods, V prop is the evaluation value of the proposed method.

Experiments and discussion
Using multi-temporal Landsat images, experiments were carried out to test the effectiveness and efficiency of the proposed reconstruction method, compared to three state-of-the-art methods. Accuracy assessments were conducted on the reconstruction results of time-series images to discuss the promotion of the proposed LSTM-based model over the comparative methods, the reconstruction performance along cloud-covered percentages and target time steps, and reconstruction error distributions over various land-cover types.

Reconstruction results in Iberian Peninsula
The time-series reconstructed images (using the NIRred-green band composition) produced by the proposed LSTM-based method were presented in Figure 6. Since images of DOY 170 and 330 have small cloud-covered regions (smaller than 0.01 in cloud-covered percentages), we do not demonstrate their reconstruction results. As the whole, the multi-temporal reconstructed images can demonstrate the landscape change process of the study area. For any image in time series, the proposed method has reconstructed missing data in cloud-covered regions with various cloud-covered percentages and distribution positions. Taking a close look at textural and spectral presentation of reconstructed images, we found out that the proposed method achieved better reconstruction performance on the center images (such as the images with DOY 186, 218, 234 and 250) that on the beginning (such as the images with DOY 90) and the end (such as the image with DOY 346) images of time series. We argued that the for-backwards network can incorporate temporal tendency information both from previous and subsequent time-series images. So, the center image (in time series) could employ more tendency and change information than the beginning and the end images to reconstruct missing data.
Furthermore, we observed that there was serious spectral distortion in the reconstructed images with DOY 106, 122, and 298. The possible reason was that their cloud-covered percentages were too big (cloudfree regions are too small) (approximate 0.92, 0.89, Figure 6. Reconstruction results of cloud-covered Landsat image by the proposed method, subfigure (a-l) represent the reconstructed images in the NIR-red-green band composition, and the text below the image indicates the imaging dates. and 0.91 for the images with DOY 106, 122, and 298, respectively) to provide enough information for restoring cloud-covered regions.

Compared results
For comparisons, the three state-of-the-art reconstruction methods (including the STWR-based, the STP-based and the PCAE-based methods) were applied to the multi-temporal image to restore missing data of cloud-covered regions. Since the STWRbased and STP-based methods were multi-temporal based, the whole time series were used to reconstruct the target images. In the PCAE-based method, the image with closer acquired date to that of the image to be reconstructed was taken as the reference (source) image. For example, the image with DOY 266 was referred to the image with DOY 314. Reconstructed images with DOY 74, 202, 266 and error maps of DOY 202 were presented in Figure 7. The error maps indicated the sum of the absolute values of differences between the predicting values and the actual values in the four bands.
Generally, all comparative methods and the proposed method yielded visually plausible images, compared to the actual observation. With a close comparison with the actual cloud-free images and error maps, the proposed method can capture the landscape change (especially for vegetation-covered regions in yellow boxes) over time. Our restored results were shown to be quite close to the actual ones in both spatial details and spectral comparability.

Quantitative evaluation
To quantitatively evaluate reconstruction performance of the proposed LSTM-based approach (comparing with the STWR-based, STP-based, and PCAEbased methods), we carried out four experiments to reconstruct the image with DOY 186 (with a cloudcovered percentage 0.69). In the STWR-based, STPbased and LSTM-based experiments, we employed the whole time series as reference images to restore the image with DOY 186. In the PCAE-based experiment, the image with DOY 202 was taken as reference image. The accuracy assessment of reconstruction results was listed in Table 1.
Overall, the LSTM-based method achieved the best reconstruction accuracy, compared with the three state-of-the-art methods. This was expected, when comparing to the PCAE-based method (only using one reference image), the LSTM-based method incorporated more landscape change information from the time-series images to reconstruct cloud-covered regions. When comparing to the STWR-based and STP-based methods, the LSTM-based method used the for-backward network to learn multi-scale and nonlinear time-series information to reach the landscape change patterns.
Furthermore, reconstruction accuracies of the LSTM-based method varied greatly in various spectral bands. The green band presented the best reconstruction accuracies (0.6029 in the NMSE, 32.1560 in the MAE, 4.88% in MAPE, and 0.9572 in CC), followed by the blue and red bands, and the NIR band produced the worst results. This was consistent with the results of previous studies (Chen, Huang, and Chen et al. 2016). On one hand, clouds make the greatest impact on the blue band with the shortest wavelength. On the other hand, red band and NIR band mainly demonstrate more complex temporal tendency information of vegetation (widely appearing in the study area) growth, which is hard to precisely capture by reconstruction models. And the LSTMbased method achieved the greatest promotion in the red band (with 20%, 27%, 7%, 3% IR in the NMSE, MAE, MAPE, and CC, respectively). It further indicated the proposed method could extract and model the vegetation tendency and change patterns, compared to the comparative methods.

Computational complexity
Since the four reconstruction methods are different from principles, technological processes, and data preparation, it is hard to evaluate and compare their computational complexity of the whole produces in theory. So, in this study, we just focused on their key steps, and compared their time consumptions and resource occupancies on memory and GPU. In the PCAE-based method, a global mapping function (an AE network) from reference images to target images was established for reconstruction. And in the proposed LSTM-based method, there were 10 global time-series predicting functions (LSTM networks) for the 10 clusters. So, the training and predicting computation of the AE network and the 10 LSTM networks represented the computation complexities of the PCAE-based and LSTM-based methods, respectively. While, in the STWR-based and STP-based methods, each cloud-covered pixel or patch was reconstructed by a local regression function. Thus, we summed the training and predicting consumption of all cloudcovered pixels or patches, as the total.
Following the same experimental settings as in the above subsection, we applied the four methods to reconstruct the missing data of the image of DOY 186. Table 2 presented their time consumptions and resource occupancies, on a computer with 32GB memory and 4GB memory of GPU. Values before and after "+" were training and predicting consumptions, respectively. It was noted that the training times of the PCAE-based and LSTM-based were time consumptions for one epoch, and approximately 20 and 30 epochs were taken to achieve stable performance, respectively.
Since the PCAE-based and LSTM-based methods employed deep learning algorithms, they used more time (especially the training time), memory and GPU resources. And the STWR-based and STP-based methods did not use GPU. Furthermore, the PCAE-based method utilized more pixels from neighbors and multi-temporal images, than the LSTM-based method (only using pixels from multi-temporal images), for reconstruct a cloud-covered pixel. Thus, the PCAEbased method occupied more resources than the LSTM-based method.

Discussion on reconstruction performance
The for-backward LSTM network was used to reconstruct cloud-covered images. To further understand the performance of this method, we conducted a series of experiments on whether the for-backward structure was useful, the minimum cloud-free percentage (or the maximum cloudcovered percentage) for reconstructing missing pixels for the image with a special step, whether the center image in time series presented better reconstruction results, and reconstruction performance on various land-cover types.

Forward vs. backward vs. for-backward LSTM
In this study, a for-backward LSTM model was designed to extract both forward and backward time-series information to improve reconstruction performance. Here, we carried out a comparative experiment to test the advantage of the forbackward model over a forward and a backward model (as presented in Figure 3, just using the forward or backward part). Aim to reconstruct the cloud-covered image with DOY 186 (this time step is at the center of the time series), we fed a forward time series from DOY 10 to DOY 170 into the forward LSTM model, fed a backward time series from DOY 362 to DOY 202 into the backward LSTM model, and fed the whole time series into the forbackward LSTM model. Accuracy assessment of reconstruction results produced from the three models were presented in Table 3.
Overall, the for-backward LSTM model achieved the best performance, in all evaluation index, for all spectral bands, compared to the forward and backward models. This was expected, since the for-backward model can incorporate the previous (from the forward time series) and the subsequent tendency information (from backward time series) to model the landscape change pattern for reconstructing missing pixels.

Cloud-covered percentage vs. reconstruction performance
Landsat images may be covered by clouds of various areas (cloud-covered percentage from 0% to 100%). We employed cloud-free regions to reconstruct cloud-covered regions. Here, we carried out a series of experiments (taking the cloud-free image with DOY 170 as target image) to discuss whether various   cloud-covered percentages impacted the reconstruction performance, and the maximum cloud-covered percentage that can be restored. Since the image with DOY 170 is cloud-free, we firstly generated a series of simulated cloud-covered images with various cloudcovered percentage from 10% to 100% at an interval of 10% (and 5% for better presenting the change of performance). Then, we used the cloud-free regions to restore the cloud-covered regions. Finally, we compared the reconstructed image with the actual image to test the reconstruction performance. Figure 8 presented the change of reconstruction accuracy (NMSE and MAPE) in the blue, green, and red bands (since the NIR band produced the worst reconstruction results, we didn't discuss it here), along cloudcovered percentage increasing. Along with decreasing of cloud-free percentages, reconstruction performance gradually decreased. When cloud-free percentages decreasing from 90% to 40%, accuracies kept stable (for example, MAPE of green band was around 5.0) and presented very small decrease (less than 0.2 for MAPE in green band). When cloud-free percentages less than 40% (from 40% to 5%), reconstruction accuracies decreased sharply (from 5.2 to 14.0 in MAPE of green band). Taking a close look at the accuracy change of the red band, its decrease appeared earlier (approximate around 50% in cloud-free percentage), and showed sharper decreases from 20% to 5%, compared to the green band and the blue band. This further verified the conclusion in subsection 3.2, that too small cloudfree regions can't provide enough information for reconstructing cloud-covered regions. Here, we concluded that the maximum cloud-covered percentage for producing good reconstruction was 60%. When the cloud-covered percentage was greater than 60%, it was hard for the proposed method to achieve satisfactory results. Thus, in the following discussion, we focused on reconstruction of images with cloud-free percentages greater than 40%.

Reconstruction performance along time steps
The proposed for-backward method employed a sub forward time series and a sub backward time series to predict cloud-covered missing data. The number of images was 22 in this study, and the total number of forward and backward steps was 21. For a special step, the numbers of forward and backward timeseries images were t, and 21-t. While for different time steps, how about the reconstruction accuracy of the proposed method with various forward and backward steps? Here, we carried out a series of experiments to reconstruct images at different time steps. Under the conclusion from the previous experiments that the minimum cloud-free percentage of the target image for reconstruction is 0.4, we selected those images with cloud-free percentages greater than 0.4 as target images, including images with DOY 26,58,74,90,138,154,170,186,202,218,234,250,266,314,330,and 346. Figure 9 presented the reconstruction performance at different time steps.
Overall, when the time step was close to the center of the time series (such as DOY 186 and 234), the forbackward model achieved better reconstruction performance. When the step was away from the center (such as DOY 74 and 330), the reconstruction accuracy got worse. We argued that the change of ground surface was continuous and gradual, and the for-backward model can incorporate the trend information from both forward and backward steps in time-series images. When the target image was at the beginning or the end of time series, the forbackward model degenerated into backward or forward models, and produced worse reconstruction results. Thus, when using the for-backward model to reconstruct cloud-covered images, we had better place the target image in the center of the time series for better reconstruction. For example, if we want to reconstruct images in October, multi-temporal images from April to March of next year will be better than that from January to December. Taking a close look at the reconstruction performance before and after DOY 186 (the center of the time series), we found out that reconstruction after DOY 186 (from DOY 202 to 234) were slightly better than that before DOY 186 (from DOY 138 to170). One possible reason was that there were more continuous and usable (cloud-free percentage greater than 0.4) images after DOY 186.
Furthermore, determining the minimum number of time steps that are required for stable results was beneficial to increase the repeatability of the proposed method. Here, the change pattern of performance in Figure 9 inspired us to search the suitable time steps of time-series images. In the proposed method, we only utilized the trend relationship (without periodic component in multiple-year observations) in timeseries images. If we took the minimum length of forward time series and backward time series as the length of time series, the accuracy changes decreased when the length greater than 8 (DOY from 170 to 234) in Figure 9. From this perspective, we could argue the minimum length for better performance is 17 (8 + 1 + 8), in Iberian Peninsula. However, it was still noted that landcover types, cloud-covered percentages and spatial-temporal distribution can also affect the reconstruction performance, in addition to the length of time series.

Error distributions vs. land cover types
Various land-cover types present different change patterns in multi-temporal images and different sensitivity to cloud covering (Wu et al. 2018). In this experiment, we attempted to assess the reconstruction performance of the proposed method for various land-cover types (including water, forest, grassland, farmland, and urban area). To reconstruct the image with DOY 202, we firstly randomly removed grid patches (covering a preponderant land cover, such as forest) to simulate cloud-covered regions (approximate 20% areas for each land-cover type), and then we employed the remaining cloud-free regions to reconstruct the cloud-covered regions. Finally, we compared the actual image with the reconstructed results pixel by pixel, to calculate error distributions (sum of errors from four bands) across images. Figure 10 presented the sub actual images and corresponding reconstruction results, and normalized (with the maximum and minimum error values, to range [0,255]) error distribution.
The proposed method produced more and greater reconstruction errors for artificial objects (farmland and urban area) than natural objects (water, forest, and grassland). Specifically, water area presented smaller and uniform reconstruction errors across images, which was consistent with the conclusion of Wu's study (Wu et al. 2018). Since waterbody appeared low spectral radiance in satellite images and small changes during one year, the proposed method produced smaller absolute errors for water areas. Due to less affected by human activities, forest (especially in alpine and gorge regions, as presented in Figure 10) is homogeneous in spectral band. Furthermore, forest in satellite images was mainly dominated by the green band, which showed better reconstruction performance than other bands. Thus, forest shows smaller errors. Grassland almost followed the similar rules from forest. In theory, mainly dominated by vegetation, farmland should present the similar error distribution with forest. However, farmland is artificial, and agricultural activities (such as multiple cropping) would change or even destroy the temporal tendency of vegetation. Thus, farmland presented greater reconstruction error. What's more, error distribution was parcel-based, pixels from one parcel always presented similar error values. Compared with water area and forest, urban areas in middle-resolution Landsat images have more mixed pixels (with water, vegetation, or bare land) and more land-cover changes resulted from human activities. So, urban areas (especially in the development areas) presented more greater errors that other landcover types.

Compared results in Haiyang County
To further verify the performance of the proposed LSTM-based method, one more experiment was carried out in Haiyang County, Shandong Province, China. The proposed and three comparative reconstruction methods were applied to the multitemporal images to restore cloud-covered images. Reconstruction results and their actual images with DOY 10, 250, and 298 were presented in Figure 11.
Generally, the four methods restored the missing data in cloud-covered regions, and the proposed LSTM-based method produced the most similar results with the actual images, compared with the comparative methods. Furthermore, the proposed method presented the time-series change of water areas, in the middle of image of DOY 10, and in the right bottom of image of DOY 298. We argued that using the whole time series could effectively capture the dynamic changes of ground surface, which produced the better reconstruction. This proved the performance of the proposed method.
Also, following the procedure of quantitative evaluation on reconstruction results, we used MAPE to conduct a simple accuracy assessment on the restored green bands from the four methods, as listed in Table 4.
Similarly, the proposed LSTM-based method achieved a great improvement, comparing to the three comparative methods, which was constant with the study in Iberian Peninsula. This further proved the performance of the proposed method. In MAPE of time steps, reconstruction results in DOY 250 had the highest scores, followed by results in DOY 298, and the results in DOY 10 had the lowest scores. This is expected, since DOY 10 was on the beginning of time-series images. When further comparing these results to that in Iberian Peninsula, it was found, this experiment achieved worse performance. The possible reason was this study area had a greater mean cloud-covered percentage than that in Iberian Peninsula.

Conclusion
Reconstructing the missing data for cloud/shadowcovered optical images is great important for enhancing the data availability and multi-temporal analysis. We proposed a deep-learning-based method for reconstructing cloud/shadow-covered missing data for time-series images, through the combined use of the AE-LSTM-based similar pixel clustering and the for-backward LSTM-based missing data reconstruction. The method was applied onto a dataset of multi-temporal cloud-covered Landsat-8 OLI images. The reconstruction results, showing a greater improvement relative to the state-of-theart methods, demonstrated the effectiveness of the proposed method in missing data reconstruction for time-series satellite images. From further discussions on the performance of the proposed framework with various directional tendency, with various cloud/shadow-covered percentages, on different time steps, and for various land-cover types, this study concludes: (1) the for-backward reconstruction model presents better results, compared to the forward and backward models; (2) cloud/shadow-covered percentage of the target image had better be less than 60% for a good reconstruction; (3) the target image around the center of time series could achieve better reconstruction results than that at the beginning and end; and (4) the reconstruction method achieves better performance in natural ground object (such as water area and forest) and the blue band.
Despite these encouraging results, more work is needed to further explore the use of deep learning for time-series reconstruction for satellite images, such as (1) incorporating satellite data  from other sensors to further reconstruct completely cloud-covered images (such as the image with DOY 362); (2) designing an integrated deeplearning-based network to join the similar pixel clustering and the missing data reconstruction together.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported in part by the National

Data availability statement
The data that support the findings of this study are available upon request by contact with the corresponding author, or accessed through https://earthexplorer.usgs. gov.