An object-based spatiotemporal fusion model for remote sensing images

ABSTRACT Spatiotemporal fusion technique can combine the advantages of temporal resolution and spatial resolution of different images to achieve continuous monitoring for the Earth’s surface, which is a feasible solution to resolve the trade-off between the temporal and spatial resolutions of remote sensing images. In this paper, an object-based spatiotemporal fusion model (OBSTFM) is proposed to produce spatiotemporally consistent data, especially in areas experiencing non-shape changes (including phenology changes and land cover changes without shape changes). Considering different changes that might occur in different regions, multi-resolution segmentation is first employed to produce segmented objects, and then a linear injection model is introduced to produce preliminary prediction. In addition, a new optimized strategy to select similar pixels is developed to obtain a more accurate prediction. The performance of proposed OBSTFM is validated using two remotely sensed dataset experiencing phenology changes in the heterogeneous area and land cover type changes, experimental results show that the proposed method is advantageous in such areas with non-shape changes, and has satisfactory robustness and reliability in blending large-scale abrupt land cover changes. Consequently, OBSTFM has great potential for monitoring highly dynamic landscapes.


Introduction
In recent years, spatiotemporal data fusion technology has been explored in many studies to promote the applications of multi-source remote sensing data. For example, the high spatial and temporal images generated by spatiotemporal data fusion can be used to monitor crop growth (Gao et al., 2017;Zheng et al., 2016), ecosystem dynamics (Gao et al., 2015;Nagendra et al., 2013), estimate the crop yields (Allen et al., 2011;, biomass , and improve land cover classification Kong et al., 2016). Many multi-satellite data fusion algorithms are developed to integrate information from different spatial, temporal resolutions to produce data with high spatial and high temporal resolutions simultaneously. According to the techniques used to model the relationship between input two or more data sources, existing spatiotemporal data fusion methods can be classified into five categories: weight function-based, unmixing-based, Bayesian-based, learning-based, and hybrid methods Zhu et al., 2018).
Among the weight function-based method, the spatial and temporal adaptive reflectance fusion model (STARFM) proposed by Gao et al. is the earliest spatiotemporal fusion model. Most subsequent weight function-based methods are developed by the principle of STARFM. STARFM assumes that the central pixel and its spectrally similar pixels within a moving window have similar temporal increments, and then establishes the relationship between high-and lowresolution images estimates the fused results by weighting the searched similar pixels (Gao et al., 2006). However, this method is not suitable for heterogeneous landscapes or where land cover changes exist. Several weight function-based methods are since developed to overcome the limitations of STARFM. For example, the spatial temporal adaptive algorithm for mapping reflectance changes (STAARCH) was developed to improve the ability to capture disturbance phenomena (Hilker et al., 2009), an enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM) shows better prediction accuracy in heterogeneous landscape than STARFM (Zhu et al., 2010), the spatial and temporal nonlocal filter-based fusion model proposed by Cheng et al. further improves the performance in heterogeneous areas (Cheng et al., 2017), Wang et al. developed an algorithm named Fit-FC to improve the prediction accuracy in different landscapes (Q. Wang & Atkinson, 2018). Furthermore, some research focus on optimizing the selection strategy of similar pixels by integrating initial prediction (Lu et al., 2016;, and a recently proposed fusion method named the robust adaptive spatial and temporal fusion model (RASTFM) employed the non-local redundancy strategy to search for similar pixels to achieve more robust prediction (Zhao et al., 2018).
With respect to the unmixing-based methods, the first unmixing-based spatiotemporal method might be the multisensor multiresolution technique (MMT) proposed by Zhukov et al. (Zhukov et al., 1999), which faced the challenge in spectral unmixing. Some unmixing-based fusion methods were subsequently proposed to address this challenge by introducing constraints (Zurita-Milla et al., 2008) or incorporating regularization into the unmixing process (Amorós-López et al., 2013), and the object-based spatial and temporal vegetation unmixing model (OB-STVIUM) applied segmentation to better estimate the endmembers (Lu et al., 2016). Some research focused on acquiring the prediction by estimating endmember changes (Wu et al., , 2012, and STDFA was further improved by incorporating sensor differences and spatial variability (Wu et al., 2018). Maselli et al. proposed the recently developed spatial enhancer of vegetation index image series to produce spatially enhanced NDVI data series based on linear spectral mixing theory (Maselli et al., 2019).
Among the Bayesian-based methods, the Bayesian-based method has shown its effectiveness in spatial-spectral fusion of remotely sensed images, for example, the Bayesian maximum entropy method (BME) employs covariance functions to link the images of advanced microwave scanning radiometer for Earth observing system (AMSR-E) and MODIS, blends the synthetic sea surface temperature data effectively (A. Li et al., 2013). The NDVI-BSFM method uses linear mixing model to link MODIS and Landsat NDVI and multi-year average NDVI time series data, as a result, performs strengths and robustness in providing fine NDVI datasets (Liao & Wang, 2016). A Bayesian data fusion approach that incorporates the temporal correlation information in the image time series was proposed by Xue et al. (2017), and showed better performance than STARFM and ESTARFM in blending heterogeneous landscapes.
Recently, learning-based methods have grown considerably Tan et al., 2018). Huang and Song (2012) proposed the Sparse-representationbased Spatiotemporal reflectance Fusion Model (SPSTFM), which is perhaps the first dictionary-pair learning method in the spatiotemporal fusion field.  proposed a new sparse coding method with dictionary perturbation regularization in sparserepresentation-based spatiotemporal reflectance fusion and can provide a more accurate prediction than STARFM. Wei et al. proposed (Wei et al., 2017) a new dictionary-based fusion method using compressed sensing to integrate remote sensing images of different spatial and temporal resolutions and obtained satisfactory reconstruction results. Song et al. (2018) established a novel spatiotemporal fusion using deep convolutional neural network (STFDCNN), which can effectively extract and express information in large-scale remote sensing data. Tan et al. (2018) proposed a new deep convolutional spatiotemporal fusion network (DCSTFN) and achieved satisfactory performance.  proposed a two-stage spatiotemporal fusion method, which showed its advantages in capturing phenological changes and retrieving land cover changes. H. Zhang et al. (2020) proposed a remote sensing image spatiotemporal fusion method using a generative adversarial network (STFGAN) and showed its satisfactory effectiveness and generalization ability.
In order to take advantage of different algorithms, hybrid spatiotemporal fusion methods are developed to integrate the ideas of different methods listed above. For example, spatial and temporal reflectance unmixing model (STRUM) (Gevaert & García-Haro, 2015) and unmixing-based STARFM (USTARFM)  combine the strengths of STARFM and unmixing-based method, and the improved STRUM (ISTRUM) further integrates the spectral mixture analysis to better capture spectral variability and spatial details (Ma et al., 2018), and the recently proposed virtual image pair-based spatio-temporal fusion (VIPSTF) approach provides a flexible framework to improve the existing spatial weighting-and spatial unmixing-based methods (Q. Wang et al., 2020). The flexible spatiotemporal data fusion (FSDAF) is a representative spatiotemporal fusion framework integrating linear unmixing, spatial interpolation and weight-based functions. Specifically, FSDAF uses the classifier and the unmixing theory to estimate the temporal change of each class, and then distributes residuals with the help of the thin plate spline interpolator, lastly, obtains the final prediction by introducing the information in neighborhood. As a result, FSDAF can get a more robust and accurate prediction in different landscapes, especially in heterogeneous landscapes. Moreover, FSDAF has the ability to predict both gradual change and land cover type change Liu et al., 2019b). Following FSDAF, the enhanced FSDAF (EFSDAF), an improved FSDAF (IFSDAF), an enhanced FSDAF that incorporates subpixel class fraction change information (SFSDAF), and FSDAF 2.0 were later developed to overcome the limitations of FSDAF and improve the prediction accuracy (Liu et al., 2019b;Guo et al., 2020;X. Li et al., 2020;Shi et al., 2019).
While above spatiotemporal fusion algorithms have ability to produce effective fusion result to a certain extent, they still face some challenges in practice. First, many existing spatiotemporal fusion methods, like ESTARFM, STAARCH and Bayesianbased methods need more than one low-and highresolution image pair as input data, which greatly limits the application scenarios of spatiotemporal fusion technology (Guo et al., 2020). Specifically, methods like ESTARFM cannot be used to do realtime processing due to it needs the data after the prediction phase. And when we need to blend historical data it may be also difficult to acquire one more low-and high-resolution image pair which is temporally close to the prediction time due to cloud cover in many cases (Song & Huang, 2013;Q. Wang & Atkinson, 2018). Second, the ability of retrieving fast change and the robustness of method need to be further improved, especially when blending heterogeneous sites. For example, the Fit-FC proposed to retrieve strong temporal changes performs more effectively than STARFM and FSDAF in capturing fast phenological changes (Q. Wang & Atkinson, 2018), but shows lower accuracy than FSDAF in retaining image structures when blending heterogeneous sites (Liu et al., 2019a). The accuracy of learning-based methods decreases when spatial heterogeneity is high and spectral scale differences between low-and high-resolution images are large . Many of which employ the idea of unmixing like STDFA and FSDAF assume the classification result in two phases is same, as a result, the unmixing step can preserve surface structures well, however, the classification errors can affect the accuracy of unmixing results seriously and lower the robustness of model. Consequently, existing spatiotemporal fusion algorithms need further improvement and optimization.
To address the above-mentioned limitations of aforementioned spatiotemporal fusion methods to some extent, in this paper, an object-based spatiotemporal fusion method (OBSTFM) is proposed to more effectively capturing temporal changes with minimum input data. Considering the actual distribution of surface features, the segmentation method is introduced to produce surface objects with good similarity and uniformity, the computation of fusion process will be optimized in each object. The proposed method consists of three parts including multi-resolution segmentation, linear injection and spatial filtering. Multi-resolution segmentation is firstly applied to generate segmented objects, and then the linear injection model is introduced to fuse input images in each object to produce preliminary prediction, which is finally refined using spatial filtering in each object to derive final prediction. The main contributions of this paper are as follows: 1) based on multi-resolution segmentation result, the local linear interpolation model is implemented in each object, which can ensure the fusion uniformity of objects without being influenced by different variations of other objects, and 2) in the process of spatial filtering, the optimal selection strategy of similar pixels is provided to achieve more robust and accurate prediction.

Methodology
The basic idea of OBSTFM is to downscale the lowresolution image to high-resolution object through spatial analysis and the linear injection theory. Firstly, similar to the first step in OB-STVIUM (Lu et al., 2016), multi-resolution segmentation is applied in the input high-resolution image to acquire the high-resolution objects. Secondly, a linear injection model is introduced to fuse intersected high-and low-resolution objects to generate preliminary prediction. However, the pixel-bypixel computation inevitably produces uncertainties caused by complexity of the formulation. To enhance the stability and robustness of the whole fusion process, spatial filtering is implemented in each object to get the final prediction. The flowchart of the proposed OBSTFM is shown in Figure 1.

Multi-resolution segmentation
The features covered by remote sensing images generally show complex distribution characteristics. In order to acquire the proper shape and spectral features and the segmentation scales of surface features, the popular multi-resolution segmentation algorithm (MRSA) of eCognition software is applied to conduct segmentation, which is a bottom-up segmentation technique merging adjacent pixels or objects to create homogenous objects, and pixels in each object are spectrally similar and spatially adjacent (Benz et al., 2004). In the process of multiresolution segmentation, several parameters including smoothness weight, spectrum weight and scale value need to be set appropriately to obtain segmentation objects, and in which, scale value (SP) is the key parameters to control the internal heterogeneity and the average size of objects (Drǎguţ et al., 2014). Only by setting appropriate parameters can we get the object which is consistent with the actual ground feature information. The weight of spectrum parameter is generally set to a larger value than shape weight to get a better segmentation accuracy because of the rich spectral information in high-resolution data. The setting of optimal image segmentation parameters is more complicated, in order to get a better segmentation scale, the scale value is set to 150 through performing "trial-and-error" optimization. Smoothness weight and spectrum weight are empirically set to 0.5 and 0.6 respectively as the default values.

Linear injection model
To fuse the input high-and low-resolution images in each object, a simple linear interpolation model is considered in the second step, which is commonly used for pansharpening. A general formulation of the pansharpening scheme (Vivone et al., 2015) is given as: where the subscript b indicates the bth spectral band, M b is the bth band of the interpolated multispectral image, M HR b is the reconstructed high-resolution of M b . P represents the panchromatic image, and P low is obtained by low-pass filtering P. g b is the injection gain.
In spatiotemporal fusion, the auxiliary highresolution data at the base date are used to enhance the resolution of low-resolution data at prediction phase. It is reasonable to apply relevant spatial detail from auxiliary high-resolution data to enhance the resolution of low-resolution data at prediction time while preserving necessary spectral information. Therefore, the spatiotemporal fusion can be taken as the pansharpening problem, it is described as follows: Þ are the resampled low-resolution images at base phase and prediction phase of band b with the scale of the high-resolution image, g b is the injection gain which affects the accuracy of the model significantly, an algorithm that relies upon the optimization of injection coefficients by a least squares method is applied (Aiazzi et al., 2007;Vivone et al., 2015) to obtain g b : This model was introduced for spatiotemporal fusion recently because of its simplicity and effectivity . In the fusion method proposed by Sun et al. , the linear injection model is applied in the entire image (simplified as global linear injection mode here), from the fusion results, we can see that the boundary of some features in the fused results is deblurred to some extent and some spatial details are weakened. In this step, the linear injection model is employed to fuse input high-and low-resolution images in each object (simplified as local linear injection model). Before this step, the low-resolution images were downscaling to the same resolution as the high-resolution image by a bicubic interpolation operator. The fused process can be formulated as follows, Þ is the preliminary fused objects O i at t 2 using this linear injection model. Accordingly, g b is defined as follows: Although this process of fusing images in objects ensures better spatial fidelity, some abnormal values might occur in the fused result because of the noise contained in all input images. To mitigate these abnormal values, a global threshold thre is set to filter noisy pixels, Þ denotes the value of object O i calculated using the global linear interpolation model in , which can be obtained in Equation (2). Therefore, in Equation (4), the values of fused objects less than or equal to thre are regarded as preliminary fused objects, whereas the values greater than thre are substituted using the estimation computed from the global linear interpolation model in .

Spatial filtering
After acquiring the preliminary prediction from the previous steps, considering the inevitably uncertainties caused by pixel-by-pixel computation, the neighborhood information is used to reduce this kind of uncertainties and further enhance the robustness in blending areas that land cover changed. Some research (Gao et al., 2017;Zhu et al., 2016) just selected similar pixels from input high-resolution data in the base date for further computation without considering possible complex changes between different dates. Chen et al. (2018) and  optimized the selection strategy by using the intersection of selected pixels in the base data and the initially predicted result in the prediction date. For the method of selecting similar pixels based on the moving window, it is necessary to select the appropriate window size to optimize the selection of similar pixels through trial-and-error optimization. However, the object-oriented method can directly select similar pixels from the homogeneous segmentation object, avoiding the window size parameter setting. To facilitate the selection of similar pixels and avoid the uncertainties introduced by the setting of moving window size, the local information in each object is applied for spatial enhancement.
After implementing multi-resolution segmentation to input high-resolution image, the segmented objects are often homogenous with similar spatial and spectral characteristics. Considering that complex changes might occur for different objects in different time phases, the intersection of selected pixels in the object input base image H 1 O i ; b ð Þ and preliminary fused results H LLIM 2 O i; ; b À � are defined as final similar pixels, and two times of the standard deviation of each object are set as the threshold to select similar pixels. The selected strategy is formulated as À � represent the value of tth target pixel P in ith object O i in input base image and preliminary estimation, respectively. The schematic diagram of optimal selection strategy of similar pixels is shown in Figure 2. We define the final selected similar pixels in each object of two phases as H 1 P i k ; b À � and H LLIM 2 P i k ; b À � , respectively, where k means kth similar pixel. Then, the selected most similar N pixels are used to spatially filter the preliminary fused result conducted from the previous steps. This strategy avoids the error caused by the incorrect selection of similar pixels to some extent. The weight of each similar pixel W k is computed based on spatial dependence between the selected similar pixels and the target pixel. The spatial distance d k is defined as Equation (9) and normalized to an appropriate range using Equation (10): ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where x k ; y k ð Þ and x t ; y t ð Þ denote the location of similar pixels and the target pixel, d min and d max indicate the minimum value and maximum value, respectively. Then, the weight of kth similar pixel is computed as: To obtain the change information in different time phases and produce a more robust prediction, the preliminary fusion result produce from linear injection model is applied here. All selected similar pixels from preliminary prediction H LLIM 2 P i k ; b À � and the high-resolution images at base date H 1 P i k ; b À � are weighted to compute the change information. Thus, the final estimation H 2 P i t ; b À � for kth target pixel in band b is calculated as:

Experiments and results
The proposed OBSTFM was tested by real satellite images in two commonly used sites (Emelyanova et al., 2013;Zhu et al., 2016 Figure 3 and Figure 4. In the practical application of spatiotemporal fusion technique, the performance of the spatiotemporal fusion algorithm is significantly affected by registration error, differences in sensor characteristics and other factors (Shen et al., 2013;Tang et al., 2020). In order to further explore the performance of the proposed algorithm in fusing real MODIS data and Landsat data, another experiment with images experiencing fast phenology change was added. Similarly, from the dataset in the northern New South Wales provided by Emelyanova et al. (Emelyanova et al., 2013), two Landsat images acquired on 16 April 2004 and 5 July 2004 and their corresponding MODIS images are selected for this experiment. MODIS images have been reprojected and resampled to the same resolution with Landsat images. Landsat and MODIS images are geo-referenced by selecting corresponding control points. All these images are resized to cover the same area with 800 × 800 pixels, and Figure 5 shows the false color composites of these images.
Before the fusion experiments, the multi-resolution segmentation algorithm (MRSA) of eCognition software is applied to segment the Landsat images of the first phase, Figure 6 shows the segmentation results of three sites.
In these three experiments, STARFM and FSDAF are selected as comparison methods to test the performance of OBSTFM. To quantitatively evaluate the accuracy of three fusion methods, six indices in Sun et al. are used: the root mean square error (RMSE), structure similarity (SSIM) (Z. Wang et al., 2004), correlation coefficient (r), universal image quality index (UIQI) (Z. Wang & Bovik, 2002), erreur relative globale adimensionnelle de synthѐse (ERGAS) (Du et al., 2007) and spectral angle mapping (SAM) (Dennison et al., 2004). The predictions are more accurate when the values of RMSE, SAM and ERGAS are smaller and the values of UIQI, SSIM and r are higher, in which smaller SAM indicates better spectral fidelity, and higher SSIM means more similar spatial structure between predicted image and true image. There was no distinct type change in this heterogeneous image but rapid phenological changes between two time periods. The phenological changes of farmlands have no significant shape change, it can be considered that the segmentation results have no significant difference in the two phases. Theoretically, OBSTFM is very suitable for the fusion of this site. The performance of OBSTFM is compared with STARFM and FSDAF. The original Landsat data and fused results of three fusion algorithms are shown in Figure 7, and partial area of them are enlarged in Figure  8 to clearly compare the performance of three methods, and the difference areas between proposed and comparison methods have been marked with yellow circles. As we can see from Figure 7, the fusion results from FSDAF and OBSTFM show better spectral and spatial similarity to original Landsat data than that of STARFM.
From the more detailed information demonstrated in Figure 8, the ground objects in prediction from STARFM are characterized as fragmented phenomenon without the consistency of spectral and spatial textures. The performance of FSDAF is better than that of STARFM in object homogeneity. However, OBSTFM shows best performance because of its best spectral similarity and spatial texture in objects to original data.
To quantitatively evaluate the performance of three fusion methods, six indices including RMSE, SSIM, r, QI, ERGAS and SAM are applied to access their performance in different aspects. The computation results of six indices are demonstrated in Table 1, in which the optimal accuracy results have been shown in bold. From Table 1 we can see that, among three methods, FSDAF provides better prediction accuracy than that of STARFM for all six bands, OBSTFM shows more powerful prediction ability that the other two algorithms because it achieves best performance in all six indices with the smallest RMSE, ERGAS and SAM and highest SSIM, QI and r. Smallest RMSE and ERGAS represent the best prediction accuracy, and smallest SAM indicates its better spectral fidelity than the other two methods. Furthermore, higher SSIM, QI and r of results fused by OBSTFM demonstrate its best ability in capturing spatial information and predicting pixels, and the results of quantitative assessment are consistent to that of visual evaluation.

Experiment 2
In this site, most of land cover changed from croplands to water because of a large flood. The purpose of Experiment 2 is to show the performance of the proposed algorithm in retrieving land cover changes with shape changes, and test the robustness of the strategy that incorporating segmentation technology, linear injection model and spatial filtering in retrieving large-scale abrupt land cover changes. Figure 9 presents original Landsat image of 12 December 2004 and its corresponding predictions fused by the three methods. It is apparent that the image OBSTFM predicted retains the most spatial details and has a satisfactory performance in retrieving land cover changes. To show more details, the results for the sub-area (marked in yellow) are shown in Figure 10, in which, the difference areas between proposed and comparison methods have been marked with yellow circles. It is apparent that the result predicted by OBSTFM seems to be more similar to original Landsat image than those predicted by STARFM and FSDAF. The sub-area shows that STARFM and FSDAF tend to deblurred some objects with less clear boundaries. In contrast, OBSTFM effectively predicts continuous spatial and spectral information in objects as well as captures better land cover reflectance changes in pixels than the other two methods. The quantitative indices calculated from predictions and original Landsat image shown in Table 2 demonstrated that three fusion methods have certain ability to capture changed information in different time phases. For all six bands, OBSTFM provided best prediction accuracy with smallest RMSE, ERGAS and SAM and highest SSIM, QI and r, which indicates that OBSTFM has better ability to capture spectral and spatial information between base date and prediction date than STARFM and FSDAF. FSDAF has lower accuracy than OBSTFM, but it performs much better than STARFM because of smaller RMSE, ERGAS and SAM and higher SSIM and r for all six bands.

Experiment 3
The prediction results generated using three different spatiotemporal methods in Experiment 3 are shown in Figure 11. The sub-area (marked in  Figure 11 is shown in Figure 12 to show more details. Due to the strong temporally changes between base date and prediction date, the prediction fused by STARFM shows obvious spectral difference in comparison to the real image. FSDAF reproduced more accurate prediction that STARFM with more similar spectral information. However, the phenology change causes the assumption that temporal increments of each class are the same among all pixels is not strictly correct, this problem introduces significant errors in the step of unmixing in FSDAF. As a result, the fused results      Table 3. For this site, the fused results produced by OBSTFM show smaller SAM and higher r than that of STARFM and FSDAF, which means OBSTFM can preserve more spectral information and be more relevant to the reference image. As for RMSE, SSIM and QI, the proposed    OBSTFM shows better performance than STARFM and FSDAF in most bands. Moreover, Figure 13 shows the scatter plots of predicted values for NIR band compared with actual values by the three methods, which also demonstrated that OBSTFM produced predicted values closer to the actual values than the other two methods and shown best consistency to the actual image. Generally, compared to STARFM and FSDAF, the proposed OBSTFM method shows advantages in preserving spatial information and produce prediction result temporally close to the information in prediction date.

Conclusion
This paper presents a new method named OBSTFM, including multi-resolution segmentation, linear injection model and spatial filtering. Multi-resolution segmentation aims to acquire segmented objects with homogeneous and similar information, linear   injection model is employed to fuse input data in each object, and then spatial filtering is implemented in each object to more accurately capture change information in different time phases and further produce more robust prediction. The proposed algorithm was tested using two datasets with simulated MODSI images experiencing phenology change and land cover type changes in heterogeneous landscape and one dataset with real MODIS images. Minimum input data are required to implement this method, and only the parameters in multi-resolution segmentation needed to be set without setting other parameters which might influence the sensitivity of the method in following steps. In comparison with STARFM and FSDAF, OBSTFM achieves better performance in predicting images with complex temporal changes, especially in non-shape changes. In this framework, it is worth mentioning that better alternative processing strategy can be developed to improve the performance of this algorithm.