A robust and adaptive spatial-spectral fusion model for PlanetScope and Sentinel-2 imagery

ABSTRACT Satellite sensors usually compromise between their spatial and spectral resolutions due to the limitations of data volume and signal-to-noise ratio, such as the Very High spatial Resolution (VHR) PlanetScope (PS, four or five 3-m bands) and medium spatial resolution (med-resolution) Sentinel-2 (S2, ten 10-m or 20-m bands) constellations. Concomitant with the growing demand for satellite images with ample spatial details and spectral signatures, Spatial-Spectral image Fusion (SSF) is important for blending the spatial resolution of PS and the spectral resolution of S2 to produce synthetic 3-m image products in the ten bands for different applications. However, the existing studies conducted for fusing PS and S2 data present limited spatial and spectral fidelities to original PS and S2 bands. Hence, this study presents a new SSF method named Robust and Adaptive Spatial-Spectral image Fusion Model (RASSFM) for that purpose. RASSFM improves the spatial and spectral fidelities of the results through: (1) combining the spectral mapping and the spectral correlation to obtain the high-quality spatial information sources for the S2 bands, (2) utilizing the neighborhood information considering both spatial and spectral constraints to improve the spectral fidelities of the fused bands. We conducted the SSF tests at the degraded and original spatial resolutions to comprehensively evaluate the proposed RASSFM method. Furthermore, we examined the performance of RASSFM in four study sites with highly mixed regular or desert urban, heterogeneous agricultural, and vegetation-dominated landscapes. Moreover, we compared our method with twenty representative SSF methods developed for similar purposes. The results demonstrate that RASSFM not only outperforms the other methods but has robust performance in different landscapes and comparable accuracies in different bands, thereby advancing the fusion of VHR and med-resolution images from PS and S2, respectively.


Introduction
Along with the development of Earth Observation (EO) systems, a large number of satellites have been launched with multiple spatial, temporal, and spectral resolutions, including medium spatial resolution (medresolution) satellites such as Landsat 8/9 and Sentinel-2, and Very High spatial Resolution (VHR, ≤5 m) satellites such as WorldView and RapidEye. In recent years, the PlanetScope (PS) satellite constellation has become a potential game changer in the EO field, which can collect daily imagery of the entire Earth at about 3-m spatial resolution in four (Blue (B), Green (G), Red (R), and Near-InfraRed (NIR)) or five (B, G, R, NIR, and Red Edge (RE)) spectral bands (Planet 2020b). Such a super EO ability is achieved by observations from the 180 + CubeSats in the PS constellation (Planet 2020a), which are operated in the International Space Station (ISS, with a variable equator crossing time) and Sun Synchronous Orbits (SSO, with an equator crossing time between 9:30 and 11:30 am local solar time). To date, there are three generations of PS sensors in operation accompanied by the evolution of sensor technologies, including the first-generation PS2, the secondgeneration PS2.SD, and the third-generation PSB.SD mounted on the Dove, Dove-R, and SuperDove flocks, respectively (Planet 2020b). Additionally, PS2 and PS2. SD are both with four spectral bands (RGB + NIR), and the spectral ranges of the PS2.SD bands are narrower than PS2. Compared with PS2 and PS2.SD, the latest PSB. SD has an additional RE band (Planet 2020b). Specifically, the spectral and spatial resolutions of the PS sensors are referred to Table 1. VHR imagery can provide unprecedented spatial details to benefit many remote sensing applications, such as benthic habitat and seagrass species mapping (Wicaksono and Lazuardi 2018), improved leaf area index estimation for agriculture landscapes (Kimm et al. 2020), and fine-scale bathymetry and water quality retrieval (Niroumand-Jadidi et al. 2020). Due to the trade-off between the spatial resolution and the spectral resolution of satellites (Shen, Meng, and Zhang 2016), however, PS is excellent in spatial information collecting ability but unsatisfied in resolving spectral profiles of surface materials. Although PS can acquire daily VHR imagery with global coverage, the spectral bands of PS imagery is limited for remote sensing studies that require more feature bands, such as mangrove species mapping (Li, Wong, and Fung 2019a), forest pest detection (Stych et al. 2019), river discharge estimation (Feng et al. 2019), harmful algal blooms delineation (Zhao, Liu, and Wei 2020), and anomaly target detection (Xie et al. 2021a(Xie et al. , 2021b. On the other hand, med-resolution satellites such as the Sentinel-2 (S2) constellation can acquire global imagery every five days with ample spectral information but at lower spatial resolutions. The spatial resolutions of the S2A/B MultiSpectral Instrument (MSI) sensor are various, including four bands with a 10-m resolution (B, G, R, NIR), six bands with a 20-m resolution (RE1, RE2, RE3, Narrow NIR (NNIR), ShortWave InfraRed 1 (SWIR1), and SWIR2), and three bands with a 60-m resolution. Since this study focuses on land surfaces, the 60-m bands are excluded because they are generally used for atmospheric studies. Specifically, the spectral and spatial resolutions of the adopted ten 10-and 20-m S2 bands are referred to Table 1. The six bands that are missing in PS can improve many remote sensing applications due to the added value of these spectral bands, such as land-use land-cover mapping (Forkuor et al. 2018;Zhao and Zhu 2022), plant species discrimination (Otunga et al. 2019), and cyanobacterial cell density estimation (Page, Kumar, and Mishra 2018). Remote sensing studies relying on either data source would lead to limited accuracy (Feng et al. 2019).
Therefore, combining the merits of both data sources (PS and S2) to provide synthetic data that have ample information content in both spatial and spectral dimensions to broaden their application scopes is of great significance (Zhao and Huang 2017). A feasible and cost-effective solution is adopting Spatial-Spectral image Fusion (SSF) to blend the spatial resolution of PS and the spectral resolution of S2 to generate synthetic VHR (i.e. 3-m) imagery with enhanced spectral resolutions, thereby providing rich spatial and spectral information simultaneously for remote sensing studies. Generally, SSF can be divided into two categories: the classical pansharpening (Zhang 1999) and the relatively new hypersharpening (Selva et al. 2015). For pansharpening, it aims to fuse the high spatial resolution of panchromatic imagery and the high spectral resolution of multispectral or hyperspectral imagery to generate multispectral or hyperspectral imagery with high spatial resolution (Xie et al. 2019). Representative pansharpening methods include the component-substitution-based (Carper, Lillesand, and Kiefer 1990), multiresolutionanalysis-based (Liu 2000), variational-optimizationbased (Ballester et al. 2006), and machine-learningbased (Masi et al. 2016) methods. For hypersharpening, it aims to fuse the high spatial resolution of multispectral imagery (e.g. WorldView) and the high spectral resolution of hyperspectral imagery (e.g. Hyperion, AVIRIS) to generate high-spatialresolution hyperspectral imagery. To date, several approaches have been developed for hypersharpening or similar purposes, such as the componentsubstitution-based (Winter et al. 2007), unmixingbased (Zurita-Milla, Clevers, and Schaepman 2008), sparse-representation-based (Song et al. 2014;Wei et al. 2015), and integrated (Vivone and Chanussot 2020) methods. However, the SSF fusion algorithms designed for specific satellite sensors may perform less satisfactorily when applied to another sensor with different spatial and spectral characteristics or scales (Huang et al. 2013). Therefore, increasing attention has been placed on the fusion between 3-m PS imagery and other satellite data in recent years due to the growing popularity of PS imagery. For instance, Gašparović et al. (2018) adopted pansharpening methods to improve the spatial resolution of the eight 10-m/20-m S2 bands (excluding the two SWIR bands) using the spatial details from the four PS bands. Specifically, the PS band with the highest correlation to a S2 band is directly used as the panchromatic image to sharpen the corresponding S2 band. Houborg and McCabe (2018) proposed a multi-scale cross-sensor radiometric calibration method to transform the spectral characteristics of PS to Landsat 8 to produce temporally consistent surface reflectance images in their equivalent bands (B, G, R, NIR), which did not consider the spectral bands that overshoot the spectra of PS, i.e. the SWIR bands of Landsat 8. Furthermore, Li et al. (2019b) devised a hybrid fusion scheme that includes a spatial unmixing process and an existing image superresolution method to downscale the 10-m/20-m S2 bands using the spatial frequencies from the four PS bands. The results demonstrate limited image sharpness effects in the fused results (see Figures 5 and 6 in (Li et al. 2019b)), which show the spatial characteristics of the 10-m/20-m S2 bands. In operational use on continental or global scales, the combination of both spatial unmixing and superresolution techniques may lead to massive computing burdens . Li et al. (2020b) evaluated the performance of two effective pansharpening approaches (high pass modulation and Model 3) in improving S2 10-m/20-m imagery to the PS 3-m resolution. The straightforward algebraic calculation in the pansharpening methods works well for homogeneous landscapes, but it may have limitations for highly mixed urban areas with complex spectral signatures in VHR satellite imagery (Alparone et al. 2008). In summary, the spectral fidelities of the fused results in different landscapes have not been deeply investigated, particularly for the SWIR bands that overshoot the PS spectra. Moreover, the spatial fidelities of the results to original 3-m PS imagery still have room for improvement. Hence, further efforts on the SSF method development for fusing the 10-m/20-m S2 and 3-m PS imagery are necessary.
In light of the limitations described above, a new SSF method called Robust Adaptive Spatial-Spectral image Fusion Model (RASSFM) is proposed and validated in this study, which fuses observations from PS sensors with VHR but low spectral resolution and S2 MSI with med-resolution but high spectral resolution to produce synthetic image products with VHR and high spectral resolution. Basically, RASSFM differs in two aspects from previous hypersharpening studies. First, the pre-fusion step adopts a combination of spectral mapping and spectral correlation to generate ten simulated 3-m bands to act as the spatial information sources for the posterior fusion steps rather than using only one component of the two (Vivone and Chanussot 2020). Second, the spatial neighborhood information considering both spatial and spectral constraints is introduced into the fusion model to improve the spectral fidelities of the results. We evaluated RASSFM qualitatively and quantitatively via different fusion schemes in different landscapes and compared it with twenty representative fusion methods developed for similar purposes. In the remainder of this paper, we will demonstrate the study areas and datasets in Section 2, describe the methodology and evaluation schemes in Section 3, present the results in Section 4, discuss and conclude the study in Sections 5 and 6, respectively.

Study areas
We examined the fusion methods in four study sites with different land covers and climate types to demonstrate their effectiveness (see Figure 1). The first site is in Columbus, Ohio, United States (centered at 40.0056°N, 83.0292°W). This site is an urban area with various land cover types that are highly fragmented, such as built-up areas (e.g. buildings/playgrounds/roads/parking lots), vegetation (e.g. woodlands/lawns), waters (e.g. rivers/lakes), wetlands, bare lands. The rich spectral characteristics and the complex spatial patterns of the surface materials pose more challenges on resolving the fine-resolution spectral profiles. The second site is in Las Vegas, Nevada, United States (centered at 36.2179°N, 115.1864°W), which is another challenging study area in desert urban areas. This site is mainly covered by buildings, airport features, bare lands, roads, and a small proportion of vegetation. Involving this site is to examine the fusion performance in desert cities that are different from regular cities like Site I. The third study site is dominated by agricultural land covers in the Coleambally Irrigation Area in southern New South Wales, Australia (centered at 34.7849°S, 145.9903°E). This site is an irrigated field of various summer crops, such as rice, maize, sorghum, and soybean (Emelyanova et al. 2013), presenting a heterogeneous landscape. In this study, the Coleambally site mainly contains irrigation areas, dry cropland, trees, and other land cover types. The spatial heterogeneity of this site leads to high spectral variances, which can be used to examine the robustness of the fusion methods. The fourth site in northeastern Ohio, United States (centered at 41.6002°N, 81.3709°W) is mainly covered by vegetation (e.g. forest and lawns) and small parcels of builtup areas (e.g. roads and houses). Introducing this site is to enrich the land cover conditions of our experiments. On the other hand, vegetation is one of the most important variables for remote sensing applications (Tian et al. 2021).
Furthermore, the area of the four study sites is 3.6 × 3.6 km 2 , which corresponds to an image size of 1200 × 1200 pixels for 3-m PS images, 360 × 360 pixels for 10-m S2 bands, and 180 × 180 pixels for 20-m S2 bands, respectively. The spatial coverage of the study sites can clearly demonstrate the fusion effects at a local scale as achieving good fusion results at a local scale is the basis for obtaining good fusion at a regional or global scale (Zhao, Huang, and Song 2018).

PlanetScope imagery
The PS image of Site I was acquired by PSB.SD on 1 October 2021; the PS images of Sites II&IV were observed by PS2.SD on 8 August 2020 and 1 July 2019 respectively; the PS image covering Site III was collected by PS2 on 10 August 2020. Additionally, all the PS images were acquired by the PS sensors in the SSO orbit. We adopted the surface reflectance product (Level 3B) (Planet 2017) from the Planet Explorer (https://www.planet.com/explorer/) as the high spatial but low spectral resolution data. The PS illustrations are displayed in true-and falsecolor (NIR-R-G as RGB composites) to present the spectral characteristics.

Sentinel-2 imagery
The S2 images of Sites I&III&IV were acquired by S2A and on October 01, 10 August 2021 2020, and 1 July 2019 respectively; the S2 image of Site II was acquired by S2B on 8 August 2020. We used the S2 surface reflectance product (Level 2A) as the low spatial but high spectral resolution data, which were downloaded via Google Earth Engine (GEE; https:// code.earthengine.google.com/). The S2 bands were up-sampled to 3-m using the bi-cubic interpolation method to match the grids of PS. Moreover, all clipped PS and S2 images with the same geographic coverage were projected onto the Universal Transverse Mercator (UTM) coordinate system under the WGS-84 datum. Additionally, the S2 images were geometrically matched with the PS images by applying optimal offsets (Gevaert and García-Haro 2015) within the same map coordinates. The S2 illustrations are displayed with different color models to demonstrate the spectral information explicitly.

Methodology and Evaluation Settings
For the convenience and clarity of describing the image fusion method, the symbols of the main variables used in this study are defined in Table 2. Note that although PS has 3-m observations in the four bands (BRG + NIR), we still downscale the corresponding S2 10-m bands to 3-m because the radiometric characteristics and cross-satellite consistencies of PS CubeSats are not as good as rigorously calibrated scientific satellites such as Landsat8 or Sentinel-2 (Houborg and McCabe 2018).

Spectral mapping and correlation between PlanetScope and Sentinel-2
We performed a combination of spectral mapping and spectral correlation between PS and S2 to match their spectra for the posterior band-to-band fusion process. That is, generating a simulated 3-m image with ten bands corresponding to the ten S2 bands, which have similar spectral characteristics as much as possible. In hypersharpening, spectral transformation and spectral correlation have been used for similar purposes (Vivone and Chanussot 2020). However, our tests showed that either approach generates 3-m bands with different levels of spectral similarities to the S2 bands in different study sites, particularly for the six 20m bands. Hence, we combined the 3-m bands generated by the two approaches based on their spectral similarities to the S2 bands. Note that the spectral characteristics of the obtained 3-m simulated image would be similar to the original S2 bands as a whole, but they are not the same. However, the simulated image has actual 3-m spatial resolution in each band, which would act as the spatial information source rather than the spectral information source for the posterior fusion steps. The spectral information source of the final fusion result is still the original S2 bands.

Spectral mapping
The datasets used for the spectral mapping should have the same spatial coverage and small temporal gaps (Winter et al. 2007), which is easy to satisfy in this study due to the daily and global coverage of PS imagery. For simplicity and robustness, a linear spectral transformation between PS and S2 imagery is conducted to transform the spectral bands of PS imagery from four to ten to match the spectral bands of S2 imagery: where ε is the error term, M is the spectral mapping matrix that can be estimated by a least square regression as: has the same rank with the P H;λ ð Þ because it is generated via a global linear transformation between PS and S2 images (Winter et al. 2007). Therefore, P H;Λ ð Þ has the same number of statistically separable spectral components as P H;λ ð Þ (Winter et al. 2007).

Spectral correlation
In similar studies, P H;Λ ð Þ was directly used as the spatial information source for fusion (Li et al. 2020b;Winter et al. 2007). However, we found that the linear transformation above was less satisfying for the RE1&2&3 and NNIR bands in some study sites. Therefore, we added another option for the four bands based on the spectral correlation (Selva et al. 2015) between PS and S2 bands: The proposed RASSFM method for the spatial-spectral fusion of PlanetScope and Sentinel-2 imagery.
where l and k are the band indices of PS and S2, respectively. corr means the Pearson's correlation is the resampled PS band that has the largest correlation coefficient with the kth S2 band.
Then, we combined P H;Λ ð Þ and P H;Λ ð Þ based on their spectral similarities to the original S2 bands to obtain P H;Λ 0 ð Þ for the following fusion process: where the BandCombination operator means a band of P H;Λ ð Þ or P H;Λ ð Þ that has the larger correlation coefficient with the corresponding S2 band will be given to their counterpart in P H;Λ 0 ð Þ .

Spatial-spectral fusion model
Theoretically, there is a constant systematic bias between homogenous pixels observed by multiple satellites on the same date due to the different observation conditions, e.g. satellite orbit parameters, bandwidth, imaging time (Gao et al. 2006;Huang, Zhang, and Yu 2012). In this study, the simulated P H;Λ 0 ð Þ and the target SP H;Λ ð Þ can be regarded as the observations of two virtual satellite sensors with a 3-m spatial resolution. Correspondingly, P H 0 ;Λ 0 ð Þ and S H 0 ;Λ ð Þ are the spatially degraded counterparts of P H;Λ 0 ð Þ and SP H;Λ ð Þ from 3-m to 10-m/20-m resolutions to model the spatial degradation of the S2 imaging process. Hence, there is a systematic bias between the two data sources across different scales (3-m and 10-m/ 20-m) based on the above assumption: where ε is the systematic bias. Therefore, we can derive the following spatial-spectral fusion model based on Eq. (6) and (7). . This process is conducted in a band-by-band form for the ten bands of S H 0 ;Λ ð Þ . However, the above model that processes the image band as a whole may lead to significant biases for heterogeneous or mixed pixels (Gao et al. 2006). Hence, we introduce the neighborhood information of a target pixel to mitigate the deficiency via the weighted summation model below: where c is the central pixel coordinate of a local moving window, i and j represent the pixel coordinates within the moving window. W (i,j) is the weight of each pixel within the moving window. Note that the moving window size can be adjusted based on the spatial resolution gap between the input PS and S2 images. We only adopt the neighboring pixels that are similar to the central pixel for the fusion model in Eq. (9) to ensure high spectral fidelity of the fusion results, which contains two important steps, i.e. similar neighbor searching and neighbors' weight calculation.

Similar neighbor searching and weight determination
We use the image-patch-based non-local searching scheme to obtain its similar neighbors within a moving window (Zhao, Huang, and Song 2018) for each pixel in P H;Λ 0 ð Þ , which is expanded as a central image patch of the moving window. The similarities between the neighboring patches (including the central patch itself) and the central patch are measured by the following normalized exponential function, are the central patch and its similar neighborhoods in P H;Λ 0 ð Þ , respectively. R represents their similarity measurement.
, and h is the degree of decay. In this study, the image patch size is set as 3 × 3 (9 × 9 m 2 ) for computing efficiency and searching accuracy. Additionally, we adopt the first N most similar neighbors based on the similarity measurements. Note that the similar neighbor's amount N can be adjusted based on the spatial resolution gap between PS and S2 images.
After obtaining the similar neighbors, the weight of each neighbor is determined based on four factors: 1) similar neighbors with higher homogeneity (Q) should have larger weights, 2) similar neighbors with smaller cross-sensor spectral differences (F) should have larger weights, 3) similar neighbors with smaller spatial distances to the central patches (D) should have larger weights, and 4) similar neighbors with smaller spectral differences to the central patches (SD) should have larger weights. These four weight factors are combined as: D ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where ρ is a small constant to prevent D and SD from becoming zero at the central image patch, ω is the radius of the moving window. Then, the final weights that consider the four weight factors are calculated by minimizing the locally linear reconstruction error of the following cost function: X c and X n mean the combined weight factors for the central patch and its similar neighbors, respectively. W n is the vector form of W (i,j) in Eq. (9) that contains the optimal weights. The diagrams of the similar neighbor searching and weight calculation steps are shown in Figure 3.
Based on the similar neighbors and their weights obtained from Eq. (10) and (16), SP H;Λ ð Þ can be constructed by the weighting function in Eq. (9) with a pixel-wise manner, which assigns the weights of W n and zero to the similar and nonsimilar neighbors, respectively. Note that the central pixel value of each reconstructed patch is assigned to the target pixel (the center pixel of the moving window) to ensure image sharpness.
Furthermore, remote sensing images usually have noises, outliers, or abnormal observations due to various interferences, such as cloud cover, atmospheric effects, image calibration (Zhao, Huang, and Song 2018), which would negatively affect the similar neighbor searching and weight calculation steps. For instance, there may be extreme cases that no similar neighbors can be found within a predefined neighborhood. In that case, the target pixel will be assigned a weight of 1. Additionally, if the reconstructed values go beyond the normal range of surface reflectance (0-1), the pixel values generated by Eq. (8) will be employed as alternatives. In short, image noises, mis-identified clouds, uncertainties of atmospheric correction would affect the similar neighbor searching and weight calculation. These factors are tricky for remote sensing image fusion because more uncertainty sources are involved, so we used the simple fusion model in Eq. (8) to reduce error sources for such cases.

Metrics for fusion at degraded spatial resolutions
We adopted a commonly used validation strategy for spatial-spectral fusion with reference data (Shao et al. 2019;Wald, Ranchin, and Mangolini 1997), which uses image aggregation to down-sample the PS and S2 images to lower spatial resolutions for fusion and takes the original S2 images as references. Since there are two spatial resolutions (10-m and 20-m) for the S2 bands, two branches of fusion were conducted accordingly. First, the four PS and S2 bands (RGB + NIR) were downsampled to 10-m and 30-m resolutions, respectively, based on the original spatial resolution gap (~3X). Second, the four PS bands and the six S2 bands (REs, NNIR, SWIRs) were downsampled to 20-m and 120-m resolutions, respectively, based on the original spatial resolution gap (~6X). The image fusion evaluation was conducted qualitatively and quantitatively by comparing fused results with their ground truth counterparts (Pohl, Moellmann, and Fries 2017).
The qualitative evaluation was based on the similarities between the fused 10-m/20-m and reference S2 10-m/20-m bands in terms of their clarity, structure similarity, spectral fidelity. Moreover, the image sharpening effects of the fused 10-m/20-m images over their 30-m/120-m counterparts were also examined. The quantitative evaluation was based on statistical metrics between fused and reference images from spectral and structural similarity perspectives. The widely used Correlation Coefficient (CC), Root Mean Square Error (RMSE), Average Absolute Difference (AAD), Erreur Relative Global Adimensionnelle de Synthese (ERGAS; Wald 2002), and Spectral Angle Mapper (SAM; Yuhas, Goetz, and Boardman 1992) between fused and reference images were used to evaluate their spectral similarities. The Structure SIMilarity (SSIM) index (SSIM; Wang et al. 2004) and Universal Image Quality Index (UIQI; Wang and Bovik 2002) were adopted to assess their overall image structure and radiometric similarities. Additionally, the averaged AAD maps between the fused and reference bands were used to demonstrate the spatial distribution and magnitude of fusion errors. The scatter plots of fused bands against reference bands were also plotted to provide an intuitive comparison between fused and reference values in describing the approximate extent of their correlation (Zhao, Huang, and Song 2018).
The evaluation criteria for the fusion at degraded spatial resolutions are: (1) the closer CC, SSIM, and UIQI are to one, the more similar the predictions are to the references; (2) the closer RMSE, AAD, ERGAS, and SAM are to zero, the more similar the predictions are to the references; (3) the closer the scatter plots are to the 1:1 line, the more similar the predictions are to the references.

Metrics for fusion at original spatial resolutions
In addition to the fusion at the degraded resolution level with direct references, the fusion at the full resolution level (3-m PS and 10-m/20-m S2) was performed as well because (1) the practical target is to generate 3-m images in the ten bands and (2) the fusion strategy in Section 3.4.1 may not be applicable to VHR imagery, especially for heterogeneous landscapes (Alparone et al. 2008). We evaluated the spectral and spatial fidelities of results to the original S2 and PS bands, respectively, based on the criteria devised for spatial-spectral fusion without high spatial-spectral resolution references (Alparone et al. 2008;Song et al. 2014).
First, the spectral CC and spatial CC were used to evaluate the spectral and spatial distortions, respectively (Zhou, Civco, and Silander 1998). Specifically, spectral CCs were calculated between downsampled fusion results and original S2 images in band-by-band forms. Spatial CCs were calculated between highfrequency components (spatial details) of the fused and original PS bands (RGB + NIR). Note the highfrequency components were extracted by a Laplacian high pass filter. Second, the spectral distortion index (D λ ), the spatial distortion index(D s ), and the joint spectral and spatial quality index, i.e. Quality with No Reference (QNR) (Alparone et al. 2008) were employed to evaluate the spectral or spatial distortions. Note D λ was calculated based on all the ten bands before and after fusion, and D s was calculated based on the four PS and fused bands. For more details about the calculation of the three indices, we refer readers to (Alparone et al. 2008).
The evaluation criteria for the fusion at original spatial resolutions are: (1) the spectral CC, spatial CC, and QNR are to one, the better the fusion results; (2) the closer D λ and D s are to zero, the better the fusion results.

Comparison Methods and Parameterization
We compared the proposed RASSFM method with twenty representative SSF algorithms that were developed for similar purposes, including the Color Resolution Improvement Software Package (CRISP; Winter et al. 2007 Vivone and Chanussot (2020). CRISP, UBDF, and DPLM utilize component substitution, spectral unmixing, and compressed sensing techniques, respectively, for spatialspectral fusion, which have been widely used as benchmarks for image fusion algorithm development (Huang et al. 2013;Sun et al. 2014;Zhu et al. 2016). HPM has been recommended for the fusion of PS and S2 imagery because it introduces small spatial and spectral distortions in the sharpened S2 bands (Li et al. 2020b). The sixteen HySharp approaches (Vivone and Chanussot 2020) are eight welldeveloped SSF algorithms, i.e. the Gram-Schmidt (GS) approach, GS adaptive (GSA) method, Gauss High-Pass Modulation (Gauss-HPM), Gauss High-Pass Filtering (Gauss-HPF), Gauss Context-Based Decision (Gauss-CBD), Bilinear filter based HPM (Bil-HPM), Bil-HPF, and Bil-CBD, based on two high-spatialresolution band generation schemes, i.e. band synthesizing and band selection (Selva et al. 2015). More details about the sixteen approaches can be found in (Vivone and Chanussot 2020). Two save space, we show the best results among the sixteen HySharp approaches (Table S1).
The parameterizations were different for the two types of fusion:(1) fusion at the degraded spatial resolution level with direct references and (2) fusion at the original spatial resolution level with indirect references. For the fusion at the degraded spatial resolutions, the low-and high-pass Butterworth filters in CRISP were both parameterized with the cutoff frequency of 15 and the order of 3 via finetuning; the moving window size and the class number in UBDF were set as 25 and 30, respectively; the number of spectral dictionary atoms and the sparseness degree of representation coefficients in DPLM were fine-tuned to be 10 and 0.8, respectively; the moving window size and the number of similar neighbors for RASSFM were set as 7 and 10, respectively. For the fusion at the original spatial resolutions, low-and high-pass Butterworth filters in CRISP were both parameterized with the cutoff frequency of 35 and the order of 3 via fine-tuning; the moving window size and the class number in UBDF were set as 31 and 40, respectively; the number of spectral dictionary atoms and the sparseness degree of the representation coefficients in DPLM were fine-tuned to be 12 and 0.8, respectively; the moving window size and the number of similar neighbors for RASSFM were set as 11 and 25, respectively.

Results of Site I
This dataset is used to examine the performance of the SSF methods in terms of highly complex land covers with rich spectral characteristics. Showing the prior low-resolution (30-m/120-m) S2 images is to provide a benchmark to demonstrate the downscaling effects of the SSF methods (Figure 4 and S1). All methods improved the spatial resolutions of the 30m/120-m S2 images significantly, showing comparable sharpness to the reference 10-m/20-m S2 images. However, the spatial distortions of UBDF are substantial compared to the other methods, which are prone to patch effects and lose local fineresolution textures, particularly for the 20-m fusion scenario (see Figure 4&l).
As to visual spectral fidelities of the fusion at the 10-m resolution level ( Figure S1), the proposed RASSFM method performs better than the other algorithms, such as the marked stadium in Figures S1e-j. Quantitatively, it can be clearly observed from the AAD maps (Figs. S1q-v) that RASSFM has the lowest fusion biases than the others. Furthermore, the statistical metrics in Table S2 and the scatter plots in Figure  S9 explicitly demonstrate that RASSFM has lower fusion errors and higher spectral and structural similarities to the reference 10-m image.
For visual evaluations of the fusion at the 20-m resolution level (see Figure 4), UBDF and DPLM have larger spectral disparities to the references than the other methods. For instance, the vegetation in the UBDF and DPLM results (Figure 4&m) tend to be brighter and bluer, respectively, than the reference images. The DPLM results look more natural than the results of UBDF due to its superiority in preserving the spatial information of the prior PS image. The spectral fidelity of RASSFM is better than CRISP regarding the image saturation degrees compared to the reference images, especially for the RE bands ( Figure 4&j). Moreover, RASSFM performs better than HPM regarding the bare land in the northwest of the study area (Figure 4&p). For HySharp, the results look darker than the reference images, such as the vegetation and bare lands (Figure 4&o). Quantitatively, the 20-m average AAD maps explicitly demonstrate that the RASSFM has lower fusion errors than the other methods. Furthermore, the statistical metrics in Table 3 and the scatter plots in Figure S10 indicate that RASSFM has the highest spectral and image structure similarities and lowest fusion biases to the reference 20-m image compared to the others.

Results of Site II
Site II is adopted to examine the fusion performance of the involved methods in desert urban areas. For spatial information enhancement, the input 30-m/ 120-m S2 images were obviously sharpened to 10m/20-m ( Figure 5 and S2). In addition, the spatial fidelity of UBDF is worse than the other methods. For spectral fidelity, the results indicate that DPLM performs less satisfying than the others, particularly for the SWIR bands. This is probably caused by the limited spectral constraints when resolving the representative coefficients of the PS spectral atoms for each PS pixel because PS only has four visual and NIR bands (no constraints for the SWIR bands), which may lead to biases by applying the coefficients to the corresponding S2 spectral atoms. CRISP, HPM, and HySharp have comparable visual performance for the 10-m fusion results, and RASSFM demonstrates the highest spectral similarity to the reference image (see Figures S2q-v). Moreover, the 20-m fusion results show the better spectral fidelity of RASSFM than the other methods obviously ( Figure 5). Quantitatively, the evaluation metrics in Table 4 and S3, the scatter plots in Figures S11-S12, and the AAD maps in Figure  5 and S2 explicitly demonstrate that RASSFM has the lowest fusion errors compared to the other methods.

Results of Site III
This dataset is used to investigate the fusion effects of the SSF methods in heterogeneous agricultural landscapes. Qualitatively, the image sharpness effects of the SSF methods are obvious at both 10-m and 20-m fusion scenarios (see Figure 5 and 6 and S3). For instance, the blurry boundaries and textures of the croplands in the prior 30-m/120-m S2 images become much clearer in the fused 10-m/20-m images. Nevertheless, the spatial fidelity of UBDF is lower than the other methods. In addition to the sharpness enhancement, the spectral fidelities of the 10-m/20-m results fused by RASSFM is slightly better than CRISP, HPM, and HySharp by visual evaluations, and these four methods are significantly better than UBDF and DPLM, e.g. the SWIR bands ( Figure 6&m). Quantitatively, the AAD maps explicitly demonstrate that the fusion biases of RASSFM are the smallest among all methods. Moreover, the statistical metrics in Table 5 and S4 and the scatter plots in Figures S13-S14 indicate that RASSFM has higher image spectral and spatial similarities and lower fusion errors with the reference bands than the compared methods. Note that since the spectral variances within each cropland are smaller than the variances in Sites I and II, the fusion performance of each method is generally better (except SWIRs).

Results of Site IV
This group of experiments further demonstrates the performance of our proposed and the compared SSF methods in vegetation-dominated landscapes, e.g. forest and lawns. For image sharpness, the spatial details of the input 30-m/120-m images were greatly improved by all methods (Figure 7 and S4), while differences still exist for different methods. For the 20-m fusion scenario, for example, CRISP-and UBDFfused results have blurry (see Figure 7&k) and block (see Figure 7&l) effects, respectively. DPLM preserves the spatial details of the prior 10-m/20 m images better than CRISP and UBDF, but DPLM loses some textures regarding the vegetated areas (see Figure  7&m). HPM, HySharp, and RASSFM have comparable spatial textures, but their spectral fidelities to the references are quite different. For visual spectral fidelity, the performance of all methods at the 10m fusion level is comparable because this is a relatively homogeneous landscape, but the AAD maps explicitly demonstrate that RASSFM has the smallest fusion biases ( Figure S4). Moreover, the fusion at the 20-m level shows the superiority of RASSFM explicitly because it has the closest spectral characteristics to the reference images, such as the built-up and vegetation areas (Figure 7 -p). Quantitatively, the image fusion metrics in Table 6 and S5 and the scatter plots in Figures S15-S16 show that RASSFM has the largest spectral and spatial fidelities and the smallest fusion errors regarding the reference images.

Results of site I
To complement the image fusion validation in Section 4.1 at the degraded scales, the fusion at the original scales, i.e. 3-m PS and 10-m/20-m S2, is conducted to evaluate the fused high spatial-spectral resolution images at user ends. Note that the input 3-m PS and 20-m S2 images are adopted as spatial and spectral references, respectively, because the spatial and spectral contents of the fused bands come from the two data sources, respectively. In Figure 8 and S5, the spatial details of the PS images were injected into the S2 images well, and the spatial resolution improvement of the fused 3-m images are substantial compared with the original 10-m/20-m S2 images for all fusion methods, such as the playgrounds and buildings, which are obvious in both global and local zoomed-in views.
Furthermore, the spectral fidelity of RASSFM is better than the other methods (Figure 8), such as the distorted color of the outfield (fake grass) of the baseball field (location A) for the other methods. Moreover, the quantitative evaluation metrics demonstrate distinct differences. In Table 7, spectral   CCs and D λ with the S2 bands show that the spectral fidelity of RASSFM is higher than the others. In addition, the spectral CCs reveal that the spectral fusion accuracies of the other methods decrease significantly when it comes to the SWIR bands (beyond the PS spectra), but RASSFM keeps comparable accuracies. Furthermore, the spatial CCs and D s with the four PS bands indicate that the spatial information fidelity of RASSFM is better than the compared methods. Moreover, the joint spatial and spectral quality index, i.e. QNR, shows that RASSFM has the best fusion effects regarding the spatial and spectral fidelities.

Results of Site II
As shown in Figure 9 and S6, the urban objects are sharpened significantly from 10-m/20-m to 3-m resolution, but CRISP and RASSFM have higher spatial fidelities than UBDF, DPLM, HPM, and HySharp (see the zoom-in comparisons in Figure 9). For their visual spectral fidelities, the 10-m fusion results in the four bands are generally satisfying ( Figure S6), but the 20-m fusion results in the six S2 bands show larger differences, indicating the better performance of RASSFM in preserving the spectral information of S2, such as the cylindrical buildings (Figure 9 -g). Quantitatively, the spectral (spectral CC and D λ ) and spatial (spatial CC and D s ) fidelity evaluation metrics show that the RASSFM-based results have higher spectral and spatial fusion accuracies than the other methods (Table 8). The QNRs also indicate the balanced performance of RASSFM in preserving the spectral and spatial information from S2 and PS images, respectively.

Results of Site III
For the heterogeneous agricultural landscapes, all the SSF methods substantially improve the spatial resolutions of the original 10-m/20-m S2 bands, such as the boundaries of the cropland parcels and crowns (see Figure 10 and S7). Visually, the spectral fidelities of the results obtained from the 10-m S2 bands are better than the ones obtained from the 20-m S2 bands because finer spatial information can be provided from the 10-m S2 bands. The results fused by CRISP, UBDF, DPLM, and HPM have lower degrees of spectral similarity to the original S2 bands than the results produced by HySharp and RASSFM, which is more significant for fusion at the 20-m level (Figure 10 -g). Moreover, the spatial fidelity of RASSFM is better than HySharp, particularly for the boundaries of the croplands (see Figure 10&g). Quantitatively, the spectral (spectral CC and D λ ), spatial (spatial CC and D s ), and joint spatial-spectral (QNR) distortion evaluation metrics in Table 9 explicitly demonstrate that the results fused by RASSFM have the most balanced performance in preserving the spectral and spatial information from the S2 and PS images, respectively.

Results of Site IV
This dataset demonstrates the performance of the involved SSF methods in homogeneous landscapes dominated by vegetation. The spatial resolution improvements are obvious for all methods at both 10-m and 20-m fusion levels (see Figure 11 and S8). Additionally, the spatial evaluation metrics in Table 10, i.e. spatial CCs and D s , indicate that the bands fused by RASSFM have better spatial consistency with the input 3-m PS image bands. For the  spectral fidelity, their visual differences at the 10-m fusion level are not as obvious as the previous three sites due to the homogeneity of this site (see Figures  S8a-g), but the spectral metrics in Table 10, i.e. spectral CCs and D λ , show the differences explicitly. Moreover, the fused results at the 20-m fusion level demonstrate larger visual differences (see Figure 11 -g), and the RASSFM-fused results has larger similarities to the original S2 images, which is also indicated by the spectral CCs and D λ in Table 10. For the joint spatial and spectral fidelities, the QNRs in Table 10 reveal that RASSFM performs better in balancing the two aspects than the others.

Discussion
The utilization of spatial neighborhood information to improve the accuracy of image fusion has been proved effective in Spatial-Temporal image Fusion (STF; Gao et al. 2021;Li et al. 2020a;Liu et al. 2019).   Obtaining similar neighbors and determining their weights accurately are the keys. Compared with the traditional neighborhood utilization scheme in STF (Gao et al. 2006), the non-local constrained linear regression strategy adopted in this study has several advantages (Zhao, Huang, and Song 2018). Firstly, the image patch based searching scheme considers both spatial and spectral similarities between the target image patch and its similar neighbors to guarantee the reliability of the searched candidates, which is more robust than traditional pixel-based searching schemes that are prone to be affected by noises or outliers in satellite images (Gao et al. 2006). Secondly, the local constrained linear regression can maintain the local spatial structure of the prior VHR PS imagery, and the non-negative weights are less sensitive to image noises (Saul and Roweis 2003). Hence, the local constrained linear regression is more rigorous than arbitrary weight schemes such as inverse distance weighting (Chang 2006).
Generally, there is a tradeoff between the image sharpness and spectral fidelity of the fusion results. The moving window size and the similar neighbor number of RASSFM are responsible for the tradeoff, which would affect the similar neighbor searching and weight determination, respectively. The smaller the moving window size, the blurrier but the higher the spectral fidelity of the results, and vice versa. The fewer the similar neighbors, the blurrier but the higher spectral fidelity of the results, and vice versa. The parameters used in this study are determined through many trial-and-error experiments to achieve a good balance between image sharpness and spectral fidelity.
Furthermore, all fusion algorithms perform better in the four 10-m S2 bands than the six 20-m S2 bands. The different spectral and spatial disparities between PS and S2 account for that: (1) the spectral characteristics of the four PS bands are more similar to the four 10-m S2 bands than the six 20-m S2 bands, (2) the spatial resolution of the four S2 bands (10-m) is higher than the six S2 bands (20-m). Additionally, all methods have larger fusion errors in the SWIR bands than in the RE and NNIR bands in all study areas. The reason is that the SWIR bands overshoot the spectra of PS sensors, which have larger spectral differences to the four PS bands than the RE and NNIR bands, thereby leading to more challenges in extrapolating the spectral information of the SWIR bands (Winter et al. 2007). Moreover, the spectral fidelities of the results in Sites III and IV are better than the ones in Sites I and II (except SWIRs) as the land covers of Sites III and IV are less heterogeneous than Sites I and II. Therefore, small spatial resolution gaps and large spectral correlation between PS and S2, and high homogeneity of land covers will benefit the fusion accuracy.
As to efficiency, the computing time of each method in each study area was examined based on MATLAB® 2021b in a ThinkPad X1 carbon laptop equipped with an Intel® Core i7-1185G7 CPU @ 3.00 GHz and 32 GB of installed RAM (Table S6). It indicates that UBDF is the most time-consuming due to the unmixing process for all S2 pixels in each band. RASSFM and DPLM are the second and third time-consuming ones because of the similar neighbor searching and weight calculation, and the spectral dictionary learning and sparse coding. CRISP needs more time than HySharp and HPM due to the Fourier and inverse Fourier transformations. HySharp and HPM have similar efficiency due to their comparable computation burden.
Remote sensing image fusion is a broad topic, which mainly includes spatial-temporal fusion (Gao et al. 2006), spatial-spectral fusion (Selva et al. 2015), spatial-angular fusion (Huang, Zhang, and Yu 2012), spectral resolution enhancement (Sun et al. 2014), and spatial resolution enhancement . Additionally, SSF covers the traditional pansharpening and the relatively new hypersharpening. We define this study as a type of SSF because we aim to fuse VHR images with limited spectral bands (PS) and med-resolution images with more bands (S2), and there are no temporal changes between the two data sources. Additionally, this study goes beyond the scope of pansharpening as no high-spatialresolution panchromatic bands were involved. It is also slightly different from hypersharpening due to the absence of hyperspectral imagery. On the other hand, although utilizing spatial neighborhood information in RASSFM was motivated from previous STF methods (Gao et al. 2006;Zhao, Huang, and Song 2018) to improve the accuracy and robustness of SSF, their fusion purposes are quite different. STF aims to blend coarseresolution (e.g. 250-m) satellite imagery with a frequent revisiting cycle (e.g. daily) and fine-resolution (e.g. 30-m) satellite imagery with a sparse revisiting cycle (e.g. 16 days) to produce fine-resolution images with the frequent revisiting cycle. The fine-but-sparse and coarsebut-frequent inputs of STF have the same band configurations. The trickiest thing of STF is downscaling the coarse-resolution change information to fine-resolution, which basically does not exist in SSF because its data sources are acquired at the same or very close time. SSF deals with spectra downscaling rather than change downscaling. Note that although UBDF was originally proposed for SSF purposes (Zurita-Milla, Clevers, and Schaepman 2008), afterward it was revised for STF because the unmixing theory is applicable to both fusion scenarios (Li et al. 2020a;Zhu et al. 2016).

Conclusions
This study presents a new SSF method, i.e. RASSFM, for fusing the spatial details of PS and the spectral signatures of S2 to produce synthetic VHR (3-m) images in the ten bands (B, G, R, RE1&2&3, NIR, NNIR, SWIR1&2). Specifically, we proposed to combine the spectral mapping and the spectral correlation to generate VHR bands to sharpen their S2 counterparts. Moreover, the spatial neighborhood information of a target pixel considering spatial and spectral constraints was introduced to spatial-spectral fusion to improve the spectral fidelities of the results. We evaluated RASSFM qualitatively and quantitatively in different landscapes based on the fusion at the degraded and original spatial resolutions. The results indicate that RASSFM has robust and adaptive performance across different landscapes and outperforms the other methods in preserving the spatial details of PS and the spectral information of S2. This study blends the advantages of existing VHR and med-resolution satellite observations to resolve scientific questions that cannot be addressed by either data source. Additionally, RASSFM enriches the SSF community for blending PS and S2 imagery. Although RASSFM is proposed based on PS and S2 imagery, it can be used to fuse images from other satellites with similar spatial characteristics and spectral responses to produce synthetic images with high spatial-spectral resolutions.
The temporal resolution of the outputs of RASSFM is the same as S2 (five days) because PS and S2 collect data in daily and five-day intervals, respectively. Moreover, unlike some VHR satellites with limited observing areas per users' request, both PS and S2 regularly observe the entire Earth. Therefore, fusing the advantages of the two sensors can generate 3-m image products of the ten S2 bands every five days in a global scope. Along with the development of the PS constellation and the accumulation of PS and S2 data archives, the utility of RASSFM will be highlighted in the context that increasing remote sensing applications require ample spatial and spectral information, such as precision agriculture, forest structure monitoring, and artificial surface change tracking.
Although the temporal resolution of the fusion results is quite frequent, not all daily PS images are fully utilized. In fact, for PS observations that do not have corresponding S2 imagery, the six 3-m bands (i.e. RE1&2&3, NNIR, SWIR1&2) cannot be generated by RASSFM because the temporal change information in the six bands is not available. Our future study will be pursuing a spectral enhancement algorithm that can augment the spectral resolution of PS images without S2 counterparts.