Regression-based surface water fraction mapping using a synthetic spectral library for monitoring small water bodies

ABSTRACT Small water bodies (SWBs), such as ponds and on-farm reservoirs, are a key part of the hydrological system and play important roles in diverse domains from agriculture to conservation. The monitoring of SWBs has been greatly facilitated by medium-spatial-resolution satellite images, but the monitoring accuracy is considerably affected by the mixed-pixel problem. Although various spectral unmixing methods have been applied to map sub-pixel surface water fractions for large water bodies, such as lakes and reservoirs, it is challenging to map SWBs that are small in size relative to the image pixel and have dissimilar spectral properties. In this study, a novel regression-based surface water fraction mapping method (RSWFM) using a random forest and a synthetic spectral library is proposed for mapping 10 m spatial resolution surface water fractions from Sentinel-2 imagery. The RSWFM inputs a few endmembers of water, vegetation, impervious surfaces, and soil to simulate a spectral library, and considers spectral variations in endmembers for different SWBs. Additionally, RSWFM applies noise-based data augmentation on pure endmembers to overcome the limitation often arising from the use of a small set of pure spectra in training the regression model. RSWFM was assessed in ten study sites and compared with the fully constrained least squares (FCLS) linear spectral mixture analysis, multiple endmember spectral mixture analysis (MESMA), and the nonlinear random forest (RF) regression without data-augmentation. The results showed that RSWFM decreases the water fraction mapping errors by ~ 30%, ~15%, and ~ 11% in root mean square error compared with the linear FCLS, MESMA unmixings, and the nonlinear RF regression without data-augmentation respectively. RSWFM has an accuracy of approximately 0.85 in R2 in estimating the area of SWBs smaller than 1 ha.


Introduction
Surface water is an indispensable natural resource on Earth, and alterations in the levels of surface water affect aquatic and terrestrial ecosystems at local to a global scale (Vörösmarty et al. 2010;Pekel et al. 2016; X. Wang et al. 2020).Small water bodies (SWBs), usually with an area of less than 10 ha, including ponds, on-farm reservoirs, fish farms, and paddy fields, play key roles in regional biodiversity conservation (Gibbs 1993), agricultural irrigation (Vanthof and Kelly 2019), and the global carbon cycle (Polishchuk et al. 2018).SWBs are widespread worldwide.More than 2.6 million on-farm reservoirs have been reported in the USA (Perin et al. 2022), and more than 5.17 million SWBs are found in China, including ~ 3.08 million SWBs located in the Yangtze River basin (Lv et al. 2022).Although SWBs account for only 8.6% of lakes and ponds globally, they contribute more than 15% and 42% of greenhouse gas emissions of CO 2 and CH 4 , respectively (Holgerson and Raymond 2016).Furthermore, based on estimates of their sizes and distributions (Raymond et al. 2013;Holgerson and Raymond 2016), SWBs are crucial components of global change (Downing 2010).However, current global lake datasets are only able to detect surface water bodies larger than 10 ha (Lehner and Döll 2004;Perin et al. 2022) and 3 ha (Pi et al. 2022), and information about the spatial extent of the large amount of SWBs is still unavailable.
Monitoring of SWBs has been greatly facilitated by the development of satellite remote sensing.Very high resolution (VHR) images, such as from IKONOS, QuickBird, and PlanetScope can map surface water bodies at a spatial resolution finer than 10 m.For instance, Perin et al. (2021) mapped SWBs at 3 m spatial resolution using PlanetScope images and the OTSU threshold method (Otsu 1979).Huang et al. (2015) mapped urban water at a 2 m spatial resolution from GeoEye and Worldview images using a combination of pixel-and object-based machine learning methods.B. Wang et al. (2022) mapped small and densely distributed surface water bodies at a 1 m spatial resolution from Gaofen-2 images using a convolutional neural network.However, most of them are from commercial sensing systems and are often costly, and provide limited geographic coverage and sometimes a coarse temporal resolution.By contrast, medium-spatialmedium-temporal-resolution images such as Landsat and medium-spatial-high-temporal-resolution images such as Sentinel-2 satellites, are free, cover large areas, and have been widely used in surface water mapping at global and regional scales.The Landsat image archive provides imagery, typically at a 30 m spatial resolution, and has been used in surface water mapping (Pekel et al. 2016;X. Wang et al. 2018;Mahdianpari et al. 2020;Pickens et al. 2020;X. Li et al. 2021).Launched in 2016, the Sentinel-2 satellite provides multi-spectral images at a spatial resolution of 10 m and a temporal repetition rate of approximately 4 − 5 days at the equator.The relatively fine spatial resolution of Sentinel-2 images relative to those from Landsat allows enhanced surface water mapping (Freitas et al. 2019;Ludwig et al. 2019;Jamali et al. 2021;Perin et al. 2022).Although the mapping of SWBs has been facilitated by using the pixel-based image classification method, which labels a pixel to be either a water or nonwater class, it is difficult to accurately map SWBs smaller than ~ 0.04 − 0.36 ha from the Sentinel-2 and Landsat images (Freitas et al. 2019).
The mapping of SWBs from medium-spatialresolution remote sensing images is challenging due to the mixed-pixel problem, which means that both water and land classes contribute to the observed spectral response of the image pixel.The mixed pixel problem is more common in mapping SWBs than in mapping large lakes, as all or most of the SWB area may be located within the mixed pixels (Halabisky et al. 2016;Sall et al. 2021;Lv et al. 2022).
To reduce the impact of the mixed pixel problem, a large number of spectral unmixing methods, which decompose mixed pixels into a set of endmember spectra to obtain the proportions of each endmember in the mixed pixel, have been proposed to map subpixel SWBs.The most popular spectral unmixing methods are based on a linear mixture model, such as fully constrained least squares (FCLS) linear spectral mixture analysis (Heinz 2001;Feng et al. 2015;Zhang, Chen, and Lu 2015;Jarchow et al. 2020;Ling et al. 2020;C. Liu et al. 2020;Sall et al. 2021;Yang et al. 2022) and the multiple endmember spectral mixture analysis (MESMA) that allows the variable number and types of endmembers on each pixel (Kim et al. 2018;Yang et al. 2022).FCLS and MESMA are not appropriate in situations such as multiple scattering effects (Ray and Murray 1996).In contrast to FCLS and MESMA which have strict physical meaning, regression-based unmixing uses machine learning methods, such as random forest (RF) and support vector regression (SVR), to construct the relationship between the multiple spectra and the corresponding surface water fraction, and has been proposed as a means to generate subpixel surface water fraction maps (L.Li et al. 2018Li et al. , 2019;;Liang and Liu 2021).In regression-based unmixing, the relationship is determined from a set of pre-defined training data.For instance, Landsat images have been used to produce binary water maps, which may be combined with coarse spatial resolution MODIS images acquired at the same time to train the "MODIS reflectance image -surface water fraction" regression model (L.Li et al. 2018;Liang and Liu 2021).However, it is difficult to use the same method directly to map medium-spatial-resolution surface water fractions owing to a lack of fine-spatialresolution surface water map databases used to produce water fraction images.With no ancillary data, the self-trained regression first segments the medium-spatial-resolution image into a binary water map and then downscales the medium-spatialresolution multi-spectral image and binary water map to a coarse spatial resolution to train the regression model (Rover, Wylie, and Ji 2010;DeVries et al. 2017;B. Wang et al. 2022).The self-trained model is unsupervised and fails to incorporate fully the endmember information of SWBs during the unmixing.
To fully use prior endmember information, another regression-based unmixing method, regressionbased unmixing using a synthetic spectral library (Okujeni et al. 2013;Senf et al. 2020), has great potential for surface water fraction mapping.The regression-based unmixing uses synthetic spectral library inputs, a few pure image endmember spectra (Mitraka, Del Frate, and Carbone 2016), or prior endmembers collected from a finer spatial hyperspectral image (Okujeni et al. 2013) and simulates a series of class fractions and the corresponding synthetic spectra based on linear and/or nonlinear mixture models as training data.Based on the synthetic training data, machine learning methods, such as SVR and RF, were used to construct the regression model used for prediction.The regression model approach has been used for applications such as the mapping of sub-pixel land cover class fractions in urban (Okujeni et al. 2013;Mitraka, Del Frate, and Carbone 2016;Okujeni et al. 2016;Priem et al. 2019) and vegetated areas (Suess et al. 2018;Cooper et al. 2020;Senf et al. 2020).The studies based on the regression model identify the area instead of locations of specific land covers (Okujeni et al. 2013;Okujeni, van der Linden, and Hostert 2015;Okujeni et al. 2018;Schug et al. 2018;Cooper et al. 2020;Schug et al. 2020;Senf et al. 2020).The accuracies were assessed based on hundreds of grids or polygons, and each grid or polygon is composed of a cluster of very high spatial resolution pixels that were manually interpreted and upscaled for validation.The results obtained showed that the unmixing method using the synthetic spectral library not only outperformed regression-based unmixing using ancillary data from land cover maps obtained by classifying finer-spatial-resolution remote sensing data in terms of error (Priem et al. 2019) but also increased the accuracy compared with popular unmixing models such as MESMA (Okujeni et al. 2013;Mitraka, Del Frate, and Carbone 2016;Okujeni et al. 2016).However, to the best of our knowledge, regression-based unmixing using a synthetic spectral library has not been applied to mapping surface water fractions because a set of challenges are encountered with current methods.
First, traditional regression-based unmixing using a synthetic spectral library usually focuses on the mapping of vegetation-impervious surface-soil (VIS) fractions, and does not consider the case of a water -land mixture in the image pixels.The traditional method usually treats water separately from other materials of interest, as water is generally darker in the image than other land covers.Particularly, the water pixels were masked using a predefined threshold applied to a water index, and the surface water fractions were 100% for the masked pixels and 0% for the other pixels (Powell et al. 2007;Schug et al. 2018;Cooper et al. 2020;Schug et al. 2020).This process reduces the impact of water on the unmixing of VIS but it is unsuitable for quantifying the sub-pixel surface water fraction.Few studies have considered the water endmember in the mixture model when generating the synthetic spectral library but have considered that the water spectra were relatively homogeneous and used only one or two water endmembers to generate the synthetic spectral library (Okujeni et al. 2013;Senf et al. 2020).A single and unique water endmember cannot typically represent the various spectral properties of different SWBs.SWBs may be very sensitive to the surrounding environment and exhibit spectral variability due to differences in properties such as depth, water quality, chlorophyll concentration, and turbidity (Peterson, Sagan, and Sloan 2020;H. Liu et al. 2021;Wang et al. 2022).Moreover, SWBs may sometimes have low inter-class spectral separability and have some similar spectral properties to non-water classes.For instance, the spectral response of SWBs with high chlorophyll concentrations can resemble that of dense vegetation, and the spectral response of SWBs with high turbidity and shallow water depth may resemble those of some soils (Matsushita et al. 2012).Therefore, it is necessary to consider the endmember spectra from SWBs with different spectral properties in the synthetic spectral mixture model to obtain an accurate prediction of surface water fractions.
Second, the traditional regression-based unmixing using a synthetic spectral library generates a series of mixed synthetic spectra but only very few pure spectra.The traditional method primarily focuses on the unmixing of VIS, where pure pixels are relatively few in the image to be analyzed.The method may use tens of thousands of mixed spectra but only dozens of pure endmembers as training samples, which has several limitations in the analysis of sub-pixel surface water mapping.First, limited pure water spectra cannot represent the various spectral classes of SWBs, and limited pure land spectra cannot represent the various spectral properties of different land covers, such as vegetation, impervious surface, and soil.Moreover, using a small dataset of pure pixels for training usually results in unsatisfactory predictions by machine learning models (Gao et al. 2013;Ling et al. 2019;Worden et al. 2021).Lastly, the very small proportion of pure spectra in the training dataset (~1%) may not represent the real-world proportions of pure water and pure land pixels, wherein the mixed water -land pixels, which are located close to the waterlines, are in the minority.Although increasing the number of pure water and pure land spectra in the training may result in a more representative pure endmember dataset, this process is not only complicated but also time-consuming in real scenarios.
In this study, a novel regression-based surface water fraction mapping method (RSWFM) is proposed to address the challenges of traditional regressionbased unmixing using a synthetic spectral library for mapping SWBs from Sentinel-2 imagery.Unlike the traditional regression-based methods that mask the water pixels, RSWFM introduces water endmembers in the spectral mixture model and generates a series of water -land mixed spectra to train the regression model while considering the intra-class spectral variability in water endmembers and land endmembers.Additionally, to enlarge the number of pure spectra and enhance the representativeness of the pure spectra for training, RSWFM applies data augmentation, and a random noise addition method is applied to the original data (Gao et al. 2013;Ling et al. 2019) by adding Gaussian noise to the few pure endmembers.RSWFM adopts RF regression to train the relationship between the multi-spectral synthetic spectra and the corresponding surface water fractions for prediction.The aim of RSWFM is to map sub-pixel surface water fractions for small water bodies with areas that were mostly smaller than 1 ha.Here, the potential of the RSWFM method was assessed in ten study sites in China, the USA, Canada, and France and compared with two state-of-the-art linear unmixing algorithms and the nonlinear RF regression without dataaugmentation both visually and quantitatively.

Study area
Ten study sites, each with an area of 100 km 2 , were selected in this study (Figure 1).Sites 1-6 are located in the Yangtze River basin, China.Site 7 is located in North Dakota, USA.Site 8 is located in Saskatchewan, Canada.Sites 9-10 are located in Loir-et-Cher and Ain, France.Each site contains a large number of SWBs used for irrigation, aquaculture, and rice cultivation.

Sentinel-2 imagry
Ten Sentinel-2 multi-spectral images in the ten study sites were downloaded from the Copernicus European Space Agency hub.The Level 1C Sentinel-2 top of atmosphere (TOA) reflectance images were atmospheric corrected to surface reflectance images based on the Sen2Cor tool of SNAP software (Main-Knorn et al. 2017).In each site, a subset of Sentinel-2 images with an area of 100 km 2 was adopted for surface water fraction mapping (Figure 1).The ten Sentinel-2 subset images are free of opaque clouds and thin cirrus clouds based on the Sentinel-2 quality assessment band and the scene classification operator in the Sen2Cor tool.Sentinel-2 images cover the spectral range between 433 and 2280 nm, with 13 spectral bands at 10-60 m resolution.Ten bands with spatial resolutions of 10 m and 20 m were used in this study, including the blue, green, red, three vegetation red-edge bands, near-infrared (NIR) band, narrow near-infrared band, and two short-wave infrared (SWIR) bands.The Sentinel-2 images were projected onto the WGS-1984 Universal Transverse Mercator (UTM) projection.

Google Earth image for validation
Ten high-spatial-resolution (1 m) cloud-free Google Earth images were used as the ground truth data in Figure 2 (a 1 -a 10 ).The Google Earth images were acquired temporally close to the corresponding Sentinel-2 images to reduce the impact of land cover change when assessing the surface water fraction mapping outputs from Sentinel-2 (Table 1).Each Google Earth image was projected onto the UTM projection which is the same as the corresponding Sentinel-2 image.The Google Earth images were georegistered with the Sentinel-2 images to reduce the impact of registration errors (Hoge et al. 2003).Surface water bodies on all ten sites were digitized manually through visual interpretation on the basis of the Google Earth images to produce the 1 m waterland binary maps in Figure 2 (b 1 -b 10 ) (Halabisky et al. 2016;Sall et al. 2021;Perin et al. 2022).The use of finer-spatial-resolution data with advanced interpretation models including visual interpretation through expert knowledge has shown its effectiveness to quantify the surface water maps (Olofsson et al. 2014;Pekel et al. 2016;Pickens et al. 2020).Then, the 1 m binary surface water maps were spatially degraded to 10 m resolution reference surface water fraction images to validate the accuracy of surface water fraction images.The reference surface water fraction in each Sentinel-2 pixel was calculated by dividing the total number of 1 m resolution water pixels within the pixel by 100 (Nill et al. 2022;B. Wang et al. 2022).

SWBs in the ten sites
The statistics on the area of surface water bodies in each site are shown in Figure 3.The SWBs that are smaller than 1 ha are large in number at the ten sites.In particular, the number of SWBs is 3320,2335,2790,3664,1303,938,1515,1109,332, and 335 at the ten sites, respectively.The total area of SWBs smaller than 1 ha are ~377 ha (98.40% of total surface water area) at site 1, ~409 ha (98.54% of total surface water area) at site 2, ~480 ha (97.46% of total surface water area) at site 3, ~649 ha (96.92% of total surface water area) at site 4, ~268 ha (95.17% of total surface water area) at site 5, ~137 ha (86.78% of total surface water area) at site 6, ~190 ha (91.16% of total surface water area) at site 7, ~168 ha (89.45% of total surface water area) at site 8, ~69 ha (64.16% of total surface water area) at site 9, and ~39 ha (53.73% of total surface water area) at site 10.The depths of SWBs in the ten sites range from approximately 1 m to 5 m.

Method
The proposed RSWFM generated 10 m spatial resolution surface water fraction maps from the Sentinel-2 images.The six 20 m Sentinel-2 bands were first downscaled to 10 m via pan-sharpening.Furthermore, according to a combination of linear and nonlinear spectral mixture models and noisebased data augmentation, synthetic spectral libraries for mixed water -land, pure water, and pure land pixels were generated based on the Sentinel-2 image endmembers and synthetic surface water fractions.With the training dataset, RF was used to construct the regression relationship between the synthetic spectra and the corresponding surface water fractions and then applied to the Sentinel-2 image to generate the surface water fraction map.A flowchart of the RSWFM is shown in Figure 4 and more details of the method are given below.

Sentinel-2 image pre-processing
The

Endmember spectra collection
At each study site, several endmember spectra were collected directly from the Sentinel-2 image.The endmember spectra were sampled from homogeneous regions in the Sentinel-2 image based on the corresponding Google Earth image (Mitraka, Del Frate, and Carbone 2016).The size of the homogeneous region to define endmember spectra was at least 30 × 30 m.A two-level hierarchical classification scheme was used in this study (Mitraka, Del Frate, and Carbone 2016;Cooper et al. 2020;Schug et al. 2020).The first level contains four main classes including water, vegetation, impervious surface, and soil.
The second level divided the first level into more detailed sub-classes so that various land cover classes can be involved in compositing the spectral library.For instance, the first level class of vegetation includes the subclasses of trees, crops, and shrubs in the second level in site 2, and the first level class of impervious includes the subclasses of building roofs and roads in the second level in site 5.The number of endmember spectra for the first level class is listed in Table 2.Because the Sentinel-2 images used for analysis were selected for different seasons, the endmember spectra for each site were used to construct the synthetic spectral library for only that particular site.Ten synthetic spectral libraries were thus constructed.

Generation of synthetic mixed spectra
RSWFM generated a series of synthetic watervegetation-impervious surface-soil (WVIS) fractions and the corresponding synthetic mixed spectra.First, synthetic WVIS fractions were generated.The RSWFM adopts a binary mixture model that considers a pixel composed of no more than two classes (Franke et al. 2009;Okujeni et al. 2013).Two different endmembers were first selected, and the class fractions of the two selected endmembers were assigned proportionally, as shown in Figure 4 (Okujeni et al. 2013).The mixing ratios for the two endmembers were set from 10% to 90%, with an interval of 10% to reconcile the conflicts among running time, data complexity, and redundancy.The sum of the two endmember fractions was 100%.
Synthetic mixed spectra were generated according to endmember spectra and synthetic class fractions.With each synthetic endmember fraction in the binary mixture model, two spectra, each from the corresponding endmember, were mixed to generate synthetic mixed spectra.We iteratively selected all combinations of spectra from the two endmembers in the binary mixture model to fully represent all possible spectral mixing scenarios.In each binary mixture model, both linear and nonlinear mixture models are considered for generating the synthetic mixed spectra (Okujeni et al. 2013; Okujeni, van der Linden, and Hostert 2015; Mitraka, Del Frate, and Carbone 2016).The linear and nonlinear spectral synthesis models are described in Equations ( 1) and (2), respectively: (1) where α i is the i th synthetic mixed spectra using the linear spectral synthesis model, β i is the i th synthetic mixed spectra using the nonlinear spectral synthesis model, a j (i) is the mixing ratio of endmember j in mixed spectra i, b j,l (i) is a non-negativity coefficient for representing the nonlinear contribution randomly assigned from an exponential distribution with a mean value of 0.05 in mixed spectra i (Meganem et al. 2013), ρ j and ρ l are the spectra of endmembers j and l, respectively, and N is the number of endmembers in the mixture model.

Generation of synthetic pure spectra
To enhance the representability of pure endmembers in the training regression-based unmixing model, noise-based data augmentation was performed to increase the number of pure spectra and enhance the representativeness of the pure spectra.Several synthetic Gaussian noises were added to the spectra of pure water and pure land endmembers.For each endmember spectrum in Table 2, a total of K spectra vectors were generated based on Gaussian noise-based augmentation.Specifically, we assumed that ρ water b;i;k is the k th (k = 1, 2, . . ., K) synthetic spectrum for the i th water endmember in the b th spectral band, which was calculated as where ρ water b;i is the spectrum of the b th band in the i th water endmember; � water b;i;k is k th synthetic additive Gaussian noise with zero mean and one variance in the b th band for the i th water endmember; and σ water b is the spectrum standard deviation of the b th band in all water endmembers, which is multiplied by the synthetic Gaussian noise.The magnitude of Gaussian noise is proportional to the standard deviation of the spectral value in the b th band.C is a constant coefficient that controls the magnitude of the Gaussian noise.
Similarly, for land endmembers, the k th (k = 1, 2, . . ., K) synthetic spectrum for the i th land endmember in the b th spectral band, that is, ρ land b;i;k , was calculated as: where ρ land b;i is the spectrum of the b th band in the i th land endmember; � land b;i;k is k th synthetic additive Gaussian noise with zero mean and one variance in the b th band for the i th land endmember; and σ land b is the spectrum standard deviation of the b th band in all land endmembers, which is multiplied by the synthetic Gaussian noise.

Spectra unmixing based on RF regression
According to the aforementioned synthetic spectral library, a series of synthetic spectral values and their corresponding class fractions were generated.In the regression model, all the class fractions from vegetation, impervious surface, and soil were merged as land class fractions.The synthetic spectral values and their corresponding surface water fractions for the mixed and pure spectra were input into an RF regression to train the surface water fraction prediction model.Specifically, the synthetic spectra were input as independent variables, and the corresponding synthetic surface water fraction was input as a response variable.The number of synthetic mixed and pure spectra is dependent on the number of image endmembers, the mixing ratio interval for water and land, and the parameter K in Equations ( 3) and ( 4).Detailed information about the number of spectra used for training the RF regression model is shown in Table 3. RF regression (Breiman 2001) is an ensemblelearning nonlinear regression algorithm based on classification and regression trees (CART).In contrast to CART, RF combines a set of individual decision trees to improve the prediction performance.To avoid overfitting with the increase in decision trees and training data, each tree was constructed using binary partitioning of random bootstrap samples at each node of this tree.The final prediction was acquired by averaging the results of all trees.Once the RF regression model was built, it was used to predict the surface water fraction map by inputting the Sentinel-2 multi-spectral image.
The RF regression model contains two main hyperparameters including the number of decision trees (ntree) and the random subset of variables at each node (mtry).The Bayesian optimizer, an iterative response surface-based global optimization algorithm, was adopted to automatically select the optimal RF hyperparameters ntree and mtry (Pelikan, Goldberg, and Cantú-Paz 1999;Wu et al. 2019).In particular, the Bayesian optimization uses Gaussian process regression to autonomously learn the next hyperparameter value set from all the information available from previous evaluations during the tuning process (Snoek, Larochelle, and Adams 2012).According to previous studies, the range of ntree was set to 1 to 600, the optimal range of mtry was set to 1 to 10 which is the total number of variables (i.e. the number of inputted Sentinel-2 bands), and the iteration of optimization was set to 50 in the Bayesian optimizer (Feng et al. 2015;DeVries et al. 2017;L. Li et al. 2018;Han et al. 2020).

Comparison with different spectral unmixing methods
The proposed RSWFM was compared with two stateof-the-art unmixing algorithms: FCLS and MESMA.At each study site, the same set of endmembers (Table 2) was used with each unmixing method.In FCLS, the unmixing result is ill-posed if the number of endmembers is larger than that of spectral bands (Small 2001).To generate reliable results and reduce computational complexity, FCLS averaged the four endmembers in mapping the surface water fraction.In contrast, MESMA and RSWFM input all the endmember spectra in unmixing.
In RSWFM, the impacts of the parameters K and C in Equations ( 3) and (4) were assessed.Parameter K, which determines the number of synthetic pure spectra in the training data, was set to 0, 50, 100, 200, 500, 800, 1000, and 1200.Parameter C, which controls the magnitude of Gaussian noise, was set as 0.2, 0.5, 1, 5, 10, 20, and 50.Very large values for K (K > 1200) and C (C > 50) will increase the running time and do not necessarily increase the mapping accuracy through Table 3.The number of spectra samples for training regression models in different sites.The number of synthetic mixed water -land spectra is mainly dependent on the mixing ratio interval, which is set as 10% in this study.The number of synthetic pure spectra is dependent on the parameter K, and this table shows the number of synthetic pure spectra with K = 500 which is adopted in model comparison in the experimental results.many trials and were thus not assessed.When K = 0, RSWFM was the same as traditional regression-based unmixing without noise-based data augmentation.

Accuracy assessment
The accuracy of the predictions from the different methods was assessed by comparison with the reference surface water fraction maps produced from the Google Earth images (Figure 2 (b 1 −b 10 )).
The per-pixel accuracies of the different methods were assessed using the root mean square error (RMSE) and the mean absolute error (MAE) as follows: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where M is the number of 10 m Sentinel-2 image pixels, P m is the predicted surface water fraction of the m th pixel, and R m is the reference surface water fraction for the m th pixel.The percentage area error for each water body was assessed using the mean absolute percentage error (MAPE) (Chicco, Warrens, and Jurman 2021): where M is the number of SWBs assessed, P m is the predicted area of m th SWB, and R m is the reference area of m th SWB.An MAPE value of 0 indicates that there is no error between the predicted and reference water areas, while an MAPE value greater than 100% indicates that the predicted values are highly unreliable.
The correlation of predicted and reference SWB areas was assessed using the coefficient of determination (R 2 ) of the fitted line in the linear regression (Wright 1921;Chicco, Warrens, and Jurman 2021): where M is the number of SWBs assessed, P m is the predicted area of m th SWB, R m is the reference area of m th SWB, and � R is the mean reference area of all SWBs.The upper bound value of R 2 is 1, and the fitness performs perfectly when the R 2 attains 1.

Results
The predicted surface water fraction maps obtained from the different methods were compared.This section demonstrates the results of RSWFM without noise-based data augmentation (with K = 0, that is, no synthetic pure spectra were generated, and there is no impact from C according to Equations (3-4)) and RSWFM with noise-based data augmentation with parameters K = 100, C = 5 and K = 500, C = 5.The impacts of different K values (0, 50, 100, 200, 500, 800, 1000, and 1200) and C values (0.2, 0.5, 1, 5, 10, 20, and 50) are discussed in the discussion section.

Comparison of model performances in ten sites
The surface water fraction maps generated by FCLS, MESMA, and RSWFM at the ten sites are shown in Figure 5. FCLS unmixing overestimated the surface water fraction typically at sites 3-7 and site 9. RSWFM without noise-based data augmentation (K = 0) overestimated the surface water fraction at sites 3-9.The surface water fraction maps generated by RSWFM were more similar to the reference map than those generated by FCLS, MESMA, and RSWFM without noise-based data augmentation.
The quantitative accuracy assessment metrics of RMSEs and MAEs for the surface water fraction maps from each method are listed in Table 4.The proposed RSWFM generated the lowest RMSEs, which were lower than 0.16 for all ten sites.In general, RSWFM decreased the RMSEs by 0.01-0.11(~30% on average) compared with FCLS and by 0.01-0.07(~15% on average) compared with MESMA.Similarly, the RSWFM generated MAEs that were lower than 0.09 and decreased the MAEs by 0-0.11 (~46% on average) compared with FCLS.Moreover, RSWFM generated the lowest MAEs at all sites, except for sites 2, 4, and 5.
RSWFMs (K = 100 and K = 500) generated lower RMSE and MAE values than the traditional regressionbased unmixing without noise-based data augmentation (RSWFM with K = 0) at ten sites, showing that increasing the number of synthetic pure spectra could reduce the surface water fraction mapping error.In particular, RSWFM (K = 100 and K = 500) decreased RMSE by 0.003-0.031(~11% on average) and decreased MAE by 0.01-0.04(~33% on average) compared with RSWFM (K = 0).

Comparison of SWBs of different sizes, shapes, and spectral properties
The results of the surface water fractions for several SWBs of different sizes, shapes, and spectral properties from all the methods are compared in this section.Figure 6 shows the predicted surface water fractions for SWBs of different sizes obtained from the different methods.FCLS overestimated surface water fractions for many pure land pixels, such as those highlighted by black ellipses in Figure 6 (c 1 ) and (c 3 ).MESMA underestimated the surface water fractions for many pure water pixels, such as those highlighted by black ellipses in Figure 6 (d 2 ) and (d 5 ).The traditional regression-based unmixing without noisebased data augmentation (RSWFM with K = 0) overestimated the surface water fractions for several pure land pixels, such as the black ellipse in Figure 6 (e 2 ), and underestimated the surface water fractions for several pure water pixels, such as the black ellipse in Figure 6 (e 6 ).In contrast, RSWFM with noise-based data augmentation (K = 100 and K = 500) predicted surface water fractions better in the pure land and pure water pixels generally.All the methods roughly predicted the shape of the SWBs when they were larger than approximately 0.3 ha in Figure 6.This is because large SWBs contain many pure-water pixels.All the methods failed to accurately predict the exact shape of the smallest SWB, which was 0.088 ha (first row in Figure 6).This is because a large proportion of the SWB area was located in the mixed water -land boundary pixels.This finding reveals that even though Sentinel-2 images have a relatively fine 10 m resolution, they are still challenging in mapping the SWB of small size (especially <0.1 ha, about 10 Sentinel-2 pixels).
Figure 7 shows the predicted surface water fractions for SWBs of different shapes obtained using different methods.It is clear that for artificial fishponds and onfarm reservoirs that have rectangular and circular shapes, such as SWB in Figure 7 (a 7 ), all the methods can roughly predict the shape of these SWBs.For natural ponds with irregular shapes, none of the methods could precisely map the shape of the SWB, as shown in Figure 7 (c 4 -g 4 ).For the linear SWB in Figure 7 (a 1 ), all the methods have generally mapped the shape of the SWB but failed to accurately map the regions where the river is meandering, as highlighted by black ellipses in Figure 7 (c 1 -g 1 ).Similar results were obtained by comparing the different methods.In particular, the FCLS overestimated the surface water fraction in the vegetation regions, as shown in Figure 7 (c 2 ) and (c 3 ).MESMA underestimated the surface water fraction within the pond, as indicated by the black ellipse in Figure 7 (d 4 ) and (d 6 ).RSWFM (K = 100 and K = 500) reduced the overestimation in the vegetation regions compared with RSWFM (K = 0), as shown in Figure 7 (e 5 ) and (e 7 ), and better mapped the shape of the SWBs, showing the effectiveness of integrating noise-based data augmentation.
Figure 8 shows the predicted surface water fractions for SWBs with different spectral properties using different methods.The SWBs are represented as black, dark blue, light blue, and dark green in the Sentinel-2 false-color composite images in Figure 8 (b 1 -b 7 ).In general, because FCLS averaged the spectra of   the surface water in the bare region, as highlighted by the black ellipse in Figure 8 (d 6 ), and underestimated the surface water near the water−land boundary region, as highlighted by the black ellipse in Figure 8 (d 7 ).In contrast, the RSWFM maps were more similar to the real surface water of the SWBs in Figure 8 (a 1 -a 7 ).Although RSWFM outperformed the comparators in mapping most SWBs, it predicted some flaws for some SWBs.For instance, RSWFM overestimated surface water fractions in the bare land regions highlighted in the black ellipse in Figure 8 (f 5 ) and (g 5 ), whereas MESMA better mapped the surface water fractions in this region.

Impact of RSWFM parameters
RSWFM performance depends on its parameters.In RSWFM, the parameter K controls the number of enlarged pure spectra (the number of synthetic pure spectra is K times the number of endmember spectra), and the parameter C in Equations (3-4) controls the magnitude of the Gaussian noise added to the pure endmember spectra.Different K values (K = 0, 50, 100, 200, 500, 800, 1000, and 1200) and C values (C = 0.2, 0.5, 1, 5, 10, 20, and 50) were assessed.The corresponding RMSE values for the surface water fraction maps are shown in Figure 9.When 0.5<C < 50, RSWFM with K > 0 generated smaller RMSE values than the traditional regression-based unmixing without noise-based data augmentation (RSWFM with K = 0) at all ten sites, showing that using noise-based data augmentation could improve the accuracy of RSWFM.In general, the lowest RMSE values were found for RSWFM with K ranging from 200 to 1000 and C ranging from 1 to 5 for all ten sites, and the difference in RMSE was less than approximately 0.015 within this range.RSWFM with a small value of K (K ≤100) generated a relatively larger RMSE than RSWFM with a relatively larger value (K ≥200), indicating that RSWFM requires a sufficient number of augmented pure endmember spectra to ensure the accuracy of the RF regression.It is also noticed that using an extremely large value of K will not necessarily decrease the RMSE (such as RSWFM with C = 5 at site 1) but will increase the complexity and running time of the RF regression model.For instance, RSWFM with K = 1000 decreased RMSE by only 0.001 but the running time doubled in comparison with RSWFM with K = 500.
For parameter C, which controls the magnitude of the Gaussian noise in the synthetic pure endmember, neither a very large value (C = 50) nor a very small value (C = 0.2) generates a low RMSE.This is because a very large value of C indicates a small magnitude of noise, and the synthetic pure endmember spectra would not be representative of the variance in pure endmember spectra change, whereas a small value of C indicates a very large magnitude of the noise that may overestimate the variance of synthetic pure endmember spectra.
The optimal values of K and C are in the range of 200 to 1000 and 1 to 5 respectively based on the ten sites around the world.In this study, the K = 500 and C = 5 usually generated the results with the smallest RMSE.It is also suggested to select the optimal parameters for C and K on the basis of the grid search through many trials.

Per-SWB water area estimation
This section explores the potential of RSWFM for estimating the surface water area for each SWB.RSWFM with K = 500 and C = 5 is assessed.Water buffers were created for each SWB by expanding the water outline outward by 20 m (Halabisky et al. 2016;Sall et al. 2021).The surface water area for an SWB was calculated by summarizing the total surface water fraction of pixels in the 20 m buffer of the SWB in the RSWFM surface water fraction map. Figure 10 shows scatter plots between the reference and predicted surface water areas for SWBs smaller than 1 ha, whose buffers did not interact with other SWBs.Linear regression was used to fit the reference and predict SWB water areas, and the R 2 of the fitted line was used to assess the degree of match between the reference and predicted SWB water areas.RSWFM generated an R 2 larger than 0.85, showing a good agreement when comparing the RSWFM prediction and the reference.R 2 larger than 0.95 that showed the highest agreements were found in site 1, site 2, and site 6.R 2 smaller than 0.90 were found in site 8 (R 2 = 0.8722) and site 10 (R 2 = 0.8591) where the dense vegetation and phytoplankton have similar spectral features as the SWBs.
Figure 11 shows MAPEs of the predicted SWB water area for SWBs of different sizes in the ten sites; a lower MAPE indicates a better match between the predicted and the reference water area for a target SWB.Different from previous studies that mapped the SWBs smaller than 5-30 ha based on the pixel-based classification (Bie et al. 2020;Perin et al. 2021), this study explored the potential of the sub-pixel method of RSWFM in mapping SWBs smaller than 1 ha.In general, the water area estimation accuracy from the proposed RSWFM increased with the increase of area ranges except for the area range of 0.5-1 ha in site 4, the area range of 0.3-0.4ha in site 5, and the area range of 0.4-0.5 ha in site 10.This finding is consistent with the findings of previous studies that the accuracy in mapping SWB decreases with the decrease in SWB area (Perin et al. 2022).The MAPEs for RSWFM were larger than ~ 50% when SWB was less than 0.1 ha in all ten sites, highlighting the need of mapping these very small SWBs from 10 m Sentinel-2 imagery in the future.

Limitations and future research
The regions where the proposed RSWFM overestimated and underestimated surface water fractions were analyzed.Since the aim of this study is to map water fractions instead of binary water maps, the metrics such as false positives or true negatives were not adopted in the analysis (Ovakoglou et al. 2021;Figure 10.Scatter plots of the predicted SWB areas estimated by the proposed RSWFM and reference SWB areas in ten sites.The 1:1 line is shown as the black dotted line.N represents the number of SWBs used for assessment in each site.The parameters K and C used in RSWFM are 500 and 5, respectively.Pantazi et al. 2022).In this study, the water fraction error images in Figure 12 were generated by subtracting the reference surface water fractions from the RSWFM predictions.In the error maps, a positive value indicates overestimation, and a negative value indicates underestimation in surface water fraction.It is found that overestimations in surface water fraction were mostly found in dense vegetation regions such as shown in Figure 12 (e 2 ) and (e 3 ) and in dark shadow regions such as shown in Figure 12 (e 1 ) and (e 3 ), because the dense vegetation and dark shadow have similar spectral features to water.The underestimations in surface water fraction were mostly found at the water -land boundaries for the SWBs.
Although the proposed RSWFM decreased the RMSE compared to the classical FCLS and MESMA, limitations still exist.The proposed method is a supervised method that requires prior endmember spectra, which is the same as other supervised unmixing methods such as FCLS and MESMA.In this study, endmembers in each site were selected from each corresponding Sentinel-2 image respectively.It is noticed that different SWBs are generally variant in spectra in different regions around the world, and many SWBs are variant in spectra at different seasons.For instance, on-farm reservoirs are used to store water in the wet season and are used for irrigation for crops and become dry.It is thus necessary to collect representative endmember spectra for the study site to be analyzed and avoid selecting water endmembers from dry ponds.In this study, the image endmembers directly selected from the Sentinel-2 image with the help of very high resolution (VHR) Google Earth images were adopted.The image endmember has the advantage of reducing the impact of imaging observation condition, solar altitude, and vegetation phenology on unmixing studies (Halabisky et al. 2016;Okujeni et al. 2016;L. Li et al. 2019;Sall et al. 2021).Similar to the supervised spectral unmixing models based on the linear mixture models (Heinz 2001;Franke et al. 2009) and machine learning models (Okujeni et al. 2013(Okujeni et al. , 2018)), we highlighted the use of representative image endmembers when using the proposed RSWFM in unmixing the Sentinel-2 images.Another potential work is the combination of publically available online spectral libraries to construct a universal machine learning model to enhance the generalization of the proposed RSWFM.In addition, this study assessed RSWFM in ten sites with SWBs of fishponds, natural ponds, and small on-farm reservoirs in some selective regions around the world.Although this approach provided a range of SWBs a greater diversity of SWBs could be evaluated by working on a larger, even global, area.Besides, the proposed method was applied to the 10 m Sentinel-2 image in this study, and the result was that it was still challenging to accurately map SWBs that were smaller than 0.1 ha.With the development of VHR images such as PlanetScope, it would be possible to explore the potential of the RSWFM method for mapping SWBs from VHR imagery.

Conclusion
This study proposes a novel regression-based surface water fraction mapping method based on a synthetic spectra library for SWBs from Sentinel-2 imagery and improves the traditional spectral unmixing algorithms in mapping sub-pixel surface water fractions of SWBs.In particular, the proposed RSWFM is based on state-of-the-art regression-based unmixing using a synthetic spectral library and improves several aspects of the classical methods.RSWFM considers the water endmember in the unmixing model, whereas most regression-based unmixing masks out water pixels.RSWFM increased the number of pure endmembers by adding synthetic Gaussian noise to the spectra of pure endmembers, which is effective in dealing with the limitations of the small training dataset in the machine learning method.RSWFM considers both linear and nonlinear mixture models, which can better deal with the multiple scattering effects than the linear FCLS and MESMA models.RSWFM considers different spectral properties in water endmembers and better predicts surface water fractions than FCLS, which simply averages the water endmember spectra in the unmixing.
RSWFM was assessed at ten sites with hundreds or thousands of SWBs smaller than 1 ha.The experimental results showed that the proposed RSWFM generated high accuracy (RMSE <0.16, MAE < 0.09) in the surface water fraction map.Additionally, the proposed method generated predicted SWB areas with an R 2 value of the fitted linear regression greater than 0.85.Considering its good applicability for SWBs, the proposed RSWFM is particularly valuable for surface water fraction mapping of SWBs across large areas at a medium spatial resolution.

Figure 1 .
Figure 1.Locations of the ten study sites.Each study site has an area of 100 km 2 .The false color Sentinel-2 images are composited with NIR-red-green as RGB.

Figure 2 .
Figure 2. The validation images of the ten sites.(a 1 −a 10 ) Google Earth RGB images used for validation, (b 1 −b 10 ) surface water bodies digitized manually from Google Earth images.The blue color in (b 1 −b 10 ) indicates surface water bodies.
six Sentinel-2 20 m bands, including the three vegetation red-edge bands, narrow NIR band, and two SWIR bands, were downscaled to 10 m based on the area-to-point regression kriging (ATPRK), which uses linear regression modeling and residual downscaling to sharpen the coarse spatial resolution imagery (Q.Wang, Shi, and Atkinson 2016).ATPRK could provide sufficient spatial geometric information in downscaling the 20 m Sentinel-2 imagery in comparison with other pan-sharpening approaches (Q.Wang et al. 2016; Q. Li et al. 2022).The key of ATPRK is selecting the appropriate 10 m Sentinel-2 pan-like band used for downscaling each 20 m Sentinel-2 bands.In ATPRK, the four 10 m band were upscaled to 20 m.Then, the Pearson correlation coefficients between the 20 m band and each upscaled 10 m band were calculated, and the pan-like band used for pansharpening each 20 m Sentinel-2 band was determined based on the 10 m band with the highest Pearson correlation coefficients to that 20 m band (Hoge et al. 2003).

Figure 3 .
Figure 3.The number of water bodies in each study site.

Figure 4 .
Figure 4. Flowchart of the proposed RSWFM.A mixing ratio interval of 10% is used in this study.
different water endmembers, it overestimated the surface water fractions in the regions covered by dense vegetation, as highlighted by the black ellipse in Figure8 (c 3).Moreover, FCLS overestimated the surface water fraction in the shadow area, as highlighted by the black ellipse in Figure8 (c 4), because the dark shadows and water have similar spectral properties.In contrast, MESMA and RSWFM, which consider the intra-class spectral variability in water endmembers, better-distinguished water from land and reduced the overestimation of the surface water fraction in vegetation areas.MESMA overestimated

Figure 9 .
Figure 9. RMSE values of surface water fraction from RSWFM using different values for the parameters CK and KC.Lighter color indicates smaller RMSE values.

Figure 11 .
Figure 11.Mean absolute percentage errors (MAPEs) comparison for the estimated area of SWBs grouped into different SWB area ranges in ten sites.The selected SWBs used for MAPE estimation of each SWB area range are those with an area of corresponding SWB area range and site.The MAPEs value increase with the decrease of SWB area generally.

Figure 12 .
Figure 12.Zoomed-in regions for SWB examples in the surface water fraction error map.(a 1 −a 3 ) Google-Earth images, (b 1 −b 3 ) Sentinel-2 images, (c 1 −c 3 ) Reference surface water fractions, (d 1 −d 3 ) the proposed RSWFM with parameters K=500 and C=5, and (e 1 −e 3 ) surface water fraction error maps which were generated by subtracting the reference surface water fractions from the RSWFM predictions.

Table 1 .
The acquisition dates of the Sentinel-2 and Google Earth images.

Table 2 .
The number of water, vegetation, impervious surface, and soil endmembers in the ten sites.

Table 4 .
Accuracy assessment results.The lowest values (indicating the most accurate) in each row are highlighted in bold.