Iterative spatial leave-one-out cross-validation and gap-filling based data augmentation for supervised learning applications in marine remote sensing

ABSTRACT In marine remote sensing, supervised learning can link variables measured in-situ near the ocean surface to variables that can be measured from space. However, the in-situ data used for training and validating such empirical satellite algorithms are often spatially auto-correlated and clustered, giving rise to various statistical challenges such as overfitting to spatial structures. Furthermore, co-located in-situ and satellite measurements are rare in the oceans because of the cost of data collection from research vessels and frequent cloud cover. We propose two methods to mitigate these challenges. The first method builds on spatial leave-one-out cross-validation (SLOOCV), an approach designed to provide sound error estimates when data are spatially auto-correlated by enforcing a minimum separation distance between training and test observations. However, estimating this distance may be impossible with sparse and spatially clustered data. We hence propose to iterate and integrate error estimates over a range of separation distances (iSLOOCV). To address the often-small size of labeled data sets based on marine in-situ data, we tested if increasing the number of observations for algorithm training by means of cloud-filling algorithms for marine satellite data improved predictions. The potential of these two methods is demonstrated by developing empirical algorithms for mapping the proportions of seven diagnostic pigments (DPs) that serve as proxies for phytoplankton community composition in the northern Gulf of Mexico. We estimated the prediction accuracy of 13 algorithms with iSLOOCV, using various sets of satellite data products as input, and found adequate algorithms for 4 of the 7 DPs. Random forests combining ocean color and environmental variables as input had the lowest prediction errors overall. Correlations between predictions and observations estimated by iSLOOCV ranged from 0.69 to 0.85 and mean absolute errors from 0.02 to 0.13. Daily maps and longer-term composites of these DPs were broadly consistent with previously published results. Overall, errors increased when extrapolating over larger distances, highlighting how iSLOOCV can illuminate changes in algorithm performance based on sub-regional data coverage. Generating larger training sets by prior gap-filling substantially improved all error measures for 3 of the 7 DPs, with mixed results for the others. Therefore, data augmentation by gap-filling of satellite data should not be used as a default approach but can be a useful tool when supervised learning applications are suspected to be limited by the size of the training set.


Introduction
Satellite remote sensing allows for mapping of physical and biological phenomena at high temporal resolution and broad spatial scales (Kerr and Ostrovsky 2003). In marine applications, supervised learning -from linear regression to deep neural networks -often serves to map variables measured in-situ based on variables that can be measured from space. Approaches used for this purpose include linear regression, generalized additive models, random forests, and neural networks (e.g. Chen et al. 2019;Doerffer and Schiller 2007;Hieronymi, Müller, and Doerffer 2017;Hu et al. 2018;Keiner and Brown 1999;Liu et al. 2021;O'Reilly et al. 1998;Stock 2015;Xi et al. 2020). However, the in-situ data used for training and validating such empirical satellite algorithms are rarely randomly distributed in space and time, creating statistical pitfalls for supervised learning (Stock 2022). For example, marine labeled data are often spatially autocorrelated and clustered, for example, along ship tracks and near phenomena of interest like river plumes. Consequently, the data may not be independent, a core assumption of standard approaches for the training and validation of supervised learning algorithms. Ignoring dependence structures when validating statistical models can lead to an underestimation of prediction errors and to the selection of too flexible models (Roberts et al. 2017). These problems are exacerbated in marine research because in-situ data sets available for the oceans tend to be small, because data collection from research vessels is time-consuming and expensive, and because cloud cover often prevents matching in-situ measurements with co-located satellite observations. Beyond the statistical limitations arising from small data sets, frequent cloud cover may leave whole sub-regions of a study area and rare conditions underrepresented in the data, posing a major barrier to the training and validation of supervisedlearning based models (Stock 2022).
The objective of this study is to address such challenges of small, spatially clustered and autocorrelated labeled data sets. For this purpose, we propose two new computational approaches focused on error estimation, model selection (the choice between different statistical models to minimize prediction errors), and data augmentation (increasing the amount of available data) by gap-filling. The first approach builds on spatial leave-one-out cross-validation (SLOOCV), a method for error estimation and model selection when data are spatially auto-correlated (Le Rest et al. 2014; Le Rest, Pinaud, and Bretagnolle 2013). However, the standard SLOOCV algorithm relies on the calculation of residual variograms, which can be misleading when data are spatially clustered (Bel et al. 2009) or when flexible machine learning models are used (Roberts et al. 2017). We therefore adjusted the standard SLOOCV algorithm to avoid the calculation of residual variograms by iterating over a range of distances (iterative, or iSLOOCV).
The second approach mitigates the typically small size and limited spatial coverage of marine labeled data sets. Because satellite data often have large gaps caused by clouds, the size of the available data set can be increased by means of gap-filling algorithms (e.g. Alvera-Azcárate et al. 2007; Barth et al. 2020;Hilborn and Costa 2018;Liu et al. 2019;Saulquin, Gohin, and Fanton D' Andon 2018;). On the one hand, prior gap-filling could improve predictive models by creating many additional matchups for training. On the other hand, reconstructing pixel values where no satellite observations exist can introduce additional errors compared to direct satellite measurements. For example, phytoplankton communities inside and outside of mesoscale eddies can differ (Soja-Woźniak et al. 2020), and such differences can be obscured beyond reconstruction by high cloud cover lasting several days. It is not clear a priori if the advantages of a larger data set for model training would outweigh additional errors introduced by gap-filling. We hence tested if including additional in-situ observations matched with reconstructed satellite data in the training of empirical algorithms can improve their prediction accuracy.
This study demonstrates the potential of these new methods by mapping phytoplankton diagnostic pigments (DPs) that serve as biomarkers for different phytoplankton types in the northern Gulf of Mexico (NGOM). Chlorophyll a concentration, a proxy for phytoplankton biomass, is a widely available standard satellite data product (McClain 2009). However, phytoplankton primary production and carbon fixation -and hence, their biogeochemical and ecological functions -depend also on community composition (Chakraborty, Lohrenz, and Gundersen 2017;Quere et al. 2005). Researchers have already proposed many algorithms for satellite mapping of different aspects of phytoplankton community composition (IOCCG 2014;Mouw et al. 2017), including several algorithms for mapping DPs (e.g. Bracher et al. 2015b;Moisan et al. 2017;Pan et al. 2010). Despite these advances, the development and accuracy assessment of satellite algorithms for mapping DPs remain challenging , especially in coastal regionswhere monitoring would be especially important, because human uses and pressures are concentrated at the coasts (Stock et al. 2018b). Phytoplankton community composition is correlated with environmental variables such as light availability and SST (Mouw, Ciochetto, and Yoder 2019). We thus combined SLOOCV with an ecological satellite-mapping approach (Raitsos et al. 2008), i.e. we mapped the DPs based on satellitebased spectral and environmental variables. We trained and validated various statistical models, including both widely used approaches such as artificial neural networks and models that are theoretically suitable but have been less frequently used in ocean color remote sensing, like boosted regression trees. Given a lack of DP algorithms optimized for the NGOM, we used the bestperforming algorithms identified here to generate daily maps, 8-day, monthly and annual composites, as well as seasonal climatologies. These data are available for download in GEOTIF format (see Section "Data availability").

Study area
The NGOM is a region facing substantial environmental changes and risks. Over the 21 st century, the NGOM's physical climate will warm considerably (Biasutti et al. 2012). Offshore oil extraction poses risks to the Gulf's marine biota and ecosystem services (Beyer et al. 2016;Ozhan, Parsons, and Bargu 2014). Riverine inputs of nutrients and stratification lead to seasonal hypoxic conditions in a large "dead zone," with substantial reduction of opportunities for demersal fishing (Rabalais, Turner, and Wiseman 2002). From a remote sensing perspective, the NGOM covers a wide range of biogeochemical and bio-optical conditions, from eutrophic coastal waters to oligotrophic offshore waters (Martínez-López and Zavala-Hidalgo 2009;Müller-Karger et al. 1991;Xue et al. 2013. Because of these characteristics, and the resulting high spatiotemporal variability of phytoplankton dynamics and optical water properties, standard ocean color algorithms for the global oceans can have considerable absolute errors when applied in the NGOM (e.g.; Nababan et al. 2011).

In-situ data
We combined HPLC (high-performance liquid chromatography) data for 2003-2018 from two sources: Kramer and Siegel (2019) and SeaBASS (Werdell et al. 2003;Werdell and Bailey 2002). We extracted concentrations of seven diagnostic pigments (DPs) as response variables: 19′-butanoyloxyfucoxanthin (But.fuco), 19′-hexanoyloxyfucoxanthin (Hex.fuco), alloxanthin (Allo), fucoxanthin (Fuco), peridinin (Perid), zeaxanthin (Zea) and total chlorophyll b (Chl.b). These seven DPs are widely used to characterize phytoplankton community composition (Mouw et al. 2017;Uitz et al. 2006;Vidussi et al. 2001). We removed observations made within 10 km of land according to GSHHS full-resolution shorelines (Wessel and Smith 1996) to mitigate potential effects of stray light and extreme nearshore conditions such as ephemeral turbidity due to resuspension of sediment. We also removed observations made at depths greater than 10 m. If there were multiple observations within the first 10 m of the water column for a location and time, we only retained the observation closest to the surface.
While many empirical satellite algorithms for mapping DPs predict absolute concentrations (e.g. Bracher et al. 2015b), we were primarily interested in predicting phytoplankton community composition. Absolute DP concentrations, however, reflect both phytoplankton biomass and community composition. Many studies interested primarily in community composition predict phytoplankton size classes or functional types based on weighted relative concentrations (i.e. percentage made up by each of the DPs; Hirata et al. 2011;Mouw et al. 2017). However, this conversion benefits from locally derived weights and involves major uncertainties (Chase et al. 2020). No local weights were available for the Gulf of Mexico, and we could not find published evidence that global weights (e.g. Uitz et al. 2006) would be adequate in this region. Thus, relative concentrations of the DPs were used as response variables serving as proxies for community composition. The relative concentrations were calculated by dividing each pigment's absolute concentration by the sum of all seven pigments' concentrations, S DP (Vidussi et al. 2001): where c X is the HPLC-measured concentration of the pigment indicated by the subscript and r X is the corresponding relative concentration that we predicted as indicators of community composition. The linear correlation between in-situ Chl a and S DP was 0.97, indicating a high consistency of the various pigment measurements. Histograms of in-situ relative concentrations for locations with matching satellite data are shown in Fig. S1.

Satellite data
Predictors were daily satellite data products for 2003-2018 from various sources (  Maritorena et al. 2010). GlobColour merges ocean color data from several sensors (SeaWiFS, MERIS, MODIS-Aqua, VIIRS, and OLCI-A), allowing for a larger number of matchups for algorithm training and testing. To ensure the best spatial coverage, we only included GlobColour data products that merged data from all available sensors for the given time. The spatial resolution was 4 km and all data from other sources were resampled to the same grid as the GlobColour data. We acknowledge that other multi-sensor data sets, such as OC-CCI (Ocean Color Climate Change Initiative) data, have also been successfully used for mapping phytoplankton pigments (Gittings et al. 2019;Sun et al. 2019). Based on the GlobColour remote sensing reflectances (RRS), we calculated band ratios as additional predictors; for example, these variables allow the distinction of optical water types ). Finally, given its well-established statistical relationship to Chl a concentration, we included the maximum blue-to-green band ratio R as a predictor: Finally, we included day-of-year as a predictor to account for any seasonal patterns in the data. While it has been recommended to use a 3 h temporal window and a spatial window consisting of few pixels at the sensor's native resolution for matching satellite data with in-situ data (Bailey and Werdell 2006), the trade-off between a tight spatiotemporal match and the number of match-ups has led past research mapping different aspects of phytoplankton community composition from space to relax these criteria (e.g. Bracher et al. 2015b;Raitsos et al. 2008). We followed these examples and matched in-situ with satellite observations using a same-calendar-day temporal window. Spatially, we used a 4-pixel window at 4 km resolution, and bilinearly interpolated the extracted value because there are strong land-sea gradients in coastal parts of our study area . With these criteria, we obtained 130 matchups of the satellite data with insitu measurements of the DPs (Figure 1). The matchups covered diverse biooptical and environmental conditions (Fig. S2). Some algorithms used all predictors, and some used only a subset (e.g. only RRS). To address co-linear predictors (Fig. S3), variable selection was integrated in the algorithms using all predictors, e.g. by regularization, by using principal components calculated from the original predictors (following Bracher et al. 2015b;Xi et al. 2020), or by bagging and boosting based algorithms that are insensitive to high dimensionality and colinear predictors (random forests and boosted regression trees; Belgiu and Drăgu 2016;Dormann et al. 2013).

Supervised learning algorithms
For each of the 7 DPs, we compared how accurately 13 empirical algorithms predicted relative concentrations for previously unseen data that were spatially separated from all training data (Table 2). Pan et al. (2013Pan et al. ( ), (2010 proposed that the broadscale spatial distribution of DPs can be approximated as a cubic polynomial function of remote sensing reflectance band ratios (algorithm PAN). Following this example, we fit (least squares) a function of the form where r is a band ratio and Y DP is the relative concentration of each pigment. Using the closest bands available in the GlobColour data, we tested both For each pigment, we chose the ratio that led to the best least-squares fit of the polynomial. For predicting zeaxanthin Pan et al. (2013), (2010) modified r based on Figure 1. Direct matchups of in-situ pigment and satellite data (n = 130), and reconstructed match-ups (additional in-situ observations with satellite data reconstructed by a gap-filling algorithm; n = 219; see Section 2.7). Table 2. Overview of empirical algorithms tested in this study. "All" predictors means that all variables listed in Table 1 were provided as input and collinearity was addressed by dimensionality reduction methods (e.g. regularization or using principal components as predictors instead of the original variables). SST. However, on our data, incorporating SST -while optimizing model fit -led to outlier predictions that resulted in large mean error estimates, and we hence did not include SST in our model. Furthermore, it is important to note that in contrast to the original algorithm, we predicted relative concentrations of the pigments, not absolute concentrations. Consequently, the algorithm tested here follows the idea but is not an exact reproduction of Pan et al.'s methods. Many sophisticated neural-network based ocean color algorithms have been developed in recent years for different purposes (e.g. Hieronymi, Müller, and Doerffer 2017;Pahlevan et al. 2020;Ruescas et al. 2018), but given the relatively small size of our data set, we used simpler model structures. Following Keiner and Brown (1999) and González Vilas, Spyrakos, and Torres Palenzuela (2011), we trained ANNs with 5 (ANN5), 10 (ANN10) and 20 (ANN20) nodes in a single hidden layer, as well as 4 nodes each in 2 hidden layers (ANN4 + 4). All ANNs were trained by means of stochastic gradient descent using the R package ANN2 (Lammers 2020), with L2 regularization and hyperbolic-tangent activation functions. We tested different multipliers for the L2 penalty. Because ANNs are sensitive to initial, randomly chosen parameters, we repeated training each ANN five times for each penalty multiplier, while withholding 20% of the data for error estimation. We chose the ANN that achieved the smallest mean squared error on the withheld data.
Random forests (algorithm RF; Breiman 2001) are an increasingly popular model choice in remote sensing applications of supervised learning (Belgiu and Drăgu 2016). We generated random forests with 300 trees and various proportions of predictors considered at each split with the R package randomForest (Liaw and Wiener 2002). The proportion of predictors considered was selected based on the out-of-bag error. We trained boosted regression trees with the R package gbm (Greenwell, Boehmke, and Cunningham 2019) and a learning rate of 0.001, a bag fraction of 75%, and interaction depths from 1 to 3 (models BRT1, BRT2, BRT3). For each tested interaction depth, we chose the optimal number of trees using the algorithm of Elith, Leathwick, and Hastie (2008).
Finally, given the relatively large number of potential predictors and correlations between some predictors (e.g. RRS in neighboring bands), we tested three models that used principal component (PC) scores of the predictors as input. We first used a linear multiple regression for this purpose (PCRLIN), following the methods described by Bracher et al. (2015a) and Xi et al. (2020): PCs of remote sensing reflectances were calculated from the matchups, and the number of PCs chosen to include in the models based on the Akaike Information Criterion. We also tested principal components of all original predictors instead of only RRS (PCRALL), and random forests instead of a linear model (PCRRF).

Iterative Spatial Leave-One-Out Cross-Validation (iSLOOCV)
Cross-validation (CV) estimates a statistical model's prediction error by repeatedly splitting the data into folds. Each fold serves as a test set for an algorithm trained on all other folds. The resulting error estimates are then averaged. This approach reduces the reliance of the error estimate on the sample drawn for testing, yielding more reliable estimates (Lyons et al. 2018). In marine remote sensing, a split into training and test sets is most often made randomly (Bracher et al. 2015a;Hirata et al. 2011;Raitsos et al. 2008;Xi et al. 2020), which assumes that observations are independent (Arlot and Celisse 2010). However, marine labeled data (including the data used in this study) are often spatially autocorrelated and clustered in space and time ( Figure 2). Therefore, observations that are randomly selected for validation may not be independent of the observations in the training set, which in turn can lead to an underestimation of prediction errors (Pohjankukka et al. 2017;Stock 2022). This problem can be overcome by ensuring that training and test sets are sufficiently separated in geographic space, in time, or in predictor space, depending on the data and application. There are two main cross-validation strategies for such situations. Spatial block CV splits the data into folds based on geographical blocks (Roberts et al. 2017;Stock et al. 2018a;Valavi et al. 2019). However, the choice of block size and shape can be challenging. In contrast, spatial leave-one-out CV (SLOOCV; Le Rest et al. 2014;Le Rest, Pinaud, and Bretagnolle 2013) modifies leaveone-out cross-validation, where each observation is held-out as a test set once, while the training set in each step consists of all other observations. SLOOCV adjusts this strategy to avoid the effects of spatial autocorrelation by excluding all observations located within a circle around each test observation from model training (Figure 2). Error estimates are hence based on models that have been trained using only data that are at least the circle's radius, r, away from the test observation. Le Rest et al. (2014) suggest using the distance at which the residuals of a model fitted to the full data set become independent. They tested this recommendation on a relatively large data set that was randomly distributed in space. However, residuals can be misleading if the model is over-fitting to the spatial structure of the data (Roberts et al. 2017), and our smaller data set with highly uneven spatial coverage did not allow the reliable estimation of variograms and the de-correlation range.
We hence used an iterative version of SLOOCV (in the following, iSLOOCV) for the validation of empirical satellite algorithms for the oceans. Instead of choosing a fixed distance r, we iterated over values from 0.1 km to 200 km. At r = 0.1 km, the test observation itself as well and close-by matchups like measurements made in the same location at different days are excluded from algorithm training. At r = 200 km, a large sub-region around the test observation is excluded from training. The following pseudo-code summarizes the iSLOOCV algorithm for a given statistical model: (

Error measures
We calculated the following error measures (step 1.b in the pseudo-code above), for simplicity omitting subscripts for the radius r: • All error estimates for DP algorithms reported in this study were calculated by means of iSLOOCV.

Data augmentation
To increase the number of matchups, we filled data gaps separately for all predictors using three algorithms: Linear temporal interpolation (LTI), datainterpolating empirical orthogonal functions (DINEOF), and spatiotemporal Kriging (STKR; Fig. S4).  found these algorithms to produce solid reconstructions of pixels obscured by clouds in 3-day composites of Chl a for the Gulf of Mexico; Figure 2. Spatial leave-one-out cross-validation. The iterative version proposed in this study explores how error estimates change as a function of the circle's radius.
furthermore, these algorithms interpolate in time, which is important given the higher temporal resolution of the data in this study. The LTI algorithm simply interpolated between the closest prior and following observation for each pixel. DINEOF (Alvera-Azcárate et al. 2005; Beckers and Rixen 2003) was run using the software provided by GHER (2016). To achieve acceptable computation times, we split the 16 years of satellite data into four non-overlapping, 4-year subsets. STKR was implemented using the R packages spacetime (Pebesma 2012) and gstat (Gräler, Pebesma, and Heuvelink 2016;Pebesma 2004). We constructed empirical variograms for a sample of over 50,000 pixels from satellite data distributed evenly over the year 2017 (Fig. S4), fitted theoretical variogram models (manually fine-tuning the fitting process), and interpolated pixels with missing data using the 300 closest data points. The GlobColour product contained two Chl a products (see Table 1): CHL (based on empirical band ratio algorithms) and CHL_GSM (based on the semi-analytical Garver-Siegel -Maritorena algorithm; Maritorena, Siegel, and Peterson 2002). Following a comparison of the reconstructed CHL and CHL_GSM values to in-situ observations of Chl a (Table S1), we chose spatiotemporal Kriging to fill data gaps in our satellite data, yielding 219 additional, reconstructed matchups ( Figure 1). Reconstructed matchups were optionally used for algorithm training, but not for validation.

Selected models and mapping
Among all tested empirical algorithms, we selected one algorithm for each response based on the iSLOOCV results. We considered both the average errors over threshold distances r from 0.1 km to 200 km, and plots of distance-specific errors e(r) versus the radius r. However, to ensure an acceptable accuracy of predictions, we only selected final models for which the linear correlation between predicted and observed values in the iSLOOCV was ≥0.6, and for which there was negligible bias (ME and MDE close to zero). As the required accuracy for data products depends on the specific application (Agumya and Hunter 2002), these criteria are intended as a minimum requirement because data products with larger errors are unlikely to be useful for further applications. Among algorithms fulfilling these criteria, we qualitatively considered all error measures, as well as how the estimated errors changed with increasing SLOOCV radius. Once we selected one algorithm for each response variable for which the criteria above could be met, we trained the algorithm on the full data set. We then used it to create daily maps for the period 2003-2018. Finally, we averaged the daily maps into 8-day, monthly, and annual composites, and monthly and seasonal climatologies.

Error estimation
For most algorithms and response variables, prediction errors increased and the correlation between predicted and observed values decreased with higher threshold distance r in the iSLOOCV (Figures 3, S6). Furthermore, as r increases, in-situ locations are oneby-one removed from the training set in order of their distance to the test observation, resulting in minor changes of the training set. Hence, fluctuating lines in Figure 3 (as exhibited by ANN10) indicate that the trained models were sensitive to particularities of the sample or random aspects of model training.
We identified at least one and often several algorithms fulfilling the quality criteria (see Section 2.8) for four of the seven DPs: But.fuco, Hex.fuco, Fuco and Zea. For these four DPs, the best achieved linear correlations ranged from 0.85 (Fuco) to 0.73 (Zea). MDPDs ranged from 26% (Fuco) to 82% (But.fuco). In most cases, the best error statistics were achieved by random forests or boosted regression trees; the polynomial algorithm by Pan et al. (2013Pan et al. ( ), (2010 worked best for Hex.fuco according to all error measures. For the other three DPs, none of the algorithms achieved an adequate correlation between predicted and observed values. We hence did not select final models and present no maps for these three DPs. A complete list of all tested algorithms' crossvalidated errors is provided in the Supplementary Materials (Tables S2-S8).
Among the algorithms using principal components as predictors, using a random forest (PCRRF) instead of multiple linear regression improved predictions. The first principal component (PC1, 43% of variance) represented variables related to water transparency (Chl, ZEU, Kd490, etc.), PC2 (22%) was related to RRS (dominated by 443, 490). PC3 (12%) and PC4 (7%) were related to environmental variables -PC3 dominated by wind and PC4 by SLA. PC1 was the most important predictor in PCRRF algorithms across pigments.

Data augmentation by gap-filling
Increasing the number of observations for algorithm training by including data reconstructed by a gap-filling algorithm had mixed positive and negative effects, depending on the response variable and the algorithm (Table 4, Tables S2-S8). Including reconstructed observations in algorithm training improved all error measures for Zea, Allo and Chl.b. It also improved the correlation for Perid substantially, while resulting in small increases of other error measures for this DP. However, of the four pigments for which training with reconstructed matchups improved predictions, only algorithms for Zea met the basic quality criteria for justifying further applications and analyses (see Section 2.8). Gap-filling was therefore used in the training of the final algorithms for Zea, but not for But.fuco, Hex.fuco and Fuco.

Model selection and predictions
Several algorithms with similar error statistics existed for each DP, and no algorithm worked best for all DPs. We chose to use random forests for the four DPs  Tables 3-6 and S2-S8. Table 3. Best achieved error statistics averaged over 0.1 km -200 km distance thresholds in the spatial leave-one-out crossvalidation by the tested empirical algorithms, and the model which achieved the best value for each response variable and error measure. A "+" behind the model abbreviation indicates that the best error was achieved when using gap-filled data for training, in addition to direct matchups.   Table 3.
where the basic quality criteria (see Section 2.8) were met, because random forests worked well across all DPs and error measures. These models either achieved the lowest cross-validated errors or came close to the best ones for But.fuco, Fuco, and Zea (Tables 5, 6, S2-S8). Figure 4 shows example 8-day composites generated with these algorithms. For Hex.fuco, the polynomial regression following Pan et al. (2010), (2013) had better error statistics overall, yet to maintain consistency between algorithms, random forests were used as final models to create daily maps of all four DPs. However, we provide Hex.fuco maps generated with the PAN algorithm for download in addition to the maps generated with random forests (see "Data availability").

Spatial and summer-winter patterns of diagnostic pigments
In summer, Fuco made up the largest proportion in nearshore waters up to 20 km from the coastline, with some geographic variation (Figures 5, 6). Zea also made a notable contribution, constituting over 30% of the DPs in nearshore waters, and was the dominant pigment further offshore. Hex.fuco accounted for only a small fraction of the total pigment concentration in nearshore waters, but for about 25% offshore. But. fuco made up only few percent of the diagnostic pigment throughout the study area, increasing Table 5. iSLOOCV errors of final models (all random forests) used to generate DP maps for further analyses. Abbreviations as in Table 3, and ME: mean error; MDE: median error.  Table 6. Percent difference between random forests' error statistics and best statistics achieved by any model. For example, if we had chosen the final But.fuco algorithm based on RMSE alone (i.e. ignoring all other error statistics), the correlation would have been 6% higher, and the MAE 5% lower (but other error statistics would have been worse, as the random forest was the best model according to these). Abbreviations as in Table 3.   slightly when moving offshore. In winter, Fuco was the dominant pigment within the first 80 km from the coastline, again with some geographic variation. Hex. fuco and Zea were the dominant pigments further offshore, with (when averaged throughout the study area) similar proportions that increased offshore. The proportion of But.fuco was on average higher than in summer, but still low. Spring and autumn climatologies showed primarily transitions between winter and summer conditions (Figs. S7, S8). It is important to keep in mind that the numbers presented are the proportion of the sum of seven DPs, whereas we present results for only four, because the available data did not support sufficiently accurate algorithms for the other three. Time series of daily predictions for selected locations reflected these seasonal patterns (Figure 7). The daily predictions had high short-term variability, likely resulting from the sometimes large but unbiased prediction errors (such that subsequent observations may have overestimation followed by underestimation or vice versa). In particular, the proportion of Fuco fluctuated strongly at the coastal stations over short time periods. These fluctuations were mostly countered by opposing fluctuations in Zea and Hex. fuco (Fig. S9). The time series also exhibited rare and short-lived peaks of fucoxanthin, with high concentrations typical of coastal locations and occurring across seasons. These peaks occurred at times in which high-chlorophyll waters reached far offshore (Fig. S10). While coastal water advection is more common, the predicted Fuco peaks coincided with the highest satellite measured Chl a concentrations in the study period for the two offshore locations (Fig.  S11). While less visible in the composite-based Figure 7, these peaks were in daily data accompanied in reductions of other DPs' proportions (Fig. S12). For example, from July 4 th to 7 th 2009, Fuco at GC600 increased by 0.3, while Zea decreased by 0.1, Hex. fuco decreased by 0.2, and But.fuco stayed almost the same (all numbers rounded).

Cross-validation for satellite mapping
The challenges of predictive modeling when data are not independent have long been discussed by statisticians, e.g. in the context of time series analysis (Arlot and Celisse 2010;Opsomer, Wang, and Yang 2001). Recently, the effects of non-independent data have received renewed interest in spatial statistics, especially because more complex machine learning methods are prone to over-fitting to dependence structures (Gregr et al. 2019;Roberts et al. 2017;Stock et al. 2018b). Such overfitting cannot be detected by validation methods relying on randomly held-out observations, and could be especially problematic for ocean remote sensing, because marine in-situ data are often clustered along cruise tracks or phenomena of interest. For example,  found that error estimates from spatial block CV were considerably larger than error estimates from 5-fold CV and supported different conclusions about which algorithms were accurate enough for further applications. Yet the choice of statistical designs for algorithm validation is rarely justified in the biological ocean remote sensing literature. Recent discussions of and progress in algorithm validation have focused on the collection of highquality in-situ data following shared protocols, the mismatch of spatial scale between in-situ samples and satellite observations, and new sources of in-situ data such as Argo floats (Bracher et al. 2014;Brewin et al. 2016;Dierssen et al. 2020;Groom et al. 2019;IOCCG 2014;Ruddick et al. 2019;Wojtasiewicz et al. 2018). These are crucial aspects of satellite algorithm validation, yet the statistical consequences of the spatial distribution of labeled data for supervised learning require similar attention (Stock 2022).
This study demonstrated the use of spatial leaveone-out cross-validation (SLOOCV) for validating and selecting empirical satellite algorithms with data that were spatially clustered. Like spatial block crossvalidation, SLOOCV enforces a spatial separation of the data used for training and testing models, and avoids the often challenging definition of spatial blocks (Roberts et al. 2017;. However, the original SLOOCV method uses a single radius, the range of auto-correlation in the residuals, within which training data are omitted around each test observation (Le Rest et al. 2014;Le Rest, Pinaud, and Bretagnolle 2013). This range can be impossible to estimate on sparse and clustered data sets that are common in marine research based on insitu measurements. We therefore conducted SLOOCV iterating over a range of distances as opposed to a single distance. We proposed an iterative version (iSLOOCV) to explore how the estimated error changed as a function of distance and used the average error over the whole range to compare the algorithms. This approach bypasses the need to choose a single radius and allowed for the selection of algorithms that worked well both in locations wellcovered by training data and when making predictions for locations farther away. Overall, prediction errors increased with larger radius, but only moderately, suggesting that the algorithms were not overfitting to spatial structures. A possible explanation for the moderate increase of errors at larger radii is that spatially clustered in-situ measurements were sometimes made in different years or seasons. In dynamic marine systems, data collected in the same geographic location but at different times can nevertheless represent a large variety of environmental and bio-optical conditions, and thus be less correlated than suggested by their spatial proximity. Indeed, separating the data used for model training and validation in time or in predictor space can be preferable to spatial separation, depending on the characteristics of the study system and data and on the model's intended application (Roberts et al. 2017). Adaptation of iSLOOCV to separating training and test data based on spatiotemporal distances or in predictor space is a promising direction for future research.

Data augmentation by gap-filling
The availability of sufficiently large labeled data sets for supervised learning applications in remote sensing is a widespread problem and can be addressed from various angles such as semi-supervised learning (Liu et al. 2017). Here, we proposed a direct, simple approach to data augmentation that matched in-situ observations with reconstructed, gap-free satellite data. This approach increased the number of available matchups from 130 to 349. At the same time, reconstructed pixels can have larger errors than direct satellite retrievals or errors with a different distribution. This additional noise could counteract possible improvements of prediction accuracy gained from a larger training set. In this study, including matchups of in-situ measurements with reconstructed satellite imagery in algorithm training had mixed effects on prediction errors. It led to a considerable reduction of prediction errors according to all measures for three of the seven DPs, including one of the four DPs for which pre-set accuracy criteria for further analyses were met. Thus, while we cannot recommend increasing the number of matchups by means of gap-filling as a default procedure, the results show that in situations where researchers are concerned about the size of the labeled data set for algorithm training, increasing the number of matchups by means of gap-filling algorithms can be helpful.

Quality of selected algorithms and generated data products
While the focus of this study was on methodological aspects, we provide generated maps for download and report detailed error statistics to allow potential users of the algorithms and data to judge their fitness for potential applications in marine science. For example, our results show that fine-scale predictions (i.e. daily data for individual pixels) could have large errors, resulting in high short-term variability. However, all algorithms had negligible bias and regional-scale spatial distributions as well as seasonal patterns were adequately predicted. Hence, the generated data products should be primarily used for broad-scale and longer-term analyses, and potential users should carefully consider their fitness for the intended application. The final algorithms' errors were similar to those of other published ocean color algorithms for the NGOM. For example,  report relative errors of 40%-60%, and R 2 between 0.52 and 0.65 (i.e. linear correlations between 0.72 and 0.81) for a Chl a algorithm for the Louisiana shelf -an easier prediction task because of the smaller study area and well-established correlations between chlorophyll concentrations and ocean color variables. The four DP algorithms clearing the quality criteria from Section 2.8 achieved median absolute percentage differences between 26% and 82% and linear correlations between 0.72 and 0.85. The predictions exhibited negligible bias; therefore, averaging over multiple daily images or areas encompassing several pixels could further increase accuracy (as random errors cancel each other out). Several algorithms achieved similar error statistics for each of these DPs, suggesting that the achieved accuracy is close to what is possible with the currently available data. Rare and short-lived high offshore fucoxanthin concentrations predicted by the algorithms were associated with unusually high chlorophyll a concentration in these locations, reflecting oceanographic conditions normally associated with phytoplankton typical of coastal waters; however, lacking in-situ data for these situations, we cannot conclude that the spikes reflect real changes in the phytoplankton community. Overall, at broad spatial scales, our results were qualitatively consistent with the findings of previous field campaigns focusing on phytoplankton communities in the NGOM (Chakraborty, Lohrenz, and Gundersen 2017;Chakraborty and Lohrenz 2015;Lambert, Bianchia, and Santschi 1998;Qian et al. 2003). A detailed qualitative comparison is provided in Appendix A1. These results demonstrate the potential of iterative SLOOCV to select adequate supervised learning-based algorithms for satellite mapping applications with relatively small, spatially autocorrelated and unevenly distributed data sets.
Despite these broad-scale similarities between the predicted spatial distributions of diagnostic pigments and independent field campaigns investigating phytoplankton community composition, it is important to keep in mind that pigments are imperfect proxies for phytoplankton types. Yet overall, HPLC is among the most common and quality-controlled methods available. Four broad taxonomic groups of phytoplankton can be reliably distinguished based on their pigment signatures as described by HPLC data, and several of the individual pigments mapped here can serve as useful proxies for these groups: for example, Fuco for diatoms, Hex.fuco for haptophytes, and Zea for cyanobacteria (Kramer and Siegel 2019). Locally, more phytoplankton groups can be distinguished based on HPLC data (Kramer, Siegel, and Graff 2020;Kramer and Siegel 2019). However, distinguishing more detailed groups requires data on more pigments than those for which we found adequate algorithms. Together, these limitations of pigment-based phytoplankton community characterization and satellite retrievals of relevant pigments suggest that only broad phytoplankton classes can be distinguished from space by linking HPLC data with multi-spectral reflectances and environmental variables.

Summary and conclusions
(1) Spatial leave-one-out cross-validation that iterates over a range of distances separating training and test observations allows the validation and selection of algorithms based on small, spatially clustered data sets, without the need to choose a fixed separation distance a priori. It also provides insights into how errors change as the distance from locations with data increases, and into over-fitting as the training set shrinks with increasing separation distance.
(2) Gap-filling methods can be used to increase the number of matchups between satellite and insitu measurements. The benefits of more matchups for training supervised learning algorithms sometimes, but not always, outweigh additional errors introduced. Data augmentation by gapfilling is hence worth testing in applications where a small matchup data set is suspected to be the limiting factor for supervised learning.
(3) Regionally optimized supervised learning algorithms for remote sensing of diagnostic phytoplankton pigments achieved adequate accuracy for four out of seven diagnostic pigments, suggesting that some, but not all relevant, broad phytoplankton classes can be distinguished from space based on multi-spectral satellite and environmental data in the northern Gulf of Mexico.
The original data used in this study are publicly available for download from the sources provided in Table 1.

Disclosure statement
No potential conflict of interest was reported by the author(s).