Uniform upscaling techniques for eddy covariance FLUXes (UFLUX)

ABSTRACT Data-driven techniques that scale up eddy covariance (EC) fluxes from tower footprints with satellite observations and machine learning algorithms significantly advance our understanding of global carbon, water, and energy cycles. However, few upscaling approaches take a consistent approach to upscaling both carbon and energy fluxes. A lack of uniformity in the upscaling approach could lead to inconsistencies in global interannual variability of fluxes and between types of carbon and energy fluxes. Hence, this study aims to identify obstacles in flux upscaling and propose a uniform upscaling framework UFLUX for gross primary productivity (GPP), ecosystem respiration (Reco), net ecosystem exchange (NEE), sensible heat (H), and latent energy (LE). The key findings are as follows: 1) The upscaling performance exhibits a limited improvement from the use of more advanced machine learning approaches (e.g. <0.3 in R2 improvements while using deep neural networks). 2) The spatial density of EC towers is the primary factor determining the effectiveness of upscaling, explaining >50% of the upscaling uncertainty. 3) The UFLUX framework considered the interconnection between fluxes and achieved a competitive validation precision (daily R2 = 0.7 on average of five flux types) when compared with products that upscaled a subset of the fluxes. UFLUX effectively preserved the ecosystem light-use efficiency (0.83 of linear regression slope and the same after), Bowen ratio (0.8), and particularly, the water-use efficiency (0.81), when compared to the only other product (i.e. FLUXCOM) to upscale both carbon and water.


Introduction
The EC networks have contributed greatly to our understanding of global carbon, water, and energy cycles (Baldocchi 2020).EC provides direct and continuous measurements of ecosystem-atmosphere mass and energy fluxes and has been widely used for studying ecosystem responses to environmental forcings (Reichstein et al. 2014), for validating Earth-observing satellite-derived products (Baldocchi et al. 2001), and the development, calibration and evaluation of land surface models (LSMs) (Fisher and Koven 2020).Satellite and LSMs products are not devoid of errors (Slevin 2016;Wang et al. 2017) and these errors have presented difficulties in generating reliable climate projections (H.Wang et al. 2017).Hence, accurate global flux estimates upscaled from EC networks are necessary to independently validate and bolster our climate mitigation endeavours (Jung et al. 2017).
The EC upscaling studies have achieved important breakthroughs in global ecosystem carbon uptake estimation (Beer et al. 2010), but several challenges remain.The motivation to develop a consistent and uniform EC flux upscaling was underlined by the significant disparities observed in global estimates of carbon fluxes, particularly in their GPP interannual variability (Dong et al. 2022).The divergent trends displayed by these GPP products pose significant challenges in understanding global carbon budgets and addressing the climate change crisis (Dong et al. 2022).Given the similarity of driving data, it is likely that the differences are due to the technical implementation of upscaling algorithms.
The precise influence of these technical factors on the substantial variance observed in the interannual variability of global fluxes (Dong et al. 2022) remains undetermined.Employing basic averages of outcomes obtained through the utilization of diverse driver datasets, algorithms, and so on may result in a flux time series at the global scale that exhibits little sensitivity to interannual fluctuations.Moreover, it is important to recognize that the carbon cycling processes are intricately intertwined with those of water and energy (e.g.light-use and water-use efficiencies), given that photosynthesis is significantly influenced by atmospheric and soil moisture deficits (Fu et al. 2022).Upscaling approaches that focus on single fluxes are unlikely to preserve known trends between fluxes.
Most EC upscaling studies focused on GPP (Joiner and Yoshida 2020;Ueyama et al. 2013).Whilst EC typically measures H, LE and NEE, it can be further partitioned into GPP and Reco (Aubinet, Vesala, and Papale 2012).Using a uniform upscaling routine -e.g. the same machine-learning algorithms, satellite vegetation proxies, and environmental drivers -improves the comparability between upscaled fluxes, and this is particularly important for understanding key climate-ecosystem interactions, e.g.ecosystem lightuse efficiency and plant water stress (Ai et al. 2018;Jung et al. 2019).
Comparison between existing upscaling routines is difficult when each uses different machine learning algorithms, satellite proxies and environmental driver datasets (Joiner and Yoshida 2020;Jung et al. 2020;Zeng et al. 2020).The discrepancies between upscaled fluxes in the literature might pertain to the inconsistent machine learning algorithms and drivers (Dong et al. 2022).For example, satellite-derived solar-induced fluorescence (SIF) was extensively reported to be superior to vegetation indices in upscaling as it correlates with photosynthesis closely (Guanter et al. 2021;Liu et al. 2020;Sun et al. 2018).Recently, the near-infrared reflectance (NIRv) also showed advantages in upscaling (Badgley et al. 2019).Similarly, widely used machine-learning algorithms range from support vector machines (SVR) (Ichii et al. 2017;Ueyama et al. 2013;Yang et al. 2007), neural networks (NN) (Joiner and Yoshida 2020;Papale et al. 2015), to random forests (RF) (Tramontana et al. 2015;Zeng et al. 2020), but their effectiveness in upscaling remains undetermined.A key consideration is the extent to which the choice of predictors and machine learning algorithm will result in different spatiotemporal trends.For example, Zeng et al. (2020) reported an upward global GPP trend from 2000 to 2020, but global GPP time series in Jung et al. (2020) and Joiner and Yoshida (2020) were predominantly stationary.
It is also noteworthy that the placement of EC global towers lacks an overarching sampling design (Sulkava et al. 2011), and as a result, the spatial density of EC towers varies significantly by geographic region (Hill, Chocholek, and Clement 2017).Nearly 85% of EC towers are located within northern temperate ecosystems, with <10% in tropical ecosystems, this skewed distribution of tower locations presents a challenge to global upscaling efforts that have not been fully quantified (Schimel et al. 2015).It remains unclear to what extent the number and distribution of EC towers affect spatial and temporal trends in the upscaling products.
To account for the known links between water and energy fluxes, we undertook an investigation into the viability of upscaling carbon, water, and energy fluxes in a uniform and inter-comparable manner.We seek to address two unanswered questions: 1) Can EC fluxes be successfully upscaled in a consistent manner?2) What factors affect the effectiveness of flux upscaling?To address these questions, we introduce the UFLUX framework (https://sites.google.com/view/uflux),which leverages the same machine learning algorithm, satellite vegetation proxy, and environmental drivers to scale up GPP, Reco, NEE, H, and LE fluxes.The accuracy of UFLUX estimates was assessed on a global scale using EC towers from the FLUXNET2015 database (Pastorello et al. 2020).Additionally, we investigate the impact of machine learning algorithms, satellite vegetation proxy, environmental drivers, and the spatial density of EC towers on the performance of UFLUX.We also examined the performance of UFLUX in preserving the light-use efficiency, water-use efficiency, and Bowen ratio against other upscaling products.

UFLUX overview
The flux upscaling routine (Figure 1) makes use of a non-linear relation (f ) between fluxes (F) and predictors which is established based on the machine learning algorithm (Joiner and Yoshida 2020;Jung et al. 2020): Typically, the predictors in upscaling studies are satellite remote sensing (RS) vegetation proxy products and environmental driver data Env ð Þ (Joiner and Yoshida 2021;Jung et al. 2020;Zeng et al. 2020).In addition, auxiliary parameters (Aux:) like geolocation and vegetation classification are commonly also taken into consideration (Joiner and Yoshida 2020;Jung et al. 2020;Zeng et al. 2020).The carbon and water fluxes were interconnected by the water-use efficiency (Hatfield and Dold 2019) -i.e.applying the interquartile ranges of light-use and water-use efficiencies derived from EC towers across ecosystem types into the machine learning models as constraints.
Machine learning algorithms are commonly used for flux upscaling (Reichstein et al. 2019) because they can well fit non-linear relations, e.g.light-use efficiency (Monteith 1972) (supplementary Section A.1.GPP upscaling theory), respiration-temperature responses (Lloyd and Taylor 1994), and the evapotranspiration rates (Penman 1948) (supplementary Section A.2. Reco, H, and LE upscaling theories).The machine learning algorithms were constrained by light-use and water-use efficiencies, resulting in a framework that is not purely data-driven but rather aligns more closely with ecological sensibility.The relationships fitted during the 'training part' are extrapolated to the globe in the 'production part' on daily 0.25° resolutions (Figure 1).In the final 'validating part', estimates are corroborated using the leave-one-out cross-validation (Marchetti 2021) approach (Figure 1).

EC flux data
Daily EC data are selected from the 206 open-access FLUXNET2015 towers (Pastorello et al. 2020).We use NEE (NEE_VUT_REF), GPP (GPP_NT_VUT_REF), Reco (RECO_NT_VUT_REF) (Table S1) with data filtering criteria in line with Tramontana et al. (2016) and Joiner and Yoshida (2020).We use the daily values when 1) less than 33% of half-hourly data are gap-filled, 2) the NEE uncertainty is smaller than 3 g C m −2 d −1 , and 3) the difference between daytime-and night-time partitioned GPP is smaller than 3 g C m −2 d −1 .We also use H (H_F_MDS) and LE (LE_F_MDS) from the FLUXNET2015 database with data filter criteria referring to Jung et al. (2019) -we use the daily values with less than 33% of half-hourly data are gap-filled.
In accordance with the literature, the environmental drivers (supplementary Section B. Full data description) are obtained from a climate reanalysis database (Joiner and Yoshida 2020;Jung et al. 2020).In total eight environmental drivers are used from the Climate Forecast System Version 2 (CFSV2) of National Centers for Environmental Prediction (NECP) (Saha et al. 2014): temperature (2 m above ground, TA), downward shortwave radiation (at surface, SWIN), specific humidity (2 m above ground) for calculating vapour pressure deficit (VPD), soil moisture content (5 cm integral below surface, SWC), U-/V-wind components (10 m above ground) converted to wind speed (WS) and wind direction (WD), precipitation rate (at surface, P) and pressure (PA).All environmental drivers were resampled from 6-hourly to daily sum (P) or daily mean (TA, SWIN, VPD, SWC, U/V, WS, WD, and PA).
Considering the time coverage of different datasets -i.e.GOME-2 SIF data started in 2007, while the FLUXNET2015 database ended in 2014 -the time for training is from 2007 to 2014.To assess the predictive capacity of UFLUX in forecasting future fluxes, the global flux estimation was generated for a time frame extending from 2007 to 2018, surpassing 2014.In the context of time lag effects, we incorporated time lag considerations by integrating time stamps into the machine learning models.

Model training and global estimation
In the training part, a complete model training (Figure 2) refers to a case described in Table 1.In line with the literature, the upscaling is implemented separately within each plant functional type from the International Geosphere Biosphere Programme (IGBP) (Joiner and Yoshida 2020;Jung et al. 2020;Zeng et al. 2020) using the leave-one-out cross validation (LOOCV) (Marchetti 2021).Specifically, the model is trained and tested per ecosystem type, e.g.only data with the same plant function type are used for training and testing the model.For a PFT group (PFT i ) containing m sites, for example, one site (site j ) is treated as the test set, and the rest of m-1 sites construct the training set.This process is repeated for each tower, so the algorithm is trained and tested 206 times in total (supplementary Section A.3.Data splitting rationale).In this way, we train the model in 17 cases (Table 1), and each case includes the 206 complete model training runs.Models trained in cases 0 to 4 are used for upscaling carbon, water, and energy fluxes.Please note that the upscaling setup (e.g. from algorithm) of cases 0 to 4 is the default uniform setup of UFLUX.
In the production part, the trained upscaling models are applied to calculate fluxes across the globe from 2007 to 2018 on daily 0.25° resolutions.Unlike other flux types, Table 1.All the testing cases in the study to comprehensively assess the influences from technical aspects on the upscaling performance.'High uncertainty' in case 8 and 9 indicates EC towers where the testing upscaling performance is poor (R 2 < 0.3) in case 0. The different training setup between the base case and other cases is highlighted in orange.NEE has two distinct pathways for upscaling.The first approach involves separately upscaling GPP and Reco on a global scale and then calculating NEE (denoted as 'NEE_ind').On the other hand, the second approach directly upscales NEE to the globe using EC measurements (denoted as 'NEE_dir').In this study, both pathways are considered for a thorough assessment of the UFLUX framework (supplementary Section A. Upscaling theories).

Assessment measures
We examine the UFLUX estimates by comparing them to EC fluxes based on the coefficient of determination (R 2 ), linear regression slope, root mean squared error (RMSE), and mean bias error (MBE), and the uncertainty was measured using the interquartile range (IQR) of the MBE, following the methodology in our previous studies (Zhu et al. 2022;Zhu, McCalmont, et al. 2023).In addition to the technical testing, we also analyse the contribution of each predictor to the upscaling performance via the commonly used permutation feature importance (Altmann et al. 2010).It repetitively shuffles the column of each feature to corrupt the data and computes a reference score on the model fitted with the corrupted version of data.The score sum of all features equals one, and the importance of each feature can be evaluated thereby.Fluxes derived from UFLUX were compared with other EC upscaling products particularly to examine their ability to preserve certain key ecological parameters, e.g. the light-use efficiency, water-use efficiency, and Bowen ratio (Chapin, Matson, and Vitousek 2011).

Impacts of machine learning algorithms and predictors
To investigate the impacts of machine learning algorithms on the upscaling performance, we examine six machine learning algorithms (case 0 and case 5-9, Table 1) focusing on GPP: 1) Support vector regression (SVR) (Awad and Khanna 2015).2) Random forest regression (RFR) (Breiman 2001).3) Multi-layer perceptron neural networks (MLP) (Rumelhart, Hinton, and Williams 1986).4) EXtreme Gradient Boosting (Xgboost, XGB) (Chen and Guestrin 2016).5) Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997).6) Stack of RFR, MLP, and XGB with a final regressor (Wolpert 1992).Hyperparameters of the machine learning algorithms were tuned automatically by exhaustively considering all parameter combinations via the grid search technique (Pedregosa et al. 2011).LSTM is a mainstream deep learning technique by effectively processing very long time series (Hochreiter and Schmidhuber 1997).Both LSTM and the stacked algorithm have the potential for improvements (Wolpert 1992), thereby they are only tested in areas where the upscaling accuracy tends to be less satisfactory, e.g. the tropics, and where the dominant plant functional type is evergreen broadleaf forest (Jung et al. 2020).The aim is to determine if the relatively poor upscaling performance in these areas can be addressed with an advanced and complicated algorithm architecture.
We also inspect how will the upscaling performance be affected by the combination of predictors (which are commonly referred to as features in machine learning studies) in case 0 versus case 10-14 (Table 1).By inter-comparing case 0 against case 15, we expect to investigate the possible difference using solar-induced fluorescence (SIF) and nearinfrared reflectance (NIRv).

Impacts of EC spatial sampling
The spatial sampling of EC towers (i.e. the spatial density of the measurements) could introduce uncertainty in flux upscaling.We analyse the impacts of EC spatial sampling via a 'similarity' test -i.e. the 'similarity' between training and test sets in case 16 (Table 1).The fluxes, particularly carbon fluxes, are driven by or closely associated with environmental variables, specifically climate conditions and the vegetative greenness measured by satellite proxies.The 'similarity' was calculated by the coefficient of determination of these environmental variables between the target geolocations and those sampled with EC towers.In case 16, for a test EC tower, there are n training towers with the same plant functional type.These n towers construct a training sample space.In the i th of n tests, one randomly selects i (i 2 0; n ½ �) towers from the sample space to create a subspace for training and testing the machine learning model.The upscaling performance and the highest 'similarity' are recorded for the following analysis.In addition, we calculate the 'similarity' between geolocations of satellite pixels and corresponding EC towers, enabling us to apply this analysis globally.

UFLUX validations
In this section, we will present the tower-level UFLUX upscaling performance for GPP, Reco, H, and LE fluxes as well as their global distribution.Furthermore, directly and indirectly upscaled NEE will be compared.Seasonal variability of all the five fluxes is also demonstrated.The R 2 and MBE distribution of all ecosystem classes concentrated on the upper left region of Figure 4, i.e. higher R 2 but smaller MBE absolute values.For PFTs with low R 2 , the MBE tends to be small too -e.g.GPP of evergreen broadleaf forests (EBF).The median R 2 and MBE were 0.72 and 0.02 g C m −2 d −1 for GPP, were 0.61 and 0.14 g C m −2 d −1 for Reco, were 0.68 and −1.49W m −2 for H and were 0.76 and 0.2 W m −2 for LE (Table 2).The R 2 value on average of the four fluxes was 0.7.Closed shrublands had the lowest performance among plant functional types for GPP, H, and LE, whilst evergreen broadleaf forests had the lowest performance for Reco (Figure 4).The upscaling performance seemed uncorrelated with flux strength, for example, closed shrublands (CSH) had medium GPP strength compared with other plant functional types.
The Northern Hemisphere, where 82% of the EC towers are located (Schimel et al. 2015), had the highest R 2 and the smallest MBE absolute valuesin particular, Europe and North America where EC towers were densely located.Africa and South America have fewer EC towers, lower R 2 and larger MBE absolute values than the Northern Hemisphere (Figure 4).
Overall, the UFLUX estimates displayed patterns that aligned with EC measurements, albeit with less intensity (Figure 5).This difference in flux intensity was particularly observed in evergreen broad forest (EBF) for Reco and closed shrublands (CSH) for H.For example, large Reco fluxes between 2008 and 2011 in EBF were barely seen in UFLUX estimates (Figure 5).
The scatters of EC NEE against UFLUX NEE are distributed close to the one-to-one line across ecosystem types (Figure 6).The direct and indirect estimates of NEE showed a high degree of similarity in their distribution, with an R 2 value close to 0.8.On the global scale, the annual difference between EC NEE and UFLUX NEE estimates was smaller 3 Mg C ha −1 .Please refer to Figure S1 and S2 for the interannual variability of UFLUX carbon fluxes including NEE and UFLUX NEE multi-year cumulative variations, respectively.

Seasonal variability of global estimates
For carbon fluxes, according to Figure 7, GPP and Reco contain expected seasonal and regional patterns.Higher annual GPP and Reco were seen in South America, Central Africa, and Southeast Asia and no obvious seasonal variations were observed in these regions.Northeast America and Eurasia were found with much stronger seasonality: higher values in the summer, while lower values in the winter.Regarding NEE, India, West Asia and Northeast America were the main carbon sources (i.e.positive NEE), while Central Africa, West America, Europe, and Southeast Asia were the main carbon sinks.South America and Central Africa were the main carbon sinks annually.Whilst North America and Eurasia turned from carbon sinks to sources in September and vice versa in May.
In terms of energy and water fluxes, Northwest America, South Africa, Middle East and Australia were seen with higher H. Whilst Northeast America, South America, Central Africa, and Southeast Asia were seen with higher LE.Like carbon fluxes, both H and LE were observed with expected patterns -i.e. higher values in the summer and were observed with lower values in the winter (Figure 7).Please see Figure S3 for the annual maps of UFLUX GPP, Reco, NEE, H, LE from 2001 to 2021.

Impacts of algorithms and predictors on the flux upscaling
Considering the effects of machine-learning algorithms, R 2 of XGB and RFR were consistently higher than SVR, and MLP across ecosystems and the interquartile range (IQR) of both R 2 and MBE were smaller for XGB and RFR than for SVR and MLP (Figure 8 and Table 3).The model running time of validating at FLUXNET2015 towers for XGB, RFR, SVR, and MLP were 15, 36,415 and 17 min.The performance of using XGB (median R 2 : 0.72 and MBE IQR: 1.27 g C m −2 d −1 ) was close to using RFR (median R 2 : 0.63 and MBE IQR: 1.4 g C m −2 d −1 ) for flux upscaling at a daily scale (Figure 8 and Table 3).In contrast, SVR (R 2 : 0.3 and MBE IQR: 2.72 g C m −2 d −1 ) and MLP (R 2 : 0.5 and MBE IQR: 4.2 g C m −2 d −1 ) showed relatively poor performance.
In addition, in regions (e.g.tropics) where the UFLUX accuracy was low (median R 2 : 0.1), the R 2 of using LSTM (median R 2 : 0.13) and stacked machine learning (median R 2 : 0.09) were also very low (Table 4).The feature importance -i.e. the score measures how useful the input features are at predicting fluxes -showed that SIF was the most important feature for GPP and LE upscaling (Figure 9(a)).Please see Figure S4 for feature importance results across plant functional types.The importance of SIF (59%) far exceeded EVI (10%), meteorology, and other auxiliary features.For Reco, VPD (19%), EVI (19%), and SIF (15%) contributed more than 50% of importance.It is noteworthy, that the contribution of air temperature (12%) was relatively less important for Reco upscaling.Solar radiation contributed 42% of importance in H upscaling, and the contribution of SIF and EVI was separately less than 6%.For LE upscaling, SIF was the most important features, with an importance of 45%.
Regarding the effects of feature combinations, R 2 and MBE in upscaling of using six different feature combinations showed seemingly contradictory findings compared with the feature importance analysis.Small changes were seen in R 2 and MBE for most ecosystem classes when SIF or SIF & EVI were dropped from the features (Figure 9   Interestingly, scenarios using only SIF and EVI as features achieved R 2 and MBE very close to scenario using all features in croplands and grasslands.In open shrublands and woody savannahs, R 2 of scenarios using SIF and EVI was much higher than scenario using all features.Scenarios using meteorology only achieved R 2 and MBE similar to scenario using all features in evergreen broadleaf forests, evergreen needleleaf forests, and wetlands.In deciduous broadleaf forests and mix forests, scenario using auxiliary features only achieved R 2 and MBE close to scenario using all features (Figure 9(b)).
The upscaling performance of using solar-induced fluorescence (SIF) and near-infrared reflectance (NIRv) was every close.In general, the R 2 of using SIF was marginally higher than that of using NIRv -e.g.-the median R 2 for GPP upscaling of using SIF (0.72) was 4% higher than using NIRv (0.69) (Table 5).

Impacts of EC spatial sampling on the flux upscaling
Overall, R 2 increased with more training towers while MBE decreased (Figure 10(a)).
Regarding the relationship between similarity analysis and UFLUX upscaling performance, the highest level of similarity, i.e. the highest R 2 in satellite vegetation proxy time series between the test EC tower and a training tower, showed a strong positive linear correlation (slope = 0.92, intercept = 0.07, correlation coefficient = 0.82, and p-value = 0) with the R 2 of flux upscaling.The standard deviation of the similarity also showed a positive linear correlation with the up-scaling R 2 (slope = 2.90, intercept = 0.17, correlation coefficient = 0.67 and p-value = 0).
Regarding the training tower number and similarity level, for test towers with highand/or medium-level similarity training towers, the R 2 IQR became narrower, while the number of training sites increased.The upscaling R 2 is kept around 0.85 or higher when the model was trained with at least one high similarity or five medium similarity sites.In contrast, even when the model was trained with more than 20 low-level similarity towers, the upscaling R 2 was still lower than 0.6.In the current FLUXNET2015 database, European and North American sites were seen with almost all the high and medium similarity sites, while sites in other regions were only seen with low similarity sites (Figure 10(b)).This distribution pattern was similar to the distribution of upscaling model performancehigher R 2 also clustered in Europe and North America.Please see Figure S5 for more information about the relationship between upscaling performance and the 'similarity' metric across biomes and EC towers.

Assessment on reproducing key ecological parameters
The UFLUX was compared alongside existing openly accessible EC upscaling products -i.e.FLUXCOM (Jung et al. 2020), FluxSat (Joiner and Yoshida 2021), and (Zeng et al. 2020) -in preserving the light-use efficiency.Considering the impacts of various spatial resolution across products, all the products were resampled into the same spatial resolution (0.5 • ) and then interpolated into the geolocations of EC towers.All the four products showed an R 2 of around 0.8 when compared with EC light-use efficiency.In terms of the linear regression slope, the FluxSat was 12% higher than UFLUX and Zeng of which the slope was around 0.83.In the meantime, the FLUXCOM was 24% smaller in slope (Figure 11).The efficacy of UFLUX was assessed in terms of preserving water-use efficiency in comparison to FLUXCOM (Jung et al. 2019), as FluxSat (Joiner and Yoshida 2021) solely provides GPP and (Zeng et al. 2020) exclusively supplies carbon-related fluxes (GPP, Reco, and NEE).UFLUX and FLUXCOM achieved equivalent R 2 of 0.75 when compared with EC water-use efficiency (Figure 12).The UFLUX had a higher linear regression slope of 0.83, and the slope for FLUXCOM was 0.58.A relatively small slope was particularly observed where the water-use efficiency was larger than 20 g C (Kg H 2 O) −1 .
To further evaluate the ability of upscaled fluxes in representing the type of heat transfer, the UFLUX was compared with FLUXCOM in reproducing the Bowen ratio (Jung et al. 2019).In general, the results of UFLUX and FLUXCOM were comparable.The R 2 for UFLUX exceeded that of FLUXCOM by a marginal 0.05, whereas the linear regression slope for FLUXCOM surpassed that of UFLUX by 0.04 (Figure 13).

The advantages and drawbacks of ULFUX
The UFLUX was suggested to hold great promise for providing uniform upscaling routines.Some previous studies that have examined the upscaling of carbon and water fluxes, often separately (Jung et al. 2019(Jung et al. , 2020)).However, this study represents a pioneering effort to upscale them together by interconnecting fluxes with key ecological parameters light-use and water-use fluxes.This holds significance, for example, the UFLUX preserved the ecosystem water-use efficiency and thereby can provide guidance on nature-based solutions (Zhu, Olde, et al. 2023).The UFLUX framework offers a means to  disentangle whether the uncertainties stem from upscaling methodologies or the inherent characteristics of EC systems.
The UFLUX can capture approximately 70% of the variability of carbon, water, and energy fluxes (i.e.averaged R 2 = 0.7, Table 2), while the uncertainty in terms of interquartile range (IQR) was relatively small (1.5 g C m −2 d −1 for GPP and Reco, 15 W m −2 for H and LE).The term 'variability' is derived from the study by Joiner and Yoshida (2020) and it pertains to the mean R 2 calculated across GPP, Reco, H, and LE fluxes -with NEE inferred as the difference between Reco and GPP.The UFLUX framework also exhibited great potential in predicting future fluxes (R 2 = 0.73, Table S6).Given its strong capability in capturing the non-linear relationship between fluxes and the environment, the UFLUX was expected to be beneficial for quantifying ecosystem light-and water-use efficiency, global carbon and water budgets, and further for climate goals (Smith et al. 2021).Directly and indirectly upscaled NEEs were both consistent with EC measurements, and the difference in the sum of nearly 10 years was smaller than 10 Mg C ha −1 across ecosystems (Figure 6(a)).The spatial distribution of the upscaled NEE was in a great agreement with the literature (Jiang et al. 2022;Keenan and Williams 2018;Zeng et al. 2020).
It is particularly crucial to exercise caution when examining the interannual variability of the net ecosystem carbon exchange derived from any global estimates.To provide an illustration, the NEE trend described in Zeng et al. (2020) was not apparent in our observations of UFLUX NEE (Figure S1).Please also see the comparison of UFLUX against EC and other upscaling products in Figure S6.This discrepancy does not imply that UFLUX was superior or vice versa, but rather highlights the inherent constraints of data-driven approaches.The method UFLUX used to interconnect carbon and water fluxes exhibited improvements, particularly in tropical latent energy (Figure S6e) and water-use efficiency (Figure 12).However, these improvements are far from being enough, the UFLUX still underestimated the towerlevel light-use and water-use efficiencies, as well as the Bowen ratio (Figures 11,12,and Figure 13).The UFLUX still struggled to capture the interannual variability of fluxes in tropics (Figure S6).The ability to accurately capture the flux interannual variability depends on the efficacy of machine learning algorithms, the proficiency of satellite vegetation proxies in monitoring ecosystem (carbon) dynamics, and the representativeness of EC towers in sampling global terrestrial ecosystems (Aubinet, Vesala, and Papale 2012).For example, major carbon source sinks (e.g.South America and India in Figure 6(b)) have very limited number of EC towers (Pastorello et al. 2020;Schimel et al. 2015) and this added extra uncertainty in flux estimation.In this case, it would be very important to know the limiting factors and potential improvements for the upscaling performance, while standardizing the upscaling routine in a consistent and comparable manner.
The quantification of uncertainty presents a challenge when dealing with black-box models, such as machine-learning UFLUX, particularly in light of the heterogeneous spatial scales inherent in the flux footprints derived from tower-based observations, which serve as the basis for training these machine learning models.A potential strategy for addressing this challenge in non-MODIS-based data products may involve the exploration of geographically weighted regression (GWR).Our future research will be focused on the upscaling or estimation of spatiotemporal variations in uncertainty within the UFLUX framework (Joiner and Yoshida 2020).

The limited impacts from algorithms and environmental drivers
The Xgboost algorithm is highly recommended for flux upscaling due to its superior performance compared to other machine learning algorithms commonly used for flux upscaling (Ichii et al. 2017;Joiner and Yoshida 2020).It achieved the highest R 2 values and the lowest level of uncertainty.Furthermore, it had a shorter running time and does not require extensive computational resources like graphics processing units (GPUs), which are typically used in deep learning techniques (Reichstein et al. 2019).This means that promoting the use of Xgboost would not incur significant financial costs, making it a more affordable option for countries with limited resources for flux quantification (Hill, Chocholek, and Clement 2017).Moreover, the reduced reliance on computational resources also minimizes the potential for additional carbon emissions, which have been observed in many deep learning applications (Strubell, Ganesh, and McCallum 2020).It is worth noting that the introduction of deep learning and stacked machine learning techniques did not yield substantial improvements in flux upscaling in regions like the tropics (Figure S7) where the upscaling performance was unsatisfactory and uncertainties in global carbon cycles were significant (Jung et al. 2020).This suggests that prioritizing technical advancements may not be the most crucial factor in accurately quantifying global fluxes.
The debate surrounding the effects of satellite vegetation proxies on upscaling performance continues.In agreement with the literature, it has been found that SIF provided the most significant information for flux upscaling (Guanter et al. 2012;Zhang et al. 2016).However, removing SIF or other satellite vegetation proxies has minimal impact on the upscaling performance (Figure 9(b)).Satellite vegetation proxies indicate the intensity of photosynthesis and/or the greenness of ecosystems, which are closely related to fluxes, particularly GPP intensity (Verrelst et al. 2015).Fluxes are influenced to a large extent by environmental factors such as solar radiation, air temperature, and vapour pressure deficit (Zhu et al. 2022).In addition, we explored the inclusion of carbon fluxes as a feature for model training to predict energy fluxes, and vice versa.However, our experimentation did not yield any substantial enhancements, as evidenced by a small change in R 2 , which remained below 0.1.Considering the remarkable capabilities of machine learning algorithms in capturing non-linear relationships, it is possible that the role of satellite products as flux proxies could be replaced by combinations of environmental drivers.However, further investigation is necessary to determine the extent to which satellite vegetation proxies can be replaced by the combinations of environmental drivers.This is due to the intricate nature of the relationship between fluxes and environmental drivers.In many ecosystems, fluxes are predominantly influenced by solar radiation, air temperature, and vapour pressure, while in other ecosystems, factors such as water table depth exhibit a strong correlation with flux intensity (Zhu, McCalmont, et al. 2023).In such cases, the utilization of satellite vegetation proxies can offer substantial value.

The importance of sufficient EC sampling
The findings of this study indicate that the spatial sampling of EC towers plays a crucial role in the performance of flux upscaling.A model trained with data from a diverse range of towers that sampled various ecosystems tended to exhibit better upscaling performance (Figure 3 and Table S7).For instance, the upscaling performance was particularly strong for deciduous broadleaf forests and evergreen needleleaf forests, which had 21 and 31 towers, respectively, with long-term data records out of a total of 206 FLUXNET2015 towers (Pastorello et al. 2020).On the other hand, the upscaling performance was poorer for evergreen broadleaf forests (13 towers) and shrublands (9 towers), with data records covering a limited period of time (Pastorello et al. 2020).For example, five out of six South American towers had time series lengths shorter than four years, and four of those towers had time series shorter than two years (Figure S8) (Pastorello et al. 2020).It is worth noting that the diversity in the representation of EC towers is also significant.Despite 70 out of 206 FLUXNET towers being located in Europe, nearly 50% of the towers were deployed in forests, while grasslands and croplands, which constitute 39% of Europe's land cover, had less than 9% and 16% of towers, respectively (https://www.eea.europa.eu/themes/agriculture/intro).The extant towers were prone to be present in regions with relatively higher vegetation coverage (Figure S8).Additionally, the specific characteristics of ecosystems pose challenges in flux upscaling.For instance, the GPP time series for evergreen broadleaf forests tend to be relatively stable (Jung et al. 2020), making it more difficult to accurately estimate fluxes compared to ecosystems with strong seasonal variations.Nevertheless, ensuring an evenly distributed network of EC towers would greatly benefit flux upscaling efforts (Sulkava et al. 2011).
When discussing the impacts of EC sampling on the flux upscaling performance, it is noteworthy that the choice of a method for partitioning data into training and test sets holds significant importance.Here, we employed the leave-one-out cross-validation (LOOCV) technique, akin to the approach employed in Joiner and Yoshida (2020).In this way, each tower played a role in the training and validation process.However, it is crucial to note that during individual validation steps (e.g. one of the 206 validations conducted for GPP), the specific tower under consideration remained entirely 'untouched'.The rationale behind selecting this approach lies in the fact that, if we were to adopt the conventional data shuffle-split validation method (Pedregosa et al. 2011), data originating from the same tower could be utilized both for training and validation purposes.This contradicts our objective of spatially upscaling EC fluxes.The leave-one-out crossvalidation method ensures a comprehensive validation process that guards against potential biases arising from selecting only subsets of towers for either training or validation purposes.This approach thus enhances the integrity and robustness of our validation procedures.
In the context of temporal analysis, we employed this methodology to assess the efficacy of upscaling techniques in predicting future fluxes.This entailed training the model using historical data to make future predictions.It is noteworthy that we refrained from employing a random selection approach for approximately two-thirds of the data, as such a method might inadvertently involve training models with data from 'yesterday' and 'tomorrow' to predict values for 'today'.Under such conditions, the performance of linear regression would remain relatively stable.Therefore, we conducted an evaluation of machine learning algorithms in their capacity to forecast extended time series.
In contrast, we employed the conventional random data selection (Ichii et al. 2017;Pedregosa et al. 2011;Tramontana et al. 2016) for both spatial and temporal upscaling.For instance, the coefficient of determination (R 2 ) for both aspects approached 0.75 for GPP, showing a slight improvement compared to the data splitting approach we currently employ, but it is suggested to choose a data selection/split method that suits the aim of study.

Summary and recommendations
This study standardizes the routine of upscaling EC fluxes, while also determining the key factors that affect the performance of upscaling.The UFLUX upscaling framework recommends the use of the Xgboost algorithm due to its relatively high accuracy (R 2 = 0.7 on a daily basis) and small uncertainties (1.5 g C m −2 d −1 for GPP and Reco, 15 W m −2 for H and LE).Additionally, the UFLUX framework exhibited agreement in estimating NEE through direct and indirect upscaling pathways.Despite the advancements in machine learning algorithms, they did not significantly improve the upscaling performance.Among the predictors, SIF contributed the most information (ca.50% of the contribution) to the upscaling model.Combining vegetation indices and environmental drivers can achieve similar results.The spatial sampling of EC towers plays a crucial role in flux upscaling, requiring at least three towers to achieve an upscaling R 2 of 0.75.The satellite vegetation proxies at these three towers should closely match the proxies at the target location, with an R 2 ranging between 0.5 and 0.75.If one tower can explain more than 75% of the variability in satellite vegetation proxies, it is sufficient to train the model and estimate the flux with this tower for the target location with an R 2 of 0.8.However, if less than 50% of the variability can be explained, no matter how many towers are used, it will not be enough.This study emphasizes the importance of a spatiotemporally even distribution of EC towers for reliable estimation of global fluxes.

Figure 1 .
Figure 1.Study overview.The UFLUX framework includes three major parts.Part 1: tower-level upscaling model training.This part also considers factors affecting the upscaling performance by feeding with different parameter combinations.Part 2: producing upscaled GPP, Reco, NEE, H, and LE fluxes using the trained models.Upscaled estimates of NEE are produced from upscaled Reco and GPP (i.e.NEE = Reco -GPP), however we also validate this against directly upscaled NEE.Part 3: validating the upscaled models and products at both tower and global scales.Global NEE variability is also assessed using the upscaled products.

Figure 2 .
Figure 2. Schematic diagram of one complete upscaling model training.The red dashed arrows indicate the data flow into the machine-learning (ML) upscaling model.The orange blocks or oval are input data to models, green blocks are model output data, and blue blocks refer to components the models.The block on the bottom-left reveals the machine-learning model workflow.The block on the right shows the model scenarios: 1) spatial estimation (estimating the whole GPP time series at one EC tower by training the model using data at other towers with the same plant functional type); 2) temporal estimation (estimating future GPP from data in the past).

Figure 3 .
Figure 3. Upscaling performance for GPP, Reco, NEE, H, and LE at EC level.The x axis is normalised mean the bias error (MBE), the y axis is the R 2 , and the colourmap indicates the normalised root mean squared error (RMSE).The normalised MBE and RMSE are corresponding values divided by the mean fluxes (e.g.normalised GPP MBE = GPP MBE/GPP), considering the different flux magnitude across Plant Functional Types (PFTs).The dots represent the averaged MBE and R 2 of PFTs: Mixed Forests (MF), Evergreen Needleleaf Forests (ENF), Grasslands (GRA), Woody Savannas (WSA), Savannas (SAV), Evergreen Broadleaf Forests (EBF), permanent Wetlands (WET), Deciduous Broadleaf Forests (DBF), Open Shrublands (OSH), Croplands (CRO), and Closed Shrublands (CSH).The dot colours of the four subplots separately represent PFT-averaged levels of GPP, Reco, H, and LE fluxes derived from EC.The contours represent how well the upscaling performance is considering both the R 2 and MBE.The dot curves are for better visual effects in distinguishing the upscaling performance between plant function types.

Figure 4 .
Figure 4. UFLUX upscaling performance in terms of R 2 and MBE at the cross-validation eddy covariance towers.The MBE unit for GPP and Reco is g C m −2 d −1 and for H and LE is W m −2 .

Figure 5 .
Figure 5. Fingerprints for measured and estimated fluxes averaged by PFTs for GPP, Reco, H, and LE, respectively.Their x-axes represent the time dimension, their y-axes represent PFTs, and the colour indicates the flux strength.For each flux of GPP, Reco, H, and LE, the upper fingerprint plot shows the measured fluxes, and the bottom fingerprint shows the estimated fluxes.

Figure 10 .
Figure 10.Impacts on the upscaling performance from the EC towers per se.(a) Relationship between the training site number and the upscaling model performance in terms of R 2 and MBE.Colours represent the 'similarity' classes: low (less than 50% of the variance in the test site GPP can be explained by the training sites), medium (50% − 75% of the variance in the test site GPP can be explained by the training sites), and large (more than 75% of the variance in the test site GPP can be explained by the training sites).Note for this sub-figure, the upscaling mode for each site, that has k training sites, iteratively ran k! times.(b) The number of medium similarity towers for locations across the globe.White & khaki regions are in urgent need of more EC towers as the number of medium similarity towers smaller than three.

Figure 11 .
Figure 11.Comparison of light-use efficiency (LUE) derived from eddy covariance (EC) towers and flux upscaling products, interpolated at the corresponding tower locations.Dot density in red regions signifies high values, contrasting with blue regions indicating lower values.Dashed lines represent the one-to-one relationship.

Figure 13 .
Figure 13.Comparison of Bowen ratio (H/LE) derived from eddy covariance (EC) towers and flux upscaling products, interpolated at the corresponding tower locations.Dot density in red regions signifies high values, contrasting with blue regions indicating lower values.Dashed lines represent the one-to-one relationship.

Figure 12 .
Figure 12.Comparison of water-use efficiency (WUE) derived from eddy covariance (EC) towers and flux upscaling products, interpolated at the corresponding tower locations.Dot density in red regions signifies high values, contrasting with blue regions indicating lower values.Dashed lines represent the one-to-one relationship.The water-use efficiency unit is g C (kg H 2 O) −1 .

Table 2 .
Statistical metrics of the UFLUX upscaling performance at FLUXNET2015 towers using SIF as the satellite vegetation proxy, the unit for RMSE and MBE are g C m −2 d −1 for GPP and Reco and W m −2 for H and LE.See tableS2for the results of using MODIS NIRv as the satellite vegetation proxy.

Table 3 .
The median values of statistical metrics of the upscaling performance using random forest regression, support vector regression and multiple Layer perceptron at FLUXNET2015 towers; the unit for RMSE and MBE is g C m −2 d −1 .See tableS3for the full table.

Table 5 .
The median values of statistical metrics of the upscaling performance using NIRv at FLUXNET2015 towers; the unit for RMSE and MBE are g C m −2 d −1 for GPP and Reco and W m −2 for H and LE.See tableS5for the full table.