A stacking ensemble algorithm for improving the biases of forest aboveground biomass estimations from multiple remotely sensed datasets

ABSTRACT Accurately quantifying the aboveground biomass (AGB) of forests is crucial for understanding global change-related issues such as the carbon cycle and climate change. Many studies have estimated AGB from multiple remotely sensed datasets using various algorithms, but substantial uncertainties remain in AGB predictions. In this study, we aim to explore whether diverse algorithms stacked together are able to improve the accuracy of AGB estimates. To build the stacking framework, five base learners were first selected from a series of algorithms, including multivariate adaptive regression splines (MARS), support vector regression (SVR), multilayer perceptron (MLP) model, random forests (RF), extremely randomized trees (ERT), stochastic gradient boosting (SGB), gradient-boosted regression tree (GBRT) algorithm, and categorical boosting (CatBoost), based on diversity and accuracy metrics. Ridge and RF were utilized as the meta learner to combine the outputs of base learners. In addition, six important features were selected according to the feature importance values provided by the CatBoost, ERT, GBRT, SGB, MARS and RF algorithms as inputs of the meta learner in the stacking process. We then used stacking models with 3–5 selected base learners and ridge or RF to estimate AGB. The AGB data compiled from plot-level forest AGB, high-resolution AGB data derived from field and lidar data and the corresponding predictor variables extracted from the satellite-derived leaf area index, net primary production, forest canopy height, tree cover data, and Global Multiresolution Terrain Elevation Data 2010, as well as climate data, were randomly split into groups of 80% for training the model and 20% for model evaluation. The evaluation results showed that stacking generally outweighed the optimal base learner and provided improved AGB estimations, mainly by decreasing the bias. All stacking models had relative improvement (RI) values in bias of at least 22.12%, even reaching more than 90% under some scenarios, except for deciduous broadleaf forests, where an optimal algorithm could provide low biased estimations. In contrast, the improvements of stacking in R2 and RMSE were not significant. The stacking of MARS, MLP, and SVR provided improved results compared with the optimal base learner, and the average RI in R2 was 3.54% when we used all data without separating forest types. Finally, the optimal stacking model was used to generate global forest AGB maps.


Introduction
Accurately quantifying the aboveground biomass (AGB) of forests is crucial for understanding global change-related issues, such as the carbon cycle and climate change (Houghton, Hall, and Goetz 2009). In recent decades, many studies have estimated forest AGB from optical images, synthetic aperture radar (SAR), and light detection and ranging (LiDAR) data at local or regional scales (Lu et al. 2016;Zolkos, Goetz, and Dubayah 2013;Wulder et al. 2012;Neumann et al. 2012). However, large uncertainties remain in existing forest AGB maps (Saatchi et al. 2011;Baccini et al. 2012;Mitchard et al. 2013).
To improve the accuracy of AGB predictions, recent efforts have concentrated on building ground-based forest observation systems (Chave et al. 2019) and providing new remotely sensed data sources, such as the Global Ecosystem Dynamics Investigation LiDAR and the European Space Agency P-band radar (Carreiras et al. 2017;Qi et al. 2019), and the integration of multisource remotely sensed data Luo et al. 2019;Kattenborn et al. 2015). In addition, data-driven machine learning algorithms have been developed, along with an increasing number of remote sensing observations, and their performance in estimating AGB has been explored (Gleason and Im 2012;López-Serrano et al. 2016;de Almeida et al. 2019). However, each algorithm has its own scope of application, and no algorithm has good performance in all situations (Wang, Wu, and Mo 2013). There is on the pursuit for the most robust algorithms which are appropriate for forest AGB estimation or mapping, particularly at large scales.
Rather than selecting the single best model to estimate land surface parameters from remotely sensed data, ensemble algorithms combine the advantages of multiple learners, leading to an improved prediction accuracy Mendes-Moreira et al. 2012). Currently, ensemble algorithms used for estimating forest AGB are mainly homogeneous ensemble algorithms that aggregate results from the same algorithm, such as tree-based bagging represented by random forests (RF) and boosting represented by stochastic gradient boosting (SGB), gradient-boosted regression tree (GBRT) algorithms, categorical boosting (CatBoost), and extreme gradient boosting (XGBoost) regression algorithms (Breiman 2001;Belgiu and Drăguţ 2016;Friedman 2001Friedman , 2002Huang et al. 2019). Bagging generates bootstrap samples from the original datasets to train weak decision tree models and then averages the outputs to obtain final predictions, which reduces the prediction variance (Yang et al. 2010), whereas boosting converts weak learners to strong learners by increasing the weights of samples with higher prediction errors in the following iteration, thereby gradually improving the prediction accuracy by decreasing the bias (Bühlmann and Hothorn 2007). It has been widely demonstrated that a series of bagging and boosting algorithms outperformed individual learning algorithms in estimating forest AGB Li et al. 2020). However, heterogeneous ensemble algorithms, which can combine diverse learners, including both individual algorithms and homogeneous ensemble algorithms mentioned above, thus providing better prediction results (Healey et al. 2018;Naimi and Balzer 2018), are still in its infancy in the field of forest AGB estimation.
In this study, we aim to explore the potential of the heterogeneous stacking ensemble algorithms for improving the accuracy of AGB estimation from multiple remote sensing datasets and in particular, address whether and to what extent the stacking algorithm can improve AGB predictions relative to homogeneous ensemble algorithms or optimal individual algorithms.
The stacking algorithm is also known as stacked generalization or super learning, which was first proposed by Wolpert (1992) and formalized by Breiman (1996). Generally, stacking has a two-layer structure in which the meta-model (level-1 model) in the second layer is used to combine the outputs of base learners (level-0) in the first layer. Until now, stacking has been applied to map forest changes (Healey et al. 2018), estimate daily average PM2.5 concentrations (Zhai and Chen 2018), forecast short-term electricity consumption (Divina et al. 2018), and improve the spatial interpolation accuracy of daily maximum air temperature (Cho et al. 2020) due to its superior performance by improving generalization ability in comparison with single algorithms. Previous studies have suggested that the success of a stacking model depends on the accuracy and the diversity of base learners (Nath and Sahu 2019;Naimi and Balzer 2018). Using stacking to combine multiple diverse base learners that can effectively compensate for each other's inadequacies is assumed to improve predictions relative to base learners (Tyralis et al. 2019). Therefore, the choice of suitable base learners is a critical issue of stacking. Most studies have evaluated the models based solely on accuracy, whereas diversity has not been quantified properly (Wang, Lu, and Feng 2020). In this study, we selected base learners based on accuracy and diversity. Moreover, we investigated how the performance of stacking in estimating AGB was affected by the selected base learners and their combinations with meta learners, which was also the second objective of this study. The optimal stacking model was finally used to generate forest AGB map at a global scale.

Field AGB
The forest AGB were inherited from one of our previous studies . They were compiled from plot-level forest AGB and high-resolution AGB data derived from field data and lidar data. We collected plot-level AGB measured during the period 2000-2010 from published literature and online databases. The plots were mainly located in mature or primary forests with minimal human disturbance . To ensure the representativeness of these plot measurements to forest conditions and to reduce the potential error in data geolocation, collected plots of less than 0.05 ha in size were filtered out (Keeling and Phillips 2007;Bouvet et al. 2018). The remaining plot-level AGB were aggregated to a 0.01° spatial resolution. Moreover, the mismatches in spatial scales between field plots and pixels of remotely sensed data may lead to uncertainties of forest AGB estimation, particularly when forest AGB shows strong local spatial variation (Réjou-Méchain et al. 2014). Therefore, we assessed the homogeneity and representativeness of reference AGB data using the coefficient of variation (CV) of tree cover (Hansen et al. 2013) within each 0.01° cell and removed the reference AGB data with a corresponding CV value larger than 1.0. Six high-resolution AGB data, which were derived from field AGB and lidar and had spatial resolutions finer than 100 m, were also used as reference AGB data. We reprojected these AGB maps to the geographical coordinate system and then aggregated them to the 0.01° scale. More details could be found in one of our previous papers .

Input data collection
Remotely sensed data for AGB prediction were mainly the Leaf Area Index (LAI) product  from the Global LAnd Surface Satellites (GLASS) product suite (Liang et al. 2021), forest canopy height retrieved from Geoscience Laser Altimeter System (GLAS) data (Simard et al. 2011), MODIS Net Primary Production (NPP) product (Running and Zhao 2019), tree cover data (Hansen et al. 2013), and Global Multiresolution Terrain Elevation Data 2010 (GMTED2010) (Danielson and Gesch 2011). Climate data from WorldClim2 (Fick and Hijmans 2017), and changes in temperature and precipitation based on climatic research unit gridded dataset (Harris et al. 2014), were also included for AGB estimation.

GLASS LAI data
The LAI product selected was the GLASS LAI product at 8-day and 1 km resolution. It was derived from the reprocessed MODIS reflectance time-series data using a general regression neural network algorithm that was trained with the combined time-series LAI from the MODIS and CYCLOPES LAI products and provided in a sinusoidal projection (Liang et al. 2013;Xiao et al. 2014). Previous studies suggested that the GLASS LAI product is more accurate and temporally continuous than other LAI products, such as MODIS LAI and Geoland2 LAI (Li et al. 2018;Xiao et al. 2014Xiao et al. , 2016. We reprojected the 8-day GLASS LAI data from 2001 to 2010 to the WGS 84 geographical coordinate system and averaged them to the monthly scale. The maximum LAI for the year 2005 and the interannual variation in LAI from 2001 to 2010 characterized by the CV (LAI-CV) were used as predictors of forest AGB.

Global canopy height map
The global canopy height (CH) map obtained from Geoscience Laser Altimeter System (GLAS) data by Simard et al. (2011) were used. GLAS waveforms located in areas of slopes below 5 degrees and with bias correction of lower than 25% of the measured RH100 and within forested areas according to the GlobCover map were preserved to produce the global CH map. We resampled the CH map at 1-km resolution to 0.01° using the nearest neighbor method.

MODIS NPP data
The annual MOD17A3HGF (version 6) data at 500 m resolution were obtained from the Land Processes Distributed Active Archive Center (https://lpdaac. usgs.gov/products/mod17a3hgfv006) (Running and Zhao 2019). Consistent with the preprocessing of LAI data, we reprojected the data from 2001 to 2010 to the WGS84 geographic coordinate system and aggregated them to 0.01°. Annual NPP data for 2005 and the CV of NPP from 2001 to 2010 (NPP-CV) served as predictors of forest AGB.

Global tree cover product
The global forest cover map used in this study was provided by Hansen et al. (2013) and had a 30-meter spatial resolution. For consistency with other datasets, the 30-meter data were aggregated to a 0.01° resolution. The mean and standard deviation of tree cover within each 0.01° cell (TC-Mean, TC-Std) were calculated for prediction of forest AGB. Additionally, the aggregated map served as the base map of global tree cover; forests and shrublands with tree cover > 10% were considered forest pixels, while other pixels were masked (Schmitt et al. 2009).

Topographical and climatic data
The Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010) suite contains raster elevation products at 30, 15, and 7.5 arc-second spatial resolutions. The DEM and slope information were derived from GMTED2010 data at a 30 arc-second spatial resolution (Danielson and Gesch 2011).
Climate variables used for AGB estimation included the annual mean temperature (Temp) and precipitation (Prec) from the WorldClim2 dataset (Fick and Hijmans 2017), as well as changes in annual temperature (TempChg) and changes in precipitation

Stacking ensemble learning algorithm
The stacking ensemble learning framework for estimating forest AGB is shown in Figure 1. In the twolayer stacking structure, the first layer included n base learners, and the second layer used a linear or nonlinear algorithm called a meta learner to combine the predictions of base learners. All data were randomly split into training data (80%) and test data (20%). The training data were further divided into five folds. In each of the five iterations, four folds were chosen for training base learners, whereas the remaining folds were held out for AGB prediction. The five-fold crossvalidated predictions were called meta-features and served as input variables of the meta learner. When original features were not included in the stacking, the number of variables for training the meta learner equaled the number of base learners. The base learners were then refitted using all training data, and the refitted models were applied to the test data to generate meta-features or inputs of the meta learner. Final AGB predictions were obtained by a meta learner, and their accuracies were evaluated based on test data. To reduce the impacts of random splitting on the evaluation results, the above procedures were repeated 50 times.
In this study, in addition to the evaluation of stacking using all data, we examined the performance of stacking models in estimating forest AGB for four forest types that had more than 1000 AGB samples, including EBF, DBF, WSA and SAV.
Since the combinations of base learners with high accuracy and diversity could maximize the generalization accuracy (Zhou 2009;Bin et al. 2020;Fan, Xiao, and Wang 2014), we selected CatBoost as one base learner due to its demonstrated better performance than the other candidate base learners . The diversity among base learners was measured using Spearman's rank correlation coefficient. Prediction errors of base learners had a lower correlation or even they were uncorrelated, suggesting that these learners skilled differently and thus corresponded to a higher diversity (Ma and Dai 2016). Afterward, we calculated the average of Spearman's rank correlations among eight candidate base learners for 50 runs based on the prediction results from  and then selected the base learner with the lowest correlation with the CatBoost model as the second base learner. Similarly, the third base learner had the lowest mean correlation with the first two base learners. Through repeating the process, the remaining base learners were gradually added into the stacking model.
In addition, some studies have suggested that a few base learners instead of all available learners should be stacked together, and 3 or 4 base learners might be optimal (Zhou, Wu, and Tang 2002;Breiman 1996;Cho et al. 2020). Therefore, we attempted to stack 3-5 base learners and examined their performance for AGB estimation. The order of base learners selected under each forest type scenario is shown in Table 1. The stacking model with 3 base learners indicates that the first 3 learners were combined, and similarly, stacking with 5 base learners indicates that the first 5 base learners were used. Under all these scenarios, at least one homogeneous ensemble algorithm was selected as the base learners. To fully explore the stacking performance, we also used three individual learners, MARS, MLP, and SVR, as base learners, similar to many ensemble algorithms, and assessed the performance of associated stacking models in estimating forest AGB.

Meta learners
Simple linear models such as ridge and lasso regression are often used as the meta model, and they can provide a smooth interpretation of the predictions of base models (Cho et al. 2020). In this study, we also included ridge and lasso as meta learners in stacking. Additionally, original features that might improve the prediction results were incorporated as well (Pernía-Espinoza et al. 2018). RF and MLP could capture the nonlinear relationships between original features and forest biomass and were thus employed as meta learners in this study (Healey et al. 2018). We ranked the original features or variables used to train the base learners by feature importance values provided by the CatBoost, ERT, GBRT, SGB, MARS and RF algorithms ) and used six important features as additional inputs of the meta learner in stacking models. The selected original features under different scenarios are shown in Table 1.
Moreover, we initially tested the performance of ridge, lasso, MLP and RF algorithms as meta learners and found that lasso and MLP provided better prediction results than ridge and RF (Nath and Sahu 2019). Therefore, the ridge and RF were chosen as meta learners of stacking, and their performance in estimating AGB, when combined with different base learners, was explored (Table 1).

Forest AGB mapping
To further explore the performance of stacking models in AGB estimation, we generated the global forest AGB maps based on the optimal stacking model (Stacking AGB) and optimal base learner (CatBoost AGB), respectively, and then compared the spatial

Accuracy assessment
Common metrics, including R 2 value, root mean square error (RMSE), and bias, were used to evaluate the accuracy of AGB predictions. They were calculated as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where y represents the reference AGB, � y denotes the mean value of the reference AGB, ŷ is the predicted AGB using the models, and N is the number of samples. The model with higher R 2 value and lower RMSE and bias is preferred for AGB estimation. In addition, the relative improvement (RI) in stacking performance for AGB estimation compared with the optimal base learner was quantified (Sun and Li 2020).
where I R 2 , I RMSE , and represent the RI in R 2 , RMSE, and bias, respectively. The subscript s represents the stacking model, and b indicates the base learner.

Performance of stacking models for forest AGB estimation
The average R 2 , RMSE, and bias values of the five base learners and different stacking models obtained over 50 runs using all data are shown in Table 2. The CatBoost model outweighed the other base learners for AGB prediction and achieved an accuracy with an R 2 value of 0.70, RMSE of 47.19 Mg/ha and bias of 0.12 Mg/ha, which was the benchmark for evaluating the relative performance of stacking models in which CatBoost was contained. Similarly, MLP had an overall better performance than MARS and SVR and was therefore used to evaluate the RI of stacking in which the first layers were MARS, MLP, and SVR. Consistent with some previous studies, the results of this study indicated that the incorporation of original features in stacking slightly improved the estimates (Pernía-Espinoza et al. 2018), mainly by decreasing RMSE. Moreover, stacking using the RF model and original features (RFOri) provided a more accurate estimation than stacking with ridge and original features (Table 2). In contrast, the stacking using ridge as a meta learner provided more accurate results than stacking using the RF model when original features were not considered inputs of the meta learner. Therefore, the RI shown in Table 2 for ridge- related stacking models was based on the ridge learner without original features, and the RI for RF-related models was based on RFOri. RFOri produced the most accurate estimates of forest AGB for all combinations of base learners, whereas the performance of the ridge model, ridge model with original features, and RF model without original features was slightly worse in terms of R 2 and RMSE (Table 2 and Figure 2). Furthermore, the stacking using RFOri obtained a more accurate estimation of forest AGB than the optimal base learner, suggesting that stacking could improve the AGB estimation from multiple remote sensing datasets. For the base learner combination, MLP, MARS, and SVR, stacking using RFOri improved the estimation by 8.96%, 7.46% and 92.70% in terms of R 2 , RMSE, and bias, respectively. However, when four or five base learners were used, we found that stacking using RFOri produced larger bias than the other stacking models (Table 2).
With meta-features alone, stacking models using the ridge model as the meta learner provided better results than those based on RF, as well as the optimal base learner, despite being less accurate than stacking models using RFOri (Figure 2). The maximum RI was 3.54% for R 2 and 2.87% for RMSE, achieved by stacking using MARS, MLP, and SVR as base learners. For bias, the average RI was more than 22.12%, which suggested that stacking could improve the estimates, particularly by reducing the bias. Using CatBoost, MARS, and MLP as base learners, stacking obtained similar results to the CatBoost model in terms of R 2 and RMSE but decreased the bias by 49.8% (Table 2).

Figure 2.
Comparing the performance of stacking using ridge (Ridge) and stacking using RF, as well as original features (RFOri) with those obtained by the optimal base learner.
Either using ridge or RFOri as the meta model, the stacking of MARS, MLP, and SVR significantly improved the results relative to the best base learner (Figure 2). This was consistent with the viewpoint that base learners should be mediocre learners, with an average performance of approximately 0.5-0.6; therefore, ensemble learners were better than the best base learners (Lasisi and Attoh-Okine 2019). For the stacking models containing CatBoost, which was a strong base learner, the results were slightly better than the optimal base learner; however, they did not suggest that stacking models using strong base learners should not be used since they could significantly improve the bias of estimates (Table 2).

Performance of stacking for different forest types
For all forest types, including EBF, DBF, WSA, and SAV, CatBoost remained the optimal base learner, while the SVR predictions were generally better than the MARS and MLP for AGB predictions. Figure 3 shows the average R 2 , RMSE and bias achieved by stacking using ridge as the meta learner for 50 runs, and Figure 4 shows the boxplot of R 2 , RMSE and bias for 50 runs. The results suggested that stacking MARS, MLP, and SVR improved the AGB estimation with increases in R 2 and decreases in RMSE and bias. The average RI in R 2 was 4.68% for EBF, 4.56% for DBF, 4.07% for WSA, and 4.68% for SAV (Figure 3 and Figure 4). In contrast, the stacking model in which CatBoost was one of the base learners obtained similar results to the optimal base learner CatBoost, indicating that stacking did not significantly improve the results in terms of R 2 and RMSE. However, all stacking models provided less biased AGB prediction than CatBoost and SVR, which confirmed that stacking improved the estimation by reducing the bias.
For EBF, all stacking models improved the performance in terms of the R 2 , RMSE, and bias values compared with the optimal base learner, and the improvement was slightly different with the number of base learners used. When 3 base learners were included in stacking, the RI in R 2 , RMSE, and bias were 0.22%, 0.13%, and 89.93%, respectively. Stacking using 5 base learners had RIs in R 2 , RMSE, and bias of 1.00%, 0.59%, and 87.36%, respectively. The combination of four base learners, including CatBoost, MARS, MLP, and GBRT, had the best performance in terms of R 2 and RMSE (Figure 4). For DBF and WSA, CatBoost performed better than EBF, with a relatively larger R 2 and lower RMSE and a particularly lower bias. Under this condition, the stacking of CatBoost and other base learners did not lead to improved results except that stacking provided lower biased results in WSA (Figure 3 and Figure 4). The estimated results tended to become worse as the number of base learners used in stacking increased (Figure 4). These results indicated that when bias achieved by an algorithm was low, stacking of several learning algorithms might not be useful for further improving the accuracy of AGB estimation. For SAV, all base learners were weak learners with an R 2 of less than 0.48, and stacking indeed improved the results. With more base learners, the improvement of the stacking model in estimating forest AGB was greater. The bias decreased from 0.73 achieved by CatBoost to 0.06 achieved by stacking with five base learners, with an RI of 91.89%.
Despite the large differences in the performance of stacking models for different forest types, the combination of MARS, MLP, and SVR using stacking greatly improved the results compared with the optimal base learner SVR.

Global forest AGB maps generated using stacking
Based on the optimal stacking model (CatBoost + MARS + MLP + GBRT (Ridge) in Table 2) and base learner CatBoost, global AGB maps were generated for the 2000s from multiple remotely sensed data ( Figure 5). The Stacking AGB map showed that tropical and subtropical forests stocked the most carbon in AGB per hectare, whereas the carbon stocks were lower in boreal and temperate forests. Compared with CatBoost AGB, Stacking AGB was higher in most regions with high biomass values and almost lower in regions with low biomass values, which suggested that stacking provided more reasonable AGB maps than optimal base learner. However, the difference in the spatial distribution of Stacking AGB and Fusion AGB was evident. Fusion AGB provided higher AGB in Oceania and Africa. Figure 6 showed the estimated Stacking AGB, CatBoost AGB, and Fusion AGB values for different forest types. The results suggested that the three AGB maps, particularly Stacking AGB and CatBoost AGB, had similar statistical distributions for all forest types. Estimated AGB values in EBF were larger than those in ENF, DBF, MF, WSA, and SAV. Compared with the AGB difference obtained by subtracting the CatBoost AGB from Stacking AGB, AGB differences between Stacking AGB and Fusion AGB were larger ( Figure 6)

Discussion
In recent decades, many studies have exploited complementary information from multiple remotely sensed data to improve AGB estimation. However, the effect of taking the advantage of diverse algorithms on the accurate estimation of forest AGB remains underexplored. In this study, we integrated several machine learning algorithms using stacking to estimate AGB from multiple satellite-derived data products. The results of this study showed that stacking models could generally improve the accuracy of AGB predictions, by greatly reducing the bias of estimates and slightly improving the R 2 and RMSE values. Therefore, if the main objective is to reduce the bias of estimates, such as for the retrieval of land surface parameters from satellite-derived data, stacking provides an effective way to achieve the goal. However, when the bias achieved by an optimal algorithm is low (e.g. DBF), the prediction accuracy cannot be further improved by stacking. In this situation, the base model rather than the stacking model should be used given its lower complexity (e.g. simple to train and interpret). Homogeneous ensemble methods such as RF and gradient boosting tree-base algorithms that have higher prediction performances may be considered (Güneralp, Filippi, and Randall 2014;Mutanga, Adam, and Cho 2012;Zhao et al. 2019).
The base learner combination, MARS, MLP, and SVR, obtained much better improvement compared with the results obtained by stacking using CatBoost as one of the base learners, suggesting the necessity to consider ensemble algorithms for improvement of prediction accuracy. However, it should be noted that the stacking of MARS, MLP, and SVR still provided less accurate results than the homogeneous CatBoost model. This was partly because stacking was a way to fuse information or add information to estimation, not an intrinsically motivated algorithm. More advanced algorithms, such as deep learning, remains worth investigating in future studies (Reichstein et al. 2019).
Previous studies have suggested that the appropriate selection of base learners is important for stacking models. In this study, we used the correlation of prediction error to quantify diversity and to further select base learners. Some metrics, such as covariance, dissimilarity measure, chi-square measure and mutual information, could be used to measure the diversity of base learners in stacking in future studies (Dutta 2009). Some studies have selected base learners based on their differences in algorithm principles (Wang, Lu, and Feng 2020). However, this should not greatly affect the results, since several combination strategies used in this study were consistent with the selection in principle (e.g. MARS + MLP + SVR, CatBoost + MARS + MLP, and CatBoost + MARS + SVR in Table 2). To fully explore the feasibility of stacking for global forest AGB mapping, we generated AGB maps using the optimal stacking model on a global scale. Comparison results showed that stacking AGB was generally close to CatBoost AGB, with an AGB difference of less than 20 Mg/ha in most forest regions. Due to the urgent need to improve forest AGB estimation on regional and global scales, many studies have integrated multisource remotely sensed data by using various machine learning algorithms. Two typical examples were forest AGB maps covering tropical regions produced by Saatchi et al. (2011) based on the MaxEnt algorithm and by Baccini et al. (2012) based on the RF algorithm, which were also used by Liu et al. (2015) to generate global forest AGB maps by establishing the relationships between AGB and vegetation optical depth, and by Carvalhais et al. (2014) to calculate turnover times of carbon in terrestrial ecosystems. The existing AGB maps were almost produced by a specific algorithm rather than an ensemble of several algorithms. As suggested by Figure 5, the estimated AGB based only on CatBoost and that based on an ensemble of several algorithms under the stacking framework were different in both magnitude and spatial distribution, which indicated the uncertainties associated with the AGB modeling algorithms and the necessity to examine the uncertainties in future studies of AGB estimation or mapping. In our previous studies, systematic comparisons of existing regional and global forest AGB maps covering different continents (Zhang, Liang, and Yang 2019) and a detailed comparison of several global AGB maps (Zhang and Liang 2020) were performed. The results revealed large discrepancies in current AGB maps, which could be from field biomass, the choice and quality of remotely sensed data, and the highlighted uncertainties of AGB modeling algorithms.

Conclusion
In recent decades, many studies have estimated forest AGB by fusing multiple remotely sensed data using various algorithms. However, integrating several algorithms to improve AGB estimation has rarely been investigated. In this study, we examined the performance of the stacking ensemble algorithm, which can combine diverse learners, in estimating forest AGB from multiple remotely sensed datasets. Based on the diversity measured by Spearman's rank correlation coefficient of prediction errors achieved by base learners, as well as accuracy, we selected five base learners and used a combination of 3-5 base learners with ridge and RF to estimate AGB. The evaluation results showed that the stacking model generally outweighed the optimal base model and improved the prediction results, particularly by decreasing the bias, and stacking using the RF model as the meta learner and important original features as inputs of the meta learner provided the most accurate estimation. However, it could lead to larger biased estimates with an increase in the number of base learners included in the stacking structure. In terms of R 2 and RMSE, only slight improvement of stacking in the prediction accuracy was found. The stacking of MARS, MLP, and SVR provided greatly improved results compared with the optimal base learner. The average RI in R 2 was 3.54% when we used all datasets without separating forest types, and the average RI in R 2 was 4.68% for EBF, 4.56% for DBF, 4.07% for WSA, and 4.68% for SAV. All stacking models had an RI bias of at least 22.12%, or even reaching more than 90% under some scenarios, except for DBF, where an optimal algorithm could provide low biased results. The results of this study demonstrated the capability of the stacking ensemble algorithm in reducing the bias in estimating AGB. In future studies, stacking could be utilized to retrieve AGB or other biophysical parameters from remotely sensed data if decreasing the bias was the main objective.

Data and codes availability statement
The data that supports the findings of this study is available at https://doi.org/10.5281/zenodo.5464675.

Disclosure statement
No potential conflict of interest was reported by the author(s).