Improving the simulation of terrestrial water storage anomalies over China using a Bayesian model averaging ensemble approach

ABSTRACT The ability to estimate terrestrial water storage (TWS) is essential for monitoring hydrological extremes (e.g., droughts and floods) and predicting future changes in the hydrological cycle. However, inadequacies in model physics and parameters, as well as uncertainties in meteorological forcing data, commonly limit the ability of land surface models (LSMs) to accurately simulate TWS. In this study, the authors show how simulations of TWS anomalies (TWSAs) from multiple meteorological forcings and multiple LSMs can be combined in a Bayesian model averaging (BMA) ensemble approach to improve monitoring and predictions. Simulations using three forcing datasets and two LSMs were conducted over mainland China for the period 1979–2008. All the simulations showed good temporal correlations with satellite observations from the Gravity Recovery and Climate Experiment during 2004–08. The correlation coefficient ranged between 0.5 and 0.8 in the humid regions (e.g., the Yangtze river basin, Huaihe basin, and Zhujiang basin), but was much lower in the arid regions (e.g., the Heihe basin and Tarim river basin). The BMA ensemble approach performed better than all individual member simulations. It captured the spatial distribution and temporal variations of TWSAs over mainland China and the eight major river basins very well; plus, it showed the highest R value (> 0.5) over most basins and the lowest root-mean-square error value (< 40 mm) in all basins of China. The good performance of the BMA ensemble approach shows that it is a promising way to reproduce long-term, high-resolution spatial and temporal TWSA data. Graphical Abstract


Introduction
Terrestrial water storage (TWS), which includes soil moisture (SM), groundwater storage (GWS), surface water (SW), vegetation water (VW), snow and ice, plays a fundamental role in the global hydrological cycle (Philippon and Fontaine 2002;Grippa et al. 2011). TWS anomalies (TWSAs) can affect the hydrological cycle through SM feedbacks to the atmosphere. TWS acts as a 'memory' component of the terrestrial hydrological cycle within the climate system through land-atmosphere interactions due to its unique dynamics (Koster and Suarez 2001). There is a strong correspondence between TWSAs and extreme hydrological events (e.g., droughts and floods). It is thus very important to obtain long-term, accurate, high-resolution spatial and temporal TWSA information to monitor and predict extreme hydrological events, as well as to evaluate the potential impact of these variations on climate change (Dirmeyer, Schlosser, and Brubaker 2009;Chen et al. 2010;Koster et al. 2010).
Three methods are traditionally used to obtain TWSA estimates: satellite observations of gravity anomalies, such as the Gravity Recovery and Climate Experiment (GRACE) satellite data (Syed et al. 2008;Chen et al. 2010); the water balance method (Zeng et al. 2007); and land surface models (LSMs) (Grippa et al. 2011;Sun, Xie, and Tian 2015). Each method presents advantages and disadvantages. At present, mechanistic LSMs are widely used to estimate land hydrological variables such as SM, DW, and evapotranspiration at global and continental scales (Wang and Zeng 2011;Wang and Dickinson 2012); however, LSM simulations of land hydrological variables are still very inaccurate. The main sources of error in offline LSM simulations of land hydrological variables are the uncertainties in both meteorological forcing data and land surface parameterizations (Dirmeyer, Dolman, and Sato 1999;Wang and Zeng 2011;Wang, Lettenmaier, and Sheffield 2011). Ensemble simulations combining multiple LSMs, or multiple forcings, have been found to perform better than a single LSM or forcing dataset in most cases (Guo et al. 2007;Liu et al. 2016). However, whilst ensemble simulations combining multiple LSMs or multiple forcings using the simple arithmetical average ensemble method have been shown to outperform most individual member simulations over most subregions, they were still found to be inferior to the best individual ensemble member in most cases (Guo et al. 2007;Liu and Xie 2013).
Bayesian model averaging (BMA) was proposed by Raftery et al. (2005) as a statistical post-processing procedure that infers consensus predictions by weighting individual member predictions based on their probabilistic likelihood measures, with the better performing predictions receiving higher weights in the ensemble predictions (Duan et al. 2007;Li, Zeng, and Li 2009). Many studies have demonstrated that this approach is superior to the simple arithmetical averaging method, and provides a quantitative description of total predictive uncertainty through the probability density function (PDF), especially in numerical weather prediction (Raftery et al. 2005;Duan et al. 2007;Liu and Xie 2013;Li and Lin 2015;Jia and Xie 2016). However, few studies have focused on applying the BMA ensemble approach to multiple LSMs and multiple forcing ensemble simulations. An approach considering the uncertainties from the forcings and LSM simultaneously has the potential to improve TWSA simulations. In the present work, four simulations using three meteorological forcings to drive two LSMs over mainland China were merged using the BMA ensemble approach, and a comparison was conducted between the four individual simulations, the BMA ensemble simulations, and the GRACE satellite observed TWSA data.

Meteorological forcings
In this study, we used three meteorological forcing datasets developed by different institutions to drive LSMs over mainland China: (1) Qian (Qian et al. 2006;Tian et al. 2010); (2) Princeton (Sheffield, Goteti, and Wood 2006); (3) Institute of Tibetan Plateau Research, Chinese Academy of Sciences (ITPCAS) (He 2010). The three meteorological forcing datasets are the same as employed in Liu et al. (2016). Table 1 in Liu et al. (2016) summarizes the primary features and the differences between these forcing datasets.

LSMs
The two LSMs used in this study were version 3. 5 of the Community Land Model (CLM3.5) (Oleson et al. 2007), and version 4.5 (CLM4.5) (Oleson et al. 2013), released by the National Center for Atmospheric Research. In these two LSMs, TWS is the sum of GWS, SM, snow water equivalent (SWE), VW, and SW. These hydrological variables are the main variables in land hydrological processes. The TWSA simulated by the two LSMs assumes that the contribution of VW and SW are Table 1. Comparison of the major features of LSM structures and hydrological schemes.

Model
Major hydrology scheme Reference CLM3.5 (1) The total soil column of 0-3.43 m is divided into 10 layers with variable thickness ranging from 0.0175 to 1.1370 m (2) Scaling of canopy interception (3) A combination of the TOPMODEL and BATS runoff scheme (4) A snow submodel with up to five layers depending on the total snow depth (5) A simple groundwater model for determining water table depth Oleson et al. (2007); Niu et al. (2005Niu et al. ( , 2007; Lawrence and Chase (2007) CLM4.5 (1) The total soil column 0-42.1032 m is divided into 15 layers with variable thickness ranging from 0.0175 to 13.8512 m (2) New treatments of soil column-groundwater interactions and soil evaporation (3) Revised hydraulic properties of frozen soils, increasing the consistency between soil water state and water table position and allowing for a perched water table above icy permafrost ground (4) Revised snow model (5) Introduction of a surface water store (6) Refinement of global PFT, wetland, and lake distributions, and more realistic optical properties for grasslands and croplands (7) Addition of an urban canyon model Lawrence et al. (2012); Swenson and Lawrence (2012); Oleson et al. (2013) negligible. In this study, simulated TWS values in each individual member model were converted to TWSA by subtracting the 1979-2008 modeled mean TWS. Table 1 summarizes the primary features and the differences between the two LSMs, as well as the sources of the vegetation and soil parameters.

GRACE data
Due to the lack of direct TWSA observations, we used GRACE data to compare and evaluate our five TWSA simulations (i.e., four individual member simulations and their BMA ensemble simulation). Since the GRACE satellite launched in 2002, many studies have used GRACE data for various hydrological applications (e.g., Chen et al. 2010;Grippa et al. 2011). The GRACE satellite mission provides observations of land gravity changes, from which TWS changes can be derived. Different methods can be used to retrieve the TWS from satellite gravity observations, and these different methods result in different GRACE TWSA datasets (Rowlands et al. 2005;Chambers 2006;Bruinsma et al. 2010). In our study, the GRACE-derived TWSA data were provided by the Tellus product processed by NASA's Jet Propulsion Laboratory, which were smoothed with a 300 km radius and converted to a grid format at a monthly temporal and 1°× 1°spatial resolution (Wahr et al. 2004;Swenson and Wahr 2006;Sun, Xie, and Tian 2015). The five-year complete dataset for 2004-08 was used for this study.

Experimental setup
The four individual ensemble member TWSA simulations were derived from two different LSMs (described in Table 1) driven by three different forcing datasets, which we coded as: (1) Qian_CLM3.5; (2) Prin_CLM3.5; (3) ITP_CLM3.5; and (4) ITP_ CLM4.5. These simulations were all run at a resolution of 1°× 1°. The four sets of simulated TWSA data were then merged using the BMA ensemble method, referred to simply as 'BMA'. Table 2 gives more specific information on the experimental design.

Ensemble approach
In this study, an advanced BMA ensemble approach was used. The BMA approach employed in this study is the same as employed in Liu and Xie (2013). The description of the method parallels that of Liu and Xie (2013), as follows in the next two paragraphs.
The BMA approach is a weighted average of PDFs centered on the bias-corrected predictions from a set of individual ensemble members: where y is the predictive variable; f k ; k ¼ 1; 2; Á Á Á ; K is the kth ensemble member forecasts; K is the number of ensemble members being combined; y T are the training data; and p k ðyjðf k ; y T ÞÞ is the conditional PDF of y based on ensemble member f k , given that f k is the best forecast in the ensemble. The weighting w k is the posterior probability of the forecast, such that it is non-negative and P K k¼1 w k ¼ 1, and represents the contribution of ensemble member k to the predictive skill of the ensemble. In the BMA we developed and used, we assumed that the conditional PDF p k ðyjðf k ; y T ÞÞ from each ensemble member at the specific time and location was approximated as a gamma distribution: where Γðα k Þ is the gamma function. The shape parameter α k and scale parameter β k of the gamma distribution are given by: where μ k and σ 2 k are the mean and variance of the gamma distribution, and b 0k ; b 1k ; c 0 ; c 1 are the parameters to be estimated. The mean of the BMA posterior PDF is the deterministic forecast. In this study, we considered only the deterministic forecast of the BMA method for the TWSA. The BMA weights ω k and parameters b 0k ; b 1k ; c 0 ; c 1 in this case were estimated from a training dataset using the linear regression method and the maximum likelihood technique, for which the values were obtained iteratively using the modified Markov chain Monte Carlo algorithm. In the present study, 2004-06 was chosen as the training period, and 2007-08 as the evaluation period.

Spatial distribution of TWSAs
To examine the performance of the BMA ensemble simulation of TWSAs, we first compared the four simulations and their BMA ensemble simulation of the mean TWSA with GRACE data from 2004-08. Figure 1 shows the spatial distribution of the five-year (2004-08) averaged TWSA in China derived from Qian_CLM3.5, Prin_CLM3.5, ITP_CLM3.5, ITP_CLM4.5, BMA, and GRACE data. The GRACE data exhibited strong regional variations and there were several obvious negative anomaly centers, such as the Beijing-Tianjin-Hebei region, the northern region of Northeast China, and the northwestern region of Xinjiang Province; and positive anomaly centers, such as the southwestern region of China and the mid-northern Tibetan Plateau (Figure 1(f)). Among all individual member simulations, only ITP_CLM3.5 captured these negative and positive anomaly centers, although the magnitude differed from the GRACE data. Other individual member simulations did not follow the pattern observed in the GRACE data (Figure 1(a-d)). The BMA ensemble simulation captured the TWSA spatial distribution pattern in China very well, and with similar magnitude, compared with the GRACE data (Figure 1(e)).

TWSA temporal variation
To evaluate the performance of the BMA ensemble simulated TWSA quantitatively, we compared the individual model outputs, BMA, and GRACE data time series averaged over the eight major basins of mainland China for the period 2004-08. Figure 2 shows the comparisons between ITP_CLM3.5, BMA, and GRACE data. Due to the large difference in magnitude between Qian_CLM3.5, Prin_CLM3.5, ITP_CLM4.5 and the GRACE data, Qian_CLM3.5, Prin_CLM3.5 and ITP_CLM4.5 are not shown in Figure 2. ITP_CLM3.5 and BMA captured the seasonal cycle and temporal evolution of the GRACE data very well over most basins, but ITP_CLM3.5 showed a higher bias than the BMA ensemble simulation. Qian_CLM3.5, Prin_CLM3.5, and ITP_CLM4.5 also captured the seasonal cycle of the GRACE data well over most basins, but showed a higher bias compared to ITP_CLM3.5. It should be noted that LSM simulations usually reproduce anomalies and seasonal variations but fail to simulate the mean SM (Qian et al. 2006;Liu and Xie 2013). By contrast, the BMA ensemble modeling did simulate the mean TWSA value.

Statistical comparisons between simulations and GRACE data
To further evaluate the performance of the BMA ensemble approach in simulating TWSAs, we compared the correlation coefficient (R), root-mean-square error (RMSE), and normalized standard deviation (SDV) of the five simulations. R and SDV were plotted on twodimensional Taylor diagrams (Taylor 2001). The statistical post-processing component of the BMA approach requires training data to calibrate the BMA model parameters. We chose 2004-06 as the training period and 2007-08 as the evaluation period. Figure 3 shows the statistical scores (R and RMSE) for Prin_CLM3.5, ITP_CLM3.5, ITP_CLM4.5, and BMA over the validation period (2007-08) in the eight major basins of China (due to the differences in magnitude, Qian_CLM3.5 is not shown in Figure 2). In the humid areas (e.g., the  Yangtze river basin, Huaihe basin, and Zhujiang basin), all the simulations showed good temporal correlation (R), ranging from 0.5 to 0.8. However, the models performed worse in the arid areas (e.g., the Heihe basin and Tarim river basin) than in the humid areas. This may be explained by the impact of human activities, such as the extraction of groundwater for domestic, industrial, and agricultural water consumption. In the individual member experiments, no single model performed optimally in all the basins. The BMA ensemble simulation showed the highest R value (> 0.5) over most basins of China and the smallest RMSE value (< 40 mm) over all the basins of China, and performed better than all individual member simulations. Figure 4 shows a Taylor diagram comparing the five experiments with the observed GRACE data over the eight major basins of China. ITP_CLM3.5 generally performed best among the four individual member simulations, and the BMA ensemble simulation performed best among all five experiments in most cases. The BMA ensemble approach improved the quality of the simulated TWSA significantly, not only by improving the simulation of the spatial ( Figure 1) and temporal (Figure 2) variations, but also by reducing the mean bias (Figure 3).

Discussion
Uncertainties in terrestrial water cycle simulations by offline LSMs result from uncertainties in LSM parameterizations, meteorological forcings, and land surface parameter information. We need to reduce these uncertainties in order to improve terrestrial water cycle modeling. The BMA ensemble approach shows promise as a way of reproducing long-term, highresolution, spatial and temporal TWSA data, and could be useful in supplementing the current lack of long-term TWSA data.
Some limitations of our study should be noted. Firstly, the BMA method depends on the accuracy of observations to provide an adequate training dataset for calibrating the model parameters. We assumed that the GRACE data were true values. In fact, different retrieval methods applied to the land gravity changes detected by GRACE result in different GRACE TWSA data. These uncertainties were not quantified in this study. Secondly, the TWSA is the sum of changes in GWS, SM, SWE, VW, and SW. The LSMs in this study neglected the contribution of VW and SW in TWSA estimates. Thirdly, the approach considers uncertainties in LSM parameterization and meteorological forcing, but does not consider uncertainties in land surface parameter information.
In this study, we considered only the deterministic forecast of the BMA method for the TWSA. The BMA approach can produce calibrated and sharply predictive PDFs from the ensembles of dynamic models and provide probability forecasting, as pointed out by Liu and Xie (2014) and Li and Lin (2015). In future work, we intend to extend the BMA TWSA simulation to probability prediction, and investigate the warning schemes of extreme events such as floods and droughts, based on the BMA TWSA prediction.

Summary and conclusion
In this study, we conducted simulations using two LSMs and three sets of meteorological forcing data over mainland China during the period 1979-2008. The four TWSA outputs from the simulations were then merged using a BMA ensemble approach. We compared these five simulated TWSA datasets with GRACE TWSA satellite data over mainland China and the eight major river basins of China during 2004-08. We evaluated the extent to which the quality of TWSA modeling was improved by using the BMA ensemble approach. Our main conclusions are as follows: Of the four individual member simulations of the TWSA using the three individual forcings and two LSMs, ITP_CLM3.5 best captured the spatial pattern and temporal variation of the TWSA over most basins of mainland China, and performed better than the other three individual member simulations, but it still generated mean bias when compared with the GRACE data. In general, ITP_CLM3.5 performed best and Qian_CLM3.5 performed worst in most basins. This result is associated with the quality of the meteorological forcing and land surface parameterizations.
The BMA ensemble approach combining the four individual member simulations significantly improved the ability to accurately simulate the TWSA. Not only does the approach reproduce the spatial and temporal variation of the TWSA, but it can also model the mean value of the TWSA. This approach is a promising way to accurately model the mean value and variations of the TWSA in China.