Evaluating observed versus predicted forest biomass: R-squared, index of agreement or maximal information coefficient?

ABSTRACT The accurate prediction of forest above-ground biomass is nowadays key to implementing climate change mitigation policies, such as reducing emissions from deforestation and forest degradation. In this context, the coefficient of determination ( ) is widely used as a means of evaluating the proportion of variance in the dependent variable explained by a model. However, the validity of for comparing observed versus predicted values has been challenged in the presence of bias, for instance in remote sensing predictions of forest biomass. We tested suitable alternatives, e.g. the index of agreement ( ) and the maximal information coefficient ( ). Our results show that renders systematically higher values than , and may easily lead to regarding as reliable models which included an unrealistic amount of predictors. Results seemed better for , although favoured local clustering of predictions, whether or not they corresponded to the observations. Moreover, was more sensitive to the use of cross-validation than or , and more robust against overfitted models. Therefore, we discourage the use of statistical measures alternative to for evaluating model predictions versus observed values, at least in the context of assessing the reliability of modelled biomass predictions using remote sensing. For those who consider to be conceptually superior to , we suggest using its square , in order to be more analogous to and hence facilitate comparison across studies.

The correct evaluation of AGB models is key to REDD, as errors may propagate when AGB estimations are upscaled (Chave et al., 2004;Molto, Rossi, & Blanc, 2013;. However, there is at the moment a critical lack of consensus on good practices for predicting forest AGB, both in the determination of allometric biomass equations (Sileshi, 2014;Temesgen, Affleck, Poudel, Gray, & Sessions, 2015) and in the evaluation of AGB predictions using remote sensing methods (Loew et al., 2017;Valbuena et al., 2017a). The large variety of modelling methods and approaches for evaluating their accuracy complicates meta-analyses comparing various approaches (Loew et al., 2017;Zolkos et al., 2013). In a recent article (Valbuena et al., 2017a), we focused on the evaluation of final AGB predictions from remote sensing with measures of accuracy and precision. We recommended the use of Piñeiro, Perelman, Guerschman, and Paruelo (2008) hypothesis test, considered the usability of Theil's (1958) partial inequality coefficients, and suggested means for preventing overfitting to the sample (Valbuena et al., 2017a). In this paper, we focus on the evaluation of agreement between the AGB observed and the AGB predicted by the model.
The agreement between the observed and predicted values is an important indicator in model evaluation, and the best approach for its assessment is a subject of current discussion (Duveiller, Fasbender, & Meroni, 2016;Loew et al., 2017;Simon & Tibshirani, 2011;Willmott, Robeson, & Matsuura, 2012). The coefficient of determination (R 2 ) is widely used as the most suitable statistic for describing the agreement of model predictions with those observed empirically. R 2 expresses the proportion of variance in the dependent variable explained by a model. However, the use of R 2 in performance evaluation for predictive methods has been criticized for not being necessarily related to the accuracy of predictions (e.g. Fox, 1981;Paruelo, Jobbágy, Sala, Lauenroth, & Burke, 1998;Willmott, 1982). Alternatives have been proposed, such as the index of agreement (d), which can be interpreted as the relative prediction error (Willmott, 1981;Willmott & Wicks, 1980). Ever since, it has been popular in a variety of fields such as hydrology (Legates & McCabe, 1999;López-Moreno, Latron, & Lehmann, 2010), agriculture (Nendel et al., 2013), neurology (Ganpule et al., 2017), meteorology (Aschonitis et al., 2017;Morsy, El-Sayed, & Ouda, 2016), forest ecology (Ibrom et al., 2007;Ward, Bell, Clark, & Ram, 2013), or climate change (Bring & Destouni, 2014;Gaitan, Hsieh, & Cannon, 2014;Oyler, Ballantyne, Jencso, Sweet, & Running, 2015). The use of Willmott's d in remote sensing literature has, however, been marginal (Almeida et al., 2016;García et al., 2010;Grzegozewski, Johann, Uribe-Opazo, Mercante, & Coutinho, 2016;Wachholz de Souza, Mercante, Johann, Camargo Lamparelli, & Uribe-Opazo, 2015;Yebra & Chuvieco, 2009). Since many of these authors recommended the use of d above R 2 , this research was devoted to study the convenience of using d for reporting the agreement between remote sensing AGB predictions and their observed values. In addition to d, we postulated that using the recently developed maximal information coefficient (MIC)  as a means for evaluating observed versus predicted values may be advantageous. MIC was conceived as a non-parametric alternative to correlation, in the context of evaluating relationships between explanatory and response variables (Speed, 2011). Despite being very recent, it has already been found useful in a wide variety of contexts, such as microbiology (Thomas, Bordron, Eveillard, & Michel, 2017), medicine (Vallières et al., 2017), big data analysis (Chen & Yang, 2016), and also remote sensing (Görgens, Valbuena, & Rodríguez, 2017). Although it has been suggested that the robustness of this statistic against noise should be studied in terms of measuring predictive accuracy of models (Loew et al., 2017;Murrell, Murrell, & Murrell, 2014), to our knowledge no previous study has considered the suitability of MIC in the context of evaluating the agreement between modelled predictions of AGB and their corresponding observed values.
In this article, we consider the adequacy of using alternatives to the R 2 for evaluating the agreement between observed and predicted values. The alternatives considered were d and MIC, and the study was conceived as a further aspect of prediction accuracy to be revised in the context of AGB predictions from remote sensing, additional to those outlined in Valbuena et al. (2017a). We, therefore, followed up the same experimental design to allow direct comparison and contrasting against additional measures of accuracy: results are presented for unviable modelling approaches alongside correct ones, in order to highlight whether the alternative measures of agreement reliably make a difference among them.

Modelling alternatives compared
The values of R 2 , d and MIC calculated in this study were obtained for the same modelling alternatives reported in Valbuena et al. (2017a), in order to allow direct comparisons according to the conclusions reached in that study. These consisted in predictions of plot-level AGB using parametric models and a non-parametric method: best-subset, step-wise and most similar neighbours, all of them including 'restricted' versions according to Piñeiro et al. (2008) and Valbuena et al. (2017a). All computations were carried out using the R statistical environment (version 3.3.1; R Development Core Team, 2016). The parametric models were power models adjusted in their linear form, i.e. using a natural logarithm transform of the response variable (e.g. Asner & Mascaro, 2014;Hudak et al., 2006;Naesset, 2002), which were bias-corrected when transformed back to the original scale (Baskerville, 1972;Sprugel, 1983). A first alternative, here forth denominated "best-subset" (Hudak et al., 2006;Miller, 1984), selected model predictors via exhaustive evaluation of all possible combinations using package "leaps" of R (Lumley & Miller, 2009), and minimization of Mallows' Cp (Mallows, 1973) as predictor selection criteria. Next, there was a modification of this alternative here forth denominated "best-subset restricted overfitting" (Valbuena et al., 2017a). It consisted in further constraining the "best-subset" result by restricting the inflation of sum of squares to a 10% in the cross-validation, to avoid models overfitted to the sample (Ehrenberg, 1982;Lipovetsky, 2013;Weisberg, 1985), plus positive results in Piñeiro et al.'s (2008) hypothesis test to the observed versus predicted fit. The next alternative employed the same power model but variables were selected according to a step-wise procedure (Naesset, 2002;Weisberg, 1985) and the corrected Akaike Information Criterion (AIC) (Burnham & Anderson, 2002;Sugiura, 1978), as implemented in the function "stepAIC" of R (Venables & Ripley, 2002). We denominate this method "step-wise", and also presented a "step-wise restricted overfitting" modification using the hypothesis test and the limiting criterion of 10% inflation in sums of squares (Valbuena et al., 2017a).
The other alternative employed was the nonparametric prediction. This was based on the most similar neighbour (MSN) method, one type of machine learning approach among the so-called group of nearest neighbour methods (Franco-Lopez et al., 2001;McInerney, Suárez, Valbuena, & Nieuwenhuis, 2010;McRoberts, Nelson, & Wendt, 2002). MSN predictions were carried out using the "yaImpute" package of R (Crookston & Finley, 2007). We set the algorithm to use inverse distance weighting averaging of three neighbours, based on previous experience (Almeida et al., 2016;Eskelson et al., 2009;Valbuena, Vauhkonen, Packalén, Pitkanen, & Maltamo, 2014). In the case of MSN, we employed canonical regression analysis (Cohen, Maiersperger, Gower, & Turner, 2003;Manzanera et al., 2016) to recursively select the predictors. The final selection criterion was the root mean squared error minimization with the same above-mentioned limiting restrictions of Piñeiro et al.'s (2008) hypothesis tests and avoiding overfitting (Valbuena et al., 2017a). Additionally, we calculated results also for a wide range of number of predictors p ¼ 1 . . . 30, with the intention to emphasize whether the chosen statistics of agreement between observed and predicted would show any differences for unrealistically low n=p ratios.

Evaluating the agreement between observed and predicted AGB
For all the modelling alternatives considered, we computed predictions using two options: first, (1) the model fit employing the entire dataset of i ¼ 1 . . . n forest plots used to train the model (i.e. model residuals without external validation), and also (2) a leave-one-out cross-validation removing each case i from the training data as a prior step to the whole modelling process. In results, these two options are denoted using superscripts (or subscripts) fit and cv, respectively. Thus, the training dataset used was the vector of measured AGB values at plot-level O ¼ obs i where obs i ¼ obs 1 ; . . . ; obs n , from which we obtained the vector of modelled predictions P ¼ pre i where pre i ¼ pre 1 ; . . . ; pre n . These predictions where either those fitted in model training pre fit i , or those calculated using cross-validation pre cv i À Á , since both were employed to compute the range of statistical measures considered for describing the agreement between observed and predicted. Therefore, pre i denotes either pre fit i or pre cv i in the following equations detailing the measurements of agreement considered: (1) The coefficient of determination, which shows the ratio of the sum of squared residuals to the total sum of squares: (2) Willmott's index of agreement, which was originally suggested to be (Willmott & Wicks, 1980): It was later suggested, however, that squaring the residuals may be an inconvenient approach (Willmott et al., 1985), and therefore they suggested the following refinement: And also another further modification was suggested more recently (Willmott et al., 2012): The full version of the refined index of agreement d r consists of inverting the fraction and subtracting one from it (Willmott et al., 2012: Equation (5)), in order to accommodate poorly performing models. However, this was not necessary in our case since the agreement between observed and predicted was higher for all the modelling alternatives considered.
(3) The maximal information coefficient, which measures the entropy of the relationship Speed, 2011). It is based on the naïve mutual information MI P; O ð Þ (Linfoot, 1957). The computation of MIC involves binning P and O, calculating relative abundances p P; O ð Þ at each grid of the resulting cell, and consider the overall uncertainty in their relationship using Shannon's (1948)  We employed the R implementation for MIC computation available in package "minerva" (Filosi, Visintainer, & Albanese, 2014). While the calculation of d resembles that of a sample Pearson's correlation coefficient (r), MIC is customarily interpreted as a non-parametric version of a correlation coefficient (Görgens et al., 2017). Also, Simon and Tibshirani (2011) recommended the use of Brownian distance correlation (Székely & Rizzo, 2009) above MIC. For these reasons, we also calculated from our dataset the following statistics for additional discussion on the most convenient approach for assessing the agreement between observed versus predicted in model evaluation: (4) The coefficient of correlation between observed and predicted values, calculated from Pearson's product moment correlation: (5) The distance correlation (dCor) is an approach to evaluating the relationship between two variables based on their energy (Székely & Rizzo, 2017), i.e., in this case, the distances between P; O ð Þ data in the metric space. We employed the distance correlation statistic implemented in the package "energy" of R (Rizzo & Szekely, 2017).
All the alternatives for evaluating the degree of agreement between observed and predicted were calculated both from the model fit predictions P fit À Á and the cross-validated versions P cv ð Þ: R 2 fit , R 2 cv (Equation (1)), d fit , d cv (Equation (2)), and MIC fit , MIC cv (Equation (5) (4)), and the additional statistics of correlation r fit , r cv , dCor fit and dCor cv . The relative merits of each of the proposed statistical measures of agreement between predicted and observed were evaluated by analysing the results provided by the different alternative prediction methods to the same dataset. We also compared results obtained while increasing the number of predictors in MSN. Taking into account that many of the modelling alternatives have been already ruled out as unreliable by additional statistical criteria (Valbuena et al., 2017a), our objective was to observe whether any measures of the agreement would reveal similar conclusions or otherwise conceal the unreliability of predictions. Spearman's rank correlation coefficient (ρ) was employed to assess redundancy among these measures since it tests whether two methods would rank alternatives in a similar manner.

Results
All the statistical measures of agreement between observed and predicted considered rendered most modelling alternatives reliable. Table 1 compares the results from different prediction method alternatives. The values of all measures have been expressed in per-cent units, with the intention to use them for interpreting the proportion of variance in the observed values that is explained by the model predictions, as R 2 is interpreted. It is worth noting the high values obtained by the unrestricted version of the step-wise selection, which unrealistically resulted in an over-parameterised model with p ¼ 23 predictors in spite of the use of AIC. Figure 1 shows bar plots comparing the results for parametric power models, according to whether they included or lacked the overfitting restrictions to variable selection. The versions which restricted the overfitting were more realistic. Willmott's d showed less contrast among methods than R 2 or MIC. It also showed lesser differences between values obtained using the whole dataset and cross-validated results.
The effects of increasing p in MSN predictions are expressed in Figure 2. The cross-validated coefficients of agreement ranged R 2 cv ¼ 63:7À87:1% for p ¼ 5À28, however dropping as low as R 2 cv ¼ 33:1% for p ¼ 30 or R 2 cv ¼ 38:7% for p ¼ 2. Results in Figure 2 suggest that for small p there may be little divergence whether coefficients of determination are obtained from the model fit R 2 fit or cross-validated values R 2 cv . Willmott's index of agreement d showed similar patterns as R 2 cv (Figure 2), and they were highly correlated ρ d cv ; R 2 cv À Á ¼ 0:99. This was not the case for MIC cv ðρ MIC cv ; R 2 cv À Á ¼ 0:65Þ, which was, therefore, less  (1)). R 2 cv : cross-validated coefficient of determination (Equation (1)). d fit : residual Willmott's index of agreement (Equation (2)). d cv : cross-validated Willmott's index of agreement (Equation (2)). MIC fit : residual maximal information coefficient (Equation (6)). d cv : cross-validated maximal information coefficient (Equation (6)). Coefficients have been multiplied by 100 to yield percentage units. *The actual predictors selected with each method are detailed in Appendix A. redundant than d. The difference between non-crossvalidated and cross-validated results was, in general, more pronounced for R 2 or MIC than for d, for the parametric models (Figure 1) as well as for the various MSN predictions (Figure 2).
In light of the results obtained in Figure 2, we deduced that d cv was systematically higher than R 2 cv , ranging as much as d ¼ 59:2À93:8% for all p ¼ 1À30. Hence, the proportion of explained variance may simply seem larger when using d instead of R 2 . For this reason, we further investigated the convenience of using the alternative formulations for Willmott's index of agreementd 1 and d r -, which are all compared in Figure 3. Results were again either just lower or higher, but still redundant to the information already provided by R 2 , since ρ d cv 1 ; R 2 cv À Á ¼ 0:99 and ρ d cv r ; R 2 cv À Á ¼ 0:98.  Moreover, the latest refined version of Willmott's indexd cv r -, showed less contrast among alternatives than R 2 cv , i.e. with smoother fluctuations along increasing p, which rendered it less useful to discriminate the reliability of predictions. Also, when fitting the whole training dataset with very high p these modified versionsd fit 1 and d fit robtained very low values, which seems unfair in light of the results obtained by R 2 fit , d fit and MIC fit for p > 22 in Table 1.

Discussion
In studies carrying out remote sensing AGB predictions, the most popular statistic employed to evaluate the agreement between observed and predicted is undoubtedly R 2 (e.g. Bright et al., 2012;Chen & Zhu, 2013;d'Oliveira et al., 2012;Erdody & Moskal, 2010;Hudak et al., 2006;Latifi et al., 2015;McInerney et al., 2010;Naesset, 2002;Straub, Tian, Seitz, & Reinartz, 2013;Valbuena et al., 2014;Wing et al., 2012;Zhao, Popescu, & Nelson, 2009). It is, however, well known that in presence of bias, R 2 may be high despite a poor correspondence between observed and predicted (see, e.g., Valbuena et al., 2014;White, Coops, & Scott, 2000). For this reason, it has been argued that R 2 may not necessarily be related to the accuracy of predictions (Paruelo et al., 1998), and hence Willmott's (1981) d has been suggested as a more appropriate alternative to evaluate remote sensing predictions (Almeida et al., 2016;García et al., 2010;Yebra & Chuvieco, 2009), in line with discussions held in the broader realm of statistical modelling (Fox, 1981;  (Willmott et al., 1985(Willmott et al., , 2012Willmott & Wicks, 1980) against the coefficient of determination R 2 , when the most similar neighbour (MSN) method increased the number of predictors (p). Dashed lines show values obtained using the whole training dataset, whereas solid lines were yielded by cross-validated predictions.  Willmott, 1982). In our results, however, we observed an underperformance of d as a statistic truly expressing the degree of agreement between observed and predicted. Even having been obtained after cross-validation, values of d summarized in Table 1 and Figure 1 were quite large in all cases, regardless of results in other measures of model performance (see Valbuena et al., 2017a). On the other hand, R 2 cv obtained sounder results which were more likely to discriminate reliable from unreliable alternatives, whereas Figure 2 shows that d may simply yield values systematically larger than R 2 cv . The modifications to d did not necessarily overcome this problem, since Willmott et al.'s (1985) d 1 was instead systematically lower than R 2 , and the more recent Willmott et al.'s (2012) d r showed similar but smoother patterns of variation ( Figure 3). As a conclusion, we suggest that R 2 may be a better alternative than d, in light of our results. Although Willmott (1981) correctly affirmed that R 2 may show large values if predictions are evenly distributed through the range but yet biased, we suggest that R 2 cv may simply be used along with a hypothesis test ensuring the absence of bias, such as those proposed by Piñeiro et al. (2008). Willmott's (1981Willmott's ( , 1982 put forward a number of arguments in favour of using d, and many of them point out that d may be conceptually superior to R 2 for describing the agreement in the observed versus predicted relationship for model evaluation (García et al., 2010;Yebra & Chuvieco, 2009). One interesting remark arises when comparing the calculation of Willmott's (1981) index d of agreement with that for Pearson's sample correlation coefficient r. It is noteworthy to mention that, although R 2 expresses the proportion of variance in the dependent variable explained by a model while r the dependence between two variables, R 2 and r are closely related to one another. In particular, R 2 is equivalent to the square of Pearson's correlation (r 2 ) when using a simple linear regression with intercept fit. Although this equivalence is not valid for MSN, we explored the possibility that R 2 and r 2 may be fairly close by analogy. Figure 4 shows that was indeed the case, as it was also for the square of Willmott's index of agreement (d 2 ). Thus, we argue that studies with a particular interest in using Willmott's (1981Willmott's ( , 1982 suggestions for evaluating model predictions (Almeida et al., 2016;García et al., 2010;Grzegozewski et al., 2016;Wachholz de Souza et al., 2015;Yebra & Chuvieco, 2009) should report d 2 instead, in order to make it more comparable to more widespread studies using R 2 (e.g. Naesset, 2002Naesset, , 2004Hudak et al., 2006;Zhao et al., 2009;Erdody & Moskal, 2010;McInerney et al., 2010;Bright et al., 2012;d'Oliveira et al., 2012;Wing et al., 2012;Chen & Zhu, 2013;Straub et al., 2013;Asner & Mascaro, 2014;Valbuena et al., 2014;Latifi et al., 2015;Zolkos et al., 2013).
The case of using maximal information coefficient  could also be supported by analogy to the relationship between R 2 and r 2 , since MIC is usually regarded as a non-parametric version of correlation (Chen & Yang, 2016;Görgens et al., 2017;Thomas et al., 2017;Vallières et al., 2017). Results for MIC in Table 1 and Figure 2 can be best interpreted by observing the scatterplots of observed versus crossvalidated predictions published in Valbuena et al. (2017a: Figures 2-3). The remarkably higher values of MIC with respect to R 2 for, e.g., p ¼ 8 or p ¼ 10 ( Figure  2), can be explained by the fact that these scatterplots show clustered cases in the observed versus predicted metric space (see Valbuena et al., 2017a: Figure 2). In that sense, we deduct that MIC simply favours local clustering of predictions, whether or not they correspond to the observations. We, therefore, discourage the use of MIC as a substitute of R 2 for model evaluation. Since Simon and Tibshirani (2011) recommended the use of distance correlation dCor (Székely & Rizzo, 2017) as an alternative to MIC, we also included it in Figure 4. The analogy of squaring d does not work for MIC, which simply showed unrealistically low values when squared. The square of distance correlation (dCor 2 ), however, also obtained values which were fairly similar to those for R 2 (Figure 4).
Another unexpected result obtained in Figure 2 was the fact that differences between R 2 fit and R 2 cv only became apparent for large p, even for such a small n as it was employed in our study case. This result suggests that, in spite of the popularity of crossvalidation in the assessment of forest AGB using remote sensing (e.g. Franco-Lopez et al., 2001;García et al., 2010;Hudak et al., 2006;Hudak, Crookston, Evans, Hall, & Falkowski, 2008;Latifi et al., 2015;McInerney et al., 2010;McRoberts et al., 2002;Naesset, 2002;Packalén & Maltamo, 2008;Valbuena et al., 2013a;Wing et al., 2012), reports on results obtained by leave-one-out crossvalidation could easily converge to those yielded by the model fit (i.e. using the same dataset for training and validation). This may put into question the need for cross-validation at all since it makes little difference for those models with more realistic number of predictors p 3À5 ð Þ (Figure 2). Although crossvalidation may add robustness to an analysis with small sample sizes and few highly influential cases, our results show little difference compared to using the entire training dataset (Valbuena et al., 2017a). Future research should be devoted to further investigating the reliability of cross-validation, accounting for trade-offs between model generality and statistical power (Cohen et al., 2003) faced when pondering between using cross-validation or separating part of the available field information for independent model validation. It is useful for practical forest inventory to mention that models restricted according to Valbuena et al. (2017a), despite using less predictor variables, obtained similar performances in light of all the accuracy assessment figures analysed.

Conclusions
These are the conclusions that can be drawn from the results obtained in this study and their subsequent discussion. First, we observed an underperformance of Willmott's (1981) d in comparison to R 2 , since the latter was more sensitive to actual model performance. Second, the refinements later introduced to that same index (Willmott et al., 1985(Willmott et al., , 2012 did not solve the observed shortcomings either. Third, we wish to put forward a recommendation to use a squared version of Willmott's index d 2 for those who prefer it above R 2 , to allow better comparison with studies using R 2 . Fourth, the maximal information coefficient MIC is not at all suitable for comparing relationships between observed and predicted in model evaluation. And finally, for a small number of predictors, the difference between using crossvalidation or not using it at all can be negligible, which shows how useful can be its comparison for evaluating overfitting. More research would be needed to explore the usefulness of different crossvalidation alternatives in the specific topic of remote sensing predictions of forest AGB.

Acknowledgments
This work was partially supported by the Spanish Directorate General for Scientific and Technical Research under Grant CGL2013-46387-C2-2-R. We also thank the Valsain Forest Centre, of the National Park Body (Spain), and Prof. David A. Coomes (University of Cambridge) for their valuable help. Dr Valbuena's work is supported by an EU Horizon 2020 Marie Sklodowska-Curie Action entitled "Classification of forest structural types with LIDAR remote sensing applied to study tree size-density scaling theories" (LORENZLIDAR-658180). Danilo Almeida acknowledges support from São Paulo Research Foundation (FAPESP) (grant 2018/21338-3).

Disclosure statement
No potential conflict of interest was reported by the authors.