A brief history of information and disinformation in hydrological data and the impact on the evaluation of hydrological models

ABSTRACT This paper considers what we know about the potential for disinformation in hydrological data when used for the evaluation of hydrological models. This will generally arise from epistemic uncertainties associated with hydrological observations, particularly from nonstationary or extrapolated rating curves for discharges, and poor rainfall and snowmelt information when interpolated over basin areas. Approaches based on information theory are not well suited to consideration of such epistemic uncertainties in model evaluation and an alternative approach based on setting limits of acceptability independent of any model runs is suggested. This allows for both the rejection of all models tried, and for acceptability of models across different model structures and parameter sets. The paper concludes with some suggestions for future research on defining disinformative data for both point and spatial observables, studying model failures, and defining new observations with a view to having the greatest impact on reducing model uncertainties.


Expectations about information content of hydrological data
Most hydrological applications require some estimates of variables at the basin scale.In general, this requires making some observational data available at that scale.Despite the decade of the International Association of Hydrological Sciences Prediction in Ungauged Basins (PUB) initiative, and the resultant publications (see Blöschl et al. 2013, Hrachowitz et al. 2013), estimates of hydrological variables in ungauged basins remain really rather uncertain.But we also know that the observational data of hydrological relevance are uncertain.Many observations are made at the "point" scale, using for example raingauges; then the data are uncertain both at a point and in the extrapolation to basin areas.Discharges are generally uncertain at the point of measurement because of rating curve uncertainties and nonstationarities.Where spatial data are available -for example, from remote sensing methods -there are generally uncertainties in converting the digital numbers recorded by the sensor to the hydrological variables of interest.All of these uncertainties can have a random variability component (the aleatory uncertainties) and a lack of knowledge component (the epistemic uncertainties).
If we consider the basin water balance equation over some time increment Δt: where Q is discharge observed at the basin outlet, P is precipitation input, Et is evapotranspiration output and ΔS is the change in basin storage over Δt, some examples of important epistemic uncertainties for the most important terms of the basin water balance are the following: • For discharge observations, Q: rating curve variability, extrapolation to higher and lower flows, and possible non-stationarities within and between events; • For precipitation observations, P: extrapolation from point gauges to basin scale, interpretation of spatial radar data, assessment of snow water equivalents over basin areas; • For evapotranspiration, Et: interpretation of point flux measurements for different wind speeds and directions (especially for mixed land use); extrapolations from point to basin scale (especially for mixed land use); estimates of variables in the energy budget closures for local model and remote sense based based estimates.• For changes in storage, ΔS: limitations of techniques for estimating unsaturated and saturated zone storage.
There are also epistemic uncertainties associated with neglected terms of the water balance such as recharge to deep aquifers and groundwater fluxes across the basin boundary, unmeasured occult precipitation inputs, and underflow at discharge measurement sites.Further consideration of aleatory and epistemic uncertainties in hydrological variables is discussed by Macmillan et al. (2010Macmillan et al. ( , 2011Macmillan et al. ( , 2012Macmillan et al. ( , 2018)), Westerberg et al. (2011), Coxon et al. (2015), Hollaway et al. (2018), Kiang et al. (2018), and Beven and Lane (2022).All of these uncertainties have been shown to have an impact on the application of hydrological models (see e.g.Beven and Smith 2015, Beven et al. 2022a, 2022b).There is an expectation that they will do so in complex ways: that there may be variable impacts from event to event, both in terms of timing and volumes; that there may be different impacts on rising limbs than on falling limbs; and that significant observational uncertainties in one event might have an impact on how well a model might perform in subsequent events because of the consequent impact on antecedent conditions.These are questions that have interested me since I first started trying to assess the uncertainty in hydrological model predictions using Monte Carlo simulations in about 1980 (see Beven 2001aBeven , 2016Beven , 2021)).The fundamental question is: just how can the information content of data in constraining the prediction uncertainties be assessed when there is such an expectation of epistemic uncertainties, rather than just aleatory variability?In an early study (the first Generalised Likelihood Uncertainty Estimation (GLUE) paper of Beven and Binley 1992) we used two measures to look at the information added in constraining parameter distributions as new events were used in model calibration, the Shannon entropy measure and the U-uncertainty measure based on fuzzy set theory.Both suggested that even a small number of events were effective in constraining the predictive uncertainty, but that there was residual uncertainty that was difficult to reduce further.At the time we inferred that this was because of model structural uncertainty (which certainly played a role), but I would now give the epistemic data uncertainties more weight in how well we can expect models to perform.

Information measures and the theory of information
Any consideration of information content raises the issue of how to quantify information.There are different ways of measuring information in the literature.These are mostly based on treating variables as aleatory quantities so that they can be described by probabilities.In fact, the argument has often been made that the probability calculus is the only correct axiomatic framework for dealing with variability and information (e.g.Nearing et al. 2016).In one of the very first mentions of information content in the hydrological literature, Matalas and Langbein (1962) defined information as the inverse ratio of the variance of the mean of an observed or modelled variable, to the mean population variance of an equivalent random variable.
where σ 2 is the population variance of a random variable of series length N and σ 2 m is the variance of the mean for the variable of interest.If σ 2 m is greater than the variance of the mean of the equivalent random series, then I will be less than 1.Based on this definition, Nick Matalas (1930-2019) and Walter Langbein (1907-1982) give expressions for the information content of series that are autocorrelated as simple Markov series, or for multiple regional cross-correlated series subject to similar forcings.They demonstrate that in general such correlations reduce the information content in a series because there is some repetition of information.They also showed, however, that when cross-correlations are very high then short periods of record can be highly informative about correlated variables.They used the example of tree-ring data to estimate annual precipitation.
Later discussions of information content in hydrology have tended to go back to the definition of information content defined by Claude Shannon (1916-2001) in 1948. Shannon (1948) was the first information theorist (and wartime cryptoanalyst), who worked on how best to communicate messages securely and the errors associated with the transmission of signals.He defined a measure of information for a discrete random variable X as The units of H(X) depend on the base of the logarithm.For base 2, the units are bits or shannons; for natural logarithms, the units are nats; for base 10 the units are dits or Hartleys.This information measure is also often called the Shannon entropy, since it is mathematically equivalent to the Gibbs formula for entropy in a system of discrete states.Effectively, it expresses the element of surprise to be expected in a sample from a distribution of states.The more uniform the distribution, the less the possibility of surprise and the greater the entropy.The definition can be extended to the joint distribution of multiple variables, including the conditionality of one variable on one or more others.For two variables, the measure of information reflects the joint distribution, when the mutual information can be defined as the information about X contained in Y: One of the first explicit references to the use of entropy in hydrological modelling was by Jaime Amorocho (1920Amorocho ( -1983) ) and his student Basilio Espildora (Amorocho and Espildora 1973).In that paper, they gave some background to entropy measures and explored the use of information theory in the estimation of daily discharges using a model.They defined the marginal distribution of discharges X for each day of the year from multiple-year time series of discharges using discrete intervals and Equation (2), and also how a model prediction Y might reduce the uncertainty in the estimates of discharge relative to the year to year variability on that day.They refer to the mutual information (Equation 3) as the transinformation provided by Y in reducing the uncertainty in predictions of X each day, pointing out that this can be directly compared for different models (and across days in that application).
There are equivalent expressions for marginal and mutual information that derive from fuzzy set theory (e.g.Al-Sharhan et al. 2001, Shang and Jiang 1997, Parkash et al. 2008), though I have not seen these used much in the hydrological literature.

Information content and model calibration
There is a nice conceptual link between the use of entropy measures and model evaluation, including the use of likelihood measures and the Bayes theorem.A recent introduction is provided by Weijs and Ruddell (2020), who outline how a model can be considered as a means of compression of a description of the data.The search for an acceptable model is then equivalent to a modern form of Occam's razor in trying to find the simplest model or maximum degree of compression that explains the available observations.That description will have the maximum entropy, since -going back to Shannon and his studies of signal transmission -maximum entropy represents the shortest string of bits to encode a message.
The use of entropy in model evaluation has a long history in hydrological applications, after a certain hiatus following the early Amorocho and Espildora (1973) study.Singh (1997Singh ( , 1998) ) provides a review, particularly in the use of Jayne's principle of maximum entropy for model calibration (see also Singh 2015).Note that there is a separate strand of research concerned with the use of entropy as a fundamental principle of self-organization as a means of improving process representations in models (e.g.Rinaldo et al. 1998, Porporato et al. 2001, Kleidon and Schymanski 2008, Wang and Bras 2011, Maheu et al. 2019).
There was another revival of interest in the use of information theory for model evaluation starting some 15 years ago with the work of Steven Weijs (e.g.Weijs et al. 2010Weijs et al. , 2013) ) and then later in the work of Grey Nearing and Hoshin Gupta and their collaborators (e.g.Nearing et al. 2013, Gupta and Nearing 2014, Nearing and Gupta 2015, Ruddell et al. 2019).The approach has been widely used recently in the application of machine learning and deep learning methods in hydrology (e.g.Kratzert et al. 2019a, 2019b, Nearing et al. 2021, Frame et al. 2023).Indeed, information theory has been advocated as a philosophy for hydrological uncertainty (Nearing et al. 2016), as a way of distinguishing between different process representations in hydrological models (e.g.Bennett et al. 2019), and as providing a new paradigm for Earth science (Majda and Gershgorin 2011, Goodwell et al. 2020, Nearing et al. 2020, Weijs and Ruddell 2020).
It is worth noting at this point that there are other approaches available for the evaluation of the value of different types and periods of data in model calibration and evaluation.Dynamic identifiability analysis (DYNIA) can be used in this way (e.g.Wagener et al. 2003, Avanzi et al. 2020) and, more generally, Bayesian methods of data assimilation can progressively assess the reduction in predictive uncertainty or posterior parameter distributions as more data are added.The results of both will depend on the type of likelihood measure that is employed.

Aleatory uncertainties, epistemic uncertainties and measures of information
Given the extensive recent claims for information theory as the basis for model hypothesis and causality testing, the firm basis of information theory in the axioms of probability and Kolmorgorov's complexity theory, and the apparent utility of the approach claimed by a variety of applications, it might seem perverse to argue for an alternative view of the modelling problem.However, let us for the moment be foolish and see where it leads.In doing so we will take a purely pragmatic, instrumentalist view of the modelling process in practice.While I believe that applications of information theory are clearly appropriate to cases where it can be assumed that the errors in representing a signal are aleatory in nature, I would suggest that there are limitations in cases, such as those in hydrological modelling, where there are important epistemic uncertainties.
In effect, the information theory approach, in summarizing variables in terms of probability distributions of occurrences, does not allow for the element of disinformation that can arise from the epistemic nature of uncertainties in the modelling process.Disinformation in this sense might be seeing a hydrograph response when no rainfall is recorded, or finding event runoff coefficients significantly greater than one in flashy basins with small baseflows.Beven et al. (2011), Beven and Smith (2015), Beven (2019) and Beven et al. (2022aBeven et al. ( , 2022b) ) have shown that there are flood events where runoff coefficients are greater than 1 in such basins; i.e. the observations suggest more output than input.Since the runoff coefficients depend only on the observed rainfall inputs and discharge estimates, this variability might come from epistemic uncertainties in both (but also partly from the method of separating events which is only practical in flashier basins).Within the context of information theory such events necessarily influence the entropy for the variables of interest, and, for example, the mutual information of discharge estimates conditional on the knowledge of rainfalls.
Recognizing such cases as inconsistencies arising from data uncertainties can, of course, be informative, but Beven and Smith (2015) noted that because such epistemic uncertainties are inconsistent with the water balance constraints that are incorporated into most hydrological models, then such events would introduce disinformation into the model calibration process.It is somewhat ironic, therefore, that there have also been recent papers that have tried to introduce some hydrological principles, such as the water balance, into data-based machine learning models, albeit that the aim of such studies has been to see whether such principles might be useful (see e.g.Reichstein et al. 2019, Frame et al. 2023).In doing so, it has been shown that the differences between mass balance constrained and unconstrained machine learning models are not consistent, that both still generally perform better than conceptual hydrological models, and that there are basins where both machine learning and conceptual hydrological models perform rather badly (e.g.Beven 2023, Frame et al. 2023).While this does suggest that conceptual hydrological models may not be extracting the maximum information from the observational data, perhaps because of the inclusion of disinformative events in the calibration process, it also suggests that there are inconsistencies and epistemic uncertainties in the observations that are not going to be easy to model by any method.Note that information theory has been proposed as a way of differentiating between aleatory and epistemic uncertainties by Gong et al. (2013), but these authors refer to epistemic uncertainties only in respect of potential improvements to the model used, not as arising from the observational data (which presumably for them are therefore part of the aleatory uncertainty, which they define as uncertainties that cannot be further reduced).
I remain wary of any methods that claim to disaggregate model structural uncertainties without any consideration of epistemic uncertainties in the observations themselves since while we cannot reduce the uncertainty associated with historical observables, the epistemic part of that variability might be fundamental in constraining the predictive power of a model.Indeed, any calibration dataset may not fully represent the range of epistemic variability that might be expected in prediction.Thus there may not be enough samples of disinformative events to fully characterize the joint distribution that underlies the calculation of mutual information from model residuals.And because that sample depends on uncertainties in the data, then which model exhibits the maximum mutual information in evaluation might depend on those data uncertainties (in a way similar to the overstretching of a likelihood surface that arises when the assumptions of a formal likelihood function are not met in practice; see Beven and Smith 2015, Beven 2016, 2019).But, as noted earlier, these are not really surprises -there really should be an expectation that these types of uncertainties will occur and that their characteristics will be non-stationary over time.
This might be one reason for the suggestion that the model likely to be most robust in prediction might be that calibrated on the whole of a historical record, not using a split record method (Shen et al. 2022b).In particular, epistemic errors make timing issues much more important: in the choice of particular periods of data for calibration (which can have significant effects on calibrated parameters, as demonstrated in recent papers by Brigode et al. 2013, Arsenault et al. 2018, Guo et al. 2020, and Shen et al. 2022a amongst others); in the inputs of a disinformative event on the prediction of later events (e.g.Beven and Smith 2015); and in the event-to-event variation in data timing errors on residual correlations and model performance statistics.It will be seen that Equations ( 2) and (3) depend only on the distributions of variables.They do not take any explicit account of the position of the values represented by the distribution in time.For aleatory variations this is not a big deal; for epistemic uncertainties it might be really important (e.g. the differentiation of informative and disinformative events, and the estimation of prediction uncertainties, in Beven and Smith 2015).

Informative and disinformative events and testing models as hypotheses
The recognition of disinformative events in hydrological observations raises some issues about testing models as hypotheses.We note that the term disinformative is used here with respect to the identification of an acceptable model; the very recognition that there are events with physically inconsistent runoff coefficients already provides valuable information about epistemic uncertainties in the modelling process that should be taken into account in any application.
There have, of course, been attempts to allow for variation in input uncertainties in model calibration.The Bayesian total error analysis (BATEA) methodology (e.g.Thyer et al. 2009, Renard et al. 2011) did so based on a hierarchical Bayesian model identification approach.Within this approach, each event was treated as if it had a separate random multiplier chosen from an underlying Gaussian distribution.One of the problems with the methodology was the assumption that all the uncertainties are aleatory in nature, leading to a highly stretched likelihood function that provides overconfidence in the parameter estimates.It also allows for interaction between the event rainfall multipliers and model structure and the resulting model performance on an event-by-event basis (i.e.underestimation of discharges can be compensated by a rainfall multiplier greater than 1, overestimation of discharges can be compensated by a rainfall multiplier less than 1).This leads to greatly improved model calibration but with such a wide range of multipliers that they had to be truncated in prediction and wide total error bounds.This may not in itself be unrealistic; but the assumptions in the approach are, to my mind, more problematic.
An alternative approach makes use of rejection of models that are inconsistent with what we know about the epistemic uncertainties in the data.This approach allows for equifinality of model structures and parameter sets, a concept introduced in the context of general systems theory by Ludwig von Bertalanffy (1901Bertalanffy ( -1972) ) in his 1968 text (von Bertalanffy 1968), and introduced in hydrological modelling by Beven (1975;see also Beven 2006).This rejectionist approach requires, however, a means of defining what are the limits of consistency with what we know about epistemic uncertainties in the data.This will necessarily be to some extent subjective since the epistemic uncertainties are the result of incomplete knowledge, including knowledge about just how poor the data or their extrapolations might be.
One of the issues here is the nature of events that appear unusual or have an element of surprise.These are often the extreme events, particularly flood events of interest.It has been suggested that model calibration should focus on unusual events, particularly in applications that involve flood prediction and forecasting (e.g.Laio et al. 2010, Singh andBardossy 2012).The question is whether such unusual events are simply less frequent, in which case they might be the most informative in the historical record for model calibration, or whether they are unusual because of epistemic uncertainties in the observations, in which case they might be disinformative for model calibration (Beven and Lane 2022).The problem therefore is to identify which might be which!One approach to this problem, proposed by Beven and Smith (2015), was to try to identify informative and disinformative events based on the event runoff coefficients before doing any model evaluations (in that case a likelihood evaluation was used within the GLUE methodology based on clusters of similar informative and disinformative events, with some downweighting if events followed a defined disinformative event).This then allowed two sets of uncertainty bounds to be estimated: one for the events considered informative, and the other for the events considered disinformative.Such a process then reflects the epistemic uncertainties associated with the disinformative events in defining an ensemble of acceptable models.In prediction, however, it is not known if the next event will be informative or disinformative (since the runoff coefficient cannot be determined without prior knowledge of the observed discharges).Thus, in prediction, both sets of uncertainty bounds can be plotted for the simulated variables.
The concept of using historical event runoff coefficients in model evaluation was later extended to a limits-of-acceptability approach to encompass the range of potential runoff coefficients that might be associated with different events, without the need to define disinformative events explicitly.Beven (2019) showed how "similar" events could be defined by a distribution of nearest neighbours assessed in terms of antecedent flows and runoff volumes.A weighted distribution of runoff coefficients across the nearest neighbours could then be used to define limits of acceptability for model rejection, allowing for the actual runoff coefficient for a predicted event.Beven et al. (2022a) applied this way of allowing for epistemic uncertainties to predicting discharges in the River Kent basin in Cumbria UK.They showed that none of the 100 000 models tried could satisfy the limits defined in this way at all time steps, or by analogy with statistical hypothesis testing 95% of the time steps.They pointed out that the latter condition -that is, based on aleatory assumptions -can also be problematic in the face of epistemic uncertainties since it is quite possible that the 5% failures that might be allowed could be exactly the time steps that might be of most interest, in their case in the prediction of peak flows.They also noted that the runoff coefficients allow only for the potential for volume errors and only partly for potential timing errors which can also result from epistemic uncertainties associated with the input time series.They went on to show that concentrating on limits of acceptability for the peak flows did result in a number of acceptable models, that also gave good peak flow predictions for other periods containing significant floods.These predictions also provided useful inputs for a flood inundation model for the largest flood of record in 2015 (Beven et al. 2022b).
In effect, deciding whether a model is acceptable or fit for a particular purpose when dealing with epistemic uncertainties is a matter of making difficult decisions about whether particular parts of the dataset will be informative or disinformative for model testing.This is going to require decisions that are to some degree subjective, based on assessing our lack of knowledge about the variables and processes that might be used in model testing.Beven and Lane (2022) suggest that this can be viewed as deciding on what would constitute success in a Turing-like test of model performance.They set out eight principles for creating such a test: (1) to explore the definition of fitness with relevant stakeholders; (2) to accept that models cannot be expected to perform better than allowed by the observed data used for simulations and evaluations; (3) to ensure that models do not contradict secure evidence about the nature of the basin response; (4) to ensure that evaluations aim to get the right results for the right reasons; (5) to allow for the possibility that all models might be rejected as not fit for purpose; (6) to allow that the results of such tests will always be conditional; (7) to allow that the evaluators might themselves need evaluating; and, finally, (8) to ensure that there is a proper audit trail so that assumptions and decisions in evaluation processes can be reviewed and revisited by others.
Of these, the last is the most important, because if there is a clear audit trail that records the assumptions made in an analysis, including any subjective decisions, then those assumptions can later be revisited and reassessed by others, including the stakeholders who might make use of the model outputs in decision making.
There is a distinct contrast here with the assessment of the information content contained in series of random variables as in the Shannon entropy measure and Jayne's principle of maximum entropy.These are essentially modernistic concepts, created under the assumption that ways can be found to extract the maximum information from the data in finding the correct way to make predictions with a model.That might be a concept that is fit-for-purpose for the transmission of signals in communications links, the original subject area of Claude Shannon's research.It does not, however, hold up for areas such as the application of hydrological models, with the intrinsic epistemic uncertainties associated with the data and the modelling process.There, we need to move towards more subjective types of evaluation that try to deal more explicitly with our lack of knowledge.The approach suggested by Beven and Lane ( 2022) is one tentative step in that direction.

Spatial information and models of everywhere
The question of information content becomes more focussed if we considered what is required to constrain the prediction uncertainties for particular places or basins, since in doing so we need to consider the particularities of those places and the hydrological and other data available for those basins.There is a certain uniqueness of place that is intrinsic to particular basins that will interact with the equifinality of model structures and parameter sets in producing the uncertainties associated with model predictions for that basin (Beven 2000, 2006, Beven and Freer 2001, 2020).Traditionally this has only been a matter of local model calibrations for sites where discharge observations have been available (or the epistemic problem of predicting the ungauged basin where such data were not available), but increasingly local basins are being represented as spatially distributed models of everywhere or, in current parlance, digital twins (Beven 2007, Blair et al. 2019, Blair 2021).This introduces new possibilities for the evaluation of predictions once visualizations of those spatial data are provided to local stakeholders with local information (Beven andAlcock 2012, Beven et al. 2015).New types of information based on local knowledge or data collection will mean new types of evaluation and new ways of testing whether a model is getting the right result for the right reasons.This might include qualitative information, including perceptual information about processes (e.g.Seibert and McDonnell 2002, Beven and Chappell 2021, Wagener et al. 2021), or specific data collection programmes including crowd-sourced data (e.g.Seibert and Beven 2009, Rojas-Serna et al. 2016, van Meerveld et al. 2017).
And yet, we know that the predictions of a model will depend on the limitations of the forcing data that are provided to it, as well the limitations of its structure and uncertainties in its parameterization.It will not therefore be expected to be right everywhere in making spatial predictions within a basin.There will, therefore, still be a need to allow for some limits of acceptability in deciding whether a model is fit for purpose in a particular application or not.Ideally, the information in different sources of evaluation data should be assessed on how far they influence such decisions.This raises the question as to whether the most informative data are those that result in rejecting all the models tried.This might not be the best outcome in terms of immediately providing predictions to users; it might be the best outcome in terms of improving the model representation of a basin.In this respect it is important that the limits of acceptability be set before making any model runs, rather than adapting the limits to avoid model rejections (but see Beven et al. 2022b, regarding allowing acceptance for a particular purpose).

Looking to the future: information and knowledge
The discussion here implies that the issues of information and disinformation in observational data are not simple and that, despite the theoretical foundations of information theory as an approach to model evaluations, it may have limitations in applications where epistemic uncertainties are important.We still have a lot to learn about the nature of such epistemic uncertainties in the hydrological modelling process, but a number of topics for future research can be suggested.
In any model evaluation study we need better methods of distinguishing informative and disinformative periods of observations, noting that where there are epistemic uncertainties in historical data they cannot be easily reduced.Evaluation of event runoff coefficients, as in Beven and Smith (2015) and Beven et al. (2022aBeven et al. ( , 2022b)), is one approach that can be applied in basins with small baseflow contributions (with some additional allowance for timing errors), but new approaches are needed for baseflow-dominated basins that can be applied independently of any model runs.One approach might be a time step-by-time step assessment of a range of potential observed responses, using for example classification or machine learning methods to define some limits of acceptability (as has been recently tried by de Oliveira and Vrugt 2022, and Gupta and Govindajaru 2023).This assessment is particularly important for unusual events, which can often be identified using classification methods (e.g.Iorgulescu and Beven 2004).As noted above, unusual events might be the most informative for model calibration to detect whether a model is generating the right responses for the right reasons, or it might be disinformative because of epistemic uncertainties in the observations.
In the context of the current popularity of machine learning methods it might be valuable to revisit the past debates in environmental modelling about information and knowledge or predictive power and explanatory depth.At that time, with more and more processes and parameters being implemented in ever more complex models with a view to increasing explanatory depth, there was concern that fitting such models using limited data might not lead to improved process representations and predictive power (see e.g.Beven 2001b).Now the concerns are somewhat different.There have been a number of studies using various large sets of hydrological data (such as the (daily) CAMELS (Catchment Attributes and Meteorology for Large-sample Studies) and CAMELS-UK datasets) that have suggested that machine learning models have greater predictive power than conceptual hydrological models.Indeed, it seems that use of the whole dataset can often result in better predictive power than locally calibrated models, and good results for a subset of basins not included in the training data and treated as if ungauged.The machine learning models therefore appear to be extracting more information from these large datasets than can be achieved using current conceptual models.
This is intriguing on a number of levels.It suggests that the various ways in which storage elements have been implemented in conceptual models (and we might include the models based on "physics" here) are subject to more epistemic structural error than the pattern recognition algorithms of machine learning.They are, then, lacking in knowledge or explanatory depth despite their origins as approximations to our perceptual models of hydrological processes (see Beven 2012, Beven and Chappell 2021, Wagener et al. 2021).This suggests that they need to be improved.However, there is also the issue that machine learning methods applied to these large samples of basins do not produce satisfactory results on all those basins (e.g.Kratzert et al. 2019a, 2019b, Beven 2023, Frame et al. 2023).The same is true of conceptual models, even when locally calibrated (e.g.Perrin et al. 2008, Lane et al. 2019).So, what is it about those basins that limits the potential for prediction?Is it the epistemic uncertainties in the observations for those basins (such as the poor representation of rainfall patterns and intensities in semi-arid basins or mountain basins, e.g.Faurès et al. 1995, Page et al. 2022), or some effects of timing errors, or something else?Thus, while accepting that machine learning methods have already demonstrated improvements in predictive power (albeit that this will not necessarily hold outside of the support of the training data), clearly we need to make use of these types of results to improve our knowledge.But machine learning models remain difficult to interpret in terms of explanatory depth.Certainly attempts are being made to facilitate inference from machine learning models but they remain high-dimensional parameter models that might be subject to overfitting, particularly when exposed to extrapolation outside the range of the data.One promising pathway to learning more about epistemic uncertainties might be to study such model failures in more detail (see Beven 2018, Beven andLane 2022).
There is, then, still the challenge of assessing where the real information lies in constraining prediction uncertainties using different types of evaluation and where information from different sources might be redundant.We should include here both point and spatial information provided by remote sensing.We know, for example, that point information can be subject to epistemic uncertainties when compared to model predicted variables (also called commensurability uncertainty, see e.g.Freer et al. 2004).We know that remote sensing data might also be uncertain (in general it has already been through some processing based on some model constructs) and, in some cases, also subject to commensurability issues.Some sources of remote sensing data might also be more informative for model calibration than others (see e.g.Nijzink et al. 2018).We tend to assume that the pattern information in spatial images should be useful in model evaluation, but there has been little study of the potential for such data to be affected by epistemic uncertainty (e.g.Beven et al. 2015).Thus the value of different types of data has generally been assessed in a post-hoc way, i.e. on the basis of measures of posterior uncertainties for variables of interest after different types of evaluation.But this only tells us indirectly about what types of data we should concentrate resources on measuring, or what new types of data might be even more valuable to collect.It might be even more valuable to propose critical experiments that would inform model evaluation in ways that generate knowledge (and not only predictive power), even if the experimental techniques required do not yet exist (see Beven et al. 2020, Beven andChappell 2021).This is surely a pressing topic for future research.