Facets of uncertainty: epistemic uncertainty, non-stationarity, likelihood, hypothesis testing, and communication

ABSTRACT This paper presents a discussion of some of the issues associated with the multiple sources of uncertainty and non-stationarity in the analysis and modelling of hydrological systems. Different forms of aleatory, epistemic, semantic, and ontological uncertainty are defined. The potential for epistemic uncertainties to induce disinformation in calibration data and arbitrary non-stationarities in model error characteristics, and surprises in predicting the future, are discussed in the context of other forms of non-stationarity. It is suggested that a condition tree is used to be explicit about the assumptions that underlie any assessment of uncertainty. This also provides an audit trail for providing evidence to decision makers. Editor D. Koutsoyiannis; Associate editor S. Weijs


Introduction
I first started carrying out Monte Carlo experiments with hydrological models in 1980, while working at the University of Virginia. This was not a new approach at that time, but the computing facilities available (a CDC6600 "mainframe" computer at UVa) made it feasible for the types of hydrological model being used then. Adopting a Monte Carlo approach was a response to a personal "gut feeling" that traditional statistical approaches (at that time an analysis of uncertainty around the maximum likelihood model) were not sufficient to deal with the complex sources of uncertainty in the hydrological modelling process. Over time, we have learned much more about how to discuss facets of uncertainty in terms of aleatory, epistemic, ontological, linguistic, and other types of uncertainty (for one set of definitions see Table 1). Our perceptual model of uncertainty is now much more sophisticated but I will argue that this has not resulted in analogous progress in uncertainty quantification and, more particularly, uncertainty reduction. As one referee on this paper suggested, it can be argued that the classification of uncertainties is not really necessary: there are only epistemic uncertainties (arising from lack of knowledge) because we simply do not know enough about hydrological systems and their inputs and outputs. It is then a matter of choice as to how to treat those uncertainties, including formal probabilistic and statistical frameworks.
What is clear is that such epistemic uncertainties will limit the inferences that can be made about hydrological systems. In particular, we are often dependent on the uncertainties associated with past observations (see, for example, Fig. 1) and have not really done a great deal about reducing hydrological data uncertainties into the past. Some observational uncertainties can certainly be treated as random variability or aleatory, but can also be subject to arbitrary uncertainties. Here, I use the word arbitrary to distinguish epistemic uncertainties that do not have simple structure or stationary statistical characteristics on the time scales used for model calibration and evaluation. This time scale qualification is important in this context since the only information we will have about the impact of different sources of uncertainties on model outputs will be contained in the sequences of model residuals within some limited period of time. It is easy to show that stochastic models based on purely aleatory variability can exhibit apparent short-period irregularity or non-stationarity (see, for example, Koutsoyiannis 2010, Montanari andKoutsoyiannis 2012). However, there is then the question of how to identify the characteristics of long-period variability from shorter periods of model residuals that might contain the type of arbitrary characteristics defined above. It has been shown that some arbitrary CONTACT Keith Beven k.beven@lancaster.ac.uk uncertainties of this type might be disinformative to the model calibration process , Beven and Westerberg 2011, Kauffeldt et al. 2013Fig. 1, Beven and Smith 2014), even if they might be informative in other senses (such as in identifying inconsistences in hydrological observations, Beven and Smith 2014). A disinformative event in this context is one for which the observational data are inconsistent with the fundamental principles (or capacities in the sense of Cartwright 1999) that might be applied to hydrological systems and models. Most hydrological simulation models (as opposed to forecasting models, see Beven and Young 2013) impose a principle of mass balance. We expect catchment systems to also satisfy mass balance (and energy balance and momentum balance, see Reggiani et al. 1999). The observational data, however, might not. Figure 1 is a good example of this, with far more output as discharge from the catchment than the recorded inputs for that event. While there are some circumstances, such as a rain-on-snow event, where this could be realistic scenario, clearly no model that is constrained by mass balance would be able to reproduce such an event, suggesting that the residuals would induce bias in any model inference. It also suggests that we should take a much closer look at the data to be used in model calibration and evaluation before running a model (including the neglect of potential snowmelt inputs).
The implication of allowing that some model residuals might be affected by this type of arbitrary epistemic uncertainty is that commonly used probabilistic or statistical approaches to uncertainty estimation do not take enough account of the epistemic nature of uncertainty in the modelling process. It is not just a matter of finding an appropriate statistical distribution or, alternatively, some non-parametric probabilistic structure for the model residuals (e.g. Schoups andVrugt 2010, Sikorska et al. 2014), especially when the sample of possible arbitrary uncertainties (or surprises) might be small. It will be suggested in what follows that we need to be more pro-active about methods for uncertainty identification and reduction. This might help to resolve some of the differences between current approaches.
Defining types of uncertainty (and why the differences are important) Past analysis in a variety of modelling domains in the environmental sciences has distinguished several types  Figure 1. Example of an event where the runoff coefficient based on the measured rainfalls and stream discharges is about 1.4. This clearly violates mass balance and will therefore be disinformative in calibrating a model that is constrained to maintain mass balance to represent that catchment area. of uncertainties and errors, including aleatory uncertainty, epistemic uncertainty, semantic or linguistic uncertainty, and ontological uncertainty (e.g. Beven and Binley 1992, McBratney 1992, Regan et al. 2002, Ascough et al. 2008, Beven 2009, Raadgever et al. 2011, Beven and Young 2013. Table 1 lists one such classification relevant to the application of hydrological models. In particular, the definition of aleatory uncertainty is constrained to the case of stationary statistical variation (noting that this might involve a structural statistical model but with stationary parameters), for which the full power of statistical theory and inference is appropriate. Epistemic uncertainties, on the other hand, have been broken down into those associated with model forcing data and observations of system response, and those associated with the representation of the system dynamics. As in Fig. 1, the observational data might sometimes be hydrologically inconsistent, and might lead to disinformation being fed into the model inference process Smith 2014). Any of these might be sources of the rather arbitrary nature of errors in the forcing data and resulting model residual variability noted above.
Many aspects of the modelling process involve multiple sources of uncertainty, and without making very strong assumptions about the nature of these different sources it is not possible to separate the effects of the different uncertainties (Beven 2005). Attempts to separate the error associated with rainfall inputs to a catchment, for example, result in some large changes to event inputs and a strong interaction with model structural error (e.g. Vrugt et al. 2008, Kuczera et al. 2010, Renard et al. 2010. The very fact that there are epistemic uncertainties arising from lack of knowledge about how to represent the response, about the forcing data, and about the observed responses, reinforces this problem. If we knew what type of assumptions to make then the errors would no longer be epistemic in nature.
Defining a method of uncertainty estimation (and why there is so much controversy about how to do so) Uncertainty estimation has been the subject of considerable debate in the hydrological literature. There are those who consider that formal statistics is the only way to have an objective estimate of uncertainty in terms of probabilities (e.g. Todini 2006, Stedinger et al. 2008) or that the only way to deal with the unpredictable is as probabilistic variation (Montanari 2007, Montanari andKoutsoyiannis 2012). There are those who have argued that treating all uncertainties as aleatory random variables will lead to overconfidence in model identification, so that more informal likelihood measures or limits of acceptability might be justified (e.g. within the GLUE framework of Beven 2006a, Beven and Binley 1992, Freer et al. 2004, Smith et al. 2008, Liu et al., 2009; and within approximate Bayesian computation by Nott et al. 2012. There are those who recognize the complex structure of hydrological model errors but who use transformations of different types to fit within a formal statistical framework (e.g. Montanari and Brath 2004). Some of these opinions have been explored in a number of commentaries and opinion pieces (Beven 2006a, 2006b, Hamilton 2007, Montanari 2007, Hall et al. 2007, Todini and Mantovan 2007, Sivakumar 2008 as well as in more technical papers. There is, of course, no right answer-precisely because there are multiple sources of epistemic uncertainty, including model structural uncertainty, that are impossible to separate. There are also different frameworks for assessing uncertainties and different ways of formulating likelihoods. If we had knowledge of the true nature of the sources of uncertainty then they would not be epistemic and we might then be more confident about using formal statistical theory to deal with all the sources of unpredictability. Some epistemic uncertainties should be reducible by further experimentation or observation, so that there is an expectation that we might move towards more aleatory residual error in the future. In hydrology, however, this still seems a long way off, particularly with respect to the hydrological properties of the subsurface. And if, of course, there is no right answer, then this leaves plenty of scope for different philosophical and technical approaches for uncertainty estimation-or, put another way, how to define an uncertainty estimation methodology involves ontological uncertainties (Table 1). In this situation there is a lot of uncertainty about uncertainty estimation, and this is likely to be the case for the foreseeable future. This has the consequence that communication of the meaning of different estimates of uncertainty can be difficult. This should not, however, be an excuse for not being quite clear about the assumptions that are made in producing a particular uncertainty estimate (Faulkner et al. 2007, Beven and Alcock 2012, see later).

Defining non-stationarity (in catchments and model residuals)
Many people think that the only important distinction in the modelling process is between variables that are predictable and uncertainties that are not. Model residuals might have components of both: some identifiable predictable structure as well as some unpredictable variability. The structure indicates some aspect of the system dynamics (or boundary condition and evaluation data) that is not being captured by the model. It is often represented as a deterministic function: in the very simplest case, a stationary mean bias; in more complex cases the function might indicate some structured variability in time or space, such as a trend or seasonal component. The unpredictable component, on the other hand, is usually treated as if the variability is purely aleatory on the basis that if something is not predictable then it should be considered within a probabilistic framework (e.g. Montanari 2007) albeit that, as already noted, the nature of that variability might have some long time scale properties (Koutsoyiannis 2010, Montanari andKoutsoyiannis 2012). This is important because it has implications for evaluating models as hypotheses in the face of epistemic errors (or long time scale aleatory errors). Hypothesis testing has traditionally been the realm of statistical inference and probability, including the recent application of Bayesian statistical theory to hydrological modelling (e.g. Clark et al. 2011). Purely empirically, probability and statistics can, of course, describe anything from observations to model residuals regardless of the actual sources of uncertainty as an expression of our reasonable expectations (Cox 1946). However, for any particular set of data, the resulting probabilities are conditional on the sample being considered. This is one reason why we try to abstract the empirical to a functional distributional form or the type of empirical non-parametric distributions used by Sikorska et al. (2014) or Beven and Smith (2014).
For simple cases where the empirical sample is random and stationary in its characteristics (after taking account of any well-defined structure) then there is a body of theory to suggest what we should expect in terms of variability in statistical characteristics as a function of sample size. There is also then a formal relationship between the statistical characteristics and a likelihood function that can be used in model evaluation. The simplest case is when the statistics of the sample have zero mean bias, constant variance, are independent and can be summarized as a Gaussian distribution. More complex likelihood functions could take account of bias, heteroscedasticity, autocorrelation, and other assumptions about the distribution. Even these more complex cases, however, are what I have called ideal cases in the past (e.g. Beven 2002Beven , 2006a. Fundamentally, they assume all variability in model residuals is aleatory in nature.
But real problems are not ideal in this sense; as illustrated above they are subject to arbitrary epistemic errors. It is then debatable as to whether it is appropriate to treat the errors as if they are aleatory. The reason is that the effective information content of any observations (or model residuals) will be reduced by epistemic uncertainties relative to the ideal case. Why is this? It is because the stationary parameter assumption of the aleatory component gives the possibility of future surprise a very low likelihood. Yet evaluating the performance of hydrological models in real applications often reveals surprises that are clearly not aleatory in this way, including occasional surprises of gross under or over predictions. This makes it difficult to define a formal statistical model of the residual structure and consequently, if the methods of estimating likelihoods in formal statistics are not valid, makes hypothesis testing of models more difficult (e.g. Beven 2010.
Consider the situation where the estimates of rainfall over a catchment might be of variable quality during a series of events in a model calibration period. The error in the estimates is not aleatory or distributional in nature because the distribution of events is not expected to be stationary (except possibly over very long periods of time, but that is not really of interest for the period of calibration data that might be available). This is the context in which we can describe the variability as rather arbitrary; i.e. we do not really know whether the rainfall uncertainties conform to any statistical distribution or if the errors in a calibration period are a good guide to the errors in the prediction period that we are actually interested in. The same could be true, of course, for aleatory errors with longterm properties (see examples in Koutsoyiannis 2010, Montanari and Koutsoyiannis 2012, Koutsoyiannis and Montanari 2015. The underlying stochastic process might then be stationary but it might be difficult to identify the properties of that process from a shortterm sample with apparently non-stationary statistics. These are then both forms of epistemic uncertainty. In both cases we lack knowledge about the arbitrary nature of events or the stochastic process. We could in principle, of course, constrain that uncertainty by better observational methods, or longer data seriesthough that is not very useful when we only have access to calibration data collected in the past, even if we might hope to have improved data into the future. An interesting example in this respect is the postaudit analyses of a number of groundwater modelling studies presented in Konikow and Bredehoeft (1992) and Anderson and Woessner (1992). Model predictions of future aquifer behaviour were compared with what actually happened as the future evolved. In most studies the models failed to predict the future that actually happened. In some cases this was because, with hindsight, the original model turned out to be rather poor; in other cases it was because the future boundary conditions for the simulations had not been well predicted. In hindcasting with the correct boundary conditions the predictions were much better. Hindcasting is not all that useful, however. Where modelling is used to inform decision making (as in these groundwater cases) it is predictions of the future that are required. In these studies therefore, error characteristics were not stationary and the future turned out to hold epistemic surprises (either that the calibrated model was poor, or that the changes in boundary conditions were not those expected).
These examples involve a number of forms of nonstationarity. These are summarized in Table 2. In Class 1 we place the classical definition of non-stationarity discussed by Koutsoyiannis and Montanari (2015) in the context of stochastic process theory. They, in fact, consider that this is the only legitimate use of the word non-stationarity in being consistent with its technical definition. In doing so, they are assuming that once any deterministic structure has been taken into account, all forms of epistemic error can be represented by a stationary stochastic model. The parameters of that model will, under the ergodic hypothesis, converge to the true values of the stochastic process as more and more observations are collected. That might, in the case of a complex stochastic process (or even some simple fractal processes) take a very large sample, but that does not negate the principle. Indeed, for a deterministic dynamical system, a stochastic representation will have stationary properties only if it is ergodic. If non-stationarity is assumed, then the system will not have ergodic properties and, Koutsoyiannis and Montanari (2015) suggest, inference will be impossible. This view means either we are back to treating all epistemic uncertainty as aleatory and stationary, once any deterministic structure has been removed, or we are simply left with unpredictability as a result of lack of knowledge. This view has the backing of formal stochastic theory, but I think there are two issues with it. The first is the difference between what might hold in the ergodic case and the limited sample of behaviours we have in calibrating models in practical applications. The example of a stationary stochastic process giving rise to apparently non-stationary behaviour and statistics used to illustrate Koutsoyiannis and Montanari (2015) illustrates this nicely. If we have access only to a limited part of the full record, we might see periods of different statistical characteristics, or periods that include jumps. Real hydrological data might certainly be of this form, but the identification of the true stochastic process would not be possible without very long series (this is true for any fractal type behaviour). The fact that we know that the changing statistics are produced by a stationary process in such a hypothetical example, does not negate the fact that the statistics are changing and we should be wary of using an oversimplified error model (see discussion of Fig. 2 below).
Secondly, the dynamics of a nonlinear catchment model will introduce changes in the statistical properties of residuals both in the way it processes errors in the inputs and as a result of model structural error that cannot be compensated by a simple deterministic nonstationarity. From a purely hydrological point of view we expect that model residuals should have rather different characteristics on the rising limb to those around the peak to those on the falling limb in terms of bias, changing variance, and changing autocorrelation. The problem will be greater for the type of arbitrary event to event epistemic input (or model structure) error discussed above. The error in that event will also have an effect on setting up the antecedent conditions for the following event, and in some catchments, for some time into the future. The statistics of the error will be changing. Again, therefore, we should be wary of using an oversimplified error model. It is possible that again there may be a complex stochastic model that would describe all the potential changes in error statistics, but it is doubtful if it would be identifiable given the small sample of potential errors in a calibration period. It is notable that, even given a long period of calibration data, Sikorska et al. (2014) did not attempt to identify an underlying stochastic model of the residuals, but instead used a non-parametric probabilistic approach (in the reasonable expectation tradition of Coxian probability, Cox 1946) to represent the changing variability of the modelling uncertainties under different circumstances (see also Beven and Smith 2014). There is a difficulty with any non-parametric method, however, of how to deal with potential uncertainties in the future that are outside the range of those seen in the past. Why is it important to make these distinctions? It is because it has an impact on what we should expect in testing a model as a hypothesis of how a catchment functions, and in particular whether it should be considered to be fit for purpose. For example, catchments change over time (Non-stationarity Class 2) but models are often fitted with parameters that are assumed constant in time (and often space). Why is this considered acceptable practice? Perhaps, because there is an implicit expectation that this type of nonstationarity will be dominated by uncertainty in the boundary conditions used to drive a model (including the potential for Non-stationarity Class 3). There may, of course, be some clues as to whether these nonstationarities are important if there is some identifiable structure in the model residuals that could be included as a deterministic component in Non-stationarity Class 1. But we might only see the net effect of all these nonstationarities in the changing properties of the unpredictable errors (Non-stationary Class 4). But these are rarely investigated. In practical applications, statistical model inference is normally carried out as if all sources of error were aleatory with simple stationary properties. This assumption allows the full power of statistical inference to be applied to model calibration but would seem to be an unrealistic assumption for hydrological and other environmental models.

Defining likelihood (and the implications for information content and hypothesis testing)
The advantage of taking a formal statistical approach to model calibration is that there is a formal link between the structure of a set of model residuals and the appropriate likelihood function. If, and only if, the assumptions about the structure of the errors are valid, then there is an additional advantage that there is a theoretical estimate of the probability of predicting a new observation. These advantages are undermined by the non-stationarities that arise from epistemic error which will generally reduce the information content (or introduce more disinformation) in the inference process than would be the case if all errors were simply aleatory with stationary parameters. So treating all sources of error as if aleatory will result in over-conditioning (and less protection against surprise in prediction). There is evidence for this in the very tight posterior parameter distributions that often arise in Bayesian calibrations of rainfall-runoff models. The likelihood surface is made very peaky such that models with very similar error variance can have tens or even hundreds of orders of magnitude difference in likelihood (Fig. 2). That really does not seem realistic to me, and did not when I first started evaluating likelihoods of multiple runs in the 1980s. The origins of the GLUE methodology lie there.
So one way ahead here might be to find more realistic likelihood functions that reflect the reduced errors for four model parameter sets within the same model structure (a simple single tank conceptual rainfall-runoff model, see Beven and Smith 2014). (Bottom) Likelihood ratios or posterior odds for three of the models, relative to the first (+ symbol in upper plot), evaluated using a formal likelihood and updated after the addition of further years of model residuals. The formal likelihood used allows for a mean bias, constant variance, and firstorder autocorrelation and assumes a Gaussian distribution of model residuals. While similar in RMSE (and visual performance), the different models have likelihood ratios that evolve to be 10 40 different as 6 years of data are added, followed by a rapid reduction in likelihood ratio over the next 3 years.
information content for these non-ideal cases and are robust to epistemic error. The question then is how to properly reflect the real information in a set of data when the variations are clearly not aleatory and when the summary statistics might be significantly period dependent. Again, whether the long-term properties are stationary or not is not really relevant, we want to protect against surprise in prediction (as far as is possible for an epistemic problem). In the rainfall-runoff modelling case it has been suggested that the use of summary statistics for model evaluation, such as the flow duration curve, might be more robust to error in this sense (e.g. Westerberg et al. 2011b, Vrugt and.  and Beven and Smith (2014) show how, for the relatively flashy South Tyne catchment in northern England (322 km 2 ), it is possible to differentiate obviously disinformative events from informative events in model calibration within the GLUE methodology. They take an event-based approach to model evaluation that tries to reflect the relative information content expected for informative and disinformative events. They suggest that factors that will increase the relative information content of an event include: the relative accuracy of estimation of the inputs driving the model; the relative accuracy of observations with which model outputs will be compared (including commensurability issues); and the unusualness of an event (extremes, rarity of initial conditions, . . .). Factors that will decrease the relative information content of an event include: repetition (multiple examples of similar conditions); inconsistency of the input and output data; the relative uncertainty of observations (e.g. highly uncertain overbank flood discharges would reduce information content of an extreme event, discharges for catchments with illdefined rating curves might be less informative than in catchments with well defined curves); and also a preceding disinformative/less informative event over the dynamic response time scale of the catchment.
The approach depends on classifying events prior to running the model into different classes based on rainfall volume and antecedent conditions. Outlier events can be identified and examined to see if they are disinformative in terms of their runoff coefficients or other characteristics. Limits of acceptability are established for model performance in each class of informative events and a likelihood measure is based on average model performance in each class. The information content for informative events following disinformative events is weighted less highly.
Models that do not meet the limits of acceptability are rejected (given zero likelihood) in the GLUE methodology and do not therefore contribute to the set of models to be used in prediction. This is one way of testing models as hypotheses. Epistemic error also plays a role here in that we would not want to make false negative (Type II) errors in rejecting a model that might be useful in prediction because it has been forced with poor input data. This is more serious than a false positive error in that if a poor model is not initially rejected we can hope that future evaluations would reveal its limitations. Statistical inference deals with this problem by never giving a zero likelihood, only very very small likelihoods to models that do not perform well (as seen in the orders of magnitude change in Fig. 2). This also means, however, that no model is ever rejected and hypothesis testing has to depend on some other subjective criterion, such as some informal limits on the Bayes ratios for competing models. One implication for this is that if no model is rejected there is no guarantee that the best model found is fit for purpose. This must also be assessed separately.
For the South Tyne catchment it turns out that using a standard dataset, as collected by the Environment Agency, there were a large number of disinformative events as distinguished by unrealistically high or low runoff coefficients. Excluding these events from the model calibration results in different posterior distributions of the model parameters (see Fig. 3). It also allows the characteristics of informative and disinformative events to be considered separately.
When it comes to prediction, however, we do not know a priori whether the next event will be informative or disinformative. This can only be evaluated post hoc, once the future has evolved (in model testing, of course, the "future" considered is some "validation" dataset). This may involve non-stationarities of error characteristics that have not been seen in the calibration period. Beven and Smith (2014) allowed for this by evaluating the error characteristics for informative and disinformative events separately and treating each new event as if it might be either informative or disinformative (Fig. 4). It was shown to help in spanning the observations for events later shown to be disinformative, but clearly cannot deal with every surprise that might occur in prediction, particularly when the system itself is non-stationary.
Defining model rejection in hypothesis testing (and why uncertainty estimation is not the end point of a study) In the case of the modelling study of the South Tyne catchment, some models were found that satisfied the limits of acceptability. This is not always the case; in other studies no models have satisfied all the criteria of acceptability imposed (see, for example, the attempts at "blind validation" of the SHE model by Parkin et al. 1996, Bathurst et al. 2004, and the studies of Brazier et al. 2000, Page et al. 2007, Pappenberger et al. 2007, Choi and Beven 2007, Dean et al. 2009, Mitchell et al. 2011, within the GLUE framework using a variety of different models).
In terms of the science this is, of course, a good thing in that if all the models are rejected then improvements must be made to either the data or the model structures and parameter sets within those structures being used. That is how real progress is made. But the possibility of epistemic errors in the data used to force a model might make it difficult to make an assessment of how constrained any limits of acceptability should be. We know that all models are approximations and so such limits should be set to reflect the expectation of how well a model should be able to perform. This is a balance. We should not expect a model to predict to a greater accuracy than the assessed errors in the input and evaluation data. If it does we might suspect that it has been over-fitted to accommodate some of the particular realization of error in the calibration data.
But we also do not want to make that Type II false negative error of rejecting a model that would be useful in prediction, just because of epistemic errors and disinformation in the forcing or evaluation data.
This suggests that, if we do reject all the models tried as not fit for purpose, we should look first at the data where the model is failing and assess the potential for error in that data, especially if the failures are consistent across a large number of models. In rainfallrunoff modelling this is rarely done, but hydrological modellers are beginning to become more aware of the issues (e.g. Krueger et al. 2009, McMillan et al. 2010, Westerberg et al. 2011a, Kauffeldt et al. 2013. We also have to be careful that we have searched the model space adequately to ensure that no models have been missed. This can be difficult with high numbers of parameters, when the areas of acceptable models in the model space might be quite local. Iorgulescu et al. (2005) for example made 2 billion runs of a model in a 17 parameter space of which 216 were found to satisfy the (rather constrained) limits of acceptability. Blazkova and Beven (2009) made 600 000 runs of a continuous simulation flood frequency model and found that only 37 satisfied all the limits of acceptability. They also demonstrated that whether this was the case depended on the stochastic realization of the inputs used. Improved efficiency of sampling within this type of rejectionist strategy might then be valuable (e.g. the DREAM ABC code of Sadegh and Vrugt 2014).
But where all the models tried consistently fail, and we do not have any reason for suggesting that the failure is due to disinformative data, then it suggests  Beven and Smith (2014).
HYDROLOGICAL SCIENCES JOURNAL -JOURNAL DES SCIENCES HYDROLOGIQUES that a better model is needed. This might lead to new hypotheses about how the system is functioning, or new ways of representing some processes (see also Gupta and Nearing 2014). Model rejection is not a failure, it is an opportunity to improve either the model or data or both. Finding a better model will not provide total protection against future epistemic surprises but would, we hope, be a step in the right direction. How big a step is possible, however, will also depend on reducing uncertainty in the forcing and evaluation data.

Communicating uncertainty to users of model predictions
There are two main reasons for incorporating uncertainty estimation into a study. One is for scientific purposes, to improve understanding of the problem and carry out hypothesis testing more rigorously. The second is because taking account of the uncertainty in model predictions might make a difference to a decision that is made in a practical application: for example, whether the planning process can take account of uncertainty in the predicted extent of flooding for the statutory design return period. For this second purpose it is necessary to communicate the meaning of the model predictions, and their associated uncertainties, to decision makers (e.g. Faulkner et al. 2007).
But, as we have seen, there can be no right answer to the estimation of uncertainty. Every estimate is conditional on the assumptions that are made and in most applications there are many assumptions that must be made (see, for example, . In this case it might be useful to the communication process if the users, or particular groups of users, are introduced to the nature of those assumptions. In fact, it will generally facilitate the communication process if the users can be involved in making decisions about the relevant assumptions whenever possible. The collection of assumptions that underlie any particular application can be considered to be a form of "condition tree" Alcock 2012, Beven et al. 2014). At each level of the condition tree the assumptions must be made explicit, forming an audit trail for the analysis. It has even been suggested 1 that every uncertainty assessment should be labelled with the names of those who produced it (and, by extension, perhaps those who agreed the assumptions on which it is based).  Figure 4. A sample of events taken from the model evaluation period. Each event is treated as if it is either informative (shaded 95% prediction bounds) or disinformative (dotted 95% prediction bounds). The first event is evaluated (a posteriori) as disinformative, the last two as informative. Further details of this study can be found in Beven and Smith (2014).
Can we talk of confidence rather than uncertainty in model simulations?
Decisions about hydrological systems are made under uncertainty, and often severe uncertainty, all the time. Decision and policy makers are, however, far more interested in evidence than uncertainty. Evidencebased framing has become the norm in many areas of environmental policy (e.g. Boyd 2003). In the UK, the Government has considered standards for evidence (Intellectual Property Office 2011) and the Environment Agency has an Evidence Directorate and produces documents summarizing the evidence that underpins its corporate strategy. Clearly such an agency wants to have confidence in the evidence used in such policy framing. Confidence should be inversely related to error and uncertainty, but is often assessed without reference to quantifying uncertainty in either data or model results.
An example case study is the benchmarking exercise carried out to test 2D flood routing models (Environment Agency 2013). Nineteen models were tested on 12 different test cases, ranging from dam break to fluvial and urban flooding. All the test cases were hypothetical with specified roughness parameters, even if in some of the cases the geometry was based on real areas. Some had some observations available from laboratory test cases. Thus, confidence in this case represents agreement amongst models. It was shown that not all models were appropriate for all test cases, particularly those involving supercritical flow, and that some models that used simplified forms of the St. Venant equations while faster to run had more limited applicability. Differences between models depended on model implementation and numerics, so that acceptability of a model in terms of agreement with other models was essentially a subjective judgment.
There is an implicit assumption in assessing confidence in this way that in real applications to less than ideal datasets, the models that agree can be calibrated to give satisfactory simulations for mapping and planning purposes. While the report did recommend that future comparisons should also aim to assess the value of models in assessing uncertainty in the predictions, the impacts of epistemic uncertainty in defining the input, roughness parameters, and details of the geometry of the flow domain would seem to be more important than the differences between models in which we have confidence after such testing (see . In real applications confidence can only be assessed by comparison with observed data, while allowing for uncertainties in inputs. Even then, there is evidence that effective values of roughness parameters might change with the magnitude of an event, so that confidence in calibration might not carry over to more extreme events (Romanowicz and Beven 2003). Yet, for planning purposes, the Environment Agency is interested in mapping the extent of floods with annual exceedence probabilities (AEP) of 0.01 and 0.001. It is, of course, rather rare to have observations for floods within this range of AEP, more often we need to extrapolate to such levels.
It is possible to assess the uncertainty associated with such predictions and to visualize that uncertainty either as probability maps (e.g. Leedal et al. 2010, Neal et al. 2013 or as different line styles depending on the uncertainty in flood extent in different areas (Wicks et al. 2008). In some areas, where the flood fills the valley floor, the uncertainty in flood extent might be small, but the uncertainty in water depth, with its implications for damage calculations, might be important. In other, low slope, areas the uncertainty in extent might be significant. The advantage of doing both estimates is that confidence can be given a scale, even if, as in the Intergovernmental Panel on Climate Change (IPCC), that scale is expressed in words rather than probability. In fact, the IPCC distinguishes a scale of confidence (from "very low" to "very high") from a scale of likelihood (from "exceptionally unlikely" to "virtually certain" based on a probability scale) (IPCC 2010). Confidence indicates how convergent the estimates of past and future change are at the current time; likelihood the degree of belief in particular future outcomes. Thus the summary of the outcomes from IPCC5 states: "Ocean warming dominates the increase in energy stored in the climate system, accounting for more than 90% of the energy accumulated between 1971 and 2010 (high confidence). It is virtually certain that the upper ocean (0-700 m) warmed from 1971 to 2010, and it likely warmed between the 1870s and 1971. It is very likely that the Arctic sea ice cover will continue to shrink and thin and that Northern Hemisphere spring snow cover will decrease during the 21st century as global mean surface temperature rises." (IPCC 2013). Now the IPCC will not assign any probability estimates to any of the model runs that contribute to their conclusions. They are described as projections, subject to both model limitations and conditional on scenarios of future greenhouse gas emissions. The future scenarios, and hence any probability statements, are necessarily incomplete. This has not, however, stopped the presentation of future projections in probabilistic terms in other contexts, such as those derived from an ensemble of regional model runs in the UK Climate Projections (UKCP09, see http://ukclimateprojections. defra.gov.uk). The outcomes from UKCP09 are being used to assess impacts on UK hydrology (e.g. Cloke et al. 2010, Bell et al. 2012, Kay and Jones 2012 but there is sufficient epistemic uncertainty associated with both the input scenarios and the climate model implementations to be concerned about expressions of confidence or likelihood in these cases, when the probabilities may be incomplete and we should be aware of the potential for the future to surprise (Beven 2011, Wilby andDessai 2010). Incomplete probabilities are inconsistent with a risk-based decision theoretic approach based on the exceedence probabilities of risk, although it might be possible to assess a range of exceedence curves under different assumptions about future scenarios (Rougier and Beven 2013).
We are often in this situation. Hence the need to agree assumptions and methodologies with potential users of model outcomes as discussed in the last section. Consequently, any expressions of confidence or likelihood are conditional on the assumptions, a conditionality that depends not only on what has been included, but also what might have been left out of an analysis. There will of course be epistemic uncertainties that are "unknown unknowns". Those we do not have to worry about until, for whatever reason, they are recognized as issues and become "known unknowns". More important are factors that are already "known unknowns", but which are not included in the analysis because of lack of knowledge or lack of computing power or some other reason. Confidence and likelihood need to reflect the sensitivity of potential decisions to such factors since they are not easily quantified in uncertainty estimation.

An uncertain future?
So, while quantitative uncertainty estimation is valuable in assessing the range of potential outcomes consistent with an (agreed) set of assumptions, it will generally be the case that difficult to handle epistemic uncertainties will mean that the assessment is incomplete (for good epistemic reasons). Future surprises come from that incompleteness (e.g. Beven 2013). Assessments of evidence and expressions of confidence and likelihood should reflect the potential for surprise, and robust decisions need to be insensitive to both the assessed uncertainty and the potential for surprise (erring on the side of caution, risk aversion or being precautionary). From a modeller's perspective this has the advantage that it will reduce the possibility of a future post-audit analysis showing that the model predictions were wrong, even if why that is the case might be obvious with hindsight (it is quite possible that this will be the case with the current generation of climate models as future improvements start to reduce the errors in predicting historical precipitation, for example).
From a decision maker's perspective, the issues are more problematic. If, even with a detailed (and expensive) assessment of uncertainty, there remains a potential for surprise, then just how risk averse or precautionary is it necessary to be in order to make robust decisions about the future. The answer is probably that we often cannot afford to be sufficiently robust in adapting to change; it will just be too expensive. The costs and benefits of protecting against different future extremes can be assessed, even if the probability of that extreme might be difficult to estimate. In that situation, the controlling factor is likely to be the available budget (Beven 2011). That should not, of course, take away from the responsibility for ensuring that the science that underlies the evidence is as robust as possible, and communicated properly, even if those uncertainties are high and we cannot be very confident about future likelihoods in providing evidence to decision makers.