Uncertainty quantification for computer models with spatial output using calibration-optimal bases

The calibration of complex computer codes using uncertainty quantification (UQ) methods is a rich area of statistical methodological development. When applying these techniques to simulators with spatial output, it is now standard to use principal component decomposition to reduce the dimensions of the outputs in order to allow Gaussian process emulators to predict the output for calibration. We introduce the `terminal case', in which the model cannot reproduce observations to within model discrepancy, and for which standard calibration methods in UQ fail to give sensible results. We show that even when there is no such issue with the model, the standard decomposition on the outputs can and usually does lead to a terminal case analysis. We present a simple test to allow a practitioner to establish whether their experiment will result in a terminal case analysis, and a methodology for defining calibration-optimal bases that avoid this whenever it is not inevitable. We present the optimal rotation algorithm for doing this, and demonstrate its efficacy for an idealised example for which the usual principal component methods fail. We apply these ideas to the CanAM4 model to demonstrate the terminal case issue arising for climate models. We discuss climate model tuning and the estimation of model discrepancy within this context, and show how the optimal rotation algorithm can be used in developing practical climate model tuning tools.


Abstract
The calibration of complex computer codes using uncertainty quantification (UQ) methods is a rich area of statistical methodological development. When applying these techniques to simulators with spatial output, it is now standard to use principal component decomposition to reduce the dimensions of the outputs in order to allow Gaussian process emulators to predict the output for calibration. We introduce the 'terminal case', in which the model cannot reproduce observations to within model discrepancy, and for which standard calibration methods in UQ fail to give sensible results. We show that even when there is no such issue with the model, the standard decomposition on the outputs can and usually does lead to a terminal case analysis. We present a simple test to allow a practitioner to establish whether their experiment will result in a terminal case analysis, and a methodology for defining calibration-optimal bases that avoid this whenever it is not inevitable. We present the optimal rotation algorithm for doing this, and demonstrate its efficacy for an idealised example for which the usual principal component methods fail. We apply these ideas to the CanAM4 model to demonstrate the terminal case issue arising for climate models. We discuss climate model tuning and the estimation of model discrepancy within this context, and show how the optimal rotation algorithm can be used in developing practical climate model tuning tools.

Introduction
The design and analysis of computer experiments, now part of a wider crossdisciplinary endeavour called 'Uncertainty Quantification' or 'UQ', has a rich history in statistical methodological development as far back as the landmark paper by Sacks et al. (1989). The calibration of computer simulators, a term reserved for methods that locate simulator input values with outputs that are consistent with physical observations (the inverse problem), is a well studied problem in statistical science, with Kennedy and O'Hagan's Bayesian approach based on Gaussian processes the most widely used (Kennedy and O'Hagan, 2001).
The essence of the statistical approach to calibration is to combine a formal statistical model relating the computer simulator to real-world processes for which we have partial observations (Kennedy and O'Hagan, 2001;Goldstein and Rougier, 2009;Williamson et al., 2013), with a statistical representation of the relationship between inputs and outputs of the simulator based, typically, on Gaussian processes (Haylock and O'Hagan, 1996).
Extensions for computer simulators with spatio-temporal output have centred around projecting the output onto a basis and adapting calibration methods to the lower-dimensional projections of these fields. Though wavelets (Bayarri et al., 2007) and B-splines (Williamson et al., 2012) have been tried, the approach due to Higdon et al. (2008), based on the principal components of the simulator output, has become the default method. Statistical methodological developments in UQ have built on principal component methods (e.g. Wilkinson (2010); Chang et al. (2014Chang et al. ( , 2016), and they have seen wide application, particularly in the analysis of climate models (Sexton et al., 2011;Chang et al., 2014;Pollard et al., 2016).
What statisticians term calibration is referred to as 'tuning' in the climate modelling community, a process that has a huge influence on the projections made by each modelling centre and by the Intergovernmental Panel on Climate Change (Stocker et al., 2013). Each modelling centre submits integrations of their climate model for 4 different forcing scenarios (known as Representative Concentration Pathways) to each phase of the Coupled Model Intercomparison Project (Meehl et al., 2000), with the input parameters of the model 'tuned' prior to submission so that the model output compares favourably with certain key observations. The resulting integrations, and not the simulators themselves, are what most climate scientists call 'climate models' (i.e. simulators are not considered to be functions of these now fixed parameters). These integrations are used to discover physical mechanisms (Scaife et al., 2012), projected trends (Screen and Williamson, 2017), drivers of variability (Collins et al., 2010) and future uncertainty to aid policy making (Harris et al., 2006).
Despite the application of UQ methods to the calibration of 'previousgeneration' climate models, referred to in the papers above and many others, UQ is not used for tuning within any of the major climate modelling centres (Hourdin et al., 2017). Instead, climate model parameters are often explored individually and tuning done by hand and eye, with the parameters changed, and the new run either accepted or rejected based on heuristic comparison with the current 'best' integration. Different descriptions of these processes are offered by Mauritsen et al. (2012); Williamson et al. (2017);Hourdin et al. (2017).
This lack of uptake of state-of-the-art statistical methodology for calibration amongst some of the world's most important computer simulators should give us pause for thought. The 'off-the-shelf' methodology, Bayesian calibration with principal components, is widely used elsewhere, well published, and is applied to many lower resolution climate models within the climate science literature. Is the lack of uptake a communication issue, or are there features of our methodology that mean it doesn't scale up well to climate simulators?
In this paper we show how the terminal case, wherein a simulator cannot be satisfactorily calibrated, manifests in the inference of standard UQ methodologies. We then demonstrate that even when there is a good solution to the inverse problem, the use of standard basis representations of spatial output (e.g. principal components across the design) can and regularly do lead to the terminal case and incorrect inference. We develop a simple test to see whether an analysis will lead to the terminal case before performing the calibration and, when the terminal case is not guaranteed, provide a methodology for finding an optimal basis for calibration, via a basis rotation. The efficacy of our methodology is demonstrated through application to an idealised example, and its relevance to climate model tuning through application to the calibration of the atmosphere of the current Canadian climate model, CanAM4.
In Section 2, we review UQ methodologies for calibration and present the terminal case for scalar model output. Section 3 reviews the standard approach to handling spatial output and demonstrates the implications of the terminal case for these methods through an idealised example. Section 4 presents novel methods for finding optimal bases for calibration that overcome the terminal case issues and demonstrates the efficacy of calibrating with optimal bases for our example. In Section 5 we see that standard approaches always lead to terminal analyses in CanAM4, and show how our optimal basis methodology can be used in the process of climate model tuning. Section 6 contains discussion.

Calibration methodologies and the terminal case
We consider a computer simulator to be a vector-valued function f (θ, x), with input parameters θ that we wish to estimate/constrain, and 'control' or 'forcing' parameters, x, both of which can be altered to perform computer experiments. For example, x might represent future CO 2 concentrations in a climate model. f (·, x) simulates a physical system y(x), and we have access to measurements or observations z, of part or all of y. The goal of calibration methods is to use z to learn about θ. In what follows we remove the control parameters, x, to simplify the notation, as they are not involved in calibration, but in subsequent prediction.
The two statistical methodologies for calibration that we focus on here are Bayesian (or probabilistic) calibration (Kennedy and O'Hagan, 2001;Higdon et al., 2008), and history matching with iterative refocussing (Craig et al., 1996;Vernon et al., 2010;Williamson et al., 2017). Both begin with the same type of assumption, namely that there exists a best input setting, θ * , so that for mean-zero independent observation errors, e, and model discrepancy, η (though history matching differs in only requiring uncorrelated terms in (1) rather than independent terms).
Both methods require an emulator, usually a Gaussian process representation of function f (θ), trained using runs F = (f (θ 1 ), ..., f (θ n )) based on design X = (θ 1 , . . . , θ n ). For scalar f (·), the general model is where g(θ) is a vector of specified regressors, β their coefficients, and R(|θ − θ |; φ) a weakly stationary covariance function with parameters φ. The model is completed by specifying a prior on the parameters, π(β, φ), and posterior inference given F follows naturally with There are many variants on emulation, with some practitioners preferring no regressors (Chen et al., 2016), different types of correlation function (including no correlation) (Kaufman et al., 2011;Salter and Williamson, 2016), and different priors, π(β, φ), with some leading to partially analytic posterior inference (Haylock and O'Hagan, 1996). As history matching only requires posterior means and variances of the emulator, Bayes linear analogues are sometimes used (Vernon et al., 2010). Generalisations to multivariate Gaussian processes are natural (Conti and O'Hagan, 2010), and we address the difficulty with high dimensional output from Section 3 onwards.

Probabilistic calibration
Though the underlying statistical model and the emulator are similar for both history matching and probabilistic calibration, the assumptions placed upon θ * , and the resulting inference, are quite different. Probabilistic calibration places a prior on θ * , π(θ * ), and a Gaussian process prior for the discrepancy, η ∼ GP(0, Σ η ), before deriving the posterior π(θ * , η|F, z), and marginalising for θ * . The discussion of Kennedy and O'Hagan (2001), and the later paper by Brynjarsdóttir and O'Hagan (2014), argue that lack of identifiability between θ * and η mean that strong prior information on η or θ * is essential for effective probabilistic calibration to be possible.

History matching and iterative refocussing
Note that, given a discrepancy variance, probabilistic calibration must still give a posterior π(θ * |F, z) that integrates to 1, thus predetermining an analysis that will point to some region of parameter space Θ as being 'most likely'. This can be undesirable in some application areas, as often the goal is to find out if the simulator can get 'close enough' to the observations, so that experiments predicting the future can be trusted. Climate model tuning is a good example of this, where part of the goal in tuning is to find out whether it is the choice of parameters, or the parameterisation itself, that is leading model bias (Mauritsen et al., 2012;Hourdin et al., 2017).
The method of history matching and iterative refocussing allows the question of whether the model is fit for purpose to be answered as part of the calibration exercise, by altering the problem from one of looking for the best input directly, to one of trying to rule out regions of Θ that could not contain θ * . A model unfit for purpose would have all of Θ ruled out. The method defines an implausibility measure, I(θ), with where the expectations and variances of f (θ) are derived from the Gaussian process emulator description above, and are conditioned on the runs F. If I(θ) exceeds a threshold, T , that value of θ is considered implausible and ruled out, thus defining a membership function for a subspace Θ of Θ that is Not Ruled Out Yet (NROY), with Θ = {θ ∈ Θ : I(θ) ≤ T } . The choice of T will be problem dependent, though typically, if z is one-dimensional, Pukelsheim's three sigma rule (Pukelsheim, 1994) is used to set T = 9 (Craig et al., 1996;Williamson et al., 2015). For -dimensional z, Vernon et al.
(2010) define T = χ 2 ,0.995 , the 99.5th percentile of the χ 2 -distribution with degrees of freedom, or a conservative T can be derived through Chebysev's inequality.
A key principle behind history matching is its iterative nature. Following an initial set of runs, a 'wave' of history matching is conducted, leading to a certain percentage of Θ being ruled out. A new wave can then be designed within NROY space, and the procedure repeated, refocussing the search for possible θ * .
Discrepancy and observation error variances, Σ η and Σ e , are important in both probabilistic calibration and history matching. For the latter, (3), whilst a Normal assumption on e in calibration means Σ η and Σ e appear in the likelihood.
In this paper, we focus on optimal spatial calibration for both types of methodology, as the issues we shall identify in Section 3 apply equally to both, though manifest in different ways, as we shall illustrate now with our discussion of the terminal case.

The terminal case
Consider a computer simulator, f (θ), a discrepancy variance assessment Σ η , and an observation error variance Σ e , where both variance matrices are positive definite. We define the terminal case to occur when I(θ) > T , for T as above and for a perfect emulator, so that, in equation (3), and Var[f (θ)] = 0 for all θ. So, from a history matching perspective, the terminal case occurs when the model is too far from the observations at every point in parameter space according to the model discrepancy. Hence, all of Θ is ruled out, and the modellers must reconsider their simulator, or their error tolerance.
Within a probabilistic calibration framework, the terminal case implies a prior-data conflict so that, in some sense, Σ η has been 'misspecified' or the expert is 'wrong'. Lack of identifiability requires informative expert judgement for discrepancy (Brynjarsdóttir and O'Hagan, 2014), yet the difficulty in providing such judgements for complex computer simulators (Goldstein and Rougier, 2009) may mean that the terminal case would occur quite often in practice. It is therefore important to see how such prior-data conflict would manifest. spike to an unexpected (a priori) part of parameter space. It is often not efficient to run expensive simulators, such as climate models, that require expert time to run and manage, one run at a time (Williamson, 2015). The scientists that manage jobs on supercomputers, for example, require batches of runs that can be run in parallel. However, batch designs could be even worse here. Guided by the posterior density at each point, batch designs would be the near equivalent of one point at the MAP estimate, simply shifting the peak of the posterior to somewhere as yet unsampled.
Eventually, as we see from the bottom 4 panels, posterior uncertainty in f (θ) is sufficiently reduced, and π(θ * | F, z) settles on the 'least bad' value of θ, where f (θ) is closest to the observations (though around 30 standard deviations away). For simulators with input spaces of much higher dimensions (the climate models we work with have typically specified 10-30 parameters to focus on, though these would be a subset of several hundred), we are unlikely to ever be able to reduce emulator uncertainty to the extent that the posterior spike settles over the least bad parameter setting. Hence, under an iterative procedure such as this, we would continue to chase the best input throughout parameter space, constantly moving the spike as in a game of whack-a-mole, until we run out of resources.
Our illustration of the terminal case shows that though careful subjective prior information is required for model discrepancy in order to overcome the identifiability issues with the calibration model, if those judgements lead to a prior-data conflict via a terminal case, good calibration will not be possible, and it will take a great deal of resource (enough data to build a near perfect emulator everywhere) to discover this. It would seem more natural to first history match in order to check we are not in a terminal case, and, if not, perform a probabilistic calibration within NROY space as in Salter and Williamson (2016).
Whichever calibration method, or combination of them, is preferred, it is important to understand this terminal case, as we shall show that even for models that can reproduce observations exactly, tractable methods for calibrating high dimensional output can result in a terminal case analysis.

Calibration with spatial output
For spatial fields, the most common approach to emulation and calibration involves projecting the model output onto a low-dimensional basis, Γ, and emulating the coefficients, so that fewer emulators are required (Bayarri et al., 2007;Higdon et al., 2008;Wilkinson, 2010;Sexton et al., 2011) (although alternatives, such as emulating every grid box individually, have been applied, e.g. by Gu et al. (2016)).
Writing the model output f (θ i ) as a vector of length , so that F has dimension × n, the singular value decomposition (SVD) is used to give n eigenvectors that can be used as basis vectors (equivalently, finding the principal components) (Higdon et al., 2008;Wilkinson, 2010;Sexton et al., 2011;Chang et al., 2014Chang et al., , 2016. For the size of model output typically explored using these methods, Γ will not be of full rank as n << . This means that while F can be represented exactly by projection onto Γ, general -dimensional fields will not have a perfect representation on Γ. As the majority of the variability in F is usually explained by only the first few eigenvectors, the basis is truncated after q vectors, giving a basis Γ q = (γ 1 , . . . , γ q ) of dimension × q, often chosen so that more than 95% of F is explained by Γ q . Various rules-of-thumb are used dependent on the problem, e.g. Higdon et al. (2008)  In order to emulate the model, the runs are first centred by subtracting their mean, µ, from each column of F, giving the centred ensemble F µ (we use the term ensemble to mean the collection of runs, as is common in the study of climate models). F µ is then projected onto the basis Γ q , giving q coefficients associated with each parameter choice: Given q coefficients, a field of size is reconstructed via Emulators for the coefficients of the first q SVD basis vectors are then built: Given these emulators, calibration can either be performed using the entire -dimensional output, with emulator expectations and variances transformed to the -dimensional space of the original field (Wilkinson, 2010), or on its qdimensional basis representation, with the observations projected onto this basis (Higdon et al., 2008).
Calibration (via either history matching or probabilistic calibration) re-quires an informative prior process model for the spatial discrepancy, η. This could be a stationary process defined through a simple covariance function over the output dimensions, though a richer class of non-stationary process defined via kernel convolution is often used (Higdon, 1998;Chang et al., 2014Chang et al., , 2016. These approaches specify a number of knots over the spatial field and define discrepancy to be a mixture of kernels around each of these knots. As with any calibration problem, however, strong prior information for discrepancy processes is essential to overcome identifiability issues, as discussed in Section 2.3. The way to include this information has been to fix the correlation parameters of the kernels and to have an infomative prior for their variances. With such a prior, a terminal case analysis is just as possible as for the 1D example we presented earlier. Suppose that the prior on the process is strong enough to overcome identifiability issues and is such that we don't have a terminal case. When using a basis emulator to calibrate f (·), we may artificially induce a terminal case analysis, as reconstructions from coefficients on the basis are restricted to a q-dimensional subspace of -dimensional space. Further, it will not be clear whether our analysis implies that the model is incapable of reproducing z, or that this was due to a poor basis choice. The SVD basis chooses the q-dimensional subspace that explains the maximum amount of variability in F with the fewest number of basis vectors. This choice does not guarantee that important directions in F that are consistent with z are preserved.

Illustrative example
We illustrate this problem with an idealised example of a 6 parameter function f (θ) (detailed in Section S1), with output given over a 10 × 10 grid.
Observations, z, are given by a known input parameter setting, f (θ * ), with so that a calibration exercise should be able to identify θ * . In our example, the great majority of the input space leads to output that is biased away from z: the proportion of input space leading to output consistent with z is around 0.01%.
The first panel of Figure 2 shows the observations, z, with a strong signal on the main diagonal. The second panel is the mean of the output field over a maximin Latin hypercube sample of size 60 in Θ (i.e. the mean of ensemble F). The strong signal in the ensemble is a biased version of z. In a climate context, this is analogous to the Gulf Stream being observed in the incorrect place in model output.
We calculate the SVD basis Γ as described above. Over 95% of the ensemble variability is explained by projection onto the first four basis vectors, which we refer to as the 'truncated basis', Γ 4 . If we project z onto this basis and reconstruct the original field using these coefficients, via equations (4) and (5), we obtain the field given by the third panel of Figure 2: the distinctive pattern found in z has been lost. That is, spatial calibration with Γ 4 would ultimately rule out parameter space that contained the true coefficients due to poor reconstruction, suggesting that, for reconstructions of the field using Γ 4 , we are in the terminal case.
Fitting Gaussian process emulators to the coefficients given by projection of F µ onto the four basis vectors, the expectation and variance at θ is given by To probabilistically calibrate or history match on the original field, we re- and Var[f (θ)] in terms of the coefficient emulators. These where Γ −q contains the discarded basis vectors, and Σ −q is a diagonal matrix with the associated eigenvalues as the diagonal elements (Wilkinson, 2010).
Calibrating in the coefficient subspace requires projection of z, Σ η and Σ e onto Γ 4 . For example, the implausibility in (3) on the coefficients be-comes Using the 0.995 value of the chi-squared distribution with 100 degrees of freedom to history match via (3), we rule out the whole parameter space, Θ, and so we are in the terminal case. Hence probabilistic calibration gives peaked prediction at the incorrect value of θ * , consistent with the description given in Section 2.3 (see SM section S1.1, Figures S3, S4).
By history matching on the coefficients instead, using (7), and setting T using the chi-squared distribution with 4 degrees of freedom, we find an NROY space consisting of 3.8% of Θ. However, we rule out 58% of the parameter space that was consistent with z, as the important directions for comparing the model to observations are not contained in Γ 4 .
Whether we are calibrating on the original field, or on the coefficients, the 'best' result we are able to find is that given by the reconstruction of z with Γ 4 , given in the final panel of Figure 2. A natural method for satisfying the first goal is to minimise the error given when the observations are reconstructed using B. Define the recon- where to (3), and is the implausibility when we know the basis coefficients exactly (so that the emulator variance is 0).
As W will not generally be a multiple of the identity matrix, the SVD projection from (4) is not appropriate for R W (·, z). Therefore, (4) becomes with this projection minimising the error in · W (Section S2), hence the definition of the reconstruction error in (8).
We present everything in full generality for positive definite W. Therefore, B is an orthonormal basis if B T W −1 B = I n . A basis with this property can be obtained using generalised SVD (Jolliffe, 2002), with W = I giving the usual SVD decomposition: As a measure of whether emulators can be built, we use the proportion of variability explained by projection of the ensemble onto each basis vector The proportion of ensemble variability explained by B, The SVD basis maximises V k (B, F µ ) for each k, given the previous vectors and subject to orthogonality.
Prior to building emulators and performing calibration for a given ba-   We refer to plots of this type as VarMSE plots. The red line represents , and the blue line shows V(B k , F µ ), for each truncated basis, impossible. If we could emulate the coefficients for the 5th and 6th basis vectors, we would more accurately represent z, although this rapid decrease in the reconstruction error is a feature of our example, rather than a gen-eral property of the SVD basis, and therefore we still truncate at 95% for illustrative purposes.
The SVD basis aims to maximise the blue line for each basis vector added, whereas, for calibration, we require the red line to be below T . The problem of basis selection for calibration is one of trading off these two requirements, reducing R W (·, z) while ensuring that each V k (·, F µ ) is large enough to enable emulators to be built. Given that the full SVD basis may contain information and patterns that allow the observations to be more accurately represented, the information contained in this basis may be combined in such a way that the resulting basis is suitable for calibration, with important low-order patterns blended with those that explain more of the ensemble variability.

Rotating a basis
Performing a rotation of an ensemble basis B using an n×n rotation matrix, Λ, rearranges the signal from the ensemble, potentially allowing the new truncated basis to be a better representation of z. A general n × n rotation matrix Λ can be defined by composing n(n − 1)/2 matrices that give a rotation by an angle around each pair of dimensions (Murnaghan, 1962).
Our goal is to find Λ such that BΛ minimises R W ((BΛ) q , z), subject to constraints on V k (·, F µ ) that allow the projected coefficients to be emulated.
To directly define a rotation matrix Λ via optimisation requires a large number of angles to be found, even when the ensemble size is small. Instead, we take an iterative approach, selecting new basis vectors sequentially while minimising R W (·, z) at each step, in such a way that guarantees that the resulting basis is an orthogonal rotation of the original basis.

The optimal rotation algorithm
Given an orthogonal basis B for F µ with dimension × n; a positive definite proportion of the ensemble variability to be explained by the i th basis vector; the total proportion of ensemble variability to be explained by the basis v tot ; and a bound T (usually that implied by history matching, T = χ 2 ,0.995 ), we find an optimal basis for performing calibration as follows: 1. If R W (B, z) > T , stop and revisit the specification of W, or add more runs to F µ . Else set k = 1.

Let
Define the new normalised vector as 3. Find the residual basis given Γ * k , B k , and form the orthogonal rank n basis and return Γ * q as the truncated basis for calibration. Else, set k = k+1 and B = [B k ] n−k , and return to step 2.
Prior to applying the algorithm, we must specify an initial basis, B, a weight matrix, W, and the parameters v tot and v to control the amount of variability explained by each basis vector. We use the SVD basis (with respect to W) for B, however other choices are possible, e.g. we could apply Gram-Schmidt to the ensemble itself and rotate this, or apply a different scaling to the SVD basis.
At each step, our algorithm selects the linear combination of a given basis that minimises R W (·, z), subject to explaining a given percentage of ensemble variability, and given any previously selected basis vectors. If the defined truncation Γ * q satisfies R W (Γ * q , z) < T , then the algorithm terminates, as standard residual variance maximising basis vectors no longer lead to a terminal case analysis. We allow a basis to be identified that satis-fies our two goals: we do not rule out z, and have coefficients that can be emulated, if v is set appropriately. To optimise for λ k , we use simulated annealing (Yang Xiang et al., 2013), although any optimisation scheme that converges could be used.
The check in step 1 of our algorithm is due to the following result (proved in S2): Result 1 (Invariance of R W (·, ·) to rotation). For a rotation matrix Λ of dimension k × k, and a set of basis vectors B = (b 1 , . . . , b n ), we have Regardless of the rotation that is applied to B, we cannot reduce the reconstruction error below that given by the full basis originally. However, because the SVD basis is always truncated prior to this minimum value being reached, we can search for a rotation that rearranges the information from the SVD basis in such a way that satisfies incorporating important, potentially low-order, patterns into the q basis vectors that we emulate. Hence step 1 of the algorithm provides an important test as to whether our ensemble and uncertainty assessment, (F, W), are sufficient to avoid a terminal case analysis, and shows when a rotation exists, up to the choice of v.
Theorem 1. Γ * in step 3 of the optimal rotation algorithm is an orthogonal rotation of B.
The results and proofs required, and the proof of Theorem 1 itself, are found in Section S2. Given that B passes step 1 of the algorithm, existence of an optimal rotation depends on the choice of v: Theorem 2. At the k th iteration of the optimal rotation algorithm, given In this case the algorithm converges to γ * k .
Proof. By construction, as by construction c * k minimises the reconstruction error in the W norm.
In practice, when applying our algorithm to high dimensional model output, we have found that only a small number (three or fewer) of iterations have been required, hence v often has a low dimension. The values of v required will depend on the problem, with a different approach required when a small number of vectors explain the majority of the ensemble, compared to when a large proportion of the variability is spread across many SVD basis vectors. In the former case, the values of v may be relatively high, whilst in the latter they can be lower, relative to the proportion explained by the equivalent SVD basis vectors. A reasonable approach is to initially set v as half of the proportion explained by the corresponding SVD basis vectors, reducing these further if the resulting q is too large. As Theorem 2 shows, it is possible to set v in such a way that the algorithm is unable to find a suitable basis. If we cannot find a k th basis vector that satisfies the variability constraint, given Γ * k−1 , then a basis doesn't exist for this choice of v, and the specification needs revisiting: either v k needs to be decreased, or an earlier constraints needs relaxing.
The choice v tot is also a concern for the standard UQ approaches based on principal components. In our experience, using similar rules (e.g. 95% or 99%) to the SVD applications leads to 0-2 extra basis vectors required.
In an application, it may be desirable to include certain physical patterns, deemed to be important, in our basis B, which may not lie within the subspace defined by F µ . In this case, if we have p selected physical vectors, B p = (b 1 , . . . , b p ), combining these with the first n − p vectors of the residual basis will not necessarily explain all of the variability in F µ .
The algorithm may be applied to the n + p vectors given by the physical vectors and the full residual basis, giving a rotation of this space rather than of F µ . As truncation occurs after the majority of variability of F µ , v tot , is explained, the resulting truncated basis, while not strictly a rotation of the subspace defined by F µ , will exhibit similar qualities, and may be superior for representing z, if important physical patterns can be emulated when combined with signal from the ensemble.
To perform the algorithm with basis vectors from outside the subspace defined by F µ , rather than finding linear combinations of the residual basis at step k > 1, B = (B p , B ) is used at each step, with orthogonality imposed after each new basis vector has been selected, via Gram-Schmidt (as by Result S3, applying Gram-Schmidt does not affect R W (·, z)).

Idealised example continued
We now apply the optimal rotation algorithm to the example of Section 3.
We set v = (0.4, 0.1, 0.1), v tot = 0.95, and B as the SVD basis, with the projection of (4) used for consistency with Section 3.1, to show that rotation fixes the described problems. One iteration of the algorithm finds a basis that satisfies R W (Γ * q , z) < T , with q = 5 (i.e. we need the first 4 vectors of the residual basis so that Γ * q explains at least 95% of F µ ).
The reconstruction of z with this basis, and associated VarMSE plot, are shown in Figure 4. Now, our basis allows us to accurately represent z with the leading vectors, as the important patterns from low-order eigenvectors have been combined with the leading patterns (hence an additional vector Figure 4: The reconstruction of z using the truncated basis Γ * q , and the VarMSE plot for this basis, with the truncated SVD basis given by the red and blue dotted lines. being required to explain more than 95% of F µ ).
Performing history matching as before, and using the reconstructions of the original fields rather than the coefficients, we find that 31.5% of Θ is now in NROY space ( Figure S5). Performing our previous check on the accuracy of the match, we find that no runs consistent with z have been ruled out.
As we are no longer in the terminal case, we perform probabilistic calibration on the field. The posterior densities found by calibrating on Γ * q are shown in Figure S3, with the average simulator output given by samples from this posterior in the first plot of Figure 5. While the samples here are not consistent with z, as the off-diagonal is too strong, we have been able to identify runs where there is signal on the main diagonal. This is because the rotated basis allows for this direction of the output space to be searched.
The limited signal in the important directions from F has been extracted and used to guide calibration.
We continue the calibration by running a new design within NROY space.
This new design should contain more signal in the direction of z, and hence it should be possible to find a rotation that reduces R W (·, z) further than at the previous wave. We select 60 points from the wave 1 NROY space and run f (·) at these points to give the wave 2 ensemble. We perform a rotation, and emulate and calibrate using the wave 2 ensemble. History matching reduces NROY space to 3.1% of Θ ( Figure S7). If we instead perform probabilistic calibration, with zero density assigned to regions outside of the wave 1 NROY space, we find the average output field in the 2nd plot of Figure 5 (posteriors in Figure S6).
These results represent a large improvement over performing only one wave. We have ruled out the majority of Θ, allowing future runs to be focussed in this region. Probabilistic calibration is more accurate, with samples containing a strong diagonal, as with z.
Repeating the process, our wave 3 ensemble contains patterns more consistent with z than in previous waves, and hence the truncated SVD basis does not rule out the reconstruction of z, and no rotation is required. Following emulation for this basis, history matching leads to an NROY space consisting of 2% of Θ ( Figure S8). Probabilistic calibration (in the wave 2 NROY space) gives the average output in Figure 5 (posteriors in Figure S6), showing that our samples are now consistent with z.

Application to tuning climate models
In this section, we will demonstrate that optimal rotation is an important and necessary tool if attempting to perform UQ for climate model tuning.
However, as we will discuss, climate model tuning is not a solved problem, and it would be of limited value to simply show how calibration with our Figure 5: The mean output f (θ) for samples of θ from the probabilistic calibration posterior, for the wave 1 rotated basis, the wave 2 rotated basis and the wave 3 SVD basis.
method can lead to an improvement in one output field over the standard methods, without necessarily improving the whole model or addressing the concerns of the community. We will motivate our discussion using the current Canadian atmosphere model, CanAM4.
CanAM4 is an Atmospheric Global Climate Model, which integrates the primitive equations on a rotating sphere employing a spherical-harmonic spatial discretization truncated triangularly at total wavenumber 63 (T63), with 37 vertical levels spanning the troposphere and stratosphere (von Salzen et al., 2013). CanAM4 has a large number of adjustable, 'free', parameters of which 13 will be varied here. The climatological influence of each set of free parameters is determined from 5-year 'present-day' integrations with prescribed sea-surface temperatures and sea-ice. Model output is represented on the 'linear' 128×64 Gaussian grid corresponding to the model's T63 spectral resolution.
There are many output fields that must be checked for consistency with the observed climate when tuning the parameters of a climate model (in the case of CanAM4 there are more than 20). Here we focus on just 3 2D fields: TA is the temperature averaged over longitude for each latitude and vertical pressure level so that = 2368. There is also a temporal aspect to the output, with monthly values for every grid box; however, we remove this here by averaging over 5 years of June, July, August (JJA) output.
When evaluating and tuning the model, spatial anomaly plots are routinely examined to see how the model compares with observations. An anomaly plot shows the difference between the model and the observations with a blue-white-red colour scale set such that blue is 'too negative', white is 'alright' and red is 'too positive'. So, for example, in a temperature anomaly plot, red areas show where the model is too warm (for the modellers) compared to observations. Figure 6 shows anomaly plots for CLTO, Understanding what these structural errors are so that they might be addressed either as part of this phase of development or for the next is one of the major goals of tuning . However, joint estimation of model discrepancy variances and model parameters is not possible without strong prior information (Brynjarsdóttir and O'Hagan, 2014) due to lack of identifiability.
When working with CanAM4 then, our goal is to use history matching with a 'tolerance to error' discrepancy variance ) that aims to reduce the size of NROY space, so that, ultimately, in a calibration exercise we have strong prior information about θ * and some structured information on discrepancy. A formal methodology for achieving this is beyond the scope of this paper. However, we will demonstrate that optimal rotation is a crucial component for any attempt of this nature.
We designed 62 runs of CanAM4, varying 13 parameters and using a k-extended Latin Hypercube (Williamson, 2015). Figure 7 shows VarMSE plots for the output fields CLTO, RTMT and TA for this ensemble. The weight matrix W used for the reconstructions represents our tolerance to model error (we discussed its correspondence to model discrepancy (W = Σ η + Σ e ) in Section 4), and the red lines in these plots represent 2 alternatives based on interpreting the colour scales pertaining to the white regions in Figure 6. We assume that the white region represents 3 standard deviations (solid red line) and 1 standard deviation (dashed red line), and set a diagonal W accordingly. The solid red lines on each plot indicate that we have a terminal case analysis under the small model discrepancy.
The larger discrepancy indicates a terminal case in CLTO and RTMT, and that 35 basis vectors would be enough to adequately reconstruct TA.
However, the blue line in the TA plot shows that there is so little ensemble signal on the basis vector coefficients after arguably 20 (or fewer) basis vectors, that calibration on 35 basis vectors is not possible. If discrepancy were increased (an operation that involves scaling the red line until it lies below the bound T represented by the dashed horizontal line in the plots), all 3 panels demonstrate that the reconstruction error under SVD decreases too slowly, so that a large number of basis vectors, each with coefficients that are increasingly difficult to emulate due to the decreasing ensemble signal, would be required to avoid a terminal case analysis.
Suppose model discrepancy Σ η >> Σ e so that we can consider W = Σ η in the following. In order to use optimal rotation, we require W such that R W (B, z) < T , which is not true under our specification above for RTMT and CLTO. If we really believed our Σ η represented the climate model's ability to reproduce observed climate, then this indicates that we need a larger ensemble in order to explore the model's variability. In that case, it may be desirable to follow a procedure like the one we present here to design these runs.
In our case, we believe it is clear that we have misspecified model discrepancy. In fact, we used a place-holder tolerance to error, so this analysis indicates that we are not tolerant enough to model error (at this stage). To explore model discrepancy, we first perform a rotation under the W given above, using the algorithm without step 1 in order to find R W (Γ * q , z) as close to the reconstruction error of the untruncated SVD basis as possible, for small q and whilst retaining emulatability by setting v = (0.35, 0.1, 0.1) (as 3 rotated vectors is enough). Given this rotation, we then set where j < 0.995 is a tuning parameter. This ensures that when reconstructed with the new basis, the observations will not be ruled out, and hence we can identify an NROY space likely to contain runs as consistent with z as possible, given the limited information we have with 62 ensemble members. This has the effect of 'scaling' the reconstruction error for the rotated basis seen in Figure 8 below the horizontal dotted line at the point the basis is truncated. For our fields, we set j = 0.95 for RTMT, and j = 0.68 for the others (as j = 0.95 ruled out all of Θ for CLTO and TA).
We define NROY space as runs where θ is not ruled out using (3) for each of TA, CLTO and RTMT. This NROY space consists of 0.9% of Θ. We then design and run a new 50-member ensemble within this NROY space (discussed in Salter (2017), Section 6.3.5).
Upon inspection of the TA field for this wave 2 ensemble, we observe that every run contains the previously found strong warm bias in the Southern Hemisphere ( Figure S9). As our optimal basis choice permitted the search for runs not containing this structural bias, these results are evidence that this may be a structural error. In practice, how much evidence is required before the modellers are convinced that a particular bias is structured or not is a climate modelling decision. Certainly, we could repeat our wave 1 procedure within the current NROY space and run a wave 3 and so on. This has the benefit of increasing the density of points in Θ and the accuracy of emulators in key regions of Θ, thus insuring against possible 'spikes' in the model input space that would correct the bias.
Assuming our modellers were convinced to treat this feature as a structural bias, we demonstrate an approach to include this information within the iterative calibration procedure. We first revisit the specification of the TA discrepancy, selecting the region with the common warm bias shown in Figure S10, deemed to be a structural error, and increasing Σ η for the grid boxes in this region. To do this, we set W as a diagonal matrix with 100 for the grid boxes in this region, and 1s elsewhere on the diagonal (note this is one possible choice. We might, instead, increase the correlation between these gridboxes in W in addition).
We re-define the wave 1 NROY space so that it only depends on CLTO and RTMT (consisting of 41.4% of Θ), and then include NROY wave 1 runs with the wave 2 ensemble when rotating and building emulators for wave 2.
For TA, the optimal rotation algorithm is applied using the newly-designed W, with the discrepancy Σ η defined via (14), to ensure that z is not ruled out (W reflected our beliefs about the structure, not the magnitude, of Σ η ). History matching using the wave 2 bases and emulators leads to an NROY space containing 0.03% of Θ. Plots illustrating this NROY space for six of the more active parameters are shown in Figure 9. We see that the regions with the greatest density of points in NROY space are generally found towards the edges of the parameter ranges. From the lower left plots, it is easier to identify relationships between some of the parameters, e.g.
CBMF generally needs to be high while UICEFAC needs to be towards the centre of its allowable range.
The calibration of climate models, or even simply the search for structural biases, is a massive undertaking, and a full tuning is well beyond the UQ can help with the tuning procedure in providing tools (emulators) that allow Θ to be explored much more quickly than is currently possible at the modelling centres. However, as this application has demonstrated, using the off-the-shelf methods based on the SVD basis does not work for tuning in general. It did not work for the 3 fields we showed here, nor have the authors ever found climate model output for which the problems we identified here were not present. Our application demonstrates the optimal rotation algorithm as an effective solution to quickly find bases without these issues for calibration.

Discussion
In this paper, we highlighted the issue of terminal case analyses for the calibration of computer models. A terminal case analysis occurs when the prior assessment of model discrepancy is inconsistent with the computer simulator's ability to mimic reality, and leads either to useless and incorrect posterior distributions (using the fully Bayesian procedure) or ruling out all of parameter space (using history matching). We showed that even when the prior assessment of model discrepancy is not inconsistent with the ability of the simulator, dimension reduction of spatial output using the ensemblederived principal components (the SVD basis) often leads to a terminal case analysis.
We proposed a rotation of the SVD basis to highlight and incorporate important low-signal patterns that may be contained in the original SVD basis, giving a new basis that avoids the terminal case when this is possible.
We then presented an efficient algorithm for optimal rotation, guaranteeing to avoid the terminal case when the model discrepancy allows, whilst ensuring enough signal on leading basis vectors to permit the fitting of emulators.
We proved that our algorithm gives a valid rotation of the original basis, and established a fast test to see whether a given ensemble of model runs and discrepancy specification automatically leads to a terminal case analysis prior to rotation. Our methods are presented for models with spatial output, however, if basis methods were to be used for more general high dimensional output (e.g. spatio-temporal), the optimal rotation approach would not change if, for example, PCA were taken over the entire spatio-temporal output for the design, as in Higdon et al. (2008).
We demonstrated the efficacy of our method using an idealised applica- Supplemental Material: Uncertainty quantification for computer models with spatial output using calibration-optimal bases S1 Idealised example The spatial idealised example f (θ), introduced in Section 3.1, gives output over a 10 × 10 field, and has 6 input parameters defined on [−1, 1]: for π N (x 4 , 0.2, 0.1 2 ) the density function of the Normal distribution with mean 0.2 and variance 0.1 2 , and where Ψ 10×10 (0, 0.05 2 ) gives a sample from a Normal distribution with mean zero and variance 0.05 2 , at each location in the 10 × 10 grid. Figure S1 shows (ϕ 1 , . . . , ϕ 8 ), with ϕ 1 giving the pattern most consistent with the observations, and ϕ 2 the basis vector that dominates the ensemble. After evaluating f (θ), the output is vectorised so that = 100.
We define θ * as θ * = (0.7, 0.01, 0.01, 0.25, 0.8, −0.9) with the observed field, z, given by adding a sample from N(0, Σ e ) to f (θ * ), to represent observation error. We define Σ e using the squared exponen-tial correlation function over the 10 × 10 grid, with the spatial coordinates denoted by s i = (s i1 , s i2 ) for i = 1, . . . , 100. The i, j th entry of 100 × 100 matrix Σ e is therefore We model the discrepancy as Figure S1: The 8 orthogonal basis vectors used in the definition of the idealised function, with ϕ 1 shown in the top left, ϕ 2 to the right of this, etc.
with the (i, j) th value of Σ η given by for variances v i , v j , and a correlation function C(·, ·) between locations s i and s j . For C(·, ·), we again use the squared exponential correlation function, with the same correlation lengths as for Σ e . We define v i via where the set S contains the grid boxes on the main diagonal, as we are more interested in finding fields with output consistent with the observations in this region of the output. Figure S2 shows the true NROY space given Σ e and Σ η .

S1.1 Probabilistic calibration for the SVD basis
We performed probabilistic calibration following the Kennedy O'Hagan method, with fixed discrepancy as outlined in the previous section, uniform priors on the calibration parameters, and emulators as described in the main text.
The dotted lines on Figure S3 show the resulting densities when the truncated SVD basis, Γ 4 , is used to calibrate probabilistically on the field, for the input parameters x 1 , . . . , x 5 , and the ratio r = x 5 /(1.3 + x 6 ). There are peaks of density away from the true values (θ * , as shown by the red vertical lines), particularly for x 3 and r. For x 4 , the parameter that controls the strength of the main diagonal, the posterior density is relatively flat across the entire range of x 4 .
We sample from these posteriors and run the idealised function at these samples, to evaluate whether calibration with the truncated SVD basis has highlighted a region of parameter space that is 'close' to z. 16 samples are shown in Figure S4, demonstrating that the results suggest it is not possible to remove the off-diagonal pattern that was dominant in the ensemble.

S1.2 Calibration with the rotated bases
The solid lines in Figure S3 show the posterior distributions for x 1 , . . . , x 5 and r when the wave 1 rotated basis is used for probabilistic calibration, showing improvements (compared to the SVD basis) for x 4 and r.
At wave 2, there are peaks of density at or near to the true parameter values for all but x 5 and r, as shown by the solid lines in Figure S6. The dotted lines in this plot show the wave 3 posteriors. Although the peaks for x 2 , x 3 and x 4 are not as large, this wave offers an improvement for r (important for the strength of the main diagonal) and x 1 (as this parameter has no effect on the main or off-diagonal, a flat posterior is more accurate).
The averaged posterior samples in Figure 5 show that the posterior distributions at each wave have identified better regions of parameter space than previously, with the wave 3 samples being consistent with the observations. Figure S2: Density plot of the true NROY space (i.e. no emulation involved), for each pair of parameters. For each cell in a particular pairwise plot, we average across the remaining 4 parameters, and plot the proportion of these runs that are in NROY space. The axes are reversed for the lower left plots, with the colour scale set individually for each plot. The black point corresponds to θ * . Figure S3: The posterior distributions for x 1 , . . . , x 5 and r, when we calibrate probabilistically on the field with the SVD basis derived from the wave 1 ensemble, Γ 4 (dotted lines), and the wave 1 rotated basis (solid lines). The red vertical lines show the location of θ * . Figure S4: f (θ) at 16 samples of θ from the calibration posterior distribution, when we emulate and calibrate with the truncated SVD basis Γ 4 . Figure S5: Density plot for the wave 1 NROY space given by history matching using the rotated basis, for each pair of parameters. As in Figure S2, for each plot we average over the remaining parameters and plot the proportion in NROY space for each pair. The axes are reversed for the lower left plots, and the colour scales set differently. The black point corresponds to θ * . Figure S6: The posterior distributions for x 1 , . . . , x 5 and r, at wave 2 (solid lines) and wave 3 (dotted lines), with the red lines equal to θ * . Figure S7: Density plot for the wave 2 NROY space for each pair of parameters. Regions coloured grey indicate that there are no parameter settings in NROY space here, hence we see that we have significantly reduced space, compared to Figure S5.
in the W norm, where r is given by coefficients c = (c 1 , . . . , c n ) T with The vector c that minimises the reconstruction error with respect to the W norm is found by first writing (S4) in terms of c: A scalar can be differentiated by a vector θ, with symmetric matrix A, via Differentiating the reconstruction error with respect to c, with symmetric Setting equal to zero, and solving for c, we have With W = I , this is the usual projection equation:

Rotational invariance
Result 1 (Invariance of R W (·, ·) to rotation). For a rotation matrix Λ of dimension k × k, and a set of basis vectors B = (b 1 , . . . , b n ), we have Proof. We can rewrite the reconstruction error of the rotated basis, B k Λ, where in the second line, the inverse identity has been applied a second time with C = Λ T and D = B T k W −1 B k , giving the final result.

S2.1 Proof of Theorem 1
Before proving Theorem 1, we first prove the following results: Result S2 (Orthogonality of the residual basis). The residual basis, B , calculated from F µ and B p , is orthogonal to B p (with respect to the W norm).
Proof. First, we show the orthogonality (in W) of the columns of the residual ensemble, F , and the columns of B p : This zero matrix has dimension p × n, i.e. the basis vectors in B p are orthogonal with the vectors of F , with respect to the W norm. Using this, we obtain the result by considering the (generalised) singular value decomposition of F T : where U is an orthonormal n × n matrix, Σ is a diagonal n × n matrix, and V = B is an × n matrix with B T W −1 B = I n . From here, we have that where we have multiplied on the left by B T p W −1 . From (S5), we have The final p + 1 eigenvalues on the diagonal of Σ are zero (the p vectors in B p , and the ensemble mean, remove p + 1 degrees of freedom). Therefore, we are only interested in the leading n − p − 1 columns of B , as all of the variability in F µ has already been explained. By discarding the columns associated with zero eigenvalues, we have that because Σ is diagonal, and hence we have the result.
Result S3. When B is a basis for F µ , we can write B k = BΛ k for square Λ k , i.e. the residual basis at iteration k of the algorithm contains linear combinations of the vectors of B, and hence each vector selected by the algorithm is a linear combination of B.
Theorem 1. Γ * in step 3 of the optimal rotation algorithm is an orthogonal rotation of B.
We show that Λ T Λ = I n , i.e. Λ is a rotation matrix. We have The upper-left k × k block can be written as Similarly, by Result S2. Finally, and hence from (S8) we have Λ T Λ = I n , and Λ is a rotation matrix.

S2.2 Gram-Schmidt invariance
Gram-Schmidt orthonormalisation imposes orthonormality on basis vectors B = (b 1 , . . . , b n ) (Björck, 1967). It can be written in terms of matrices (Björck, 1994): where Γ is an l×n basis containing normalised, orthogonal vectors γ 1 , . . . , γ n , and R is an n × n upper-triangular matrix. Therefore, the j th new basis vector is a linear combination of the first j basis vectors of B.
Result S4 (Gram-Schmidt invariance). The reconstruction given by the first q vectors of the original basis is equal to the reconstruction given by the first q vectors of the orthogonal basis: R W (B q , z) = R W (Γ q , z), q = 1, . . . , n Proof. Using Γ T W −1 Γ = I n (i.e. we've imposed orthonormality with re-spect to the W norm), we have As R is invertible, and applying the identity (CD) −1 = D −1 C −1 for square matrices C and D, we have i.e. the reconstruction of z is the same with both B and Γ. The proof proceeds analogously for any truncation of these bases.
Alternatively, this result could be shown by proving that both B and Γ span the same subspace, and using that a basis gives unique representations of general fields (Kuttler, 2012). Figure S9: The TA anomaly for the standard run (left), and for run 005 of the new ensemble, the 'best' run in the wave 2 ensemble, in terms of minimising the root mean squared error. The large warm bias from the standard run has not been removed. Figure S10: The grid boxes where we deem there to be a potential structural error for TA, and hence where we increase the discrepancy, as described in Section 5.