A weak-constraint 4DEnsembleVar. Part I: formulation and simple model experiments

ABSTRACT 4DEnsembleVar is a hybrid data assimilation method which purpose is not only to use ensemble flow-dependent covariance information in a variational setting, but to altogether avoid the computation of tangent linear and adjoint models. This formulation has been explored in the context of perfect models. In this setting, all information from observations has to be brought back to the start of the assimilation window using the space-time covariances of the ensemble. In large models, localisation of these covariances is essential, but the standard time-independent localisation leads to serious problems when advection is strong. This is because observation information is advected out of the localisation area, having no influence on the update. This is part I of a two-part paper in which we develop a weak-constraint formulation in which updates are allowed at observational times. This partially alleviates the time-localisation problem. Furthermore, we provide—for the first time—a detailed description of strong- and weak-constraint 4DEnVar, including implementation details for the incremental form. The merits of our new weak-constraint formulation are illustrated using the Korteweg-de-Vries equation (propagation of a soliton). The second part of this paper deals with experiments in larger and more complicated models, namely the Lorenz (1996) model and a shallow water equations model with simulated convection.


Introduction
The 4-dimensional ensemble-variational data assimilation (DA) scheme, 4DEnVar, is a hybrid DA method currently used (and still being researched) in Numerical Weather Prediction (NWP), and it is at the forefront of the next-generation DA methods. As with other hybrid methods, the basic motivation behind 4DEnVar is to use the flow-dependent background error covariance matrix from sequential methods based on the Kalman Filter (KF)-like the Ensemble Transform Kalman Filter (ETKF, Bishop et al., 2001;Wang et al., 2004) and the Local Ensemble Transform Kalman Filter (LETKF, Hunt et al., 2007)-and apply it in the 4dimensional variational framework (4DVar) first proposed by Le Dimet and Talagrand (1986), and studied in Talagrand and Courtier (1987). What makes 4DEnVar different from all other hybrids, however, is the fact that it alleviates the need to compute tangent linear models (TLM's) and adjoint models (AM's).
The use of ensemble information in a variational framework was proposed by Lorenc (2003) and Zupanski (2005). Since then, several formulations have been proposed, experimented on, and used by operational centres like the UK Met Office and the Canadian Meteorological Centre (Environ Canada). The idea of these hybrid methods is to overcome certain restrictions of the basic 4DVar, for instance, the use of a static background error covariance matrix, which is usually a climatological error covariance matrix B c . Although it is possible to generate slowly-varying B c 's-e.g. difference covariances for different seasons-, the fast variations-e.g. those in the time-frame of an assimilation window-are not captured. Hence, climatological or slow-varying background matrices do not describe the flow-dependent errors of the day. Sequential ensemble-based methods, e.g. LETKF, can describe these features. Nonetheless, the sample covariances obtained in these ensemble methods-which we denote as P b -contain sampling errors and are seldom full rank, which is a problem climatological covariances do not have. It seems logical, then, to find ways to combine the advantages of both families of methods. Clayton et al. (2013) developed a hybrid 4DVar which combines the climatological background error covariance matrix with the flow dependent background error covariance matrix generated by the LETKF. Fairbarn et al. (2014) and Goodliff et al. (2015) then showed that a fully flow dependent background error covariance matrix outperforms a *Corresponding author. e-mail: j.AmezcuaEspinosa@reading.ac.uk climatological or hybrid covariance in the 4DVar framework (4DVar-Ben). The former used model II of Lorenz (2005) with 240 variables, while the latter used the 3-variable Lorenz (1963) model with increasing non-linearity.
A non-trivial requirement for the implementation of 4DVar is generating TLMs and AMs for both the evolution and observation operators. Liu et al. (2008) developed the 4DEnVar technique, which overcomes this need. After a series of matrix transformations, the role of these matrices can be substituted by using fourdimensional space-time covariance matrices. In fact, the idea that an ensemble can be used to generate space-time covariances to use within an assimilation window was first proposed in the context of the Ensemble Kalman Smoother of Van Leeuwen and Evensen (1996) and Evensen and Van Leeuwen (2000).
A recent study by Lorenc et al. (2015) showed difficulties for the performance of 4DEnVar in larger systems, in comparison to that of 4DVar with hybrid background covariances. The authors point to the fact evolution and localisation of the covariance matrix do not commute. This means that it is not the same to evolve a localised a covariance matrix, or to evolve first and then localise.
So far, the work on 4DEnVar has been done considering a perfect model scenario, i.e. in a strong-constraint (SC) setting. In this work, we introduce a weak-constraint (WC) 4DEnVar. As explored by Tremolet (2006), the choice of control variables in WC4DVar is not unique. For reasons that will be thoroughly discussed in the paper, we use what we label an 'effective model error formulation', which stems from Tremolet's (2006) constant bias formulation.
This work is presented in two parts. Part I describes the 4DEnVar formulation in detail. We discuss its advantages over traditional DA methods, as well as its short-comings. We pay special attention to the problem of localising time cross-covariances using static localisation matrices. We illustrate this problem by experimenting with the Korteweg de Vries (KdV) equation for the evolution of a soliton (see e.g. Zakharov and Faddeev, 1971). In part I we also introduce our WC4DEnVar and discuss a proper localisation implementation for this method. We show that performing updates at times other than the initial one (as one does in the SC case), partially alleviates the impact of incorrectly localised time cross-covariances. Simple DA experiments are performed in this model.
In part II we move into larger models. We start with a detailed exploration of parameters (e.g. observation frequency, observation density in space, localisation radii, etc) using the well-known Lorenz (1996) chaotic model. Then we use a modified shallow water equations (SWE) model with simulated convection (Wursch and Craig, 2014), which allows to test our method in a more realistic setting. This is a larger, more non-linear model and serves as a good test bed for convective data assimilation methods.
The layout of this paper is as follows. Sections 2 and 3 describe in details the methodology of the 4DEnVar framework. Section 2 covers the SC case, and in Section 3 we introduce our WC4DEnVar formulation. In Section 4 we use the KdV model and use it to illustrate the time evolution of the background error covariance matrix, as well as the problems that come from static localisation of cross-time covariance matrices. In Section 5 we perform some brief DA experiments. Section 6 summarises the work of this part.

Strong-constraint 4DVar
In this section we formulate the SC4DVar method, which forms the basis of any subsequent approximation. Consider the discrete-time evolution of a dynamical system determined by the following equation: where x 2 R Nx is the vector of state variables, and m 0!t is a map R Nx ! R Nx which evolves the state variables from time 0 to time t. For the moment, we consider this model to be perfect. The initial condition x 0 is not known with certainty, and can be considered a random variable (rv). In particular, we consider it to be a Gaussian rv, i.e. x 0 ,Nðx 0;b ; BÞ, where x 0;b is a background value or first guess (usually coming from a previously generated forecast), and B 2 R NxÂNx represents the background error covariance matrix, which is full rank and positive definite. One often replaces this matrix with a climatological background error covariance B c . There are several ways to obtain B c it (e.g. Parrish and Derber, 1981;Yang et al., 2006); see e.g. Bannister (2008) for a review. The system is observed from time to time. For a given observational time, the observation equation is: where y o 2 R Ny is the vector of observations and h : R Nx ! R Ny is the observation operator. For simplicity we consider this operator constant through time. We use the term 'observation period' to denote the number of model time steps between consecutive observation vectors. The observations contain errors, which are represented by the rv η t 2 R NyÂNy . We consider this error to be Gaussian and unbiased: η t ,Nð0; RÞ, where R 2 R NyÂNy is the observation error covariance matrix. Let us consider a given time span from t ¼ 0 to t ¼ τ, which we label 'assimilation window'. By incorporating information from the background at t ¼ 0 and the observations within this window, we can find a better estimate (analysis) for the value of the state variables at every time step x 0 ; x 1 ; Á Á Á ; x τ . Since the model is perfect (SC4DVar), this reduces to finding the minimiser of a cost function that only depends on the initial condition x 0 : where the first term corresponds to the contribution from the background and the second term corresponds to the contributions from observations. To avoid an extra index to distinguish between time instants with and without observations, we introduce the indicator function ρ t ¼ 1 if t is a time instant with observations, and ρ t ¼ 0 if t is a time instant with no observations. The complexity in minimising (3) depends on the particular characteristics of the operators h and m. For general non-linear operators, Jðx 0 Þis not convex and may have multiple minima, leading to a difficult minimisation process. Another issue comes from the condition number of B. If this is large (i.e. the largest and smallest eigenvalues of B are separated by orders of magnitude), numerical minimisation methods can require many iterations for convergence. To solve these problems one can use a preconditioned formulation to reduce the ellipticity of the cost function. Moreover, an incremental formulation can be used to perform a quadratic approximation the original cost function, leading to a problem with a unique minimum. This linearisation process can be iterated several times; these iterations are the so-called outer loops.
To write down the preconditioned incremental problem, let us express the initial condition as an increment δx 0 2 R Nx from the original background value, and then precondition the increment: where v 0 2 R Nx is the new control variable. Since B is full rank and symmetric, one can find the (unique) symmetric square root B 1=2 2 R NxÂNx without further complications. In this case, δx 0 2 R Nx and v 0 2 R Nx . Then, one can approximate (3) by performing the following first order Taylor expansion: is the departure of the observations with respect to the evolution of the background guess throughout the assimilation window. To compute these departures one uses the full (generally non-linear) evolution and observation operators. We notice that two matrices appear in the equation. These are the TL observation operator and TLM of the discrete-time evolution operator which is also known as the transition matrix from time 0 to time t. Both Jacobian matrices are evaluated with respect to the reference trajectory, i.e. the evolution of x b;0 . Then, the preconditioned incremental form of SC-4DVar can be written as: This is a quadratic equation which only depends on v 0 . In order to find its global minimum it suffices to find the gradient and set it equal to zero. The gradient of (8) is:

Strong-constraint 4DEnsembleVar
Writing TLM's and AM's for large and complex models is a complicated process. Moreover, the calculations involving both TLM's and AM's are computationally expensive. To alleviate these problems, 4DEnVar was introduced. Once more, let us consider an assimilation window from t ¼ 0 to t ¼ τ, with several observational times equally distributed throughout the assimilation window. Consider that at time t we have an ensemble of N e forecasts. This ensemble can be initialized at time t ¼ 0 with the analysis coming from an LETKF running in parallel to the 4DEnVar system. At any time we can express the background (or forecast) ensemble as the matrix: from which an ensemble mean can be calculated at any time as: where 1 2 R Ne is a vector of ones. Finally, a normalized ensemble of perturbations can be computed aŝ In the EnVar framework we replace the background error covariance B at the beginning of an assimilation window by the sample estimator: In applications (e.g. NWP) it is often the case that N e ( N x , so P b;0 is a low rank non-negative definite matrix, which has consequences we discuss later. A straightforward implementation of 4DEnVar is to equate B 1=2 ¼X b;0 . Then, we can do the following substitution: An advantage of using ensemble estimators is that we can directly construct the ensemble of perturbationsŶ b;t 2 R NyÂNe using the full non-linear operators and the ensemble X b;t in the way described, e.g., in Hunt et al. (2007). Briefly, one can write: and it follows that: and the gradient (9) can be written simply as: and it should be clear that up to this moment we have not applied localisation. Hence the size of the control variable we solve for is v 0 2 R Ne . A graphic summary of the SC4DEnVar process is depicted in Fig. 1. This simple schematic depicts the three-stage-process of SC4DEnVar. These are: i. First an ensemble of 'free' (DA-less) trajectories of the model is run for the length of the assimilation window. In NWP applications these trajectories are available since meteorological centres often have an ensemble forecast system.
ii. Second, the 4DVar minimisation process is performed using 4-dimensional cross-time covariances instead of TLM's and AM's. These cross-time covariances are computed from the family of free model runs of step i.
iii. Finally an LETKF (or any other ensemble KF) is run throughout the assimilation window to create new initial conditions for the next window. The mean of this LETKF is replaced by the solution of 4DEnVar, considered to be more accurate, i.e. the ensemble keeps its perturbations but it is re-centred.In this implementation, two assimilation systems need to be operated and maintained. There are other hybrid systems such as ECMWF's ensemble DA system (Bonavita et al., 2012) which are self-sufficient, but they require the use of TLM's and AM's.
2.2.1. Applying localization. The quality of P b;t depends on the size of the ensemble. In particular, the quality of this estimator is quite poor when N e ( N x , or at least when it is smaller than the number of positive Lyapunov exponents in the system. A common practice is to eliminate spurious correlations by tampering the covariance matrix in the following manner (Whitaker and Hamill, 2002): where denotes the Schur (element-wise) product, and L xx 2 R NxÂNx is the so-called localisation matrix. The element L xx f g ij is a decreasing function of the physical distance between grid points i and j. This is usually chosen as the Fig. 1. Schematic depicting the three stage process of 4DEnVar. In part i an ensemble of 'free' (DA-less) trajectories of the model is run for the length of the assimilation window. In part ii the 4DVar minimisation process is performed using 4-dimensional cross-time covariances instead of tangent linear and adjoint models. In part iii an LETKS is run to the end of the assimilation window to create new initial conditions for the next window. The mean of this LETKS is replaced by the solution of 4DEnVar, considered to be more accurate. Gaspari and Cohn (1999) compact support approximation to a Gaussian function.
In the gradient of the 4DVar cost function the product BH T 2 R NxÂNy appears. In the ensemble setting, this product corresponds to P b;t x;y , which can be computed and localised as: which clearly requires a new matrix L xy 2 R NyÂNx . For linear observation operators, the two localisation matrices are related simply as: In order to write a localised version of (17) we require factorisations of bothP b;t andP b;t xy . Buehner (2005) showed that one can get the factorisation (22) where the operator diag converts a vector into a matrix with said vector in its main diagonal and zeros elsewhere, L x 1=2 2 R NxÂr is the square root of the L xx , and r ¼ rank ðL xx Þ. If the localisation only depends on the distance amongst grid points (i.e. there is not a special crossvariable localisation or vertical localisation) then r ¼ N g , i.e.
simply the number of grid points. Constructing L x 1=2 can be done relatively easily via the following eigenvalue decomposition: where C 2 R NxÂNx is the unitary matrix of eigenvectors, and Γ 2 R NxÂNx is a diagonal matrix containing the eigenvalues, and we have used the fact that L xx is symmetric. Then: If there are many variables per grid point, the decomposition can be performed by blocks. Also, it is often more convenient to work with a truncated localisation matrix with a given rank r < N g . This is discussed thoroughly in Appendix 1.
In the case ofP b;t xy we need to writê withX b;t defined as before, and where L 1=2 y 2 R NyÂr can easily be found as: The actual implementation we use can be found in Appendix 1. It is closer to Lorenc (2003) and his so-called α-controlvariable formulation. Wang et al. (2007) proved that both formulations are equivalent.
2.2.2. Localisation in 4-dimensional cross-time covariances. Recall that we have to compute the product BðM 0!t Þ T H T , which we replace by the sample estimatorX b;0 ðŶ b;t Þ T . Note that these covariances involve both the ensemble of state variables at time 0, and the ensemble of equivalent observations at time t. Hence, different products should be localised with different localisation matrices: Computing the static matrix L 0;0 xy is not difficult since it requires no information about the flow. However, there is no simple way of computing an optimal L 0;t xy , since that would require taking into account the dynamics of the model. Generally, the implementation of SC4DEnVar uses L 0;0 xy for all time steps, which can result in an unfavourable performance with respect to SC4DVar (Lorenc et al. 2015). This is a crucial impediment for an adequate performance of SC4DEnVar with long assimilation windows. We will illustrate this issue in detail in Section 4, where we provide a concrete example.

Weak-constraint 4DVar
Let us revisit the discrete-time dynamical system studied before, but this time not assuming perfect model. The evolution equation becomes: In this case m ðtÀ1Þ!t is the map R Nx ! R Nx which evolves the state variables from time t À 1 to time t, and ν t 2 R Nx is a rv representing the model error. Let it be a Gaussian rv centred in zero (unbiased), i.e. ν t ,Nð0; QÞwhere Q 2 R NxÂNx is the model error covariance matrix. Tremolet (2006) provides a detailed account on the way to include the model error term into the 4DVar cost function. As he points out, there is no unique choice for control variable in this case. For our implementation, we choose a variation of what he labels model-bias control variable. Consider the following representation of the true value of the state variables at any time: i.e. the state variables at any time can be written as the perfect evolution of the initial condition-given by the discrete-time map m 0!t -plus an effective model error b t 2 R Nx , which contains the accumulated effect of the real mode errors at each model step, along with their evolution. While Tremolet (2006) considered the effective model error to be constant for all observational times throughout the assimilation window (b t ¼ b"0 t τ), we let it vary for different observational times. For a linear model m t 0 !t ¼ M t 0 !t , the relationship between the statistical characteristics of the model error n t at every time step from 0 to t and those of the effective model error b t at t can be found in the following way: which is not an easy relationship, but later we will discuss simpler-and more modest-ways to approximate Q t . For the moment we consider this error to be unbiased. Let us express the WC4DVar cost function as: where the function now depends upon x 0 and β 1: Notice that all the (perfect) model evolutions m 0!t start at time 0. This is why we have chosen (30) as a definition of our control variables. If we chose the model error increment for every time step ν t , we would find a cost function with terms of the form m t 0 !t ðx t 0 Þ, i.e. perfect model evolutions initialised at different model time steps, where x t 0 contains the effects of all ν for previous model time steps. This would complicate the problem considerably, especially when using ensembles.
As in the 4DVar case, we can write down a preconditioned incremental formulation. Recalling that: we express the control variables as: which allows to perform the following first order Taylor expansion: where d t ¼ y o;t À hðm 0!t ðx b;0 ÞÞ as in the SC case. The incremental preconditioned form of the cost function for WC4DVar is: We can write the gradient as the following vector: where the control variable is v 0:τ 2 R ðτþ1ÞNx . It would seem that the size of the problem has grown considerably with respect to the SC case. Nonetheless, let us look closer at the structure of (37). Remember that ρ t ¼ 1 only if t belongs to those time steps with observations (a set we denote as t obs f g), and ρ t ¼ 0 otherwise. Hence the sum in the first blockelement of Ñ v 0:τ Jðv 0:τ Þ only has τ obs (total number of observational times) terms. For all the other block elements we have two terms: the first corresponding to the model error contribution and the second contribution to the observational error contribution. For all time steps with no observations, this later term is null. In these time steps there is nothing to make the model error to be different from its background value, which was zero by construction. Therefore v t ¼ 0 for all time steps without observations, and consequently we can redefine our control variable Hence, the number of control variables-as well as the computational cost of the problem-scales with the number of observational times included in a given assimilation window.

Weak-constraint 4DEnsembleVar
As in the SC case, we want to find a way to avoid computing TLM's and AM's. In the absence of localisation, we substitute ðB t Þ 1=2 ¼X b;t and HðB t Þ 1=2 ¼Ŷ b;t to yield the gradient of the cost function as: where the control variable is v 0:τobs 2 R ð1þτobsÞNe . As in the SC case, the actual implementation details can be found in Appendix 1.
In the SC case, the influence from observations at different times leads to a single increment that is computed at initial time x b 0 ! x a 0 , and the analysis trajectory comes from the evolution of the new initial condition. Roughly speaking, the information from observations at time t impact changes at time 0 via the covariance ðX 0Ŷt Þ L 0;t xx . Since we do not have L 0;t xx we simply use L 0;0 xx . This is not too inaccurate for observations occurring relatively close (in a time-distance sense) to the beginning of the assimilation window. However using L 0;0 xx instead of L 0;t xx can lead to considerable errors as t grows. Hence, this issue is of particular importance for long assimilation windows with observations close to the end of the window.
In the WC formulation we propose, the impact of an observation at time t influences the state variable at t ¼ 0 and at time t ¼ t obs . So although we still have an incorrectly localised expression of the form ðX 0Ŷt Þ L 0;t xx , we also have a correctly localised term ðX tŶt Þ L t;t xx , where obviously L t;t xx ¼ L 0;0 xx . We have not solved the problem of localisation, but we try to ameliorate it by allowing for increments at time steps in which observations occur.
The way our formulation works, we have jumps (from background to analysis) at the initial time and at the time of the observations. We have not experimented with how to get a smooth trajectory for the whole assimilation window. A first approach would be to modify the initial conditions, and then distribute the updates at observational times amongst the time steps between observations, following a procedure similar to the Incremental Analysis Update of Bloom et al. (1996). This would avoid the sharp jumps from background to analysis at the times of the observations.

A note on Q t
We have not said anything about the computation of Q t , i.e. the covariance matrix of the effective model error β t . In principle, it could be calculated if we had access to two ensembles: one evolved with no model error using (1): X 0:τ per , and one evolved with model error using (29): X 0:τ imp . For any time t we can use the expression (30) to write the imperfect state in terms of the perfect state at that time plus an effective error β t : We can compute the first two moments of this error as: and the previous expressions can be approximated by evaluating sample estimators: There are two issues to do with these calculations. First of all, we need two ensembles running at the same time. This is not too demanding; in fact we could divide a moderate size ensemble into two groups and run each of these groups with and without model error respectively. The major complication, however, is the evaluation of in an efficient way. Some of the difficulties include the size of the resulting matrix, as well as the rank of the matrix. We have not done exploration in this direction. Instead we have chosen a rather simple parametrisation of this error as: which can be interpreted as considering the accumulated model error to behave like the Wiener process. This is exact for a linear model with m ¼ I, i.e. when the evolution model is the identity. A more precise approximation-or indeed an efficient way to compute the actual value-is beyond the scope of this paper.

The KdV equation
In this section we illustrate the effects of using a static localisation matrix to localise time cross-covariances. We perform experiments using the Korteweg de Vries (KdV) system as model. The KdV equation is a non-linear partial differential equation in one dimension: where t denotes time, sis the spatial dimension, and u is a onedimensional velocity field. This equation admits solutions called solitons, which correspond to the propagation of a single coherent wave, see e.g. Shu (1987) or Zakharov and Faddeev (1971). This system has been used before for DA experiments in the context of feature DA and alignment error (e.g. Van Leeuwen, 2003;Lawson and Hansen, 2005). This system can be challenging for DA, since the propagation of coherent structures renders the linear and Gaussian assumptions in both the background and observational error to become less accurate. Nonetheless, the system is suited to show the emergence of asymmetric off-diagonal elements in the time cross-covariance matrices. Hence, we will use it for illustrative purposes and simple DA experiments. More detailed experiments are shown in part II of this paper with more appropriate models.
We reduce (44) into an ordinary differential equation (ODE) by discretising s into N g grid points s ¼ ½s 1 ; Á Á Á ; s j ; Á Á Á ; s Ng separated by Δs, and perform a central and 2 nd -order-accurate finite-difference approximation to both the dispersion and advection terms (after writing the latter in conservative form). Denoting u j ¼ uðs j Þ and u ¼ ½u 1 ; Á Á Á ; u j ; Á Á Á ; u Ng , we can write: We choose N g ¼ 15 grid points and Δs ¼ 1, and allow for periodic boundary conditions, i.e. u j ;u jmodðNgÞ , where mod denotes the modulo operation. We use a 4 th -order Runge-Kutta method with time step Δt ¼ 0:25 to numerically integrate (45) in time. This integration renders a discrete-time map m 0!t .
The initial condition of a perfect soliton is given by: where s o ¼ 5 is the center of the soliton at t ¼ 0, and 3A corresponds to the maximum velocity of the soliton. We choose A ¼ 1 (the propagation speed) which, combined with our steps Δt ¼ 0:25 and Δs ¼ 1, yield a Courant number C ¼ 0:75. Figure 2 is a Hovmoller plot showing the time evolution of the soliton through the domain. The horizontal axis represents different grid points, the vertical axis represents time (evolving from bottom to top), and the shading is proportional to the velocity at each grid point.
For our experiments we require the discrete-time TLM (or transition matrix) M 0!t of the model. We first obtain the continuous-time TLM F ¼ @f @s . This is a hollow circulant matrix in which the j th row has all elements equal to zero except for: where we recall that the indices are modular. Obtaining the transition matrix M from the continuous-time TLM F involves a straightforward solution of a matrix differential equation, see e.g. Simon (2006).

Covariance propagation
Let us start by obtaining the nature run of our system. As initial condition we use the profile described in (46), and we evolve the model until t ¼ 100 (i.e. 400model steps). The result of this run is illustrated in Fig. 2, this model run will later be used as the nature-or true-run of our DA experiments. The soliton propagates from left to right of the domain, taking about Δt ¼ 15 (60time steps) to complete a loop around the domain. We now examine the time propagation of a given B c . We construct this matrix following the method of Yang et al. (2006), which we describe briefly in Appendix 2. We obtain a circulant matrix, which normalized values for the j th row are: Since we have the transition matrix of the model, we can explicitly compute both the covariance of the state at any time t as: and the cross-covariance between time t ¼ 0 and time t as: We evaluate both (50) and (49) for different lead times t, and we plot these matrices in Fig. 3. Each individual panel is a covariance matrix, the horizontal and vertical axes are grid point locations. Let us start with the top row, which shows Covðx 0 ; x t Þ for different lead times t ¼ 0; 1; 5; 10; 15 f g(one for every column). The initial covariance is the circulant matrix shown in (48), but as the time increases we notice that offdiagonal features become more prominent. We notice a region of high values (shown in blue) developing above the main diagonal and the appearance of negative values (shown in white) along the main diagonal. The resulting matrices are not symmetric starting from t ¼ 1, and this asymmetry grows as time passes. This is particularly evident after 10 and 15 steps.
The third row shows B t for different lead times t ¼ 0; 1; 5; 10; 15 f g . In this case, all the matrices are symmetric by construction. There is a development of (symmetric) offdiagonal negative elements (shown in white), but most of the information remains concentrated close to the main diagonal. The magnitude of the matrix elements increases as time passes. As expected uncertainty grows in time.
Recall that we computed both Covðx 0 ; x t Þ and B t using B 0 and the actual transition matrix M 0!t . The objective of 4DEnVar, however, is to compute these quantities using ensemble estimators, which contain spurious correlations if the ensemble size is small and lead to the need for localisation. The second and fourth rows of figure 3 show the effect of localisation of Covðx 0 ; x t Þ and B t respectively, using a localisation matrix L 00 xx which follows a Gaspari and Cohn (1999) function with half-width λ ¼ 1 grid points (centred in the main diagonal). As we can see in row 2, the localised versions of Covðx 0 ; x t Þ can lose most of the important information, and this problem becomes more pronounced as the time lag increases. The localised versions of B t , shown in the bottom row, lose less information, since the most distinguishable features were close to the main diagonal.
To appreciate better the effect of localisation with different half-widths, in Fig. 4 we depict the values of the 6 th row of the covariance matrices. The left panel of the first row shows Covðx 0 ; x t Þ½6; :. Each one of the different lines shows a different lead-time. Consider we are observing all variables, and look at the red line in the figure. We see a peak corresponding to grid point 10: this means that, for a lead-time of 15 time steps, the most useful information to update grid point 6 (at t ¼ 0) comes from the observation at grid point 10. This is no surprise, since the general direction of the flow is from left to right. Actually, for all lead-times we see that the most important information to update grid point 6 comes from observations in the grid points to the right. When localising the covariances this information is lost. This is not too crucial for relative large-scale localisation cases (middle column), but becomes very evident for strict localisation (right column). In both of these panels, the shape of the localisation function is shown in light gray for reference.
The left panel of the second row of Fig. 4 shows the 6 th of B t (actually we normalise and create a correlation-type matrix for ease of comparison, since the magnitudes in this matrix grow with time), and again different lines show a different lead-times. In all cases the dominant element is that corresponding to grid point 6. New features do develop (with respect to the original B 0 , i.e. the black line), but none is dominant. When localising with both a large-(middle column) and small-(right column) scale the information is not lost as crucially as before.

Setup
For these experiments, we extend the nature run generated before until t ¼ 200 (i.e. we have 800 model steps). We create synthetic observations by taking the value of the variables at all grid points and all model time steps, and adding an uncorrelated random error with zero mean and R ¼ σ o2 I, with σ o2 ¼ 0:1. For the experiments we use subsets of these observations, both in time and space.
We observe every 3 rd grid point; hence we have 5 observed grid points and 10 unobserved grid points. This sparse observational density (in space) makes the situation challenging, and the quality of the off-diagonal elements in the covariance matrices are quite important to communicate information from observed to unobserved variables. We experiment with several observational periods. We report the results of 3 cases: observations every 2 model steps, every 5 model steps, and every 10 model steps.
We test the following formulations: 3DVar, SC4DVar, ETKS, LETKS, SC4DEnVar and LSC4DEnVar. The L at the beginning of the EnVar methods denote the localised versions. ETKS and LETKS (S for smoother instead of filter) denote the 'no-cost smoother' versions of both ETKF and LETKF (see Kalnay and Yang, 2010 for a description).
Even though the nature run is generated with a perfect model, we also use WC methods for the sake of comparison. We use the effective error WC4DVar, as well as WC4DEnVar and LWC4DEnvar. In this case, we model the (fictitious) effective model error as a zero-center Gaussian rv with Q ¼ 0:01 B c .
For the ensemble and hybrid methods we use a tiny ensemble size of N e ¼ 3. With this choice we try to mimic realistic situations where N e ( N x . This chosen value of N e should render low-quality sample estimators. The adaptive inflation of Miyoshi (2011) is used, with an initial value of ρ 0 ¼ 0:05. Inflation is generally applied in the following manner: but following Miyoshi (2011), the inflation values are different for each grid point and it evolves with time.
For the localised methods, a Gaspari-Cohn function with halfwidth λ is used. We explored several values and kept the optimal one, which depends on the period of observations in each experiment. For the localised EnVar methods, we truncated the Fig. 4. Elements of the 6th row of the matrices Covðx 0 ; x t Þ (top row) and B t (bottom row), for different times (different colors). The first column shows the unaltered elements, while the middle and right columns show these elements after being localised using Gaspari-Cohn functions of different half-widths (gray lines). localisation matrix at 11 eigenvalues. Truncation with 9 to 15 kept eigenvalues was tried and gave similar results (not shown).
For the variational and hybrid methods, we use one observational time per assimilation window. We use the incremental pre-conditioned forms discussed earlier, and we do not use outer loops in any of these methods. We later performed experiments with 2 observational times per assimilation window and this did not change the main results (hence not shown).

Results
The results of these experiments are depicted in Figs. 5-7. Each figure has 3 columns and 2 rows. The top row corresponds to results of observed grid points, while the bottom row corresponds to results of unobserved grid points. Three panels are shown in each case: i. The left panel shows the time evolution of the truth (black line) over 41 t 49, as well as the analysis trajectories reconstructed by the different DA methods (shown in different colors). As example of an observed variable we choose grid point 1; observations are shown as gray circles. As example of an unobserved variable we show grid point 3.
ii. The middle panel shows the evolution of the analysis RMSE with respect to the truth (over 41 t 49) using: x a n ðtÞ À x true n ðtÞ where we use the mean of the analysis ensemble for (L)ETKS, and the unique analysis trajectory for the Var and EnVar methods. The index n runs over observed and unobserved grid points depending on the case. The horizontal gray line shows the standard deviation of the observational error.
iii. The right panel shows summary statistics for the analysis RMSE. We eliminate the period 0 t 10 (40 time steps) as a transient. For the remaining time steps we compute the 1 st , 2 nd (median) and 3 rd quartiles of the analysis RMSE. These are depicted-for each and every method-using an upwardpointing triangle, a circle, and a downward-pointing triangle respectively. Again, the standard deviation of the observational error is shown with a horizontal gray line.  SC4DEnVar, especially for the unobserved variables. 3DVar performs slightly better than 4DVar (both SC and WC), which is understandable since the observations are frequent. Both ETKS and LETKS perform worse than the Var and EnVar methods. Localisation does not help the performance of ETKS. We explored some values of localisation and settled with λ ¼ 1:0, although we did not perform an exhaustive search. This same value was used in the EnVar methods. The worst performing method is WC4DEnVar (with no localisation), which is understandable since it contains the fictitious (and in this case unnecessary) model error, on top of the sample error from the ensemble part. Nonetheless, its performance is greatly improved by localisation.

Observations every 5 model steps.
We increase the observational period to 5 model steps, the results are shown in Fig. 6. We start noticing a distinction in the performance of the DA methods. As expected, 3DVar is now the worst performing method for both observed and unobserved variables. Again, both SC4DVar and WC4DVar are the best methods, with SC4DEnVar and WSC4DEnVar next. Localisation damages the performance of SC4DEnVar, leading to larger RMSE's in general. We started  seeing this phenomenon with an observational period of Δt ! 4 model time steps. For WC4DEnVar localisation improves the performance considerably. The performance of ETKS and LETKS is slightly better than that of the Var methods. Again, for (L)ETKS and the EnVar methods we used a localisation halfwidth of λ ¼ 1:0. 4.2.3. Observations every 10 model steps. This rather large observational period presents a DA challenge. The results are presented in Fig. 7. 3DVar is the worst-performing method, with an RMSE considerably higher than the observational error for both observed and unobserved variables. SC4DVar and WC4DVar perform considerably worse than for the previous observational frequencies, both observed and unobserved variables. The observations are infrequent and the assimilation window is too long. Hence, the linearisation we perform in the incremental form loses validity and would require the use of outer loops. In this case both ETKS and LETKS are amongst the best performing methods, and localisation helps reduce the RMSE slightly.
SC4DEnVar is still the best method, and the use of the ensemble information allows the variational solution to find a global minimum with no need for outer loops. Nonetheless, we see that the inclusion of localisation damages the performance of the method. WC4DEnVar has a very bad performance when not localised. One could think that having more degrees of freedom (control variables) in the fitting could always help, but this is not the case when they provide wrong information. After localising, the performance improves considerably for both observed and unobserved variables.
For these experiments, the half-width used in the localisation of both (L)ETKS and the hybrid methods is λ ¼ 3.

Conclusion and summary
In this paper we have explored the ensemble-variational DA formulation. As a motivation, we have discussed the problem of localising time cross-covariances using static localisation covariances, and we have illustrated this effect with the help of the KdV model. These time cross-covariances develop off-diagonal features that are erroneously eliminated when using localisation matrices that do not contain information of the direction of the flow.
Our main contribution is the introduction of an ensemble-variational method in a weak-constraint scenario, i.e. considering the effect of model error. For simplicity, we have chosen as control variable the effective model error at the time of observation.
We have discussed the implementation of the model-error term in the cost function using ensemble information. Our formulation leads to updates at the beginning of the assimilation window and at the observation times within the assimilation window. We only require one ensemble initialised at the beginning of the window, and that the size of the problem does not scale with the number of model steps. However, we require computing the correct statistics of this effective model error. We offer a modest approach for this effect. Our experiments also suggest that having updates at times different than the start of the window helps ameliorate (but does not solve) the problem of having badly-localised time cross-covariances.
The results presented in this part I encourage us to explore the performance of our methods in a more thorough manner in more appropriate models. This is the subject of part II.
and let us discuss how fast the eigenvalues γ n f g decay. For this, let us consider the half-width values in three cases: a. The case λ ! 1 corresponds to no localisation, where L xx becomes a matrix of ones. The rank of this matrix is 1. There is only a non-zero eigenvalue γ 1 ¼ 1 and only one eigenvector, corresponding to the vector of ones. This is clearly equal to doing no localisation at all. b. The case λ ¼ 0 corresponds to a strict localisation where variables in a grid point are only affected by observation at that very grid point. Then L xx becomes the identity matrix. The rank of this matrix is N x . There are N x non-zero eigenvalues, all of them with a value of 1, and N x eigenvectors.
c. Localisation matrices with finite λÞ0 fall somewhere in the middle. For the type of localisation used in this paper -where only the distance amongst grid points determines the localisation value-L xx is a circulant matrix, and it follows that in this case the columns of C are Fourier modes: column 1 corresponds to a constant (zeroth harmonic), columns 2 and 3 correspond to the first harmonic (one column for sine and one for cosine), columns 4 and 5 correspond to the second harmonic, and so forth. The associated eigenvalues in Γ decrease as the order of the harmonic increases. Then, one can retain only the first 1 þ 2k columns of C, and the first 1 þ 2k eigenvalues of Γ, Let us illustrate this in Fig. 8 with an example. For a system with N x ¼ 100 grid points, we generate localisation matrices L xx using different half-widths λ ¼ 1; 2; 3; 4; 5 (shown with different line styles in the figure), and we perform their spectral decompositions. The top panel shows the eigenvalues, in decreasing order of magnitude, for each one of the different half-widths. The decay in magnitude is slow for λ ¼ 1, but it becomes considerably faster as the half-width increases. The bottom panel shows the cumulative sum of these eigenvalues. For λ>1 it quickly approaches its maximum value P Nx n¼1 λ n ¼ N x , and many eigenvalues contribute to this sum only marginally, especially for larger localisation half-widths. Small eigenvalues can be discarded in favour of speeding up the computation time, creating a simpler minimisation problem. The minimisation becomes easier for two reasons. First the size of the control vector decreases. Second, considering small grid point-per-grid point variations (a consequence of keeping all eigenvalues) can result in a more difficult minimisation problem.
To determine the number of eigenvalues to keep, a simple idea is to chose the value n that fulfils: where the eigenvalues γ j are ordered by decreasing magnitude, and 0<c<1 is a fraction of our choice. The closer to one this fraction is, the closer our cropped localisation matrix will be to the original one. The N x in the right hand side of (58) comes from the fact that In Fig. 9 we have illustrated this choice for two fractions: 0:5in the left panel and 0:9 in the right panel. For different total number of grid points N x (lines with different colors), we show the ratio n Nx required as a function of the localisation radius λ (half-width of the Gaspari Cohn function). As expected this normalized quantity depends only on the localisation radius, and it is a decreasing function. This corresponds perfectly with our previous discussion of the two limiting cases.
An example of the reduction in computational cost can be seen, for example, at λ ¼ 4. If our desired fraction is 0:9, we would need to keep around 11% of the total number of eigenvalues. If our desired fraction were 0:5 the required fraction would be reduced to 5%.   9. Construction of L xx in SC and WC4DEnVar. For different number of grid points N and different localisation half widths (horizontal axis), the vertical axis shows the ratio of retained eigenvalues n over N, in order to cover a given fraction of the total sum of eigenvalues. For the left panel this fraction is 0:5, in the right panel it is 0:9.