Rapid update cycling with delayed observations

In this paper we examine the fundamental issues associated with the cycling of data assimilation and prediction in the case where observations are received after a delay, but we seek to assimilate them immediately on receipt, or within a short time of receipt. We obtain the optimal solution to this problem in the linear and non-linear cases, and explore its relation to simplified strategies which are adaptations of contemporary methods for large-scale data assimilation. We also discuss the challenges facing such cycling in large-scale numerical weather prediction.


Introduction
In the traditional cycling of forecasts and data assimilation (DA) for numerical weather prediction, the DA step for the global model has occurred every 6 or 12 hours. This was appropriate for an era when data was concentrated at the main synoptic times, and the limited area models (LAMs) for which the global model provides boundary conditions were cycled every 6 hours. However, in recent years data has become dominated by sources which are essentially continuous in time, and centres such as the Met Office will soon cycle their highest resolution LAM every hour.
By increasing the frequency of global analyses (e.g. to every hour) global forecasts can be based on more recent data, which is not only desirable in itself but provides timely lateral boundary conditions (LBCs) for high resolution LAMs. In one study of the Met Office's 1.5 km LAM covering the British Isles (Tang et al., 2013) it was found that replacing 3-hour and 6-hour old LBCs by 3-hour and fresh LBCs improved the UK index (a basket of scores measuring forecast skill) by 1.5% (Bruce Macpherson, pers comm). Furthermore, by having more frequent analyses the analysis increments will be smaller, which will improve the validity of the linear approximations in DA schemes. More frequent analyses may also improve the affordability of DA methods as the computational load is distributed more evenly in time.
We will see that, because of the delay in receiving some data, to ensure that all the data which are received are also assimilated, the assimilation windows will need to overlap. We obtain the optimal solution to this problem, which involves manipulating simultaneously all the states in the window and their joint errors. * Corresponding author. e-mail: tim.payne@metoffice. gov.uk We explore the relation between the optimal solution and simplified methods which are closer to current methods for large-scale DA.
We show that the current use of largely climatological prior error covariances may pose a challenge for high frequency cycling, and discuss how this may be overcome.

'Traditional' vs. 'Rapid Update' cycling
An immediate issue is that observations are not received instantaneously. For example, by 09Z on 18 June 2015 the Met Office had received over 80 million observations valid between 09Z on 15 June and 03Z on 18 June 2015, including around 0.8 million surface, 2.1 million aircraft and sonde, 12.4 million satwind and 14 million ATOVS observations. The delay between validity time and receipt for these observation types is recorded in Fig.  1. We see that to receive 95% of aircraft and sonde, surface, ATOVS and satwind observation took respectively 0.6, 1.5, 3.4 and 4.1 hours.
This presents a quandary for traditional cycling which aims to produce an analysis every 6 (at some centres every 12) hours. For definiteness consider 4D-Var (e.g. Li and Navon, 2001) with a 6-hour window [T-3,T+3], which generates an analysis at T-3 (in this discussion the units are hours).
One could perform the analysis at T+3 using all observations available by T+3, which would minimise the time delay to produce the analysis, but observations received after T+3 would not be assimilated. Alternatively, one could perform the analysis at T+7, by which time (in view of Fig. 1) almost all the observations valid in the window have been received, but the analysis is only available 4 hours after the end of the window and 10 hours after its beginning. To generate an estimate of state at T+7 we could run a 10-hour forecast from the analysis, but compared with the estimate of state at T-3 this will be degraded by model error.
Centres such as the Met Office mitigate these issues by performing each analysis twice, a 'late cut-off' analysis currently at about T + 6, and an 'early cut-off' analysis at about T+3. This is illustrated somewhat schematically in Fig. 2, which shows two adjacent non-overlapping windows [3Z,9Z) and [9Z,15Z) (where [T 1 ,T 2 ) denotes the time interval T 1 ≤ t < T 2 ). For example, at 8Z we receive observations valid between 4Z and 8Z. Considering the window [3Z,9Z), for the 'early cut-off' run we perform the data assimilation at 9Z and use the observations in the dark blue region, and for the 'late cut-off' run perform it at about 12Z and use virtually all observations ever valid in the window (combined light and dark blue region).
In principle the late cut-off analysis could make use of the early cut-off one to reduce its work load, as is done in the 'quasicontinuous' approach of Järvinen et al. (1996) and Veerse and Thèpaut (1998), but at the Met Office the late cut-off analyses start again from scratch, making no use of the work done for the early cut-off analysis.
Having both early and late cut-off analyses goes some way to mitigating the shortcomings of 6-hourly cycling. However, the analyses are still 6 hours apart, which makes them insufficiently timely for some purposes, notably (considering the comments in Section 1) the LBCs for hourly LAM analyses; the analysis increment is much larger than would be the case with an hourly update, so nonlinearity can be a significant problem, especially for the linear model in 4D-Var; and the approach is inefficient insofar as the early cut-off analyses are not used as part of a cycle. In Fig. 3 we illustrate how we would like to deal with the same case: each hour we assimilate all observations received in the last hour, e.g. at 12Z we assimilate the observations received between 11Z and 12Z (green region); these are valid between 7Z and 12Z. In principle we do not re-assimilate the observations valid between 7Z and 12Z received at earlier times (blue, red yellow and cyan regions in Fig. 3) as the information from these observations has been transferred to previous analyses and thereby to the background for this cycle.
In the context of global NWP, which provides among other outputs LBCs for hourly cycling of LAMs, an hourly update cycle is natural, but in principle the updates could be as short as one model time step. For the purposes of this paper we will refer to any cycling where observations are assimilated as soon as they are received, or as in Fig. 3 within some short time of receipt, as rapid update cycling (RUC).
We should note that the term 'Rapid Update Cycle' has been employed in the past to denote specific rapidly cycled NWP systems, for example by the National Centres for Environmental Prediction in the USA. In that case it referred to an operational regional forecast-analysis system over North America, where data was assimilated by 3D-Var (originally optimal interpolation) using non-overlapping windows of length one hour (Benjamin et al., 2004b;Benjamin et al., 2004a).

Optimal RUC
To examine the rapid update cycling problem further we will idealise it slightly by supposing that observations are valid at exact multiples of a time increment δt (as opposed to continuously in time), and become available after delays of 0, δt, 2δt, . . . , N δt. We will suppose that observations received at time kδt are Superscripts denote when the observations are received and subscripts their validity time, the longest delay being N δt.
The first problem is to develop an optimal method for assimilating observations as soon as they are available. At time kδt we seek to estimate x k , . . . , x k−N , given observations (1) and our previous estimate of x k−1 , . . . , x k−N −1 .
We will suppose that for i = k, k − 1, . . . , k − N we have observation operators h (k) i such that and for each i a model f i where the distributions of the errors ν (k) i , ω i are supposed known.

Notation and problem formulation
The optimal method is obtained by formulating the problem in such a way that standard estimation theory can be applied. We will use the convention that the underlined vector x k denotes the concatenation of the N + 1 vectors and similarly the underlined matrix A k is formed from matrices k} are matrices of size n × n, then x k , A k are of size respectively n(N + 1) × 1 and n(N + 1) × n(N + 1). Define y k to be the observations received at time kδt, so We seek the conditional expectations of x k and x k+1 given observations received up to time kδt E[x k |y 0 , y 1 , . . . , y k ], E[x k+1 |y 0 , y 1 , . . . , y k ] Note that, given x k , and assuming ω k has zero mean, the best estimate of x k+1 before observations received at kδt are assimilated is If we now define (8) Then we may write (2) and (3) as respectively which are in the standard form for observation and signal map equations in estimation theory (e.g. Jazwinski, 1970).

Linear Gaussian case
It is illuminating to first work out the details in the simplest case, where in (2) and (3) the observation operators h (k) i and model f i are linear and the errors ν (k) i , ω i are zero-mean, Gaussian and uncorrelated.
In this case denotes normally distributed with mean μ and variance ), and setting and (13) (9,10) become and the problem of finding the conditional expectations (6) of x k and x k+1 given observations received up to time k, which may be denotedx k|k ,x k+1|k , is solved by a standard Kalman Filter, as in Table 1. The basic objects manipulated are whole windows of states and observations and the covariances of the errors in these objects. The symbols in Table 1 are whole-window analogues of their usual values, e.g. P k|k−1 and P k|k are n(N +1)×n(N +1) prior and posterior error covariance matrices of the estimated x k . In special circumstances simplification is possible. For example, if new observations only occur in the final time slot, i.e. if the only observations which become available at k are y (k) k and there are no observations y i , ω i are zero-mean, Gaussian and uncorrelated. As noted in Anderson and Moore (1979), if we restrict attention to analysisprediction equations of form (16) and (18) then we may drop the Gaussian assumption on ν (k) i , ω i , merely requiring them to be zero-mean and uncorrelated, and (15)-(19) still minimises the expected error variance.

Variational equivalent of analysis step in linear Gaussian case
For large-scale data assimilation variational methods are almost universally used, so it is of interest to cast the optimal RUC analysis step (16) in variational form. Define whereÂ is the bottom right N n × N n submatrix of P k−1|k−1 . Then the analysis step (16) is equivalent tô where δ minimises This is proved in Appendix A. The J b term (20) constrains the N states in the intersection between the old and new windows by the inverse of the 'background error covariance matrix'Â. This 'big B' of size N n × N n is formed by taking the analysis error covariance from the previous stage and shearing off the oldest row and column.

General (nonlinear, non-Gaussian) case
We saw in Section 3.1 how the problem of assimilating data immediately it becomes available can be cast into a standard signal model/observation model form (9) and (10), to which we can then apply well-established theory, e.g. Jazwinski (1970). In particular, given (9) and (10) we can compute (sequentially in k) the conditional pdfs p(x k |y 0 , . . . , y k ) and p(x k+1 |y 0 , . . . , y k ). The novel feature for us is that f k and h k in (9) and (10) have a special structure which leads to significant simplifications, in particular enabling us to express these pdfs in terms of the original (as opposed to block) variables. In general, given the prior pdf p x k |y 0 , . . . , y k−1 and the conditional pdf of the observations given the state p y k |x k , Bayes' theorem tells us that the posterior pdf p x k |y 0 , . . . , y k is where the normalisation N is and the domain of integration U = R (N +1)n . We will suppose the basic process satisfies the Markov property Given p(x k−1 |y 0 , . . . , y k−1 ) (the posterior pdf at k − 1) and p x k |x k−1 , the Chapman-Kolmogorov equation then gives us for the prior pdf p x k |y 0 , . . . , We could cycle (27) and (25) to obtain the posterior pdf for every k. However, as mentioned there are simplifications arising in this case. By virtue of the fact that in (8) the ith sub-vector of h k depends only on x k−N +i−1 , p y k |x k , the conditional pdf of the observations given the state, factors into Additionally, the transition pdf p x k |x k−1 may be written Combined with (27) this gives We may cycle (30) and (25), (28) to obtain the posterior pdf for every k. For example, if the observation operators h (k) i and model f k are linear, and the errors ν (k) i , ω i are Gaussian, then one may check that (30), (25) and (28) imply that the posterior pdf where J b , J q , J o are given by (20)-(22).

Example
We illustrate some of the foregoing with a small example, in which the model is the 40-dimensional chaotic model proposed by Lorenz (1996): This system is integrated using fourth order Runge-Kutta with a time step of 0.05/6 during which time errors grow at a rate corresponding to order one hour in an atmospheric system. We therefore refer to time step k as time k. The truth is obtained by integrating (31) and to each component of x adding Gaussian model error every time step with variance σ 2 q . We suppose that at time k eight observations have just become available, at: points In this example we suppose there are no 'instantaneously available' observations (i.e. available at time k and also valid at k). Each observation has Gaussian error with variance σ 2 o where σ o = 0.546. Every grid point is observed every 5 time steps and the observation network repeats itself exactly every 10 time steps. The system is well-observed and for the values of σ o , σ q used here the departure from linearity small enough that the foregoing linear theory can be well applied to the linearised model.

Non-overlapping windows with different lags
Consider first traditional non-overlapping window strategies. Suppose we have a 6-hour cycle with 6-hour windows. For the window [k − 6, k) we wish to assimilate observations valid in k−6 ≤ t < k. For non-overlapping windows we use an optimal 1 smoother, i.e. 4D-Var (Li and Navon, 2001) with model error correctly accounted for and correct cycling of background error covariances. For the window [k − 6, k) this produces analyses {x a k−6 , . . . , x a k−1 }. As discussed in Section 2, the number of observations which are valid in the window and available for the analysis increases the longer the interval of time between the end of the window  4. RMS forecast error from most recent available analysis at times 0-5 hours into the next window following an assimilation window [k − 6, k), for lags 0 (black), 2 (blue) and 4 (green). Dashed lines denote RMS error at times before the analysis is performed. Also shown in red is the RMS error in the optimal RUC analysis using the data available at each time. In this example σ q = 0.182. and when the analysis is performed, which we term the lag. For the present example the number of observations available at each time in the window if the analysis is performed at k, k + 2, k + 4 is shown in Table 2.
Taking for example the lag=2 case, at times k and k+1 the most recently available analyses are those in the window [k−12, k−6) with last analysis at k − 7, while at k + 2, . . . , k + 5 the most recently available analysis is at k − 1. Table 3 shows the validity time of the most recent analysis available at times k, . . . , k + 5 for lags of 0, 2 and 4 hours. Fig. 4 shows the RMS forecast error (using σ q = 0.182 and averaged over 2000 cycles) in forecasts valid at times t = k, k + 1, . . . , k + 5 taken from the latest available analysis using lag=0 (in black), lag=2 (in blue) and lag=4 (in green). Fig.  4 illustrates a point about non-overlapping windows made in Section 2 above, that we choose between a short lag between observations and when the analysis is performed, giving timely analyses but not using all the observations, or a longer lag using more observations, but which at any given time requires longer forecasts which will be more degraded by model error.
For the lag=2 and lag=4 cases we also show (dashed lines) the RMS forecast error for times between the end of the analysis window and the time the analysis is performed.

Optimal RUC
Since in our example the longest delay in receipt of observations is 5 h, for the optimal RUC method of Section 3 we have N = 5. At time j this produces analyses {x a j−5 , . . . , x a j }.  Table 3. Validity time of the most recent analysis available at times k, . . . , k + 5 for lags of 0, 2 and 4 hours.
Validity time of most recent Lag analysis available at: A comparison of the observation usage of non-overlapping windows and optimal RUC was illustrated in Figs. 2 and 3. In optimal RUC all observations are used, as soon as they are received.
The red curve in Fig. 4 shows the RMS error in the RUC analysis at x a j for j = k, . . . , k + 5. From the foregoing we know this will always be less than the RMS error at j using any available analysis using a non-overlapping window with any lag. Note however that this error can be greater than that from the lagged analyses run at a later time, eg, in this example for the lag-2 and lag-4 analyses at time k. This is because the lagged analyses are using observations not available at time k.

Suboptimal methods for RUC
In Section 3 above we derived the optimal solution to the problem of assimilating data as soon as it becomes available, and saw that if the maximum delay is N and the state is described by n variables that this involved manipulating vectors of size n(N +1) and their error covariances of size n(N + 1) × n(N + 1). If observation and model errors are uncorrelated in time then optimal data assimilation methods for non-overlapping windows only involve vectors and matrices of size n and n × n.
For large-scale systems manipulating vectors of size n(N +1) and more particularly matrices of size n(N + 1) × n(N + 1) may not be manageable. Furthermore, NWP centres already have methods implemented for non-overlapping windows (we will refer to these as 'traditional methods') and will naturally seek ways of adapting these to the RUC problem. Hence a topic of practical importance is the relationship between the optimal solution to RUC and traditional methods applied to RUC.
For simplicity we restrict attention to the case of linear forecast and observation operators whereas in Section 3.2 the errors are Gaussian and uncorrelated. We will designate the optimal solution for RUC in this case (i.e. Table 1) as Method 0. We will develop suboptimal methods for RUC based on traditional methods for non-overlapping windows, with Method 3 a 'naive' application of such a method to RUC, and Methods 2 and 1 adaptations of this which are progressively closer to the optimal solution. We will then examine the relation between the four methods.

'Traditional' methods as suboptimal methods for RUC
Suppose that at time k we have prior estimates and we have just received observations A natural extension of the 4D-Var method (e.g. Li and Navon (2001)) as used for non-overlapping windows is to form analyses at j = k − N , . . . , k (34) is the error covariance of x b k−N , if this is known, or some approximation otherwise; we return to this point below. All our suboptimal methods 1-3 use (33) and (34)  Method 2: Slightly better is to save and use the second analysed state, i.e. at time k − N + 1, which will be at the beginning of the next window, giving us priors Method 1: Finally, we could follow the optimal solution and save and use all the analysed states, giving us priors The number of prior states which are simply analysis states from the previous cycle is therefore 0, 1 and N for Methods 3, 2 and 1, respectively. The four RUC methods (with the optimal one of Section 3.2 labelled 'Method 0') are summarised in Table 4. The formation of backgrounds is shown in more detail for N = 2 in Table 5.  Table 5. The background states for the analysis at t = 3 given the analysed states available from t = 2, illustrated with N = 2 for Methods 0,1,2,3. After the analysis at t = 2 we have analyses x a 0 , x a 1 and x a 2 . The backgrounds x b 1 , x b 2 and x b 3 at t = 3 are formed from these analyses as shown in the table.

Covariance of analysis and background errors in methods 1-3
To compare the various methods we will need the covariance of their errors. We may write (33) and (34) as where where we use the notation (U k ) i, j to denote the i, j n × n submatrix of U k . Denoting the truth by x t k and  Table 5.
it follows that for all Methods 1-3 the analysis error covariance A k is, from (38) We note this depends both on B k and (via K k and B v k ) the prescribed B.
The background error covariance B k+1 depends on which method is used. For method we may write (12) and M (2) Table 6. Since the error in In order to cycle Methods 1-3 we need to specify B in (34). For the rest of this section, for the purposes of comparing the four methods, we will suppose that in Methods 1-3 we use B = (B k ) 1,1 , which can be obtained from (45) (as above, underlined subscripts refer to submatrices, so (B k ) 1,1 is the top left n × n submatrix of B k ). It can be shown that for methods 1 and 2 that (B k ) 1,1 = (A k−1 ) 2,2 .

Relation between methods 0 and 1
An important comparison is between optimal Method 0 and suboptimal Method 1. They share the same background step (43) with the same M k . In both cases the analysis step may be written in the form (38), though whereas in Method 0 the gain uses true B k , i.e. the covariance of the error in x b k , for Method 1 the gain K k is (40). Because of the similarity in the structures of Methods 0 and 1 there is a simple and strong relation between their errors. If the sequence of background error covariances using Methods 0 and 1 are designated respectively B k andB k , and we start from the same prior error covariancẽ This is proved in Appendix B.

Relation between methods in limit Q k → ∞
We have four RUC methods, the optimal and three suboptimal ones. We can cycle each as described above, for the suboptimal methods using B = (B k ) 1,1 (Section 5.2). A limiting case which exhibits some of the differences between them, in particular how information is saved from previous cycles, is obtained by letting model error covariance Q k → ∞ for all k.
For simplicity suppose H = I and R k and Q k are independent of k. If Q = ∞ then after N cycles for method 0, one cycle for Methods 1 and 2, and immediately for Method 3, all knowledge of the initial background state x b 0 and its error covariance is lost. In Table 7 we show the analysed state x a j+N produced by the four methods for any j ≥ N if model error is infinite, and the corresponding background error and analysis error covariances.
The optimal Method 0 retains all the observation information ever received; at time j + N the estimate of state at any time between j and j + N is simply the average of all the observations ever received valid at that time. At the other extreme, Method 3 'forgets' all the observation information from previous cycles: at time j + N the estimate of state at any time between j and j + N is just the value of the observation valid at that time and received at time j + N . Methods 1 and 2 retain observation information from the previous cycle only at the initial time. These different behaviours are reflected in the analysis error covariances shown in Table 7. Comparing Methods 1 and 2, while in the limit Q → ∞ Method 1 analyses are no better than those of Method 2, we note in Table 7 that it has better backgrounds than Method 2.

Relation between methods in the limit Q k → 0
Suppose that Q k = 0 for all k, and we are given an estimate x b 0 of x 0 with error covariance B 0 . Estimate the remaining states in the window {0, . . . , N } by If Q k = 0 for all k then Methods 1-3 simplify to their 'strong constraint' forms, in which (34) simplifies to Crucially, in the limit Q → 0 Methods 0-3 coincide, so in particular Methods 1-3 are now optimal. This is proved in Appendix C. In the absence of model error the 'suboptimal' methods all coincide with each other, and in fact are in this case optimal.

Comparison of methods 0-3 for the example of Section 4
We may apply linearised versions of Methods 0-3 to our nonlinear chaotic example of Section 4. In an attempt to mitigate the effects of linearisation error one can formulate outer loop-style iterations for these strategies, which may be worth implementing in more non-linear systems (for the examples here they made negligible difference). Alternatively one could use the 'best linear approximation' (Payne, 2013). In our example of Section 4, at time k observations have just become available which are valid at k − 1, . . . , k − 5 so N = 5. The optimal Method 0 and suboptimal Methods 1-3 all provide analyses x a k , x a k−1 , . . . , x a k−5 . (Since in our example no observations are instantaneously available, x a k is here a forecast from x a k−1 ). Each strategy is cycled 10,000 times and the first hundred cycles disregarded. Figure 5 shows the RMS error in the analyses at k −5, . . . , k, for the optimal Method 0 (black), and suboptimal Methods 1 (blue), 2 (green) and 3 (red), for σ q = 1.82 and 0.455 (upper and lower set respectively).
As expected from the foregoing, the errors E are ordered Furthermore, the analyses and therefore their errors converge as σ q → 0.

Impact of climatological background error covariances
A significant difference between the methods of the preceding sections and those used for large scale systems is that in the latter the prior error covariance (usually denoted B) is not cycled, but is either constant or is a convex combination of a constant and an estimate of cycled B (see (47) below). It is important to note that insofar as B is fixed it is advantageous to assimilate data as far as possible simultaneously in larger units rather than split it up into smaller volumes and assimilate it in smaller units. Intuitively, by assimilating many observations simultaneously the deficiencies of the fixed B are reduced.
Illustrating this point is complicated by the fact that increasing observation batch size tends to involve making other changes which themselves have an impact. If we compare cycling using non-overlapping windows with RUC then at no instant in time are the two methodologies assimilating the same observations (see Section 2). If we use 4D-Var to compare cycling with windows of length 1 (assimilating one observation every time step) with windows of length 2 (two observations every two time steps) then the latter has the advantage of covariances evolved through the window, which is a different point to the one being made.
If we compare assimilating two observations simultaneously every time step with assimilating one after the other, in the latter case we have to decide what B to use for the second observation. In Appendix D we show that if in a scalar system we assimilate two observations every cycle, and have a choice between assimilating them (a) simultaneously using a fixed background error covariance B, or (b) separately, the first with fixed background error covariance B 1 and the second with fixed background error covariance B 2 , then it is possible to choose B so that no matter how well B 1 and B 2 are chosen strategy (a) will always outperform strategy (b). We may contrast this result concerning the use of fixed background covariances with the fact that, if B is chosen optimally every cycle, and B 1 and B 2 are chosen optimally every cycle, the two strategies will produce identical (and optimal) results.
This means that if B is fixed then, in this respect, RUC is at a disadvantage compared with conventional cycling as now there are more cycles with fewer observations used every cycle. In practice this effect may be dwarfed by the advantages of RUC as discussed in Section 2. If not, the obvious remedy is to improve the cycling of the background error covariances.
As noted above, the effect is due to using a fixed B, and is removed if B is cycled properly. This is unattainable in current large-scale NWP, but centres such as the Met Office already employ a 'hybrid' B (Clayton et al., 2013) where B c is fixed but B e is an estimate (from an ensemble) of the true prior error covariance, with β 2 c + β 2 e = 1. For best performance β e → 1 as ensemble size increases, with the fixed part having no weight in the limit of an infinitely large ensemble.
There are other possible ways of introducing adaptivity into B, such as the 'ensemble-variation integrated localised' method (Auligné et al., 2016) and the so-called variational Kalman filters (Auvinen et al., 2010). In the latter the limited-memory quasi-Newton method is used to build a low-storage approximation to the Hessian of the analysis cost function, which therefore approximates the inverse of the analysis error covariance matrix, and can also be used to evolve the covariance forward to approximate B at the next analysis time.

Concluding remarks
Rapid update cycling (RUC) is the process by which we assimilate observations into a model as soon as they become available, or more practically, within some (short) time interval δt. We have seen that if the greatest delay in receiving observations is N δt then the optimal solution to RUC at time k involves manipulating the (N + 1)n vectors x formed from x k−N , . . . , x k and the moments of the errors in x, such as their (N + 1)n × (N + 1)n error covariances.
Compared with 'traditional' cycling RUC makes more timely use of observations, which is particularly important for the provision of LBCs for LAMs. Another advantage of RUC is that the increments are smaller and hence linearisation error is reduced.
We have purposely concentrated on fundamental topics and avoided such practically important matters as efficiency and cost. The fact that for each analysis observation volumes and increments are smaller, and we always have a recent one available, suggest that it should be possible to reduce the cost per analysis. 2 On contemporary HPCs where the increased power comes through higher numbers of processors rather than increased clock speeds this is more important than the total cost per day. 3 We may adapt 'traditional' methods designed for nonoverlapping windows to RUC. These methods are suboptimal, but in all cases considered in this paper (Methods 1,2 and 3 in Section 4) they coincide with the optimal solution in the limit where model error vanishes.
Assimilating observations in smaller batches can be disadvantageous if climatological background error variances are used. This potentially poses a challenge for RUC, which could perhaps be met by improved cycling of error covariances.
(ii) By elementary algebra It follows from (A1)-(A3) and (13) that At a minimum of (24) we have J (δ) = 0 so and using this value of δ in (23) is equivalent to (16).

Appendix B. Proof of theorem of Section 5.3
If the sequence of background error covariances using Methods 0 and 1 are designated, respectively, B k andB k , and we start from the same prior error covarianceB N = B N , then for Proof. For simplicity we shall suppose that both R and B v k are non-singular and that H = I , so denoting Method 1 quantities with tildes we have K = (B −1 + R −1 ) −1 R −1 andK = ((B v ) −1 + R −1 ) −1 R −1 . Hence from (17) and (42) we have and therefore Now consider the matrix Since X is the sum of two positive semi-definite matrices it is positive semi-definite. Note also that since R is positive definite The term in square brackets in (B2) is the Schur complement of (B v ) −1B (B v ) −1 + R −1 in X so by Theorem 4.3 of Gallier (2010)

Appendix C. Proof of equivalence of methods 0-3 if Q = 0
Set V k to be the first n columns of the n(N + 1) × n(N + 1) matrix U k , i.e.
We suppose inductively that for some k ≥ N Method 0 and Methods 1-3 have the same x b k , B k and that This is true for k = N by construction. For all methods we therefore have Using the 'Kalman identity' it follows from (17), (42) that for all methods and therefore from (16), (38) that in all cases Recall that the superscript in M ( ) in (43) Letx a k denote the vector in square brackets in (C5) andÃ k the matrix in square brackets in (C3). It follows from (18), (19) and (43), (45) that both for Method 0 and for Methods = 1, 2, 3 we have . Therefore the inductive hypothesis holds for k + 1, and therefore for all k.

Appendix D. 4D-Var with a fixed background error covariance: impact of observation batch size
Suppose we have a linear system, observations in some time interval [0, T ] and all errors are Gaussian. If we assimilate the observations using an optimal method, such as 4D-Var with correctly cycled prior and posterior error covariances, using m assimilation windows [0, t 1 ], [t 1 , t 2 ], . . . , [t m−1 , T ] then the estimate of state at time T is independent of m and of how we choose t 1 < t 2 < · · · < t m−1 However, if instead of properly cycling the error covariances the background error covariances are fixed, it is often advantageous to assimilate data in larger batches.
We illustrate this by considering a case where x is a scalar quantity, which evolves in time according to x(i + 1) = μx(i) for some constant μ (which is supposed known, so there is no model error) with |μ| > 1. At each time i we wish to assimilate an observation y 1 (i) with error variance r i and an observation y 2 (i) with error variance s i . To make the problem analytically tractable we will suppose that {(r i , s i ), i = 1, . . . , ∞} are drawn Table D1. Parameters for Equation (D1).
We compare two assimilation strategies using 4D-Var with a non-cycled background: Simultaneous (batch size of 2): at each time i we assimilate y 1 (i) and y 2 (i) simultaneously using fixed background error covariance b, i.e. x a = x b + δ where δ minimises Sequential (batch size of 1): at each time i we assimilate y 1 (i) using fixed background error covariance b 1 to give intermediate analysis x a1 (i), then assimilate observation y 2 (i) using fixed background error variance b 2 to give final analysis x a (i), i.e. x a1 = x b + δ 1 , x a = x a1 + δ 2 , where δ 1 , δ 2 minimise J (δ 1 ) = 1 2 We will show the following: we can choose b so that, however b 1 and b 2 are chosen, the mean square error using the simultaneous method is lower than that using the sequential method (and strictly lower if k ≥ 2 and R 1 = R 2 ).

D.1. Proof (summary)
(1) Denoting E[(x a i − x t i ) 2 ] by A i it is readily shown that for both the simultaneous and sequential methods where the parameters β i , ρ i , σ i are as listed in Table D1.