State-of-the-art stochastic data assimilation methods for high-dimensional non-Gaussian problems

Abstract This paper compares several commonly used state-of-the-art ensemble-based data assimilation methods in a coherent mathematical notation. The study encompasses different methods that are applicable to high-dimensional geophysical systems, like ocean and atmosphere and provide an uncertainty estimate. Most variants of Ensemble Kalman Filters, Particle Filters and second-order exact methods are discussed, including Gaussian Mixture Filters, while methods that require an adjoint model or a tangent linear formulation of the model are excluded. The detailed description of all the methods in a mathematically coherent way provides both novices and experienced researchers with a unique overview and new insight in the workings and relative advantages of each method, theoretically and algorithmically, even leading to new filters. Furthermore, the practical implementation details of all ensemble and particle filter methods are discussed to show similarities and differences in the filters aiding the users in what to use when. Finally, pseudo-codes are provided for all of the methods presented in this paper.


Introduction
Data assimilation (DA) is the science of combining observations of a system, including their uncertainty, with estimates of that system from a dynamical model, including its uncertainty, to obtain a new and more accurate description of the system including an uncertainty estimate of that description. The uncertainty estimates point to an efficient description in terms of probability density functions, and in this paper we discuss methods that perform DA using an ensemble of model states to represent these probability density functions.
Ensemble Kalman filters are currently highly popular DA methods that are applied to a wide range of dynamical models including oceanic, atmospheric, and land surface models. The increasing popularity of Kalman-Filter-based ensemble (EnKF) methods in these fields is due to the relative ease of the filter implementation, increasing computational power and the natural forecast error evolution in EnKF schemes with the dynamical methods, which can be applied to high-resolution dynamical models and/or complex observation operators, has seen major developments in the last decade with the aim to understand how existing ensemble methods cope with non-linearity in the models, to develop new ensemble methods that are more suited to non-linear dynamical models, as well as to explore non-linear filters that are not limited to Gaussian distributions, such as particle filters or hybrids between particle and ensemble Kalman filters.
The origin of this paper lies in the EU-funded research project SANGOMA (Stochastic Assimilation for the Next Generation Ocean Model Applications). The project focused on generating a coherent and transparent database of the current ensemblebased data assimilation methods and development of data assimilation tools suitable for non-linear and high-dimensional systems concentrating on methods that do not require tangent linear approximations of the model or its adjoint. The methods described within this paper have been applied in operational oceanography, like the TOPAZ system (Sakov et al., 2012) or the FOAM system of the UK Met Office (see Blockley et al., 2014). While TOPAZ is already using an EnKF, FOAM applies an optimal interpolation scheme that takes in less dynamical information to estimate the error covariance matrix. That said this paper is aimed at a very broad audience and data assimilation methods discussed in this paper are not limited to applications to ocean or atmosphere models, hence, the methods are presented without the context of any specific dynamical model, allowing the reader to make the most of each technique for their specific application.
A number of reviews have been published recently, each collating parts of the development in data assimilation, e.g. Bannister (2017) gives a comprehensive review of operational methods of variational and ensemble-variational data assimilation, Houtekamer and Zhang (2016) review the ensemble Kalman filter with a focus on application to atmospheric data assimilation, Law and Stuart (2012) review variational and Kalman filter methods, and Bocquet et al. (2010) present a review of concepts and ideas of non-Gaussian data assimilation methods and discusses various sources of non-Gaussianity. The merits of this paper lie within: • coherent mathematical description of the main methods that are used in the current data assimilation community for application to high-dimensional and non-Gaussian problems allowing the reader to easily oversee the differences between the methods and compare them; • discussing ensemble Kalman Filters, particle filters, secondorder exact filters and Gaussian Mixture Filters within the same paper using consistent notation; • inclusion of practical application aspects of these methods, discussing computational cost, parallelising, localisation and inflation techniques; • provision of pseudo-code algorithms for all of the presented methods; along with inclusion of recent developments, such as the errorsubspace transform Kalman filter (ESTKF) and recent particle filters, this paper goes beyond earlier reviews (e.g. Tippett et al., 2003;Hamill et al., 2006; van Leeuwen, 2009;Houtekamer and Zhang, 2016;Bannister, 2017). The paper is organised as follows: in Section 2 the common ground through Bayes theorem is established, in Section 3 a historical overview is given for both ensemble Kalman and particle filter fields, and in Section 4 we define the basic problem solved by all of the methods presented in this paper. In Section 5 we discuss the most popular types of ensemble Kalman filter methods. Then, in Section 6, we discuss several particle filter methods that can be applied to high-dimensional problems. In Section 7, we describe ensemble filters with second-order accuracy, namely the particle filter with Gaussian resampling (PFGR), the non-linear ensemble transform filter (NETF), and the moment-matching ensemble filter. The Gaussian mixture filter is discussed in Section 8. The practical implementation of the filters including localisation, inflation, parallelisation and the computation cost as well as the aspect of non-linearity are discussed in Section 9. Finally, Appendix 1 provides pseudo codes for resampling techniques often used in particle filter methods, and Appendix 2 contains pseudo codes for all of the methods discussed in this paper.
We note that many of the filters discussed in this paper are available freely from Sangoma project website 1 along with many other tools valuable and/or necessary for data assimilation systems.

Notation
In the data assimilation community the currently most accepted notation is described in Ide et al. (1997). We adhere to this notation where possible while also making this paper acceptable and intuitive not only to data-assimilation experts but also to a wider audience including those who might like to explore data assimilation methods simply as tools for their specific needs. To this end, throughout this paper dimensions will always be described by capital N with an underscore indicating the space in question, that is • N x -dimension of state space; • N y -dimension of observation space; • N e -dimension of ensemble/particle space. Further, the time index is always denoted in parentheses in the upper right corner of the variables, i.e. (.) (m) , except for operators such as M (dynamical model) and H (observation operator) where it is in the lower right corner. However, we will omit the time index when possible to ease the notation. We will refer to each ensemble member (or each particle) by x j where the index j = 1, . . . , N e and N e is the total number of the ensemble members (or particles).
When discussing Bayesian theory in Sections 2 and 6 purely random variables will be denoted by capital letters, and fixed or deterministic or observed quantities will be denoted by lowercase letters. Probability density functions will be denoted by p (..), and q(..) and we will use lower case arguments in this context.
Throughout the paper, Greek letters will refer to various errors, e.g. observational or model errors. Finally, bold lowercase letters will denote vectors and bold uppercase letters will denote matrices.

Common ground through Bayes theorem
Various types of data assimilation methods, e.g. variational, ensemble Kalman filters, particle filters, etc., have originated from different fields and backgrounds, due to the needs of a particular community or application. However, all of these methods can be unified through Bayes theorem. In this section, we will give a summary of Bayes theorem showing how both ensemble Kalman filter (KF) methods and particle filter (PF) methods are linked in this context and what problems each of them solve. For an introduction to the Bayesian theory for data assimilation, the reader is referred to e.g. van Leeuwen and Evensen (1996) and Wikle and Berliner (2006).
Data assimilation is an approach for combining observations with model forecasts to obtain a more accurate estimate of the state and its uncertainty. In this context, we require • data, that is observations y and a knowledge of their associated error distributions and • a prior, that is a model forecast of the state, x f , and knowledge of the associated forecast and model errors; to obtain the posterior, i.e. the analysis state x a , and its associated error. The posterior can be computed through Bayes theorem which states that where p (x|y) is the posterior or analysis probability density function, p (y|x) is the observation probability density function or also called the likelihood, p (x) is the prior or forecast probability density function, and p (y) is the marginal probability density function of the observations, which can be thought of as a normalising constant. From now on, for the ease of the readability, we will abbreviate 'probability density function' with 'pdf'. Typically, data assimilation methods make the Markovian assumption for the dynamical model M and the conditional in-dependence assumption of the observations. That is, we assume that the model state or the prior at time m, when conditioned on all previous states only depends on the state at the time m − 1, Here, the superscript 0 : T is to be read for time indices from initial time to time T , which is typically called assimilation window in data assimilation. Further, observations are also usually assumed to be conditionally independent, i.e. they are assumed to be independent in time, Using Equations (2) and (3) we can rewrite Bayes theorem in Equation (1) as (4) The Markovian assumption allows us to use new observations as they become available by updating the previous estimate of the state process without having to start the calculations from scratch. This is called sequential updating and the methods described in this paper all follow this approach.
Ensemble Kalman filter methods solve this problem using Gaussian assumptions for both prior and likelihood pdf's. Multiplying two Gaussian pdf's leads again to a Gaussian pdf, i.e. the posterior or analysis pdf will also be Gaussian. The posterior pdf will have only one global maximum, which will correspond to the ensemble mean (also mode and median since the pdf is Gaussian). In other words, the posterior pdf in ensemble Kalman filter methods described in Section 5 is found in terms of the first two moments (mean and covariance) of the prior and likelihood pdf's. This is also true when ensemble Kalman filters are applied to non-linear dynamical models or observation operators, in which case the information from higher moments in an ensemble KF analysis update is ignored. This is a shortcoming in ensemble Kalman filters when applied to non-Gaussian problems. However, in general ensemble Kalman filter methods are robust when applied to non-linear models and catastrophic filter divergence, where the filter deviates strongly from the observations while producing unrealistically small error estimates, occurs mainly due to sparse or inaccurate observations (Verlaan and Heemink, 2001;Tong et al., 2016). It should, of course, be realised that in non-linear settings the estimates of the posterior mean and covariance might be off.
In particle filter methods, the posterior is obtained using the prior and likelihood pdf's directly in Equation (1) without restricting them to being Gaussian. If both prior and likelihood are Gaussian the resulting posterior or analysis pdf is also Gaussian and has a global maximum corresponding to the mean state. However, if either or both prior and likelihood pdf's are non-Gaussian then the resulting posterior pdf will also not be Gaussian. In other words, if the dynamical model or mapping of the model variables to observation space are non-linear then particle filter methods will produce an analysis pdf which will provide knowledge of more than the first two statistical moments (mean and covariance), in contrast to ensemble Kalman filter methods. Thus, the analysis pdf could be skewed, multi-modal or of varying width in comparison to a Gaussian pdf. Hence, particle filters are, by design, able to produce analysis pdf's for non-Gaussian problems. While standard particle filter methods suffer from filter divergence for large problems recently several particle filter variants have been developed that avoid this divergence.
In what follows, we will describe numerous filtering methods in Sections 5, 6, 7, and 8 and discuss how each method attempts to produce an analysis pdf for non-Gaussian and high-dimensional problems. However, firstly we provide an overview of the historical development of both ensemble Kalman filters and particle filter methods to show how these fields have evolved and what has given rise in the development of each of the methods.

History of filtering for data assimilation
Before we precede to the main point of our paper -describing in unified notation current state-of-the-art ensemble and particle filter methods for non-linear and non-Gaussian applications, their implementation, and practical application, a short summary is in order on the historical development in both ensemble Kalman filter and particle filter areas.

Development history of ensemble Kalman filters
Ensemble data assimilation (EnDA) started in 1994 with the introduction of the Ensemble Kalman filter (EnKF, Evensen (1994)). The use of perturbed observations was introduced a few years later simultaneously by Burgers et al. (1998) and Houtekamer and Mitchell (1998) to correct the previously too low spread of the analysis ensemble. This filter formulation defines today the basic 'Ensemble Kalman filter', which we will denote as the Stochastic Ensemble Kalman Filter, with a slightly different interpretation and implementation, as will be described later. The first alternative variant of the original EnKF was introduced by Pham et al. (1998a) in the form of Singular 'Evolutive' Interpolated Kalman (SEIK) filter. The SEIK filter formulates the analysis step in the space spanned by the ensemble and hence is computationally particularly efficient. In contrast to the EnKF, which was formulated as a Monte Carlo method, the SEIK filter was designed to find the analysis ensemble by writing each posterior member as a linear combination of prior members without using perturbed observations. Another ensemble Kalman filter that uses the space spanned by the ensemble was introduced with the Error-Subspace Statistical Estimation (ESSE) method (Lermusiaux and Robinson, 1999).
The filters mentioned above were all introduced for data assimilation in oceanographic problems. A new set of filter methods was introduced during the years 2001 and 2002 for meteorological applications. The Ensemble Transform Kalman Filter (ETKF, Bishop et al., 2001) was first introduced in the context of meteorological adaptive sampling. Further, the Ensemble Adjustment Kalman Filter (EAKF, Anderson, 2001) and the Ensemble Square Root Filter (EnSRF, Whitaker and Hamill, 2002) were introduced. The motivation for these three filters was to avoid the use of perturbed observations, which were found to introduce additional sampling error into the filter solution, with the meteorological community apparently being unaware of the development of the SEIK filter. The new filters were classified as ensemble square root Kalman filters and presented in a uniform notation by Tippett et al. (2003). Nerger et al. (2005a) further classified the EnKF and SEIK filters as error-subspace Kalman filters because the filters compute the correction in the errorsubspace spanned by the ensemble. This likewise holds for the ETKF, EAKF, and EnSRF, however, these filters do not explicitly use a basis in the error subspace but use the ensemble to represent the space. When the EAKF and EnSRF formulations are used to assimilate all observation at once, these filters exhibit a much larger computational cost compared to the ETKF. To reduce the cost, the original study on the EnSRF (Whitaker and Hamill, 2002) already introduced a variant in which observations are assimilated sequentially, which assumes that the observation errors are uncorrelated. A similar serial formulation of the EAKF was introduced byAnderson (2003). This sequential assimilation of observations was assessed by Nerger (2015) and it was shown that this formulation can destabilise the filtering process in cases when the observations have a strong influence.
With regard to the classification as an ensemble square root Kalman filter, the SEIK filter is the first filter method that was clearly formulated in square root form. The original EnKF uses the square root form only implicitly but an explicit square root formulation of the EnKF was presented by Evensen (2003).
The methods above all solve the original equations of the Kalman filter but use the sample covariance matrix of the ensemble to represent the state error covariance matrix. An alternative was introduced with the Maximum-Likelihood Kalman Filter (MLEF, Zupanski, 2005). This filter represents the first variant of the class of hybrid filters that were introduced in later years. The filter computes the maximum-a posteriori solution (in contrast to the minimum-variance solution of the Kalman filter) by an iterative scheme. 2 While the EnKFs were very successful in making the application of the Kalman filter feasible for the high-dimensional problems in oceanography and meteorology, the affordable ensemble size was always very limited. To counter the issue of sampling error in ensemble covariances (the ensemble-sampled covariance has a rank of not more than the ensemble size minus one while the applications were of very high state dimension) the method of covariance localisation was introduced by Houtekamer and Mitchell (1998) and Houtekamer and Mitchell (2001). Later, an alternative localisation was introduced for the ETKF (LETKF, Hunt et al., 2007) which uses a local analysis (also used previously, e.g. by Cohn et al. (1998)) where observations are down-weighted with increasing distance from the local analysis point through a tapering of the inverse observation covariances.
The relationship between the SEIK filter and the ETKF was investigated by Nerger et al. (2012a). The study leads to a new filter formulation, the Error-Subspace Transform Kalman Filter (ESTKF), which combined the advantages of both filter formulations.
The filters mentioned above represent main developments of the ensemble Kalman filters. However, there are many other developments, which are not included here. Some of them are discussed in the sections below, in particular with regard to localisation. Overall, while there are different reviews of selections of ensemble Kalman filters, a complete and coherent overview of the different methods is still missing.

Development history of particle filters
Particle filters, like ensemble Kalman filters, are variants of Monte Carlo methods in which the probability distribution of the model state given the observations is approximated by a number of particles; however, unlike ensemble Kalman filters, particle filters are fully non-linear data assimilation techniques. From a sampling point of view, Ensemble Kalman Filters draw samples directly from the posterior since the probability distribution function (pdf) is assumed to be a Gaussian. In a particle filter application, the shape of the posterior is not known, and hence one cannot sample directly from it. In its simplest form, samples are generated from the prior after which importance sampling is employed to turn them into samples from the posterior where each sample is weighted with its likelihood value.
Particle filters emerged before ensemble Kalman filters, and when Gordon et al. (1993) introduced resampling in the sequential scheme the method became mainstream in non-linear filtering. This basic scheme has been made more efficient for specific applications in numerous ways, like looking ahead, adding small perturbations to resampled particles to avoid that they are the same etc. (see Doucet et al., 2001 for a very useful review of the many methods available at that time). Attempts to apply the particle filter to geophysical systems are as old as 1996 (van Leeuwen and Evensen, 1996), with the first partially successful application by van Leeuwen (2003a). However, until recently, particle filters have been deemed to be computation-ally unfeasible for large-dimensional systems due to the filter degeneracy problem Snyder et al., 2008;van Leeuwen, 2009). This means that the likelihood weights vary substantially between the particles when the number of independent observations is large, such that one particle obtains a weight close to one, while all the others have weight very close to zero. New developments in the field generated particle filter variants that have been shown to work for large dimensional systems with a limited number of particles. These methods can be divided in two classes: those that use localisation (starting with van Leeuwen, 2003b;Bengtsson et al., 2003), followed more recently by local variants of the ensemble transform particle filter (ETPF, Reich, 2013) and the Local Particle Filter (Poterjoy, 2016a) and those that exploit the future observational information via proposal densities, such as the Implicit Particle Filter (Chorin and Tu, 2009), the Equivalent Weights Particle Filter (EWPF, van Leeuwen, 2010;van Leeuwen, 2011;Ades and van Leeuwen, 2013), and the Implicit Equal Weights Particle Filter (IEWPF, Zhu et al., 2016).
In another development, second-order exact filters have been developed that ensure that the first two moments of the posterior pdf are consistent with the particle filter, and higher-order moments are not considered. The first paper of this kind was the Particle Filter with Gaussian Resampling of Xiong et al. (2006), followed by the Merging Particle Filter (Nakano et al., 2007) and the Moment Matching Ensemble Filter (Lei and Bickel, 2011). All these filters seem to have been developed independently. The Non-linear Ensemble Transform Filter (Tödter and Ahrens, 2015) can be considered a local version of the filter by Xiong et al. (2006), ironically again developed independently. A further approximation to particle filtering is the Gaussian Mixture Filter first introduced in the geosciences by Bengtsson et al. (2003), followed by the adaptive Gaussian mixture filter variants (Hoteit et al., 2008;Stordal et al., 2011). The advantage of these filters over the standard particle filter is that each particle is 'dressed' by a Gaussian such that the likelihood weights are calculated using a covariance that is broader than the pure observational covariance, leading to better behaving weights at the cost of reducing the influence of the observations on the posterior pdf (see e.g. van Leeuwen, 2009).

The problem
Consider the following non-linear stochastic discrete-time dynamical system at a time when observations are available: where (m) ∈ R N x is the model noise (or error) distributed Gaussian with a covariance matrix Q (m) , and β (m) o ∈ R N y is the observation noise (or error) distributed Gaussian with covariance matrix R (m) .
Then we can define an ensemble of model forecasts obtained using Equation (5) for each ensemble or particle member as follows, where superscript (.) f stands for forecast. The aim of the stochastic data assimilation methods is to produce a posterior pdf or analysis distribution of the state, X a , at the time of the observations through combining the ensemble model forecast X f with observations y. In Section 5, we will discuss ensemble Kalman filter based methods and in Sections 6-8 we will discuss particle, second-order exact, and adaptive Gaussian mixture filter methods all achieving this aim through different approaches.

Ensemble Kalman filters
Given an initial ensemble X (0) ∈ R N x ×N e , the different proposed variants of the ensemble Kalman filter have the following steps in common: • Forecast step: the ensemble members at each time step between the observations 0 < k ≤ m are propagated using the full non-linear dynamical model: starting at the previous analysis ensemble (if k = 1, then this would be x j ), where j = 1, . . . , N e is the ensemble member index. • Analysis step: at the observation time k = m the ensemble forecast mean and covariance are updated using the available observations to obtain a new analysis ensemble.
The various ensemble methods differ in the analysis step. Here we will discuss current methods applicable for large-dimensional systems, namely, the original ensemble Kalman filter (EnKF) (Evensen, 1994) with stochastic innovations (Burgers et al., 1998;Houtekamer and Mitchell, 1998), the singular evolutive interpolated Kalman filter (SEIK) (Pham et al., 1998a), the error-subspace statistical estimation (ESSE) (Lermusiaux and Robinson, 1999;Lermusiaux et al., 2002;Lermusiaux, 2007), the ensemble transform Kalman filter (ETKF) (Bishop et al., 2001), the ensemble adjustment Kalman filter (EAKF) (Anderson, 2001), the original ensemble square root filter (EnSRF) (Whitaker and Hamill, 2002) with synchronous and serial observation treatment, the square root formulation of the EnKF (Evensen, 2003), the error subspace transform Kalman filter (ESTKF) (Nerger et al., 2012a), and the maximum likelihood ensemble filter (MLEF) (Zupanski, 2005;Zupanski et al., 2008). We will present these methods in the square root form and point out the different ways the analysis ensemble is obtained in each of the methods. Tippett et al. (2003) gives a uniform framework for EnSRFs, which we follow closely here. In the rest of this section for ease of notation we omit the time index (·) (k) since all of the analysis operations are done at time m.
The ensemble methods discussed in this section are based on the Kalman filter (Kalman, 1960) where the updated ensemble mean follows the Kalman update for the state, given by The ensemble covariance update follows the covariance update equation in the Kalman Filter, given by where K is the Kalman gain given by The matrix H is the linearised observation operator H(..) at the forecast mean x f . Initially the Kalman filter was derived for a linear observation operator, but in the Extended Kalman Filter the non-linear observation operator is used as above.
Since for high-dimensional systems it is computationally not feasible to form the error covariance matrix P, the analysis update of the covariance matrix in Equation (10) is formulated in a square root form by computing a transform matrix and applying it to the ensemble perturbation matrix, which is a scaled square root of P. That is, the analysis ensemble is then given by where X a = (x a , . . . , x a ) ∈ R N x ×N e is a matrix with the ensemble analysis mean in each column and the ensemble analysis perturbations are a scaled matrix square root of To obtain the general square root form we write, using (10) where S = HX f is the ensemble perturbation matrix in observation space and is the innovation covariance. It is possible to use a slightly different way to calculate matrix S using the non-linear observation operator as , and similarly for H(X f ). This can be used in any of the ensemble Kalman filters discussed below. To find the updated ensemble analysis perturbations X a we need to compute the square root T of the matrix where T is called a transform matrix. Different ways exist to compute the transform matrix T and here we will discuss the current methods applicable to large-dimensional systems. For the ensemble-based Kalman filters presented in this paper we can write the analysis update as linear transformations using a weight vector w for the ensemble mean and a weight matrix W for the ensemble perturbations as Notice, that the ensemble analysis perturbation matrix, X a , in Equation (18) has a zero mean by construction. Further, we note that for most of the methods discussed in this section, matrix W is the transformation matrix T in Equation (16). However, this is not the case for EnKF, SEnKF and MLEF. Further, we can compute the analysis ensemble directly by where W = w, . . . , w . In the sections below we will derive the weight matrices for each of the ensemble-based Kalman filter methods we discuss. The updated ensemble can then be obtained using Equation (19).
To aid simplicity in discussing the different methods we use the same letter for the variables with the same meaning, i.e. W is always the perturbation analysis transform matrix that transforms X f into X a . Clearly, such variables do not necessarily have the same values for the various methods listed below. Thus, we subscript these variables common to all methods with a specific letter for each method. This letter is underlined in the title of each subsection that follows here, e.g. for EnKF we use W N . Note that some of the variables can have the same values for different methods, though. At the end of this section we will provide a table of the common variables with their dimensions and whether they are equal to the same variable in a different method.

The Stochastic Ensemble Kalman filter (EnKF)
The Stochastic EnKF was introduced at the same time by Burgers et al. (1998) and Houtekamer and Mitchell (1998). It is a modified version of the original under-dispersive EnKF as introduced by Evensen (1994) by adding measurement noise to the innovations so that the filter maintains the correct spread in the analysis ensemble and prevents filter divergence. Although the scheme was initially interpreted as perturbing observations, a more consistent interpretation is that the predicted observations are perturbed with the observation noise. The reason for this is that it doesn't make sense to perturb observations since they already contain measurement noise (errors), e.g. from measuring instruments, and thus have already departed from the true state of the system. Also, Bayes Theorem, see Section 2 tells us that we need the probabilities of the states given this set of observations, not a perturbed set. The idea is that each ensemble member is statistically equivalent to the true state of the system, and the true observation is a perturbed measurement of the true state. So to compare that observation with the predicted observations the latter have to be perturbed with the measurement noise too to make this comparison meaningful. This reasoning is identical to that used in rank histograms in which observations are ranked in the perturbed predicted observations from the ensemble to be statistically equivalent.
Each ensemble member individually is explicitly corrected using the Kalman filter equations, and hence the square root form is implicit only as the transform matrix and its square root are never explicitly computed. In contrast to the other filters, the stochastic EnKF perturbs the predicted observations by forming a matrix where the observational noise (perturbation) matrix Y is given by: with the noise vectors j drawn from a Gaussian distribution with mean zero and covariance R. We also introduce the observation matrix Y = (y, y, . . . , y) ∈ R N y ×N e consisting of N e identical copies of the observation vector. The Stochastic EnKF uses the matrix F defined in Equation (15) with prescribed matrix R and proceeds by transforming all ensemble members according to Similar to the Equations (17)-(19), this can be written as with Due to the use of the observation ensemble Y no explicit transformation of the ensemble mean needs to be performed. Algorithm 4 in Appendix 2 gives a pseudo-algorithm of the EnKF method.
We note that while the above description of the stochastic EnKF is widely accepted and implemented, it does produce the correct posterior covariance only in a statistical sense due to extra sampling errors while the ensemble mean is not affected by the sampling error by ensuring that observation noise matrix, Y , has zero mean. However, in the limit of infinite ensemble size and when all sources of error (both observation and model) are correctly sampled, the stochastic EnKF does produce the correct posterior covariance (Whitaker and Hamill, 2002).

The singular evolutive interpolated Kalman filter (SEIK)
The SEIK filter (Pham et al., 1998bPham, 2001 was the first filter method that allowed for non-linear model evolution and that was explicitly formulated in square root form. The filter uses the Sherman-Morrison-Woodbury identity (Golub and Van Loan, 1996) to rewrite TT T (Equation 16) as Note, that the performance of this scheme depends on whether the product of the inverse of the observation error matrix, R −1 , and a given vector can be efficiently computed, which is for instance the case when we assume that the observation errors are uncorrelated. The SEIK filter computes the analysis step in the ensemble error subspace. This is achieved by defining a matrix where A E ∈ R N e ×(N e −1) is a matrix with full rank and zero column sums. Commonly, matrix A E is identified as where 0 is a matrix whose elements are equal to zero and 1 is a matrix whose elements are equal to one (Pham et al., 1998b).
Matrix A E implicitly subtracts the ensemble mean when the matrix L is computed. In addition, A E removes the last column of X f . Thus, L is an N e × N e − 1 matrix that holds the first N e − 1 ensemble perturbations. The product of the square root matrices in the ensemble error space becomes now The matrix T E T T E is of size N e − 1 × N e − 1. The square root T E is obtained from the Cholesky decomposition of (T E T T E ) −1 . Then, the ensemble transformation weight matrices in Equations (17)-(19) are given by Here, the columns of ∈ R N e −1×N e are orthonormal and orthogonal to the vector (1, . . . , 1) T . can be either random or a deterministic rotation matrix. However, if a deterministic is used then Nerger et al. (2012a) shows that a symmetric square root of T E T T E should be used for a more stable ensemble. Algorithm 5 in Appendix 2 gives a pseudo-algorithm of the SEIK method.

The error-subspace statistical estimation (ESSE)
The ESSE (Lermusiaux and Robinson, 1999Lermusiaux et al., 2002Lermusiaux, 2007 method is based on evolving an error subspace of variable size, that spans and tracks the scales and processes where the dominant errors occur (Lermusiaux et al., 2002). Here, we follow the formulation of Lermusiaux (2007) adapted to the unified notation used here.
The consideration of an evolving error subspace is analogous to the motivation of the SEIK filter. The main difference to other subspace filters mentioned here is how the ensemble matrix is truncated. That is, the full ensemble perturbation matrix X f at the current analysis time with columns . . . , N e is approximated by the fastest growing singular vectors. The full ensemble perturbation matrix is decomposed using the reduced or thin singular value decomposition (SVD), (e.g. p. 72, Golub and Van Loan, 1996), where U S ∈ R N x ×N e is an orthogonal matrix of left singular vectors, S ∈ R N e ×N e is a diagonal matrix with singular values on the diagonal, and V T S ∈ R N e ×N e is an orthogonal matrix of right singular vectors of X f . Next, normalised eigenvalues are computed via The matrices U S and E S are truncated to the leading eigenvalues. UsingŨ S ,Ẽ S with rankÑ e ≤ N e andÛ S ,Ê S with rankp <Ñ e where the similarity coefficient ρ is computed via and Tr(.) is the trace of a matrix. ρ measures the similarity between two subspaces of different sizes. The process of reducing the subspace is repeated until ρ is close to one, i.e. ρ > α where 1 − ≤ α ≤ 1 is a user selected scalar limit. 3 The dimension of the error subspace thus varies with time and in accord with model dynamics (Lermusiaux, 2007). Hence, in the following analysis update the reduced rank approximations are used where the right singular vector matrixṼ S is also truncated to have sizeÑ e ×Ñ e . The product of the square root matrices, using Equation (14), in the error subspace becomes where ensemble errors in observation space are given byS =

R.
The inverse of the N y ×N y -matrixF is obtained by performing the eigenvalue decomposition (EVD) so that Equation (37) becomes Performing another EVD in Equation (39), the symmetric square root becomes Hence, the ensemble transformation weight matrices needed to form the ensemble analysis mean and analysis perturbations in Equations (17)-(19) are given by Note, that when computing the analysis ensemble mean and perturbations, the truncated ensemble perturbation matrixX f is used in the pseudo-algorithm 6 in Appendix 2. The truncation to the rankÑ e will results in a reduction of the ensemble size. To avoid that the ensemble size shrinks, Lermusiaux (2007) described an optional adaptive method to generate new ensemble members.

The ensemble transform Kalman filter (ET KF)
The ETKF (Bishop et al., 2001) was derived to explicitly transform the ensemble in a way that results in the correct spread of the analysis ensemble. As the SEIK filter, the ETKF uses the Morrison-Woodbury identity to write In contrast to the SEIK filter, T T T T T is of size N e × N e and hence represents the error-subspace of dimension N e − 1 indirectly by the full ensemble.
Currently, the most widespread method to compute the update in the ETKF appears to be the formulation of the LETKF by Hunt et al. (2007), which we describe here. By performing the EVD of the symmetric matrix (T T T T T ) −1 = U T T U T T we obtain the symmetric square root Using this decomposition, the ensemble transformation weight matrices needed to form the ensemble analysis mean and analysis perturbations in Equations (17)-(19) are given by Using the symmetric square root produces a transform matrix which is closest to the identity matrix in the Frobenius norm (Hunt et al., 2007). Thus, the ETKF results in a minimum transform in the ensemble space, which is different from the notion of 'optimal transportation' used in the ETPF (see Section 6.3). The original publication introducing the ETKF (Bishop et al., 2001) did not specify the form of the matrix square root T T . There are different possibilities to compute it, and taking a simple single-sided square root could lead to implementations with a biased transformation, such that the transformation by W would not preserve the ensemble mean. However, using the symmetric square root approach this bias is avoided. Livings (2005) proposed another variant normalising first the forecast observation ensemble perturbation matrix so that the observations are dimensionless with standard deviation onẽ Substituting (48) into (44) gives To find the square root form next we perform the SVD In this case, the ensemble transformation weight matrices in Equations (17)- (19) become This formulation avoids the multiplication S T R −1 S and can hence prevent possible loss of accuracy due to rounding errors. However, this formulation also requires the computation of the square root of R, which itself can result in rounding errors if R is not diagonal.
Algorithm 7 in Appendix 2 gives a pseudo-algorithm of the ETKF method.

The ensemble adjustment Kalman filter (EAKF)
The EAKF was introduced by Anderson (2001). Similarly to the SEIK filter and the ETKF we require here that the matrix R −1 is readily available. Using scaled ensemble perturbations as discussed for the ETKF-formulation by Livings (2005) in Equations (48)-(49) we can write We perform the SVD on the scaled forecast ensemble observation perturbation matrixS Note that U A = U T , related to the similarity between the EAKF and the ETKF. We also use an EVD to obtain The decomposition in Equation (55) is usually performed as an SVD of the ensemble perturbation matrix X f , which approximates P f using N e ensemble members. Due to the ranks of the matrices decomposed in Equations (54) and (55) there are at most q = min(N e − 1, N y ) non-zero singular values in A and at most N e −1 non-zero eigenvalues in A . Thus, the matrices in the equations below can be truncated as follows: Then, the ensemble transformation weight matrices in Equations (17)-(19) are given by Note, that the EAKF perturbation weight matrix in Equation (56) is the same as applying the orthogonal matrix instead of the orthogonal matrix U T in the ETKF perturbation transform matrix given by Equation (51) (Tippett et al., 2003). The decomposition in Equation (55) is costly due to the size of the matrix to be decomposed. For this reason, the EAKF is typically applied with serial observation processing as will be described for the EnSRF in Section 5.7.
Algorithm 8 in Appendix 2 gives a pseudo-algorithm of the EAKF method.

The ensemble square root filter (EnSRF)
The EnSRF was introduced by Whitaker and Hamill (2002) to avoid the use of perturbed observations by a square root formulation. In the EnSRF the transform matrix is given by We first perform an EVD of F to obtain its inverse Then, we can write the ensemble analysis covariance as The diagonal matrix holding the singular values is of dimension R ∈ R N e ×N y and has thus at most min(N e , N y ) nonzero singular values. To reduce the computational cost for the case of high dimensional models with N e N y , we can truncate to get the much smaller matrix R ∈ R N e ×min(N e ,N y ) (see Table 1). The square root form for the ensemble analysis perturbations is given by and the ensemble transformation weight matrices needed to form the ensemble analysis mean and analysis perturbations in Equations (17)-(19) are given by where in Equation (61) we have post-multiplied the ensemble analysis perturbations by the orthogonal matrix of the left singular vectors U T R to ensure that the analysis ensemble is unbiased (Livings et al., 2008;Sakov and Oke, 2008).
Algorithm 9 in Appendix 2 gives a pseudo-algorithm of the EnSRF method.

EnSRF with serial observation treatment
The serial observation treatment in the EnSRF was introduced by Whitaker and Hamill (2002) together with the EnSRF assimilating all observations at once. The serial treatment reduces the computing cost. Hence, the EnSRF is typically not applied with the bulk update described above, but with serial treatment of observations, which is possible if R is diagonal. In this case, each single observation can be assimilated separately. Thus, F reduces to the scalar F and SS T to the scalar S 2 . For a single observation (N y = 1), the matrix G R becomes a vector given by All singular values of G R are zero except the first, which is its norm, where e is a vector with N e zero elements except the first, which is one. The first column of U R corresponds to the normalised vector S T The square root of the diagonal matrix in Equation (61) can be written as a sum of the identity matrix and a matrix proportional to ee T : Using Equation (65) and the fact that all columns of U are orthonormal, one obtains and the weight vector for the update of the ensemble mean is The equations above are then applied in a series over each single observation. The equations are likewise valid when the EAKF is formulated with a serial observation treatment.
Algorithm 10 in Appendix 2 gives a pseudo-algorithm of the EnSRF method with serial observation treatment.

The square root formulation of the stochastic ensemble Kalman filter (SEnKF)
The SEnKF was introduced by Evensen (2003) as a square root formulation of the stochastic EnKF. Defining Y as for the stochastic EnKF and using a matrix we obtain the matrix We could decompose F F using an EVD but this is costly if N y N e (Evensen, 2003). Instead, we assume that forecast and observation errors are uncorrelated, i.e.

SY
T ≡ 0, so that Now we can use an SVD to decompose which has a much smaller computational cost than decomposing F F using an EVD when N y N e . The ensemble transformation is then computed according to Equation (23) with the weight matrix given by Algorithm 11 in Appendix 2 gives a pseudo-algorithm of the EnKF in square root form.

The error-subspace transform Kalman filter (ESTK F)
The ESTKF has been derived from the SEIK filter (Nerger et al., 2012a) by combining the advantages of the SEIK filter and the ETKF. The ESTKF exhibits better properties than the SEIK filter, like a minimum ensemble transformation as the ETKF. However, unlike the ETKF, the ESTKF computes the ensemble transformation in the error subspace spanned by the ensemble rather than using the ensemble representation of it. That is, the error subspace of the dimension N e − 1 is represented directly in the ESTKF (similarly to the SEIK filter) while in the ETKF the error subspace is represented indirectly using the full ensemble of size N e . Similar to the SEIK filter, a projection matrix A K ∈ R N e ×N e −1 is used whose elements are defined by With this projection, the basis vectors for the error subspace are given by As for the matrix in the SEIK filter, the columns of matrix A K are orthonormal and orthogonal to the vector (1, . . . , 1) T . When the matrix L K is computed, the multiplication with A K implicitly subtracts the ensemble mean. Further, A K subtracts a fraction of the last column of X f from all other columns. In this way, the last column of X f is not just dropped as in the SEIK filter, but its information is distributed over the other columns. The product of the square root matrices in the error subspace becomes now By performing the EVD of the symmetric matrix (T K T T K ) −1 = U K K U T K we obtain the symmetric square root Then, the ensemble transformation weight matrices needed to form the ensemble analysis mean and perturbations in Equations (17)-(19) are given by Compared to the SEIK filter, both the matrices A E and are replaced by A K in the ESTKF. In addition, the ESTKF uses the symmetric square root of T K T T K . The use of A K leads to a consistent projection onto the error subspace and back onto the state space, while the symmetric square root ensures that the minimum transformation is obtained. It is also possible to apply the ESTKF with a random ensemble transformation. For this case, the rightmost matrix A K in Equation (79) is replaced by a random matrix with the same properties as the deterministic A K .
Algorithm 12 in Appendix 2 gives a pseudo-algorithm of the ESTKF method.

The Maximum Likelihood Ensemble Filter (MLEF)
The MLEF (Zupanski, 2005;Zupanski et al., 2008) calculates the state estimate as the maximum of the posterior probability density (pdf) function. This is in contrast to the ensemble Kalman filter methods described in this paper, which are based on the minimum variance approach, so targeting the mean. The maximum of the pdf is found by an iterative minimisation of the cost function using a generalised non-linear conjugate-gradient method.
The original MLEF filter (Zupanski, 2005) uses a secondorder Taylor approximation to the analysis increments, which requires that the cost function is twice differentiable. However, this requirement is not necessarily satisfied in many real life non-linear applications, for example where the parameterisation of some processes is used in models or for strongly non-linear observation operators. Here, we present the revised MLEF by Zupanski et al. (2008) that avoids this requirement by using a non-differentiable minimisation algorithm.
In contrast to all other ensemble filters discussed above, the MLEF maintains the actual state estimate separately from the ensemble, which is used to provide the measurement of estimation error. Thus, the analysis perturbations in the MLEF are computed for each ensemble member in a square root form without recentring them onto the analysis ensemble mean. Hence, this filter does not follow the same square root form as the filters described above and is presented last in this section.
In the MLEF, the ensemble analysis perturbations are defined using the difference between analysis and forecast for each ensemble member, x a j = x a j − x f j and not between ensemble analysis states and the analysis mean. They are found using generalised Hessian preconditioning in state space. A change of variable is performed as follows where the matrix G represents the inverse square root of the generalised Hessian estimated at the initial point of minimisation, and ξ j is a control variable defined in ensemble subspace. Matrix C is the covariance matrix Equation (82) can be written as a transformation of ensemble perturbations by where the elements of the weight vector w M, j ∈ R N e for ensemble member j are given by Now we use an EVD of We note, that in a linear case matrix G 1 2 is a square root of P a . Indeed the same decomposition and inversion was used to find the square root analysis perturbations for the ETKF, see Equation (45).
After successfully accomplishing the Hessian preconditioning, the next step in the iterative minimisation is to calculate the gradient in the ensemble-spanned subspace. The preconditioned generalised gradient at the k-th minimisation iteration is obtained by where Upon convergence we have obtained an optimal state analysis To complete the non-differential formulation of the MLEF, ensemble analysis perturbations are computed as follows Algorithm 13 in Appendix 2 gives a pseudo-algorithm of the MLEF method.

Summary of ensemble Kalman Filter methods
In this section, we have described ten most popular ensemble Kalman filter methods that are applicable to high-dimensional non-Gaussian problems.
This collection of methods could be categorised in different ways, for example in deterministic ensemble filters, where the analysis is found through explicit mathematical transformations (SEIK, ETKF, EAKF, EnSRF, ESTKF, MLEF), and stochastic ensemble filters, where perturbed forecasted observations are used (EnKF, SEnKF). Burgers et al. (1998) and Houtekamer and Mitchell (1998) showed that in order to maintain sufficient spread in the ensemble and prevent filter divergence, the observations should be treated as random variables, i.e. perturbed, while our interpretation is slightly different, as described above. This stochasticity, of course, leads to extra sampling noise in the filters. On the other hand, Lawson and Hansen (2004) showed that for large ensemble sizes, stochastic filters can handle non-linearity better than the deterministic filters. This is due to the additional Gaussian observation spread normalising the ensemble update in the stochastic filter, which tends to erase the non-Gaussian higher moments non-linear error growth has generated. However, current computational power restricts us to small ensemble sizes for high-dimensional problems, in which case stochastic filters add another source of sampling error thus underestimating the analysis update (Whitaker and Hamill, 2002).
While all ensemble Kalman filter methods use low-rank approximations of the state error covariance matrix, some of the methods in this section are referred to as error-subspace ensemble filters because they directly operate in the error subspace spanned by ensemble rather than using the ensemble representation of it. Such filters are SEIK (see Section 5.2), ESTKF (see Section 5.9), and ESSE (see Section 5.3). Nerger et al. (2005b) compares the stochastic EnKF with the SEIK filter in an idealised high-dimensional shallow water model with non-linear evolution, showing that the main difference between the filters lies in the efficiency of the representation of the covariance matrix P. In general, the EnKF filter will require a larger ensemble size N e to achieve the same performance as the SEIK filter. The relation of the ETKF and SEIK methods has been studied by Nerger et al. (2012a), where also the ESTKF has been derived. Apart from computing the ensemble transformation in the error subspace in case of SEIK and ESKTF, the three filters are essentially equivalent. However, for the SEIK filter it has been found that the application of the Cholesky decomposition can lead to unevenly distributed variance in the ensemble members. To this end, the ESTKF and ETKF method are preferable unless a random matrix is used in the SEIK filter. The methods described in this section each have nuances in which they differ one from another as well as underlying common ground. Writing these methods using unified mathematics notation allows us to see these algorithmic differences and commonalities more readily. Many filters described above have several common variables and while for some methods the variables have different sizes, others can not only have the same size but actually the same value, too. Table 1 summarises the sizes of the common variables between the methods and below we comment on whether they have the same value.
Apart from the matrices listed in Table 1, there are the finally resulting weight matrices W and W. The matrix W is identical for all filters. Thus, for all filters except EnKF, SEnKF and MLEF, which don't use this matrix, the mean of the analysis ensemble is identical. The EnKF and SEnKF are an exception because they do not explicitly transform the ensemble mean and introduce sampling error due to the perturbed observations. The maximum likelihood approach of the MLEF also results in a different analysis state estimate if the ensemble distribution is not Gaussian and the observation operator is non-linear. Further the square W W T is identical for all filters except the EnKF, SEnKF and MLEF. Thus, the analysis covariance matrix P a will be identical.
In contrast to the equality of the matrix W, the matrix W is different for almost all methods. Thus, while many methods yield the same analysis ensemble covariance matrix, their ensemble perturbations are distinct. In the ETKF and ESTKF methods, the dimensions of the matrices T, U, and have distinct dimensions. However, the ensemble transformation weight matrices W of both methods are identical (Nerger et al., 2012a).
In general, the choice of using a particular ensemble method depends on a number of the system's components: the dynamical model at hand, model error, number and types of observations, ensemble size. Given these degrees of freedom it is not possible to attribute one data assimilation method to be better suited for a general situation. In practise operational meteorological applications most widely use the LETKF and the serial formulations of the EnSRF and EAKF, while in oceanography there are many applications of the stochastic EnKF, the SEIK filter, and the ESTKF. There are less applications of the ESSE and MLEF, despite the fact that these filters are algorithmically interesting because of the ensemble-size adaptivity of ESSE and the maximum-likelihood solution MLEF. From the algorithmic viewpoint, the stochastic EnKF will be useful if the stochasticity can be an advantage and if large ensembles can be used. Further, the filters SEIK, ETKF, ESTKF differ from the EnKF, EnSRF and EAKF also in the application of distinct localisation methods (see Section 9.1 for the discussion of localisation). EnKF, EnSRF and EAKF allow for localisation in the state space, which could be advantageous for some observation types (Campbell et al., 2010). The serial formulation of the EnSRF and EAKF requires that the observation error covariance matrix is diagonal. Thus, these filters cannot directly be applied if the observation errors are correlated. A transformation into variables that are uncorrelated is possible in theory, but it is most likely not practical for large sets of observations.

Particle filters
In this section we will consider the standard particle filter followed by three efficient variants of the particle filters: the Equivalent Weights Particle Filter (EWPF, van Leeuwen, 2010; van Leeuwen, 2011;Ades and van Leeuwen, 2013), the Implicit Equal-Weights Particle Filter (Zhu et al., 2016) and the Ensemble Transform Particle Filter (Reich, 2013). Other variants of local particle filters are discussed in Section 9 on practical implementation. Another interesting particle filter for highdimensional systems, the so called implicit particle filter (Chorin and Tu, 2009;Chorin et al., 2010;Morzfeld et al., 2012;van Leeuwen et al., 2015), is not discussed here as it needs a 4D-Var-like minimisation for each particle. The Multivariate Rank Histogram Filter (MRHF, Metref et al., 2014b), based on the Rank-Histogram Filter of Anderson (2010) that performs well in highly non-Gaussian regimes, has been recently developed in the European project Sangoma. 4 However, it is still under development for high-dimensional systems and its idea is only shortly described in Section 9.1.5. Often particle filters are defined as providing approximations of p x (0:m) |y (1:m) , but we restrict ourselves to particle filters that are approximations of the marginal posterior pdf p x (m) |y (1:m) as there are at present no efficient algorithms for the former for high-dimensional geophysical systems, and we have forecasting in mind. Furthermore, for ease of presentation we take all earlier observations for granted, leading to the marginal posterior at time m being denoted as p x (m) |y (m) .

The standard particle filter
This particle filter is also known as the bootstrap filter or Sequential Importance Resampling (SIR). The probability distribution function (pdf) in particle filtering, represented by N e particles or ensemble members at time k, is given by where x (m) ∈ R N x is the N x -dimensional state of the system that has been integrated forward in time using the stochastic forward model and δ(x) is a Dirac-delta function. We let time m to be the time of a current set of observations with the previous observation set at time 0. Then the stochastic forward model for times 0 < k < m for each particle j = 1, . . . , N e is given by where β (k) j ∈ R N x are random terms representing the Gaussian distributed model errors with mean zero and covariance matrix Q, and M k : R N x → R N x is the deterministic model from time k − 1 to k. Thus, the model state transition from time k − 1 to k is fully described by the transition density given by which will be of later use. Using Bayes theorem where the weights w (m) j are given by and each w (m−1) j is the product of all the weights from all time steps 0 < k ≤ m − 1. The conditional pdf p y (m) |x (m) is the pdf of the observations given the model state x (m) which is often taken to be Gaussian To obtain equal-weight posterior particles one applies resampling, in which particles with high weights are duplicated, while particles with low weights are abandoned. Several schemes have been developed to perform resampling, and three of the mostused schemes are presented in Appendix 1.
The problem in high-dimensional spaces with a large number of independent observations is that these weights vary enormously over the particles, with one particle obtaining a weight close to one, while all the others have a weight very close to zero. This is the so-called degeneracy problem related to the 'curse of dimensionality': any resampling scheme will produce N e copies of the particle with the highest weight, and all variation in the ensemble has disappeared.
Hence, as mentioned at the beginning of this section, to apply a particle filter to a high-dimensional system additional information is needed to limit the search space of the filter. One option is to use localisation directly on the standard particle filter. Local particle filters, like the so-called Local Particle Filter (Poterjoy, 2016a) will be discussed in the section on localisation in particle filters. We next discuss the proposal-density particle filters since this technique could be applied to all filters and permits us to achieve equal-weights for the particles in a different way.

Proposal-density particle filters
To avoid that the ensemble degenerates we aim at ensuring that equally significant particles are picked from the posterior density. To do this we have to ensure that all particles end up in the high-probability area of the posterior pdf, and that they have very similar, or even equal, weights. For the former we can use a scheme that pulls the particles towards the observations. Several methods can be used for this, including traditional methods like 4DVar, a variational method, and ensemble Kalman filters and smoothers. However, the main ingredient in efficient particle filters is the step that ensures that the weights of the different particles are close before any resampling step.
We start by writing the prior at time m as follows: Without loss of generality but for simplicity we assume that the particle weights in the ensemble at the previous time step m − 1 are equal, so Using Equation (99) in Equation (98) leads directly to: hence, from Equation (93) the prior can be seen as a mixture density, with each density centred around one of the forecast particles.
One can now multiply the numerator and denominator of Equation (100) by the same factor q x (m) |x is defined as the collection of all particles at time m − 1, and the conditioning on j denotes that each particle does in general has a different parent to start from. This leads to , y (m) is the so-called proposal transition density, or proposal for short, whose support should be equal to or larger than that of p x (m) |x (m−1) . Note that the proposal density as formulated here is slightly more general than the usual q x (m) |x (m−1) j , y (m) through allowing for the explicit dependence on all particles at time m − 1.
Drawing from this density we find for the posterior: where w j are the particle weights given by Using Bayes' theorem, the numerator in the expression for the weights can be expressed as Therefore, the particle weight of ensemble member j can be written as: (m) .
In the so-called optimal proposal density (Doucet et al., 2000) For systems with a large number of independent observations these weights are again degenerate (see, e.g. Snyder et al., 2008;Ades and van Leeuwen, 2013;Snyder et al., 2015).
Several efficient particle filter schemes have been developed utilising the proposal density to avoid this degeneracy. Here we discuss the Equivalent-Weights Particle Filter (EWPF) and the Implicit Equal-Weights Particle Filter (IEWPF). As mentioned, the Implicit Particle Filter (Chorin et al., 2010), which allows for an extension of the one-time-step optimal proposal particle filter to a full time window explores a 4DVar-like method on each particle. Since it needs an adjoint of the underlying model, it is not discussed in this paper.
6.2.1. The equivalent-weights particle filter. The EWPF works as follows: (1) Determine the optimal proposal weight w j ∝ p y (m) |x (m−1) j for each particle. Note that these weights vary enormously in high-dimensional systems.
(2) Choose a target weight w target based on these weights that a certain percentage of particles can reach. For instance, if the target weight is set to the lowest of these weights we keep 100% of the particles. A choice of 50% will mean that the target weight is set to the medium value of these weights.
(3) Calculate the position in state space of each particle such that it has a weight exactly equal to the target weight. This is where the proposal density comes in. Note that some of the particles cannot reach this target weight no matter how we move them, and these are brought back into the ensemble via the resampling step in point 5. (4) Add a small random perturbation to each particle and recalculate its weight. (5) Resample all particles such that their weights are equal again.
It is in step 3 that we use the fact that the proposal density is dependent on all previous particles, and not just particle j. This step is the main reason for the efficiency of the filter.
As an example, when the error in the model equations is additive Gaussian and the observation operator is linear an analytical solution can be found for the maximum weight for each particle j, or actually, the minimum of minus the log of that weight called φ j : Then a target weight is set from these φ j 's. The target weight splits the ensemble of particles in two groups: those particles that have a higher optimal proposal weight, and those with a lower optimal proposal weight. The latter are abandoned at this point, and will be regenerated in the resampling step 5. For the retained particles, there is an infinite number of ways to move a particle in state space such that it reaches the target weight. In the EWPF that problem is solved by assuminĝ in which α j is a scalar, and ϒ is defined as Under this assumption the number of solutions is reduced to two, and the two values for α j are given by in which and in which w (m−1) j is the weight of particle j accumulated over previous time steps, included here for completeness. Note that w target is the target weight selected from φ's in Equation (106) (e.g. if we choose to keep 80% particles − log(w target ) = {φ j } j= N e * 0.8 where {φ j } j=1,...,N e is a sorted list of optimalproposal weight of each particle) and that α j = 1 pushes the particle to its optimal-proposal weight position. The solution resembles the optimal proposal solution in which the deterministic part of the proposal is scaled to ensure equal weights. Also note the resemblance of the deterministic part with the shape of that used in a Kalman filter when we replace Q with the ensemble covariance of the state.
When the number of independent observations is large the optimal proposal density particle filter is degenerate, meaning that one particle gets a much larger weight than all the others. The EWPF is not degenerate because a set percentage of all particles has a similar weight (before the resampling step). The EWPF does not, however, converge for large N e to the posterior pdf because of this equivalent-weights construction, in which highweight particles are moved such that their weight becomes lower, equal to the target weight. So the scheme is biased. However, the large N e limit is not that relevant in practise as the affordable number of particles will be low, below say 10,000, and typically of O(20 − 100). In that setting, the Monte-Carlo error will be substantial, and the bias should be measured against the Monte-Carlo error. As long as the latter is larger than the former the scheme is a valid alternative in high-dimensional systems.
Algorithm 16 in Appendix 2 gives a pseudo-algorithm for the EWPF.
6.2.2. The implicit equal-weights particle filter. This scheme is very similar to that of the EWPF: (1) Determine the optimal proposal weight w j ∝ p y (m) |x (m−1) j for each particle. Note that these weights vary enormously in high-dimensional systems.
(2) Choose a target weight based on these weights that a certain percentage of particles can reach. Typically the target weight is chosen as the minimum of the maximal weights, so that all particles are kept.
(3) Draw a random perturbation vector for each particle, and add this to the particle position that leads to maximal weight. So far the scheme is the same as that used in the optimal proposal density. (4) Scale each random vector such that each particle will reach the target weight. (5) Resample the particles such that their weights are equal in case the kept percentage is lower than 100%.
The main difference between this scheme and the EWPF is that in the EWPF we scale the deterministic part of the optimal proposal to reach a target weight, while here we scale the random part of the optimal proposal.
The implicit part of our scheme follows from drawing samples implicitly from a standard Gaussian distributed proposal density q (ξ ) instead of the original one q x (m) |x (m−1) , y (m) , as in (Chorin and Tu, 2009). These two pdfs are related by: where dx dξ j denotes the absolute value of the determinant of the Jacobian matrix of the R N x → R N x transformation x j = g j (ξ j ). The transformation g j (.) is now defined via the following implicit relation between variable x (m) j and ξ as , P a measure of the width of that pdf, and α j a scalar that depends on ξ (m) j . The α j are now chosen such that all particles get the same weight w target , so the scalar α j is determined for each particle from: This ensures that the filter is not degenerate in systems with arbitrary dimensions and an arbitrary number of independent observations. Because of the target-weight construction the filter does not converge to the correct posterior pdf, and the same discussion as for the EWPF applies here, namely that as long as this bias is smaller than the Monte-Carlo error this filter is a valid candidate for high-dimensional non-linear filtering.
As an example we assume now that observation errors and model errors are Gaussian and that the observation operator H ∈ R N y ×N x is linear. Then we find that where and This leads to a complicated non-linear differential equation for α j that involves the determinant of P. Since we are interested in high-dimensional problems we consider this equation in the limit of large state dimension N x . In that limit it turns out that we can integrate this equation, leading to the much simpler equation (see Appendix in Zhu et al. (2016)): in which γ j = ξ T j ξ j . This equation could be approximated by using numerical methods, such as the Newton method, etc., but analytical solutions based on the so-called Lambert W function do exist. We do not elaborate on these here.
Algorithm 17 in Appendix 2 gives a pseudo-algorithm for the IEWPF.

Between observations: relaxation steps.
If the system is not observed at every time step, the schemes mentioned above can be used over the time window between observations. No analytical solutions can be obtained in this case so that the solution has to be found iteratively. However, this procedure is rather expensive as it typically involves solving a problem similar to a 4DVar on each particle. 5 Thus, typically simpler schemes are employed between observation times. These schemes will be less efficient, although we can ensure that Bayes' theorem is fulfilled exactly for each particle.
In the following, we demonstrate the use of relaxation between observation times. We use the future observations to relax the particles at time k towards observations at next time m > k by using instead of Equation (92) the modified forward model whereβ (k) j ∈ R N x are random terms representing the model error distributed according to a given covariance matrixQ, 6 M k is the same deterministic model as in Equation (92),Υ is a relaxation matrix given bỹ Here, τ (k) is a time dependent scalar that determines the strength of the relaxation. y (m) ∈ R N y is the vector of N y observations at time m and H k : R N x → R N y is the observation operator mapping model space into observation space. Note that the observations y (m) exist at the later time m > k. The modified transition density is now given by and the modified weights w (k) j are accumulated as This simple modification of the forward model to include information about future observations using a relaxation term is only consistent with Bayes Theorem when the weights that are introduced by this modification are properly taken into account, and it leads to efficient schemes if it is used in combination with an equal-weight scheme, like the EWPF or the IEWPF. Algorithm 15 in Appendix 2 gives a pseudo-algorithm of the relaxation step used in the EWPF and the IEWPF.
Note that it would also be possible to use other methods like ensemble smoothers or ensemble 4Dvar-like methods to move particles between observations, but we will not elaborate on those here.

The Ensemble Transform Particle Filter (ETPF)
The idea of the Ensemble Transport Particle Filter (Reich, 2013) is to avoid resampling by finding a linear transportation map between the prior and the posterior ensemble such that the prior particles are minimally modified, while ensuring that the posterior particles have equal weight. We write each posterior particle as a linear combination of the prior particles as in which we ensure that the particles have the correct mean via This still leads to N 2 e −2 undetermined elements t i j . These are found by minimising the movement from old to new particles, by minimising under the condition that t i j ≥ 0. The above formulation is an example of an optimal transportation algorithm, see e.g. the review by Chen and Reich in van Leeuwen et al. (2015). This scheme can be combined with any proposal density discussed in the previous section.
If the dynamical model is deterministic one needs to add some small random noise to the particles to avoid ensemble collapse. Typically this noise is assumed to be Gaussian with zero mean and covariance P a = h 2 P f with 0 < h < 1 a free parameter. This term is an ad-hoc addition related to inflation in Ensemble Kalman Filters.
Algorithm 14 in Appendix 2 gives a pseudo-algorithm of the ETPF.

Second-order exact ensemble Kalman filters
Several extensions to ensemble Kalman filters have been proposed to overcome the linearity or Gaussianity assumptions. A large number of filters exists that try to bridge an ensemble Kalman filter and particle filter by defining smoothly varying parameters that move the filter between these two extremes based on the degeneracy of the particle filter. In high-dimensional systems, however, all of these filters become ensemble Kalman filters as any particle filter contribution results in complete degeneracy. These filters (not discussed here) will become useful when localisation is applied.
In a non-linear, non-Gaussian case the ensemble Kalman filters will necessarily produce an analysis where the mean and covariance are biased due to the assumption of a Gaussian prior pdf (Lei and Bickel, 2011). Here, we will discuss four ensemble filters that concentrate on getting the first two moments of the posterior distribution correct in non-linear situations. These are the Particle Filter with Gaussian Resampling of Xiong et al. (2006), the Non-linear Ensemble Transform Filter (Tödter and Ahrens, 2015), the Moment-Matching Ensemble Filter (Lei and Bickel, 2011), and the Merging Particle Filter (Nakano et al., 2007).

Particle Filter with Gaussian Resampling (PFGR) and Non-linear Ensemble Transform Kalman Filter (NETF)
The Particle Filter with Gaussian Resampling (PFGR, Xiong et al. (2006)) introduced an explicit ensemble transformation matching the mean and covariance matrix. The Non-linear Ensemble Transform Filter (NETF, Tödter and Ahrens, 2015) is a recent reinvention of this algorithm formulated to obtain an ensemble transformation that is analogous to that of the ETKF.
In addition, the NETF was introduced with localisation, so that the filter can be applied to high-dimensional systems (see Section 9.1). The presentation here follows the more modern formulation of the NETF in analogy to the ETKF presented before. As a novel feature, the presented formulation avoids the explicit computation of the analysis state, that is given by the weighted ensemble mean.
The PFGR and the NETF are designed to exactly match the first two moments of the posterior pdf in Bayes theorem without assuming that the prior or likelihood are normally distributed. The forecast ensemble is transformed into an analysis ensemble by applying a weight vector to obtain the analysis mean state and a transform matrix to obtain analysis ensemble perturbations, analogous in form to a square root filter (Equations (17) to (19)).
As in most particle filters, the likelihood weights that arise from Bayes' theorem are used. For normally distributed observation errors, the weight of each member is at first given by and then normalised so that the weights sum up to one. Before the weights are computed, the ensemble perturbations should be inflated by an inflation factor γ > 1 as in the ensemble-based Kalman filters (for inflation see Section 9.2). Using the weight vector w = w 1 , . . . , w N e T the transform matrix is Here, diag(w) is a diagonal matrix that contains the weights w j on the diagonal. The factor N e was not present in the formulation by Xiong et al. (2006). It was introduced by Tödter and Ahrens (2015) to ensure that the ensemble has the correct analysis variance. As in the ensemble Kalman filters, the eigenvalue decomposition of TT T = U U T yields the ensemble transformation Combining the weight vector and transform matrix as in Equation (19), the analysis ensemble is given by Here, is an random matrix. Xiong et al. (2006) use a random matrix sampled from a normal distribution with mean zero and standard deviation one. They use because in Equation (130) they omit all eigenvalues that are very close to zero and need to restore an ensemble of full size. In contrast, Tödter and Ahrens (2015) use a mean-preserving orthogonal matrix (see Pham, 2001) analogous to that used in the SEIK filter. They motivate the use of also by the reduction of ensemble outliers and showed experimentally that the random transformation with mean preserving properties leads to a more stable data assimilation process.
Note, that the transformation in Equation (131) is applied to the ensemble matrix X f instead of the ensemble perturbation matrix X f without subsequent addition of the analysis mean state (see e.g. Equation 19). This is possible because of the property of T to implicitly subtract the ensemble mean, while the multiplication of X f with the weight vector array adds the analysis mean state.
For high-dimensional systems, a localisation of the analysis step is required. It was introduced by Tödter and Ahrens (2015) in analogy to the localisation of the ETKF and SEIK filters (see Sec. 9.1). Algorithm 18 in Appendix 2 gives a pseudo-algorithm of the PFGR and NETF andAlgorithm 22 shows the computation of the weights for Gaussian observation errors.

Moment-Matching Ensemble Filter (MMEF)
A stochastic algorithm that has second-order correct statistics was developed by Lei and Bickel (2011). In this momentmatching ensemble filter (MMEF) we generate an ensemble of perturbed pseudo-observations, Y f , as in the SEnKF (see Equations 20 and 21) using H x j as variable in the density, so y is fixed. Then the analysis mean for each particle is generated using a corresponding pseudo-observation as follows in which w k y f j is given by Similarly, the analysis mean for actual observations is computed viax Furthermore, we calculate equivalent expressions for covariances for perturbed and actual observations as follows Then each of the ensemble members or particles is updated via This filter gives the correct posterior mean and covariance in the large-ensemble limit (Lei and Bickel, 2011). To see this, note that x j −x a y f j is distributed according to N 0, P a y f j , so P a y f j −1/2 N (0, I), and hence the distribution of x a j is N x a (y) , P a (y) . This filter cannot be used in high-dimensional systems, even when localisation is applied, because it needs the evaluation of several full covariance matrices. However, we can explore ensemble perturbations that are used to calculate these covariances, as in all ensemble Kalman filter schemes. The following was not discussed by Lei and Bickel (2011), but is a practical way to make the filter useful in high-dimensional systems.
We can express each covariance matrix P a y f j directly in terms of the forecast ensemble as where matrix T y f j is given by The square root of this matrix is To find the inverse of this matrix we perform an SVD on the prior ensemble matrix and compute also the EVD on the much smaller square matrices T y Using Equations (142) and (143) we find Hence, we can write the update equation of the MMEF as This expression is suitable for high-dimensional applications when the matrices T(y f j ) are computed with localisation.

Merging Particle Filter (MPF)
The merging particle filter generates several sets of posterior ensembles and merges them via a weighted average to obtain a new set of particles that has the correct mean and covariance but is more robust than the standard particle filter. Specifically, the method draws a set of q ensembles each of size N e from the weighted prior ensemble at the resampling step. Denote each ensemble member as x j,i for ensemble member j in ensemble i. Then new merged ensemble members are generated via To ensure that the new ensemble has the correct mean and covariance, the coefficients α j need to fulfil the two conditions q j=1 where each α j also has to be a real number.
When q > 3 there is no unique solution for the α's, while for q = 3 we find: Although not discussed by Nakano et al. (2007) this scheme will be degenerate for high-dimensional problems. However, we can make the α's space-dependent when q > 3 and then apply localisation.

Adaptive Gaussian mixture filter
Both ensemble Kalman and Monte Carlo-based techniques discussed in Sections 5 and 6, respectively, have their drawbacks. The Gaussian mixture filter (Anderson and Anderson, 1999;Bengtsson et al., 2003;Hoteit et al., 2008) attempts to avoid these by approximating an arbitrary form of the prior by combining multiple Gaussian priors. This gives it the advantage that both the local Kalman filter type correction step as well as the weighting and resampling step of a particle filter can be applied. This possibility also makes it applicable to highly non-linear and high dimensional systems. In this paper, we discuss the adaptive Gaussian mixture filter developed by Stordal et al. (2011) as a representative scheme, out of all Gaussian mixture filters that have been proposed.
In the Gaussian mixture filter, the prior distribution is approximated by a mixture density (Silverman, 1986) where each ensemble member forms the centre of a Gaussian density function where N (x j ,P) denotes a multivariate Gaussian kernel density with ensemble member x j as mean and covariance matrixP f = h 2 P f , in which P f is the covariance of the whole forecast ensemble and h is a bandwidth parameter. Stordal et al. (2011) discuss that the optimal choice of the bandwidth h is h opt ∼ N −1/5 e if we are only interested in the marginal properties of the individual components of x, but that it might be beneficial to choose h > h opt to reduce the risk of filter divergence, since the choice of the bandwidth parameter determines the magnitude of the Kalman filter update step. Thus, the parameter h is treated as the design parameter and is defined by the user. Note that each particle represents the mean of a Gaussian kernel and that the uncertainty associated with each particle is given by the covariance of that Gaussian kernel (Stordal et al., 2011).
If the likelihood is Gaussian, the posterior pdf is again a Gaussian mixture, now with pdf Here, the weights w j are propotional to N y − Hx f j , R a with j w j = 1 and R a = HP f H T +R. So, compared to the particle filter the covariance used in the weights is inflated with a term HP f H T , leading to more equal weights. Each mean x a j and the covariance matrixP a are obtained using one of the EnKF variants.
In high-dimensional systems, the covariance matrices are never formed explicitly, and the algorithm in Stordal et al. (2011) cannot be used. Hoteit et al. (2008) used an update based in the SEIK filter (see Section 5.2). For a more modern formulation, we provide here an algorithm based on Stordal et al. (2011) but explore an ETKF to avoid the explicit computation ofP. First, the matrix is generated with S = HX f similar to Equation (44) This is used to update the mean of each Gaussian kernel by calculating the ETKF update on each of the prior particles as The new centres of the Gaussian mixture densities are now found as A square root of the posterior covariance of each Gaussian mixture density is found by Thus, for Equation (150) we haveP a = (N e − 1) −1 ZZ T , but to evaluate the equation one can use the square root Z, so it is not required to computeP a explicitly.
Until this point, the algorithm is the standard Gaussian mixture filter. The adaptive part of the filter was introduced by Stordal et al. (2011) and has been demonstrated to avoid filter divergence due to ensemble degeneration. Further, the adaptivity allows us to choose smaller values of the bandwidth parameter h. To stabilise the Gaussian mixture filter, we interpolate the original analysis weights with a uniform weight as For the adaptivity, α is chosen to be where N eff = 1/( N e l=1 w 2 l ) is the effective ensemble size. To avoid ensemble degeneration one can further add a resampling step as in particle filters. It is performed if N eff < N c , with N c a value that can be chosen freely, for instance N c = 0.5N e . The full scheme then becomes: (a) When N eff ≥ N c no resampling is needed, so the weights are calculated as above and transported with each particle to the next set of observations. (b) When N eff < N c we will resample according to any of the resampling schemes in Appendix 1. This leads to a new set of states for the centres of the Gaussian mixtures denoted x j,(i) in which j denotes the index of the state for resampling, and i its index after resampling. Note that several of the new states will coincide. To avoid identical samples we draw our final new ensemble from the Gaussian mixtures, as follows Further, we setP a = (N e − 1) −1 X a (X a ) T , but use this only in factorised form.
Note that in this scheme we never calculate a full state covariance.
It is important to realise what the adaptive part does. Indeed, by construction, the filter is not degenerate, but at the expense of strongly reducing the influence of the observations when α is small. In high dimensional systems with a large number of independent observations localisation is essential to avoid using the scheme as a sum of ensemble Kalman filters only.
In the scheme by Bengtsson et al. (2003), the mean of each Gaussian pdf is chosen at random from the ensemble, and the covariance in each Gaussian pdf is estimated from the ensemble members which are local in state space, including a localisation and smoothing step. Since the scheme has not been applied to high-dimensional systems it will not be discussed here.

Practical implementation of the ensemble methods
This section is devoted to issues related to the practical implementation of the ensemble methods. In particular, we address the need for localisation and inflation in some of the presented ensemble methods to counteract the issues arising from ensemble undersampling in large scale problems such as ocean and atmosphere prediction. We also discuss the computational cost of each method as presented and the parallelisation of ensemble data assimilation methods. We will conclude this section with a discussion on the suitability of the ensemble data assimilation methods applied to non-linear dynamical models.

Localisation in EnDA
The success of the EnDAmethods is highly dependent on the size of the ensemble being adequate for the system we apply these methods to. Thus, for large scale problems, where the number of state variables is many magnitudes larger than the number of ensemble members, ensemble undersampling can cause major problems in EnDA methods: underestimated ensemble variance, filter divergence, and errors in estimated correlations, in particular spurious long-range correlations. In such cases, spatial localisation is a necessary tool to minimise the effect of undersampling.
Localisation damps long-range correlations, e.g. in the ensemble covariance matrix ('covariance localisation', see Section 9.1.2). This damping can be applied to the extend to keep only correlations over limited distances and erase long-range correlations in the analysis step. Thus, localisation decouples the analysis update at distant locations in a model grid. The underlying assumption of localisation is that the assimilation problem has in fact a local structure. This means, that correlation length scales are much shorter than the extent of the model grid so that only correlations over short distances are relevant while for long distances the sampling error in the ensemble-estimated covariance matrices dominates (see, e.g. Morzfeld et al., 2017). This seems to be fulfilled for many oceanic and atmospheric applications. For example, Patil et al. (2001) described a locally low dimension for atmospheric dynamics. The success of localised filters in oceanic and atmospheric data assimilation applications also shows that this condition is dominantly fulfilled, even though it is known that long-range correlations (teleconnections) exist in the atmosphere and ocean. However, if a modelling problem does not have a local structure or if too little observations are available or the observations only represent long-range properties of the system, localisation cannot be applied.
Localisation is usually applied either explicitly by considering only observations from a region surrounding the location of the analysis or implicitly by modifying P or R so that observations from beyond a certain distance cannot affect the analysis state. The way in which such localisation is applied is still an active field of research and many variants of localisation schemes have emerged over the last decade. There are two main types of spatial localisation techniques (or simply localisation) that are widely used in ensemble data assimilation: Covariance localisation (also termed P-or B-localisation) and observation localisation (also denoted R-localisation). Both methods will be discussed here together with domain localisation, which is required for the application of observation localisation. In addition, a number of adaptive localisation schemes was developed over the recent years. A selection of these schemes is discussed in Section 9.1.4.
In general, all localisation schemes are empirical. While they improve the estimations by ensemble filters, they can disturb balances in the model state (Lorenc, 2003;Kepert, 2009). Further, the interaction of localisation with the serial observation processing usually applied with the EnSRF and EAKF methods can reduce the stability of these filters (Nerger, 2015).

Domain localisation. Domain localisation or local analysis is the oldest localisation technique. For ensemble
Kalman filters it was first applied by Houtekamer and Mitchell (1998), but the method was also applied in earlier schemes of optimal interpolation (see Cohn et al., 1998). In domain localisation we only use the ensemble perturbations that belong to the domain D γ in which the analysis correction of the state estimate is computed. For example, this domain can be a vertical column of grid points or a single grid point. Thus, we use a linear transformation D γ to obtain where j = 1, . . . , N e and γ = 1, . . . , with being the total number of subdomains. To localise, we now only use observations within a specified distance -the localization radiusaround the local domain D γ . This defines a local observation domainD γ . Using the corresponding linear transformationD γ we can transform the observation error covariance R, the global observation vector y, and the global observation operator H analogously to Equation (162) to their local parts Thus, we neglect observations that are outside of the domain D γ . Then a general local analysis state is given by where W γ and W γ are computed using local ensemble forecast perturbations and local observations from the domainD γ . For a complete analysis update, a loop over all local analysis domains has to be performed with a local analysis update for each domain.
Applying domain localisation allows significant savings in computing time since solving for the analysis update is not performed globally but on much smaller local domains. Accordingly, updates on the smaller scale domains can be done independently and therefore parallel (Nerger et al., 2006) even if the observation domains overlap. In ensemble-based Kalman filters, domain localisation was used predominantly with filters that use the analysis error covariance matrix for the calculation of the gain like SEIK, ETKF, ESTKF, all discussed in detail in Section 5. In these algorithms, the forecast error covariance matrix is never explicitly computed. Examples of the application of domain localisation can be found, e.g. in Brusdal et al. (2003) and Testut et al. (2003).
Blindly using domain localisation can result in boxed analysis fields if neighbouring local domains are updated using significantly different observation sets. Thus, great care needs to be taken to choose domains so that they overlap sufficiently to produce smooth global analysis fields with minimal increase in computational cost. Today, domain localisation is typically applied with observation localisation (Hunt et al., 2007), which is discussed in Section 9.1.3.

Covariance localisation. Covariance localisation
(also termed P-localisation or B-localisation, depending on whether the background covariance matrix is denoted P or B as in variational assimilation schemes) is a localisation method that is directly applied to the ensemble covariance matrix. The ensemble undersampling causes spurious cross-correlations between state variables. As realistic long-range correlations are typically small, the sampling errors are particularly pronounced for long distances. The direct covariance localisation can be used to reduce the long-range correlations in the forecast error covariance and hence damp the spurious correlations. In addition, the rank of the ensemble covariance is increased, giving more degrees of freedom to the analysis update Whitaker and Hamill, 2002).
Typically, covariance localisation is applied by first forming a correlation matrix C and then taking a Schur product (an element by element matrix multiplication) of this correlation matrix and the forecast error covariance. Thus, given some P f , our localised forecast error covariance will be The localization matrix C is usually formed of correlation functions with compact support similar in shape to a Gaussian function (e.g. Gaspari and Cohn, 1999). Practically, the computation of the covariance matrix P f can be avoided by applying the localisation matrix to the matrices P f H T and HP f H T (see Equation (11)). We note that, from all the ensemble-based Kalman filter methods presented in Section 5, covariance localisation can only be applied to the EnSRF and EAKF, since for these methods observations can be processed serially, and in the stochastic EnKF.

Observation localisation.
In the case of square root filters, presented in Section 5, the full covariance matrix is never formed. Instead, only the ensemble perturbation matrix X f is calculated at each analysis step. Petrie and Dance (2010) showed that covariance-localisation for square root filters cannot be approximately decomposed into a square root of the correlation matrix ρ, thus covariance localisation cannot be applied. For such filters, e.g. SEIK, ETKF and ESTKF, the observation localisation is a more natural choice and is currently used instead of covariance localisation (Hunt et al., 2007;Miyoshi and Yamane, 2007;Janjić et al., 2011).
Observation localisation is applied by modifying the observation error covariance matrix R. More specifically, one modifies its inverse R −1 so that the inverse observation variance decreases to zero with the distance of an observation from an analysis grid point. To be able to define the distance, it is necessary to perform the analysis with the domain localization method as described in Section 9.1.1. An abrupt cutoff could be obtained by setting observation variances to zero beyond a given distance. This would be equivalent to the simple domain localisation of Section 9.1.1 and could result in non-smooth analysis updates. For a smooth analysis, e.g. Brankart et al. (2003) described to increase the observation error variance with increasing distance from the analysis grid point. Hunt et al. (2007) proposed to use a gradual observation localisation in the LETKF acting on R −1 , which is likewise applicable with the SEIK filter and the ESTKF. In this case, elements of R −1 are multiplied by a smoothly decreasing function of distance from the analysis grid point. This modification smoothly reduces the observation influence and excludes observations outside a defined radius by prescribing their error to be infinitely large. As for covariance-localisation, the method uses a Schur product as Here, the same correlation function (Gaspari and Cohn, 1999) as for covariance localisation can be used to construct the localisation matrix. However, in contrast to covariance localisation, C is not a correlation matrix as the values on the diagonal of this matrix vary with the distance between the observation and the local analysis domain. Then, the analysis update is computed as in the case of domain localisation, but using the weight-localised matrixR. For computational savings we would in practise also discard any observations with zero weight from the analysis computations. Both observation and covariance localisation can lead to similar assimilation results. In general, the optimal localisation has been found to be a bit larger for covariance localisation than for observation localisation (Greybush et al., 2011). The reason for this difference lies in the different effect of the localisations in the Kalman gain as was explained by Nerger et al. (2012b).

Adaptive localisation schemes.
The localisation methods described above are widely used and can be applied without much additional computing cost. However, the optimal localisation radius is a priori unknown and needs to be tuned in numerical experiments. For the tuning one performs several data assimilation experiments with different localisation radii, perhaps over shorter time periods, and selects the radius that results in the smallest estimation errors. Regarding the theoretical understanding of localisation, Kirchgessner et al. (2014) showed for the case of observation localisation when each grid point is observed that the optimal localisation radius should be reached when the sum over the observation weights equals the ensemble size. This finding allows for a simple form of adaptivity or a starting point for further tuning. Further, Perianez et al. (2014) showed that both the sampling error in the ensemble covariance matrix and the observation error influence the optimal localisation radius. As the sampling error has a largest influence when the true correlations are small, the dynamically generated correlations also influence the optimal localisation radius (Zhen and Zhang, 2014;Flowerdew, 2015).
To avoid the need for numerical tuning and to better adapt the localisation to the dynamically created correlation structure, several adaptive localisation methods have been developed, which we shortly mention here. A common approach is to damp the spurious correlations that are caused by sampling errors due to the small ensemble size. Anderson (2007) developed a hierarchical localisation method, in which the ensemble is partitioned into sub-ensembles. Then, the sub-ensembles are used to estimate the sampling errors. Bishop and Hodyss (2009) proposed an adaptive localisation method that uses a power of the correlations to damp small correlations and pronounce those correlations that are significant. This method can find correlations even at longer distances. Further, methods have been developed to find empirical localisation functions. In these methods, one attempts to find for a single observation the weight factor that minimises the deviation from a true solution (Anderson, 2012;Lei and Anderson, 2014;Flowerdew, 2015). These methods are typically tuned once based on observation system simulation experiments (OSSEs), in which one knows the true state. When the OSSEs are configured realistically, the obtained localisation functions should be applicable for the assimilation of real observations after the tuning.
The major advantage of the methods proposed so far is that they are able to adaptively specify the localisation function or radius according to the dynamically generated covariance structure. However, the methods still need tuning, which can be even more costly than for the fixed covariance and observation localisation methods. For example, the method by Bishop and Hodyss (2009) requires the specification of an envelope function around the locations that are found by powering the correlations and the number of powers that are computed. Lei and Anderson (2014) also showed that the localisation function can change when it is applied iteratively such that a sufficient number of iterations have to be computed.
Apart from the adaptive localisation methods, further methods like spectral localisation (Buehner and Charron, 2007) and localisation in different variables (i.e. stream function, velocity potential, Kepert, 2006) have been developed. However, none of these methods are yet a standard for operational centres. 9.1.5. Localisation in particle filtering. Several variants of the Particle Filter that explore localisation have been developed recently, following its success in Ensemble Kalman Filters. An issue with directly localising R or using domain localisation is that the weight of each particle is a global property of the filter ( van Leeuwen, 2009). That is, the same particle could have a high weight in one area and low weight in another making it ambiguous whether this particle should be resampled or not. Keeping parts of a number of particles that all perform well in a certain area of the domain and parts of other particles in other areas of the domain would lead to balance problems between variables and sharp gradients in the fields. In contrast, when performing parameter estimation a smooth variation of parameter values is less likely to cause imbalances in the model variables, and localisation is straightforward, as pioneered by Vossepoel and Van Leeuwen (2006).
Particle filters that use a proposal density, such as the EWPF discussed in Section 6.2.1 indirectly use localisation through the model error covariance matrix Q. This localisation does not explicitly work on the weights but on how the states are updated, because a natural choice is to pre-multiply each update of a particle with that matrix. Since the model error covariance matrix will mainly contain short length-scale correlations related to missing or inaccurate physics at the model grid scale, each point in the state space is only influenced by observations within the radius set by that covariance matrix. In fact, as noted in Section 6.2.1, we do have the freedom to choose this matrix differently from Q, so other choices closer to our needs are possible. This is because the effects of this choice will be taken into account in the computation of the weights of each particle. This has not been explored in any detail in the literature.
Of the full particle filters, the ETPF (Reich, 2013) can easily be localised by taking for each grid point only observations close to that grid point into account and making the transformation matrix space-dependent to ensure smooth transitions between different regions. This can for example be achieved by calculating the transformation matrix in a limited number of grid points and interpolate that matrix between grid points. This would also reduce the number of computations, which would otherwise be prohibitive (see Section 9.4 on computational costs).
The PFGR and the NETF perform an ensemble transformation similar to the ETKF, but with a transform matrix T computed from particle filter weights. Accordingly, observation localisation can be applied to the NETF (Tödter and Ahrens, 2015) by smoothing the weight matrix over space. This can also be applied to the MMPF in the high-dimensional implementation. Also the MPF can be localised by making the weights local and using a systematic resampling method like Stochastic Universal Resampling (seeAppendix 1). In practise, more might be needed, e.g. the extra averaging as advocated by Penny and Miyoshi (2016) described below.
Several localisation schemes have been proposed and discussed in the review van Leeuwen (2009) and those will not be repeated here. The most obvious thing to do is to weight and resample locally, and somehow glue the resampled particles together via averaging at the edges between resampled local particles ( van Leeuwen, 2003b). Recently, Penny and Miyoshi (2016) used this idea with more extensive averaging, and their scheme runs as follows. First, for each grid point j the observations close to that grid point are found and the weight of each particle i is calculated based on the likelihood of only those observations: in which y j denotes the set of observations within the localisation area. This is followed by resampling via Stochastic Universal Resampling to obtain ensemble members x a i, j with i = 1, . . . , N e for each grid point j. As mentioned before, the issue is that two neighbouring grid points can have different sets of particles, and smoothing is needed to ensure that the posterior ensemble consists of smooth particles. This smoothing is performed for each grid point j for each particle i by averaging over the N p neighbouring points within the localisation area around grid point j: in which j k for k = 1, . . . , N p denotes the grid point index for those points in the localisation area around grid point j. The resampling via Stochastic Universal Resampling is done such that the weights are sorted before resampling, so that high-weight particles are joined up to reduce spurious gradients. While this scheme does solve the degeneracy problem in simple one-dimensional systems it is unclear if it will work well in complex systems such as the atmosphere in which fronts can easily be smoothed out, and non-linear balances broken, see e.g. the discussion in van Leeuwen (2009).
Anew scheme has recently been proposed in Poterjoy (2016a), which involves a very careful process of ensuring smooth posterior particles and retaining non-linear relations. The filter processes each observation sequentially, as follows. First, adapted weights are calculated for the first element y 1 of the observation vector, as These weights are then normalised by their sumW . Then we resample the ensemble according to these normalised weights to form particles x k i .
Here, α is an important parameter in this scheme, with α = 1 leading to standard weighting, and α = 0 leading to all weights being equal to 1. Its importance lies in the fact that the weights are always larger than 1−α, so even a value close to 1, say α = 0.99, leads to a minimum weight of 0.01 that might seem small, but it means that particles that are more than 1.7 observational standard deviations away from the observations have their weights cut off to something close to 1 − α. This seriously limits the influence the observation can have on the ensemble. Furthermore, the influence of α does depend on the size of the observational error, which is perhaps not what one would like. It is included to avoid losing any particle. Now, we do the following for each grid point j. For each member i we calculate a weight in which ρ(.) is the localisation function with localisation radius r . The normalised weights for this grid point, w i , are obtained by dividingw i by the summed weights over all the particles. Note, again, the role played by α. Then, the posterior mean for this observation at this grid point is calculated as in which x i, j is grid point j of particle i. Next, a number of scalars are calculated that ensure smooth posterior fields (Poterjoy, 2016a): so that the final estimate becomes: This procedure is followed for each grid point so that at the end we have an updated set of particles that have incorporated the first observation. As a next step the whole process is repeated for the next observation, with a small change thatw i is multiplied byw i from the previous observation, until all observations have been assimilated. In this way, the full weight of all observations is accumulated in the algorithm. Now the importance of α comes to full light: without α the ensemble would collapse because thẽ w's would be degenerate when observations are accumulated.
The final estimate shows that each particle at grid point j is the posterior mean at that point plus a contribution from the deviation of the posterior resampled particle from that mean and a contribution from the deviation of the prior particle from that mean. So each particle is a mixture of posterior and prior particles, and departures from the prior are suppressed. When α = 1, so for a full particle filter, we find for grid points at the observation locations that c j = 0 because it is ρ(y 1 , x j , r ) = 1 here. Accordingly, it is r 2, j = 0 and r 1, j ≈ 1 and indeed the scheme gives back the full particle filter.
Between observation locations it can be shown that the particles have the correct first and second order moments, but higherorder moments are not conserved. To remedy this a probabilistic correction is applied at each grid point as follows. The prior particles are dressed by Gaussians with width 1 and weighted by the likelihood weights to generate the correct posterior pdf. The posterior particles are dressed in the same way, each with weight 1/N e . Then the cumulative distribution functions (cdf's) for the two densities are calculated using a trapezoidal rule integration. A cubic spline is used to find the prior cdf values at each prior particle i, denoted by cdf(i). Then a cubic spline is fitted to the other cdf, and the posterior particle i is found as the inverse of its cdf at value cdf(i). See Poterjoy (2016a) for details. The result of this procedure is that higher order moments are brought back into the ensemble between observed points. This scheme, although rather complicated, is the only local particle filter scheme that has been applied to high-dimensional geophysical systems based on primitive equations in Poterjoy and Anderson (2016b). (van Leeuwen, 2003b applied a local particle filter to a high-dimensional quasi-geostrophic system, but that system is quite robust to sharp gradients as it does not allow gravity waves.) Another interesting local particle filter is the Multivariate Rank Histogram Filter (Metref et al., 2014a). The idea is to write the posterior pdf in terms of an observed marginal multiplied by a set of conditional pdfs. For example, for a 3-dimensional system in which variable x 1 is observed we have: The filter now uses the rank-histogram idea ofAnderson (2010) on each component, resulting in a fully non-Gaussian update of each component. Localisation can be easily applied directly in this algorithm as it is a transformation algorithm and the transformation can be made local. Unfortunately, this procedure becomes too expensive when the system is high dimensional. However, via a so-called mean-field approximation we suppress the conditioning on non-observed variables, so that we find: This will make the algorithm parallelisable and suitable for high-dimensional applications, although that has not been explored yet.

Ensemble covariance inflation
In practice, an ensemble Kalman filter can diverge from the truth due to systematic underestimation of the error variances in the filter, possibly caused by model errors or ensemble undersampling as discussed in Section 9.1. In particular, estimating a too large amount of long range correlation will reduce the estimated variance too strongly. Regardless of the cause, underestimating the uncertainty leads to a filter that is overly confident in the state estimate. Thus, the analysis step of the filter puts increasingly more weight on the ensemble background estimate than on the observations and, at some point, it disregards observations completely. Localisation is one method to reduce the undersampling. However, for high-dimensional systems, localisation alone is not sufficient to ensure a stable assimilation process and covariance inflation is applied to further increase the sampled variance and thus stabilise the filter. In addition, the inflation can partly account for model error in case of an imperfect model (Pham et al., 1998b;Hamill, 2001;Anderson, 2001;Whitaker and Hamill, 2002;Hunt et al., 2007).
Most common is a fixed multiplicative covariance inflation (Anderson and Anderson, 1999). The method uses the inflation factor r to perform a multiplicative inflation for each ensemble member x a, f j . With j = 1, . . . , N e being ensemble member indices, it is given by where r usually is chosen to be slightly greater than one. The specification of an optimal inflation factor may vary according to the size of the ensemble ( (Hamill, 2001); Whitaker and Hamill, 2002) and the choice of r will depend on various factors, such as dynamics of the model, type of the ensemble filter used as well as the length scale of covariance localisation. Related to covariance inflation is the so-called 'forgettingfactor'ρ introduced by Pham et al. (1998b). The forgetting factor is usually chosen to be slightly lower than one and is typically applied in the square root filters like the ETKF, SEIK and ESTKF. For example, in the ETKF it is applied to TT T , e.g. in Equation (44) as In this way, the inflation and forgetting factors are related as ρ = r −2 . Equation (180) allows one to apply inflation in a computationally very efficient way because TT T is much smaller than the ensemble states to which the inflation is applied in Equation (179). Next to the multiplicative inflation, an additive inflation has been proposed. The multiplicative inflation leads to an inflation that is relative to the variance level. Thus, large variances will be inflated much more than small variances. This behaviour can be avoided with additive inflation (Ott et al., 2004), which can also be applied in combination with the multiplicative inflation. In additive inflation, all variances are inflated by the same amount, rather than a relative factor. This difference can be useful if the variances vary strongly as in this case the additive inflation acts stronger on the very small variances.
The optimal strength of the inflation is usually determined by tuning experiments, i.e. running experiments with different inflation values and analysing which value results in the smallest estimation errors. Usually a single fixed value of r or ρ is chosen for all grid points. This situation is mainly motivated by the fact that a manual tuning of spatially varying inflations is not feasible for high-dimensional models. To avoid the tuning, several adaptive inflation methods have been proposed. Brankart et al. (2003) proposed to use the relation to estimate a temporally variable forgetting factor ρ for multiplicative inflation. This equation is one of the statistical consistency relations in observation space that Kalman filters should fulfil (Desroziers et al., 2005). Further, Anderson (2009) proposed a method to adaptively estimate spatially and temporally varying inflation factors. This method also aims to fulfil Equation (181) but uses Bayesian estimation to obtain the inflation values. All of these adaptive methods do assume that we have a very good knowledge of the error covariance of the observations. Apart from adaptively inflating the ensemble spread, adaptive inflation of observation errors has been proposed by Minamide and Zhang (2017) for assimilating all-sky satellite brightness temperatures. An alternative to the inflation can be to explicitly account for the sampling error caused by the finite ensemble size as is done in the finite-size ensemble transform Kalman filter (Bocquet, 2011). This method, while still denoted 'Kalman filter', requires the iterative minimisation of a cost functional and is hence distinct from the Ensemble Kalman filter variants in Section 5, which compute a one-step analysis update.

Parallelisation of EnDA
The need to integrate an ensemble of model states leads to large computational costs, because instead of computing a single model integration as in normal modelling applications an ensemble of O(10-100) members has to be propagated. To reduce the time to perform the costly computations one can apply parallelisation of the data assimilation program and then use high-performance computers with a large number of processors to perform the computations. The ensemble integrations as the most costly part of the computations can be easily parallelised. In fact, the integration of each ensemble state is independent from the other states. Thus, this step could be parallelised by simply starting the numerical model N e times. Each model state has to be initialised from a different restart file and one has to store the final state of each model integration to keep the information on the forecast ensemble. Subsequently to the ensemble forecasts, one starts the data assimilation program, which reads the ensemble information from the files, computes the analysis step, and writes a set of new restart files to prepare the next forecast phase. The computations of the analysis step can also be parallelised as is outlined below. This implementation scheme of data assimilation can be termed 'offline coupling' (Nerger and Hiller, 2013). While being flexible, the frequent writing and reading of the large files holding the ensemble states can take a significant amount of time.
A more sophisticated parallelisation of the ensemble data assimilation problem with a high-dimensional ocean model was discussed by Keppenne and Rienecker (2002) and Keppenne and Rienecker (2003). This method applied a domain-decomposition to the model and then integrated several ensemble states concurrently. The forecast ensemble was then collected by the use of the parallelisation technique SHMEM, which was also used for exchanging data in between processors during the analysis step of the EnKF applied in this study. Keeping the analysis step and the ensemble forecasts within one program reduced the overall computing time because the writing and reading of model state files is reduced.
The analysis step of the ensemble filters can also be parallelised using parallelisation methods like the Message Passing Interface (MPI, Gropp et al., 1994). The parallelization differs depending on whether localisation is used and on which of the filters is used. For the filter methods that assimilate all observations at once (in contrast to the serial observation processing of the EAKF and EnSRF) using the domain-decomposition of a model was found to be more efficient than using ensembles which are distributed over several processors because the amount of data that has to be exchanged using MPI is smaller for domaindecomposition (Nerger et al., 2005a).
For the ensemble Kalman filters with domain localisation, the local analysis update is independent for each local domain. Thus, this part is naturally parallel and can be distributed with MPI, the shared-memory standard OpenMP, or a combination of both. However, because the observation domains have a larger spatial extent they can reach into the grid domain held by neighbouring processors. The local analysis step needs the difference (innovation) between the observation and the corresponding part of the observed state vector. These differences need to be first computed by the processor that holds the sub-domain and then exchanged in between the different processors computing the analysis step. This computation of the observation innovations and their exchange using MPI is only required once before the loop over all local analysis domains can be computed in parallel (Nerger and Hiller, 2013). The cost for these operations depends on the total number of observations and on their distribution over the model grid. For many observations this can limit the parallel speedup of the analysis update as was shown for the localised SEIK filter by Nerger and Hiller (2013).
The EAKF and EnSRF are typically applied with serial observation processing and covariance localisation. In this case, the parallelisation of the analysis step has to take into account that for each assimilated observation the full model state has to be updated. Hence, also the innovation differences between the not yet assimilated observations and the corresponding observed model state change after each update. Anderson (2007) proposed to let each processor separately update the innovations so that the required parallel communication is limited. This parallelisation does not take the localisation into account. Taking into account that the localisation results in a limited reach of the observation influence, Wang et al. (2013) proposed another parallelisation strategy.
The analysis step of the ensemble Kalman filters requires only the model states. This allows for a generic coupling between the model and the analysis step. In particular, one can implement filter algorithms such that they can be coupled in the same way with different models. This allows one to build generic frameworks for ensemble data assimilation (Nerger et al., 2005a;Nerger and Hiller, 2013;Browne and Wilson, 2015). In the generic form, the ensemble forecast can still be computed by concurrent parallel model forecasts. The transfer of the forecast state information can then be performed either directly in memory by subroutine calls (Nerger and Hiller, 2013) or by parallel communication using MPI (Nerger, 2004;Browne and Wilson, 2015). These strategies allow a tight 'online' coupling of the model and the data assimilation code that computes the analysis updates. The coupling can be achieved with minimal changes in the model code.
For the implementation of the EWPF and IEWPF different parallelisation schemes are applicable for the computations at each nudging step in between observations (Equations (120) and (123)) and at observation time for the EWPF between Equations (106) to (111) and for the IEWPF between Equations (112) to (114). Before the observation time, the computations for the random forcingβ (m) j in Equation (120) are independent for each particle since a different forcing is drawn from the covariance for each of them. Similar, the nudging term in Equation (120) and the update of the weights are independent for each particle. Thus, these operations can be performed in parallel and there is no need to gather all particles on a single process. The computation of the matrix ϒ in Equations (108) and (117) is computationally the most expensive part. When the observation operator does not change over time, this matrix can be precomputed before beginning the assimilation. The downside of this approach is that this matrix can be huge and requires a lot of memory if the state dimension and number of observations are large. Otherwise, since the same matrix is used by all particles, it is possible to distribute the computation to all processes allocated for the particles, e.g. using a parallel matrix solver. At observation time, most of the computations are again independent for all particles. Only the maximum weight obtained from Equations (106) and (118) for EWPF and IEWPF, respectively, must be exchanged over all processes holding particles, so that the target weight w target can be computed. Further parallelisation, e.g. to use the domain decomposition of the model, might also be possible. However, the matrix Q is frequently implemented in form of operators. As the parallelisation is always dependent on the par-ticular implementation of the matrix Q it cannot be generalised for all models.

Computational cost
In Section 5, we presented various ensemble-base Kalman filter methods in a clean mathematical way for ease of comparison and clarity. In Appendix 2, we give a practical and precise pseudoalgorithms on how to implement each method. Providing detailed operation counts for all the ensemble methods presented in this paper would be too lengthy but more importantly the actual operation count would depend on many details such as operators H and R, which are case specific for the model and observations. The operation counts provided here have been obtained by counting them in the pseudo codes in Appendix 2.
Generally, the leading order of operation counts in the different filters are those that scale with third order in any of the dimensions N x , N y , and N e . For the SEIK, ETKF, ESTKF, and the EAKF methods, the leading order of operation count is O N y N 2 e + N 3 e + N x N 2 e if the observation error covariance matrix is diagonal. The main cost is the update of the ensemble by multiplying with the weight matrix in Equation (19) which has a complexity of O N x N 2 e . Computing the matrix TT T , e.g. in Equation (25) involves multiplications of matrix S with R −1 which has a complexity of O N y N 2 e . Finally computing the transform matrix T by a Cholesky decomposition or an EVD has a computational complexity of O N y N 2 e . While the leading order of operation counts is identical for all four filters, the SEIK and ESTKF are in general computationally faster than the ETKF, or the bulk formulation of the EAKF, despite equal leading operation counts due to details in the algorithms. For the EAKF, computing the SVDs of the matrices X f andS, whose costs scale with O N x N 2 e and O N y N 2 e , respectively, increases the computing time without changing the leading order of operation counts. Thus, the leading order of operation count does not reflect the computing speed.
The serial observation handling that is usually applied in the EAKF and EnSRF leads to an operation count of O N y N x N e in the leading order. Because only the ensemble updates are of third order complexity in the serial update, it can be faster than the bulk updates that assimilate all observations at once. This is even the case when localisation is used. However, in combination with localisation, the stability of the serial formulations can be deteriorated (Nerger, 2015).
The leading order operation count of the stochastic EnKF with perturbed forecasted observations is O(N y N 2 e + N 2 y N e + N 3 y + N x N 2 e ). Here again the ensemble update, which scales as O N x N 2 e , is usually the most costly operation. However, the EnKF is usually more costly than the filters mentioned before because of the inversion of the N y × N y matrix F F (Equation (69)), which has a complexity of O N 3 y . Parallelising this inversion can help to reduce the computing time. A computing cost O N 3 y also occurs for the bulk formulation of the EnSRF due to the EVD computed in Equation (58).
When localisation is used, the change in the cost compared to the global formulation depends on the localisation method used. For covariance localisation (Section 9.1.2), the cost for computing the weight in matrix C and to apply it to P f or the matrices P f H T and HP f H T is added. For observation localisation (Section 9.1.3), the cost to compute the analysis for a local analysis domain with the bulk update methods is O(N y,γ N 2 e + N 3 e + N x,γ N 2 e ), where N y,γ is the number of local observations and N x,γ is the size of a local state vector that is corrected. Because both N y,γ and N x,γ are usually much smaller than the global dimensions N y and N x , a single local analysis update is cheaper than the global update. However, the local analysis update has to be computed for each local analysis domain. Thus, the cost for the analysis with observation localisation is usually significantly higher than the global analysis. However, the local analysis can be easily parallelised to reduce the computing time as was described in Section 9.3.
The computing cost in the ETKF can be reduced using a projection matrix A analogous to the SEIK and ESKTF methods. For the ETKF, this projection is square-matrix with diagonal entries of 1 − 1/N e and off-diagonal entries of −1/N e . The advantage of using A is that one can avoid the explicit computation of X in favour of applying A to smaller matrices when evaluating the analysis equations (Nerger et al., 2012a).
The computational cost for the particle-based non-linear filter NETF (Section 7.1) is similar to that of the ETKF since the analysis is performed in the N e -dimensional subspace spanned by the ensemble members. In addition, the NETF does not compute an inverse matrix thus avoiding computational instabilities caused by small singular values, which are sometimes neglected in ETKF implementations for that reason (Sakov et al., 2012). If localisation is applied to the NETF, the local analysis computations are independent and can be evaluated in parallel as for the ETKF. The generation of random rotation matrices consumes additional resources; however, it is possible to resort to a collection of precalculated random matrices since they only depend on ensemble size N e .

Ensemble data assimilation and non-linearity
The original EnKF was developed to overcome stability problems of the extended Kalman filter (see Jazwinski, 1970) that were discovered with ocean data assimilation applications (Evensen, 1993). Due to its use of an ensemble to propagate the state error covariance matrix, the EnKF is suited for nonlinear models in this phase. However, the analysis step is based on the Kalman filter and is only optimal for Gaussian distributions. Obviously, a non-linear model forecast will transform a Gaussian distribution into a non-Gaussian distribution. Hence, the optimality of the Kalman filters is no longer preserved and the estimated analysis state and the error estimates will be sub- optimal. This is a common issue for all ensemble filters whose analysis step is based on the equations of the Kalman filter. Nonetheless, the many existing data assimilation studies with non-linear models, e.g. of the ocean or atmosphere, with different formulations of the ensemble Kalman Filters show that these filters are rather stable with regard to non-linearity. Second-order accurate ensemble filters, like the NETF, MMEF and MPF in Section 7 as well as the adaptive Gaussian mixture filter described in Section 8 avoid the assumption that the forecast ensemble has a Gaussian distribution. Thus, they should be better suited for non-linear systems. When the methods are applied with localisation, they can also be applied with large systems (e.g. the NETF in Tödter et al., 2016). However, filters like the NETF are still approximations to the full non-linear analysis that is performed by particle filters.
Particle filters do not rely on any assumption on the error distribution of the state estimate. However, the observation errors are frequently assumed to be Gaussian as is for the particle filters presented in Section 6.Additionally, while the EWPF and IEWPF do not require knowledge of forecast errors, they both require good knowledge of model errors, i.e. Q. Of course, good knowledge of model errors is always beneficial to forecasting irrespective of the data assimilation method used, but for the application EWPF and IEWPF model errors are essential. While standard particle filters suffer the curse of dimensionality when applied to large systems, the EWPF and IEWPF by construction are designed to work for high-dimensional models, including those which are highly non-linear, with a small number of particles, e.g. Ades and van Leeuwen (2013) and Zhu et al. (2016). However, when applying the relaxation scheme between observations in an EWPF it is important to keep in mind that one has to choose the relaxation termΥ in Equation (120) very carefully. We can choose this term to suit the needs of our model and indeed we need to do so carefully by selecting an appropriate relaxation strength function ρ and covariance matrix. The relaxation term can be chosen to be constant between observation times, but that would not be a good idea if the system experiences oscillations between observations. In that case, the strength term can be chosen to be linearly increasing with the time lag to the next observation we are nudging particles to, or non-linear with maximum strength close to the observations.
In all local particle filters that we discussed the posterior particles are linear combinations of the prior particles. This has the potential to break non-linear balances between variables in the model. However, the linear combinations are typically formed such that only prior particles are added that are close to each other in state space, and hence quite similar. So this is not necessarily a disadvantage.

Summary and conclusion
This overview paper provides a coherent algorithmic summary and highlights differences between many currently used ensemble data assimilation methods that can be applied to high dimensional and non-linear problems such as ocean or weather prediction including well-known ensemble-based Kalman filters as well as recently developed particle filter methods and the Gaussian mixture filter.
We have presented these methods in a mathematically coherent way allowing the reader to compare many methods easily. In particular, we have presented all ensemble-based Kalman filter methods in form of a square root filter. In addition, we have included practical pseudo-algorithms for all methods since for computational reasons many of them would not be implemented in the form they are mathematically described. For some of the particle filters and for the Gaussian mixture filter we have presented the theory along with the step by step algorithm.
Finally, we have discussed important issues for practical implementation of the ensemble methods including various methods of localisation, inflation, parallelisation, computational cost and ensemble applicability to non-linear problems.
Concluding, a wealth of ensemble-based data-assimilation methods have been developed, and although they seem quite different in theory, the numerical implementations are quite similar. The implementations turn out to be quite similar to those for the particle filters, even those that explore a proposal density, where the state covariances that play an essential role in the Kalman filters are replaced by the covariance of the model errors. The main difference is that the state covariances are evolving over time and are always of low rank, while the model error covariance is given and of full rank but sparse. This means that different numerical algorithms need to be used to solve the equations when the system of interest has a high dimension.

Code availability
We note that many of the algorithms here have been efficiently implemented as a part of the Sangoma project and are freely available to everyone on the project website http://sourceforge. net/p/sangoma/ along with many other tools useful to data assimilation. Table 2 provides a list of available filters. Please note that the filter implementation was done independently from this paper so that not all filters described here are available. For simplicity these filters have been implemented without parallelisation and are hence only usable for moderately large problems with a state dimension of O(10 5 ).
Further, all of these analysis methods have been implemented in at least one of the toolboxes connected to the Sangoma project, these are: EMPIRE, OAK, SESAM, OPENDA, BELUGA/ SEQUOIA, NERSC and PDAF. For example, the set of filters listed in Table 2, plus the SEIK filter (5.2) with localization, are available in a parallelised implementation for high-dimensional problems in the freely available data assimilation framework PDAF (Nerger et al., 2005a;Nerger and Hiller, 2013). Further, the EWPF (Section 6.2.1) and IEWPF (Section 6.2.2) are available in EMPIRE (Browne and Wilson, 2015).
1. www.data-assimilation.net/. 2. The discussion of the increasingly growing developments in hybrid data assimilation methods is beyond the scope of this paper, instead we refer the reader to a very recent review article by Bannister (2017), and papers by Frei and Künsch (2013) and Chustagulprom et al. (2016) aiming to bridge particle and ensemble Kalman filter methods. 3. Note that ρ = 1 ifp =Ñ e . 4. Many of the analysis methods discussed in this paper including MRHF have been implemented in Sangoma and are available for free to download from www.data-assimilation.net, as well as many other data assimilation tools for diagnostics, utilities etc.. 5. Interestingly, the ECMWF is using an ensemble of 4DVars for their weather forecasting scheme, and it is relatively easy to turn this into a set of particles using 4DVar as proposal (see e.g. van Leeuwen et al., 2015). 6. The model error covariance matrices are usually assumed to be equal, i.e.Q = Q. .

A.2. Stochastic universal resampling (SUR)
Stochastic universal resampling is also known as systematic resampling. It performs resampling in the same way as the basic random resampling algorithm except instead of drawing each u j independently from U(0, 1) for j = 1, . . . , N e , it uses a uniform random number u according to u ∼ U[0, 1/N e ] and u j = u + ( j − 1)/N e (Bolić et al., 2003). Given the weights {w j } N e j=1 associated with the ensemble of particles, where the sum of weights is equal to one, the total number of particles N e and the number of particles to be gen-eratedÑ e , we generate an index of the sampled particles using the Algorithm 2.
Algorithm 2 Algorithm of stochastic universal resampling function SUR(w, N e ,Ñ e ) w 1 ← w 1 for j ← 2 to N e do compute cumulative weightŝ generate a random number c ← 1 for j ← 1 toÑ e do while u >ŵ c do c ← c + 1 end while I j ← c assign an index of the sampled particle u ← u + 1/N e c ← 1 end for return I end function The required input for the SUR is: w ∈ R N e a vector of particle weights, N e the total number of particles in the filter, andÑ e the number of particles to be sampled and the method returns an index I ∈ RÑ e which can then be used to select the sampled particles x * j = x I ( j) for j = 1 :Ñ e . Note, that this method has a lower sampling noise than probabilistic resampling since only one random variable is drawn.

A.3. Residual resampling (RR)
The RR algorithm samples the particles in two parts. In the first part the number of replications of particles is calculated, but since the method does not guarantee that the number of resampled particles is N e , the residual N r is computed. The second step requires resampling, which produces N r of the finalÑ e particles. InAlgorithm 3 this is done by PR, but other resampling technique can be used.
The required input for the RR is: w ∈ R N e a vector of particle weights, N e the total number of particles in the filter, andÑ e the number of particles to be sampled and the method returns an Set tolerance; (1 × 1) repeat Find min. covar. matr. q 1 ← q 1 + 1 (1 × 1) U q1 ← U(1 : q 1 , :) (N x × q 1 ) E q1 ← E(1 : q 1 , 1 : q 1 ) (q 1 × q 1 )