Stochastic parameterization identification using ensemble Kalman filtering combined with maximum likelihood methods

Abstract For modelling geophysical systems, large-scale processes are described through a set of coarse-grained dynamical equations while small-scale processes are represented via parameterizations. This work proposes a method for identifying the best possible stochastic parameterization from noisy data. State-of-the-art sequential estimation methods such as Kalman and particle filters do not achieve this goal successfully because both suffer from the collapse of the posterior distribution of the parameters. To overcome this intrinsic limitation, we propose two statistical learning methods. They are based on the combination of the ensemble Kalman filter (EnKF) with either the expectation–maximization (EM) or the Newton–Raphson (NR) used to maximize a likelihood associated to the parameters to be estimated. The EM and NR are applied primarily in the statistics and machine learning communities and are brought here in the context of data assimilation for the geosciences. The methods are derived using a Bayesian approach for a hidden Markov model and they are applied to infer deterministic and stochastic physical parameters from noisy observations in coarse-grained dynamical models. Numerical experiments are conducted using the Lorenz-96 dynamical system with one and two scales as a proof of concept. The imperfect coarse-grained model is modelled through a one-scale Lorenz-96 system in which a stochastic parameterization is incorporated to represent the small-scale dynamics. The algorithms are able to identify the optimal stochastic parameterization with good accuracy under moderate observational noise. The proposed EnKF-EM and EnKF-NR are promising efficient statistical learning methods for developing stochastic parameterizations in high-dimensional geophysical models.


Introduction
The statistical combination of observations of a dynamical model with a priori information of physical laws allows the estimation of the full state of the model even when it is only ⋆ Corresponding author.e-mail: pulido@unne.edu.arpartially observed.This is the main aim of data assimilation (Kalnay, 2002).One common challenge of evolving multi-scale systems in applications ranging from meteorology, oceanography, hydrology and space physics to biochemistry and biological systems is the presence of parameters that do not rely on known physical constants so that their values are unknown and unconstrained.Data assimilation techniques can also be forc 0000 Tellus mulated to estimate these model parameters from observations (Jazwinski, 1970;Wikle and Berliner, 2007).
There are several multi-scale physical systems which are modelled through coarse-grained equations.The most paradigmatic cases being climate models (Stensrud, 2009), large-eddy simulations of turbulent flows (Mason and Thomson, 1992), and electron fluxes in the radiation belts (Kondrashov et al., 2011).These imperfect models need to include subgrid-scale effects through physical parameterizations (Nicolis, 2004).In the last years, stochastic physical parameterizations have been incorporated in weather forecast and climate models (Palmer, 2001;Shutts, 2015;Christensen et al., 2015).They are called stochastic parameterizations because they represent stochastically a process that is not explicitly resolved in the model, even when the unresolved process may not be itself stochastic.The forecast skill of ensemble forecast systems has been shown to improve with these stochastic parameterizations (Ibid.).Deterministic integrations with models that include these parameterizations have also been shown to improve climate features (see e.g.Lott et al. 2012).In general, stochastic parameterizations are expected to improve coarse-grained models of multi-scale physical systems (Katsoulakis et al., 2003;Majda and Gershgorin, 2011).However, the functional form of the schemes and their parameters, which represents small-scale effects, are unknown and must be inferred from observations.The development of automatic statistical learning techniques to identify an optimal stochastic parameterization and estimate its parameters is, therefore, highly desirable.
One standard methodology to estimate physical model parameters from observations in data assimilation techniques, such as the traditional Kalman filter, is to augment the state space with the parameters (Jazwinski, 1970).This methodology has also been implemented in the ensemble-based Kalman filter (see e.g.Anderson 2001).The parameters are constrained through their correlations with the observed variables.
The collapse of the parameter posterior distribution found in both ensemble Kalman filters (Delsole and Yang, 2010;Ruiz et al., 2013a;b;Santitissadeekorn and Jones, 2015) and particle filters (West and Liu, 2001) is a major contention point when one is interested in estimating stochastic parameters of nonlinear dynamical models.Hereinafter, we refer as stochastic parameters to those that define the covariance of a Gaussian stochastic process (Delsole and Yang, 2010).In other words, the sequential filters are, in principle, able to estimate deterministic physical parameters, the mean of the parameter posterior distri-bution, through the augmented state-space procedure, but they are unable to estimate stochastic parameters of the model, because of the collapse of the corresponding posterior distribution.
Using the Kalman filter with the augmentation method, Delsole and Yang (2010) proved analytically the collapse of the parameter covariance in a first-order autoregressive model.They proposed a generalized maximum likelihood estimation using an approximate sequential method to estimate stochastic parameters.Carrassi and Vannitsem (2011) derived the evolution of the augmented error covariance in the extended Kalman filter using a quadratic in time approximation that mitigates the collapse of the parameter error covariance.Santitissadeekorn and Jones (2015) proposed a particle filter blended with an ensemble Kalman filter and use a random walk model for the parameters.
This technique was able to estimate stochastic parameters in the first-order autoregressive model, but a tunable parameter in the random walk model needs to be introduced.
The Expectation-Maximization (EM) algorithm (Dempster et al., 1977;Bishop, 2006) is a widely used methodology to maximize the likelihood function in a broad spectrum of applications.One of the advantages of the EM algorithm is that its implementation is rather straigthforward.Wu (1983) showed that if the likelihood is smooth and unimodal, the EM algorithm converges to the unique maximum likelihood estimate.Accelerations of the EM algorithm have been proposed for its use in machine learning (Neal and Hinton, 1999).Recently, it was used in an application with a highly nonlinear observation operator (Tandeo et al., 2015).The EM algorithm was able to estimate subgrid-scale parameters with good accuracy while standard ensemble Kalman filter techniques failed.It has also been applied to the Lorenz-63 system to estimate model error covariance (Dreano et al., 2017).
In this work, we combine for stochastic parameterization identification these two independent methodologies: the ensemble Kalman filter (Evensen, 1994;2003) for the state-estimate with maximum likelihood estimators, the EM (Dempster et al., 1977;Bishop, 2006) and the Newton-Raphson (NR) algorithms (Cappé et al., 2005).The derivation of the technique is explained in detail and simple terms so that readers that are not from those communities can understand the basis of the methodologies, how they can be combined, and hopefully foresee potential applications in other geophysical systems.The learning statistical techniques are suitable to infer the functional form and the parameter values of stochastic parameterizations in chaotic spatio-temporal dynamical systems.They are eval-uated here on a two-scale spatially extended chaotic dynamical system (Lorenz, 1996) to estimate deterministic physical parameters, together with additive and multiplicative stochastic parameters.Pulido et al. (2016) evaluated methods based on the EnKF alone to estimate subgrid-scale parameters in a two-scale system: they showed that an offline estimation method is able to recover the functional form of the subgrid-scale parameterization, but none of the methods was able to estimate the stochastic

Hidden Markov model
A hidden Markov model is defined by a stochastic nonlinear dynamical model M that evolves in time the hidden variables where k stands for the time index.The dynamical model M depends on a set of deterministic and stochastic physical parameters denoted by Ω.We assume an additive random model error, The notation E () stands for the expectation operator, E [f (x)] ≡ f (x)p(x)dx with p being the probability density function of the underlying process X.
The observations at time k, y k ∈ R M , are related to the hidden variables through the observational operator H, where ǫ k is an additive random observation error with observa- Our K for the observation window is evaluated in the numerical experiments.The estimation technique may be applied in sucessive K-windows.For stochastic parameterizations in which the parameters are sensitive to processes of different time scales, a batch method may also be required to capture the sensitivity to slow processes.

Expectation-maximization algorithm
The EM algorithm maximizes the log-likelihood of observations as a function of the statistical parameters θ in the presence of a hidden state x0:K 1 , l(θ) = ln L(y1:K; θ) = ln p(x0:K, y1:K; θ)dx0:K. (3) An analytic form for the log-likelihood function, (3), can be obtained only in a few ideal cases.Furthermore, the numerical evaluation of (3) may involve high-dimensional integration of the complete likelihood (integrand of (3)).Given an initial guess of the statistical parameters θ, the EM algorithm maximizes the log-likelihood of observations as a function of the statistical parameters in successive iterations without the need to evaluate the complete likelihood.
Using Jensen inequality a lower bound for the log-likelihood is obtained, If we choose q(x0:K) = p(x0:K|y1:K ; θ ′ ), the equality is satisfied in (5), therefore p(x0:K|y1:K ; θ ′ ) is an upper bound to Q and so it is the q function that maximises Q(q, θ).
1 We use the notation ";", p(y 1:K ; θ) instead of conditioning "|" to emphasize that θ is not a random variable but a parameter.NR maximization and EM are point estimation methods so that θ is indeed assumed to be a parameter (Cappé et al., 2005).
From ( 5) we see that if we maximize Q(q, θ) over θ, we find a lower bound for l(θ).The idea of the EM algorithm is to first find the probability density function q that maximizes Q, the conditional probability of the hidden state given the observations, and then to determine the parameter θ that maximizes Q.
Hence, the EM algorithm encompasses the following steps: Expectation: Determine the distribution q that maximizes Q.
This function is easily shown to be q * = p(x0:K|y1:K; θ ′ ) (see (5); Neal and Hinton 1999).The function q * is the conditional probability of the hidden state given the observations.In practice, this is obtained by evaluating the conditional probability at θ ′ .
Maximization: Determine the statistical parameters θ * that maximize Q(q * , θ) over θ.The new estimation of the statistical parameters is denoted by θ * while the (fixed) previous estimation by θ ′ .The expectation step is a function of these old statistical parameters θ ′ .The part of function Q to maximize is given by where we use the notation Jazwinski, 1970).While the function that we want to maximize is the log-likelihood, the intermediate function ( 6) of the EM algorithm to maximize is the expectation of the joint distribution conditioned to the observations.

2.2..2. Expectation-maximization for a hidden Markov model
The joint distribution of a hidden Markov model using the definition of the conditional probability distribution reads The model state probability density function can be expressed as a product of the transition density from t k to t k+1 using the definition of the conditional probability distribution and the Markov property, The observations are mutually independent and are conditioned on the current state (see ( 2)) so that p(y1: Then, replacing ( 8) and ( 9) in ( 7) yields If we now assume a Gaussian hidden Markov model, and that the covariances R k and Q k are constant in time, the logarithm of the joint distribution ( 10) is then given by The Markov hypothesis implies that model error is not correlated in time.Otherwise, we would have cross terms in the model error summation of ( 11).The assumption of a Gaussian hidden Markov model is central to derive a closed form for the statistical parameters that maximize the intermediate function.
However, the dynamical model and observation operator may have nonlinear dependencies so that the Gaussian assumption is not strictly held.We therefore consider an iterative method in which each step is an approximation.In general, the method will converge through sucessive approximations.For severe nonlinear dependencies in the dynamical model, the existence of a single maximum in the log-likelihood is not guaranteed.In that case, the EM algorithm may converge to a local maximum.
We consider (11) as a function of the statistical parameters in this Gaussian state-space model.As mentioned, the statistical parameters, which are in general denoted by θ, are x0, P0, Q, R, and the physical parameters from M. In this way, the loglikelihood function is written as In this Gaussian state-space model, the maximum of the intermediate function in the EM algorithm, (6), may be determined analytically from Note that θ ′ is fixed in (13).We only need to find the critical values of the statistical parameters Q and R. The physical parameters are appended to the state, so that their model error is  13), is In the case of the observation error covariance, the solution is Therefore we can summarize the EM algorithm for a hidden Markov model as: Expectation: The required set of expectations given the observations must be evaluated at θi, i being the iteration number, specifically, E ( Maximization: Since we assume Gaussian distributions, the optimal value of θi+1 can be determined analytically, which in our model are Q and R, as derived in ( 14) and ( 15).These equations are evaluated using the expectations determined in the Expectation step.
The basic steps of this EM algorithm are depicted in Fig.
the ensemble transform Kalman filter (Hunt et al., 2007) and the Rauch-Tung-Striebel smoother (Cosme et al., 2012;Raanes, 2016) 2 .A short description of these methods is given in the Appendix.The empirical expectations are determined using the smoothed ensemble member states at t k , x s m (t k ).For instance, where Ne is the number of ensemble members.Then, using these empiral expectations R and/or Q are computed from ( 14) and/or (15).
The EM algorithm applied to a linear Gaussian state space model using the Kalman filter was first proposed by Shumway and Stoffer (1982).Its approximation using an ensemble of draws (Monte Carlo EM) was proposed in Wei and Tanner (1990).It was later generalized with the extended Kalman filter and Gaussian kernels by Ghahramani and Roweis (1999).The use of the EnKF and the ensemble Kalman smoother permits the extension of the EM algorithm to nonlinear high-dimensional dynamical models and nonlinear observation operators.

Newton-Raphson
The EM algorithm is highly versatile and can be readily implemented.However, it requires the optimal value in the maximization step to be computed analytically which limits the range of its applications.If physical parameters of a nonlinear model need to be estimated, an analytical expression for the optimal statistical parameter values may not be available.Another approach to find an estimate of the statistical parameters consists in maximizing an approximation of the likelihood function l(θ) with respect to the parameters, (3).This maximization may be conducted using standard optimization methods (Cappé et al., 2005).
Following Carrassi et al. (2017), the observation probability density function can be decomposed into the product 2 In principle what is required in (6) is p(x 0:K |y 1:K ) so that a fixedinterval smoother needs to be applied.However, it has been shown by Raanes that the Rauch-Tung-Striebel smoother and the ensemble Kalman smoother, a fixed-interval smoother, are equivalent even in the nonlinear, non-Gaussian case.
with the convention y1:0 = {∅}.In the case of sequential application of NR maximization in successive K-windows, the a priori probability distribution p(x0) can be taken from the previous estimation.For such a case, we leave implicit the conditioning in ( 17) on all the past observations, p(y1:K; θ) = p(y1:K|y:0; θ), y:0 = {y0, y−1, y−2, • • • } which is called contextual evidence in Carrassi et al. (2017).The times of the evidencing window, 1 : K, required for the estimation are the only ones that are kept explicit in (17).
Replacing ( 17) in (3) yields If we assume Gaussian distributions and linear dynamical and observational models, the integrand in ( 18) is exactly the analysis distribution given by a Kalman filter (Carrassi et al., 2017).
The likelihood of the observations conditioned on the state at each time is then given by and the prior forecast distribution, where is the analysis state -filter mean state estimate-at time k − 1, and P f k is the forecast covariance matrix of the filter.
The resulting approximation of the observation likelihood function which is obtained replacing ( 19) and ( 20) in ( 18), is where C stands for the constants independent of θ and the observational operator is assumed linear, H = H.Equation ( 21) is exact for linear models M = M, but just an approximation for nonlinear ones.As in EM, the point we made is that we expect that the likelihood in the iterative method can converge through sucessive approximations.
The evaluation of the model evidence ( 21) does not require the smoother.The forecasts x f k in ( 21) are started from the analysis -filter state estimates.In this case, the initial statistical parameters x0 and P0 need to be good approximations (e.g. an estimation from the previous evidencing window) or they need to be estimated jointly to the other potentially unknown parameters Ω, R, and Q.Note that (21) does not depend explicitly on Q because the forecasts x f k already include the model error.The steps of the NR method are sketched in Fig. 1b.For all the cases in which we can find an analytical expression for the maximization step of the EM algorithm, we can also derive a gradient of the likelihood function (Cappé et al., 2005).However, we apply the NR maximization in both cases; when the EM maximization step can be derived analytically but also when it cannot.Thus, we implement a NR maximization based on a so-called derivative-free optimization method, i.e. a method that does not require the likelihood gradient, to be described in the next section.parameterization which has to be indentified.This parameterization should represent the effects of small-scale variables on the large-scale variables.In this way, the coarse-grained onescale model with a physical parameterization with tunable deterministic and stochastic parameters is adjusted to account for the (noisy) observed data.We evaluate whether the EM algo-rithm and the NR method are able to determine the set of optimal parameters, assuming they exist.

Design of the numerical experiments
The synthetic observations are taken from the known nature integration by, see (2), with H = I, i.e. all the state is observed.Futhermore, we as-

Perfect-model experiments
In the perfect-model experiments, we use the one-scale Lorenz-96 system and a physical parameterization that represents subgrid-scale effects.The nature integration is conducted with this model and a set of "true" physical parameter values.
These parameters characterize both deterministic and stochastic processes.By virtue of the perfect model assumption, the model used in the estimation experiments is exactly the same as the one used in the nature integration except that the physical parameter values are assumed to be unknown.Although for simplicity we call this "perfect model experiment", this experiment could be thought as a model selection experiment with parametric model error in which we know the "perfect functional form of the dynamical equations" but the model parameters are completely unknown and they need to be selected from noisy observations.
We have included in the one-scale Lorenz-96 model a physical parameterization which is taken to be, where a noise term, ηj (t), of the form, has been added to each deterministic parameter.Equation ( 25) represents a random walk with standard deviation of the process σj, the stochastic parameters, and νj (t) is a realization of a Gaussian distribution with zero mean and unit variance.The parameterization ( 24) is assumed to represent subgrid-scale effects, i.e. effects produced by the small-scale variables to the large-scale variables (Wilks, 2005).
Update of θ (i) Q (i+1) from Eq. ( 14) R (i+1) from Eq. ( 15) .Each column of the matrix X k is an ensemble member state X k ≡ x1:N e (t k ) at time k.Subscript (i) means i-th iteration.A final application of the filter may be required to obtain the updated analysis state at i + 1.The function llik is the log-likelihood calculation from (21).

Model-identification experiments
In the model-identification experiments, the nature integration is conducted with the two-scale Lorenz-96 model (Lorenz, 1996).The state of this integration is taken as the true state evolution.The equations of the two-scale Lorenz-96 model, "true" model, are given by N equations of large-scale variables Xn, with n = 1, . . ., N ; and NS equations of small-scale variables Ym, given by where m = 1, . . ., NS.The two set of equations, ( 26) and ( 27), are assumed to be defined on a periodic domain, X−1 ≡ XN−1, X0 ≡ XN , XN+1 ≡ X1, and Y0 ≡ YN S , YN S +1 ≡ Y1, YN S +2 ≡ Y2.
The imperfect model used in the model-identification experiments is the one-scale Lorenz-96 model ( 23) with a parameterization (24) meant to represent small-scale effects (right-hand side of ( 26)).

Numerical experiment details
As used in previous works (see e. atmosphere considering the error growth rates; Lorenz, 1996).
The number of ensemble members is set to Ne = 50.The number of assimilation cycles (observation times) is K = 500.This is the "evidencing window" (Carrassi et al., 2017) in which we seek for the optimal statistical parameters.The measurement variance error is set to αR = 0.5 except otherwise stated.We do not use any inflation factor, since the model error covariance matrix is estimated.
The optimization method used in the NR maximization is "newuoa" (Powell, 2006).This is an unconstrained minimization algorithm which does not require derivatives.It is suitable for control spaces of about a few hundred dimensions.This derivative-free method could eventually permit to extend the NR maximization method to cases in which the state evolution (1) incorporates a non-additive model error.

Perfect-model experiment: Estimation of model noise parameters
The nature integration is obtained from the one-scale Lorenz-96 model ( 23) with a constant forcing of a0 = 17 without higher orders in the parameterization; in other words a one-scale Lorenz-96 model with an external forcing of F = 17.Information quantifiers show that for an external forcing of F = 17, the Lorenz-96 model is in a chaotic regime with maximal statistical complexity (Pulido and Rosso, 2017).The true model noise covariance is defined by Q t = α t Q I with α t Q = 1.0 (true parameter values are denoted by a t superscript).The observations are taken from the nature integration and perturbed using (22).
A first experiment examines the log-likelihood ( 21) as a function of αQ for different true measurement errors, α t R = 0.1, 0.5, 1.0 (Fig. 2a).A relatively smooth function is found with a well-defined maximum.The function is better conditioned for the experiments with smaller observational noise, αR. Figure 2b shows the log-likelihood as a function of αQ and αR.The darkest shading is around (αQ, αR) ≈ (1.0, 0.5).
However, note that because of the asymmetric shape of the loglikelihood function (Fig. 2a), the darker red region is shifted toward higher αQ and αR values.The up-left bottom-right orientation of the likelihood pattern in the plane αQ and αR reveals a correlation between them: the larger αQ, the smaller αR for the local maximal likelihood.
We conducted a second experiment using the same observations but the estimation of model noise covariance matrix is performed through the NR method.The control space is of 8x8=64 dimensions, i.e. the full Q model error covariance matrix is estimated (note that N = 8 is the model state dimension).Figure 3a depicts the convergence of the log-likelihood function in three experiments with evidencing window K = 100, 500 and 1000.
shown in Fig. 3b.As the number of cycles used in a single batch process increases, the estimation error diminishes.
The convergence of the EM algorithm applied for the estimation of model noise covariance matrix only (8x8=64 dimensions) is shown in Fig. 4.This work is focused on the estimation of physical parameters so that the observation error covariance matrix is assumed to be known.The method would allow to estimate it jointly through (15), however this is beyond the main aim of this work.This is similar to the previous experiment, using the EM instead of the NR method.In 10 iterations, the EM algorithm achieves a reasonable estimation, which is not further improved for larger number of iterations.The obtained log-likelihood value is rather similar to the NR method.
The noise in the log-likelihood function diminishes with longer evidencing windows.Comparing the standard Ne = 50 experiments with Ne = 500 in Fig. 4a, the noise also diminishes by increasing the number of ensemble members.Increasing the number of members does not appear to impact on the estimation of off-diagonal values, but it does so on the diagonal stochastic parameter values (Fig. 5a and b).The error in the estimates is about 7% in both diagonal and off-diagonal terms of the model noise covariance matrix for K = 100, and lower than 2% for the K = 1000 cycles case (Fig. 5).

Perfect-model experiment: Estimation of deterministic and stochastic parameters
A second set of perfect-model experiments evaluates the estimation of deterministic and stochastic parameters from a physical parameterization.The model used to generate the synthetic observations is ( 23) with the physical parameterization (24).
The deterministic parameters to conduct the nature integration are fixed to a t 0 = 17.0, a t 1 = −1.15,and a t 2 = 0.04 and the model error variance in each parameter is set to σ t 0 = 0.5, σ t 1 = 0.05, and σ t 2 = 0.002 respectively.The true parameters are governed by a stochastic process (25).This set of deterministic parameters is a representative physical quadratic polynomial parameterization, which closely resembles the dynamical regime of a two-scale Lorenz-96 model with F = 18 (Pulido and Rosso, 2017).The observational noise is set to αR = 0.5.
An augmented state space of 11 dimensions is used, which is composed by appending to the 8 model variables the 3 physical parameters.The evolution of the augmented state is represented by (1) for the state vector component and a random walk for the parameters.The EM algorithm is then used to estimate the additive augmented state model error Q which is an 11x11 covariance matrix.Therefore, the smoother recursion gives an estimate of both the state variables and deterministic parameters.
The recursion formula for the model error covariance matrix (and the parameter covariance submatrix) is given by ( 14).true value and it presents the lowest sensitivity.The estimation of the stochastic parameters by the EM algorithm converges rather precisely to the true stochastic parameters (Fig. 6b).
The convergence requires of about 80 iterations.The estimated model error for the state variables is in the order of 5 × 10 −2 .This represents the additive inflation needed by the filter for an optimal convergence.It establishes a lower threshold for the estimation of additive stochastic parameters.
A similar experiment was conducted with NR maximization for the same synthetic observations.A scaling of Sσ = (1, 10, 100) was included in the optimization to increase the condition number.A good convergence was obtained with the optimization algorithm.The estimated optimal parameter values are σ0 = 0.38 σ1 = 0.060 σ2 = 0.0025 for which the log-likelihood is l = −491.The estimation is reasonable with a relative error of about 25%.

Model-identification experiment: Estimation of the deterministic and stochastic parameters
As a proof-of-concept model-identification experiment, we now use synthetic observations with an additive observational noise of αR = 0.5 taken from the nature integration of the twoscale Lorenz-96 model with F = 18.On the other hand, the one-scale Lorenz-96 model is used in the ensemble Kalman filter with a physical parameterization that includes the quadratic polynomial function, (24), and the stochastic process (25).The deterministic parameters are estimated through an augmented state space while the stochastic parameters are optimized via the algorithm for the maximization of the log-likelihood function.
The model error covariance estimation is constrained for these experiments to the three stochastic parameters alone.Figure 7a shows the estimated deterministic parameters as a function of   tially the functional form of the effects of the small-scale variables.Using a constrained random walk appears to give the best simulation.

Conclusions
Two methods, the EnKF-EM and EnKF-NR, have been intro- The estimation based on the expectation-maximization algorithm gives very promising results in these medium-sized experiments (≈100 parameters).About 50 iterations are needed to achieve an estimation error lower than 10% using 100 observation times.Using a longer observation time inverval, the accuracy is improved.The estimation of stochastic parameters included the case of additive, i.e. a0, and multiplicative parameters, i.e. a1Xn and a2X 2 n .The number of ensemble members has a strong impact on the stochastic parameter variance, while the length of the observation time interval appears to have a stronger impact on the stochastic parameter correlations.
The estimation based on the NR method also presents good convergence for the perfect-model experiment with an additive stochastic parameter.For the more realistic model-identification experiments, the model evidence presents some noise which may affect the convergence.For higher dimensional problems, optimization algorithms that use the gradient of the likelihood to the statistical parameters need to be implemented.Moreover, the use of simulated annealing or other stochastic gradient optimization techniques suitable for noisy cost functions would be required.
Both estimation methods can be applied to a set of different dynamical models to address which one is more reliable given a set of noisy observations; the so called "model selection" problem.A comparison of the likelihood from the different models with the optimal parameters gives a measure of the model fidelity to the observations.Majda and Gershgorin (2011)  The increase of data availability in many areas has fostered the number of applications of the ensemble Kalman filter.In particular, it has been used for influenza forecasting (Shaman et al., 2013) and for determining a neural network structure (Hamilton et al., 2013).The increase in spatial and temporal resolution of data offers great opportunities for understanding multi-scale strongly-coupled systems such as atmospheric and oceanic dynamics.This has lead to the proposal of purely datadriven modeling which uses past observations to reconstruct the dynamics through the ensemble Kalman filter without a dynamical model (Hamilton et al., 2016;Lguensat et al., 2017).
The use of automatic statistical learning techniques that can use measurements for improvement of multi-scale models is also a promising venue.Following this recent stream of research, in this work we propose the coupling of the EM algorithm and NR method with the ensemble Kalman filter which may be applicable to a wide range of multi-scale systems to improve the representation of the complex interactions between different scales.given by a constrained random walk with optimal EM parameters.
empirical mean and covariance of the a priori hidden state are x f m (t k ), where X f (t k ) is a matrix with the ensemble member perturbations, x f m (t k ) − x f (t k ), as the m-ith column.To obtain the estimated hidden state, called analysis state, the observations are combined statistically with the a priori model state using the Kalman filter equations.In the case of the ensemble transformed Kalman filter (Hunt et al., 2007), the analysis state is a linear combination of the Ne ensemble member perturbations, x a = x f + X f w a , P a = X f Pa (X f ) T . (A2) The optimal ensemble member weights w a are obtained considering the distance between the projection of member states to the observational space, y f m ≡ H(x f m ), and observations y.These weights and the analysis covariance matrix in the perturbation space are All the quantities in (A2) and (A3) are at time t k so that the time dependence is omitted for clarity.A detailed derivation of (A2) and (A3) and a thorough description of the ensemble transformed Kalman filter and its numerical implementation can be found in Hunt et al. (2007).
To determine each ensemble member of the analysis state, the ensemble transformed Kalman filter uses the square root of the analysis covariance matrix, thus it belongs to the so-called square-root filters, where the perturbations of w a m are the columns of W a = [(Ne − 1) Pa ] 1/2 .The analysis state is evolved to the time of the next available observation t k+1 through the dynamical model equations which give the a priori or forecasted state, The smoother determines the probability density function of a dynamical model conditioned to a set of past and future observations, i.e. p(x k |y1:K ), based on the Gaussian assumption.
Applying the Rauch-Tung-Striebel smoother retrospective formula to each ensemble member (Cosme et al., 2012), where K s (t k ) = P a (t k )M T k→k+1 [P f (t k+1 )] −1 , and M k→k+1 being the linear tangent model.For the application of the smoother in conjunction with the ensemble transformed Kalman filter, the smoother gain is reexpressed as (A7) In practice, the peusdo-inversion of the forecast state perturbation matrix X f required in (A7) is conducted through singular value decomposition.
component of the subgrid-scale effects.In the present work, the results show that the NR and EM techniques are able to uncover the functional form of the subgrid-scale parameterization while succesfully determining the stochastic parameters of the representation of subgrid-scale effects.This work is organized as follows.Section 2 briefly introduces the EM algorithm and derives the marginal likelihood of the data using a Bayesian perspective.The implementation of the EM and NR likehood maximization algorithms in the context of data assimilation using the ensemble Kalman filter is also discussed.Section 3 describes the experiments which are based on the one-and two-scale Lorenz-96 systems.The former includes simple deterministic and stochastic parameterizations to represent the effects of the smaller scale to mimic the twoscale Lorenz-96 system.Section 4 focuses on the results: Section 4.1 discusses the experiments for the estimation of model noise.Section 4.2 shows the results of the estimation of deterministic and stochastic parameters in a perfect-model scenario.Section 4.3 shows the estimation experiments for an imperfect model.The conclusions are drawn in Section 5.
estimation problem: Given a set of observation vectors distributed in time, {y k , k = 1, . . ., K}, a nonlinear stochastic dynamical model, M, and a nonlinear observation operator, H, we want to estimate the initial prior distribution p(x0), the observation error covariance R k , the model error covariance Q k , and deterministic and stochastic physical parameters Ω of M. Since the EM literature also uses the term parameter for the covariances, we need to distinguish them from deterministic and stochastic model parameters in this work.We refer to the parameters of a subgrid-scale parameterization (in the physical model) as physical parameters, including deterministic and stochastic ones.While the parameters of the likelihood function are referred to as statistical parameters.These include the deterministic and stochastic physical parameters, as well as the initial prior distribution, the observation error covariance and the model error covariance.The estimation method we derive is based on maximum likelihood estimation.Given a set of independent and identically distributed (iid) observations from a probability density function represented by p(y1:K|θ), we seek to maximize the likelihood function L(y1:K; θ) as a function of θ.We denote {y1, • • • , yK} by y1:K and the set of statistical parameters to be estimated by θ: the deterministic and stochastic physical parameters Ω of the dynamical model M as well as observation error covariances R k , model error covariances Q k and the initial prior distribution p(x0).In practical applications, the statistical moments R k , Q k and P0 are usually poorly constrained.It may thus be convenient to estimate them jointly with the physical parameters.The dynamical model is assumed to be nonlinear and to include stochastic processes represented by some of the physical parameters in Ω.The estimation technique used in this work is a batch method: a set of observations taken along a time interval is used to estimate the model state trajectory that is closest to them, considering measurement and model errors with a least-square criterion to be established below.The simultaneous use of observations distributed in time is essential to capture the interplay of the sev-eral statistical parameters and physical stochastic parameters included in the estimation problem.The required minimal length included in Q.The x0, P0 are at the initial time so that they are obtained as an output of the smoother which gives a Gaussian approximation of p(x k |y1:K) with k = 0, • • • , K. The smoother equations are shown in the Appendix.Differentiating (11) with respect to Q and R and applying the expectation conditioned to the observations, we can determine the root of the condition, (13), which gives the maximum of the intermediate function.The value of the model error covariance, solution of ( the Gaussian case.Hence, this expectation step involves the application of a foward filter and a backward smoother.

A
first set of numerical experiments consists of twin experiments with a perfect model in which we first generate a set of noisy observations using the model with known parameters.Then, the maximum likelihood estimators are computed using the same model with the synthetic observations.Since we know the true parameters, we can evaluate the error in the estimation and the performance of the proposed algorithms.A second set of experiments applies the method for model identification.The (imperfect) model represents the multi-scale system through a set of coarse-grained dynamical equations and an unknown stochastic physical parameterization.The model-identification experiments are imperfect model experiments in which we seek to determine the stochastic physical parameterization of the small-scale variables from observations.In particular, the "nature" or true model is the two-scale Lorenz-96 model and it is used to generate the synthetic observations, while the imperfect model is the one-scale Lorenz-96 model forced by a physical Fig. 1.(a) Flowchart of the EM algorithm (left panel).(b) NR flowchart (right panel).Each column of the matrix X k is an ensemble member state X k ≡ x1:N e (t k ) at time k.Subscript (i) means i-th iteration.A final application of the filter may be required to obtain the updated analysis state at i + 1.The function llik is the log-likelihood calculation from (21).
g., Wilks 2005; Pulido et al. 2016), we set N = 8 and M = 256 for the large-and smallscale variables respectively.The constants are set to the standard values b = 10, c = 10 and h = 1.The ordinary differential equations (26)-(27) are solved by a fourth-order Runge-Kutta algorithm.The time step is set to dt = 0.001 for integrating (26) and (27).For the model-identification experiments, we aim to mimic the dynamics of the large-scale equations of the two-scale Lorenz-96 system with the one-scale forced by a physical parameterization (24).In other words, our nature is the two-scale model, while our imperfect coarsegrained model is the forced one-scale model.For this reason, we take 8 variables for the one-scale Lorenz-96 model for the perfect-model experiments (as the number of large-scale variables in the model-identification experiments).Equations (23) are also solved by a fourth-order Runge-Kutta algorithm.The time step is also set to dt = 0.001.The EnKF implementation we use is the ensemble transform Kalman filter(Hunt et al., 2007) without localization.A short description of the ensemble transform Kalman filter is given in the Appendix.The time interval between observations (cycle) is 0.05 (an elapsed time of 0.2 represents about 1 day in the real

TheFig. 2 .
Fig.2.Log-likelihood function as a function of (a) model noise for three true observational noise values, α t R = 0.1, 0.5, 1.0; and as a function of (b) model noise (αQ) and observational noise (αR) for a case with α t Q = 1.0 and α t R = 0.5.Darker red shading represents larger log-likelihood.

FigureFig. 3 .Fig. 4 .Fig. 5 .
Figure6ashows the estimation of the mean deterministic parameters as a function of the EM iterations.The estimation of the deterministic parameters is rather accurate; a2 has a small

Fig. 6 .
Fig. 6.(a) Estimated mean deterministic parameters, ai, as a function of the EM iterations for the perfect-model parameter experiment.(b) Estimated stochastic parameters, σi.

Fig. 7 .
Fig. 7. (a) Estimated deterministic parameters as a function of the EM iterations for the model-identification experiment.Twenty experiments with random initial deterministic and stochastic parameters are shown.(b) Estimated stochastic parameters.(c) Loglikelihood function.

Fig. 8 .
Fig. 8. (a) Log-likelihood as a function of the σ0 parameter at the σ1 and σ2 optimal values for the NR estimation (green curve) and with the optimal values for the EM estimation (blue curve) for the imperfect-model experiment.(b) Analysis RMSE as a function of the σ0 parameter.
duced to characterize physical parameterizations in stochastic nonlinear multi-scale dynamical systems from noisy observations, which include the estimation of deterministic and stochastic parameters.Both methods determine the maximum of the observation likelihood -maximum of the model evidence-in a time interval in which a set of spatio-temporally distributed observations are available.They use the ensemble Kalman filter to combine observations with model predictions.The methods are first evaluated in a controlled model experiment in which the true parameters are known and then, in the two-scale Lorenz-96 dynamics which is represented with a stochastic coarse-grained model.The methods do not require neither inflation factors nor any other tunable parameters.The performance of the methods is excellent, even in the presence of moderate observational noise.
seeked to improve imperfect models by adding stochastic forcing and used a measure from information theory that gives the closest model distribution to the observed probability distribution.The model-identification experiments in the current work can be viewed as pursuing a similar objective, stochastic processes are added to the physical parameterization to improve the model representation of the unresolved processes.A sequential Monte Carlo filter is used between observations so that their error is accounted in the estimation.In both cases, the methodologies are based on Gaussian assumptions.Hannart et al. (2016) proposed to apply the observation likelihood function, model evidence, that results from assimilating a set of observations, for the detection and attribution of climate change.They suggest to evaluate the likelihood in two possible model configurations, one with the current anthropogenic forcing scenario (factual world) and one with the preindustrial forcing scenario (contrafactual world).If the evidencing window where the observations are located includes, for instance, an extreme event then one could determine the fraction of attributable risk as the fraction of the change in the observation likelihood of the extreme event which is attributable to the anthropogenic forcing.

Fig. 9 .
Fig. 9. (a) Scatterplot of the true small-scale effects in the two-scale Lorenz-96 model as a function of a large-scale variable (coloured dots) and scatterplot of the deterministic parameterization with optimal parameters (white dots).(b) Scatterplot from the stochastic paramerization with optimal parameters obtained with the EM algorithm and (c) with the NR method.(d) Scatterplotgiven by a constrained random walk with optimal EM parameters.