The iterated auxiliary particle filter

We present an offline, iterated particle filter to facilitate statistical inference in general state space hidden Markov models. Given a model and a sequence of observations, the associated marginal likelihood L is central to likelihood-based inference for unknown statistical parameters. We define a class of"twisted"models: each member is specified by a sequence of positive functions psi and has an associated psi-auxiliary particle filter that provides unbiased estimates of L. We identify a sequence psi* that is optimal in the sense that the psi*-auxiliary particle filter's estimate of L has zero variance. In practical applications, psi* is unknown so the psi*-auxiliary particle filter cannot straightforwardly be implemented. We use an iterative scheme to approximate psi*, and demonstrate empirically that the resulting iterated auxiliary particle filter significantly outperforms the bootstrap particle filter in challenging settings. Applications include parameter estimation using a particle Markov chain Monte Carlo algorithm.


Introduction
Particle filtering, or sequential Monte Carlo (SMC), methodology involves the simulation over time of an artificial particle system (ξ i t ; t ∈ {1, . . . , T }, i ∈ {1, . . . , N}). It is particularly suited to numerical approximation of integrals of the form where X = R d for some d ∈ N, T ∈ N, x 1:T := (x 1 , . . . , x T ), μ 1 is a probability density function on X, each f t a transition density on X, and each g t is a bounded, continuous, and nonnegative function. Algorithm 1 describes a particle filter, using which an estimate of (1) can be computed as Algorithm 1 A Particle Filter 1. Sample ξ i 1 ∼ μ 1 independently for i ∈ {1, . . . , N}. 2. For t = 2, . . . , T , sample independently f (x t−1 , x t )g(x t , y t )dx 1:T dy 1:T , where μ : X → R + is a probability density function, f : X × X → R + a transition density, g : X × Y → R + an observation density and A and B measurable subsets of X T and Y T , respectively. Statistical inference is often conducted upon the basis of a realization y 1:T of Y 1:T for some finite T , which we will consider to be fixed throughout the remainder of the article. Letting E denote expectations w.r.t. P, our main statistical quantity of interest is L := E[ T t=1 g(X t , y t )], the marginal likelihood associated with y 1:T . In the above, we take R + to be the nonnegative real numbers, and assume throughout that L > 0.
Running Algorithm 1 with corresponds exactly to running the bootstrap particle filter (BPF) of Gordon, Salmond, and Smith (1993), and we observe that when (4) holds, the quantity Z defined in (1) is identical to L, so that Z N defined in (2) is an approximation of L. In applications where L is the primary quantity of interest, there is typically an unknown statistical parameter θ ∈ that governs μ, f , and g, and in this setting the map θ → L(θ ) is the likelihood function. We continue to suppress the dependence on θ from the notation until Section 5.
The accuracy of the approximation Z N has been studied extensively. For example, the expectation of Z N , under the law of the particle filter, is exactly Z for any N ∈ N, and Z N converges almost surely to Z as N → ∞; these can be seen as consequences of Del Moral (2004, Theorem 7.4.2). For practical values of N, however, the quality of the approximation can vary considerably depending on the model and/or observation sequence. When used to facilitate parameter estimation using, for example, particle Markov chain Monte Carlo (Andrieu, Doucet, and Holenstein 2010), it is desirable that the accuracy of Z N be robust to small changes in the model and this is not typically the case.
In Section 2 we introduce a family of "twisted HMMs, " parameterized by a sequence of positive functions ψ := (ψ 1 , . . . , ψ T ). Running a particle filter associated with any of these twisted HMMs provides unbiased and strongly consistent estimates of L. Some specific definitions of ψ correspond to wellknown modifications of the BPF, and the algorithm itself can be viewed as a generalization of the auxiliary particle filter (APF) of Pitt and Shephard (1999). Of particular interest is a sequence ψ * for which Z N = L with probability 1. In general, ψ * is not known and the corresponding APF cannot be implemented, so our main focus in Section 3 is approximating the sequence ψ * iteratively, and defining final estimates through use of a simple stopping rule. In the applications of Section 5, we find that the resulting estimates significantly outperform the BPF, and exhibit some robustness to both increases in the dimension of the latent state space X and changes in the model parameters. There are some restrictions on the class of transition densities and the functions ψ 1 , . . . , ψ T that can be used in practice, which we discuss.
This work builds upon a number of methodological advances, most notably the twisted particle filter (Whiteley and Lee 2014), the APF (Pitt and Shephard 1999), block sampling (Doucet, Briers, and Sénécal 2006), and look-ahead schemes (Lin et al. 2013). In particular, the sequence ψ * is closely related to the generalized eigenfunctions described in Whiteley and Lee (2014), but in that work the particle filter as opposed to the HMM was twisted to define alternative approximations of L. For simplicity, we have presented the BPF in which multinomial resampling occurs at each timestep. Commonly employed modifications of this algorithm include adaptive resampling (Kong, Liu, and Wong 1994;Liu and Chen 1995) and alternative resampling schemes (see, e.g., Douc, Cappé, and Moulines 2005). Generalization to the time-inhomogeneous HMM setting is fairly straightforward, so we restrict ourselves to the timehomogeneous setting for clarity of exposition.

Twisted Models and the ψ-Auxiliary Particle Filter
Given an HMM (μ, f , g) and a sequence of observations y 1:T , we introduce a family of alternative twisted models based on a sequence of real-valued, bounded, continuous, and positive functionsψ := (ψ 1 , ψ 2 , . . . , ψ T ). Letting, for an arbitrary transition density f and function ψ, f (x, ψ ) := X f (x, x )ψ (x )dx , we define a sequence of normalizing functions(ψ 1 ,ψ 2 , . . . ,ψ T ) on X byψ t (x t ) := f (x t , ψ t+1 ) for t ∈ {1, . . . , T − 1},ψ T ≡ 1, and a normalizing constantψ 0 := X μ(x 1 )ψ 1 (x 1 )dx 1 . We then define the twisted model via the following sequence of twisted initial and transition densities: , . . . , T },(5) and the sequence of positive functions which play the role of observation densities in the twisted model. Our interest in this family is motivated by the following invariance result. We denote by 1 the sequence of constant functions equal to 1 everywhere.
Proposition 1. If ψ is a sequence of bounded, continuous and positive functions, and Proof. We observe that and the result follows.
From a methodological perspective, Proposition 1 makes clear a particular sense in which the L.H.S. of (1) is common to an entire family of μ 1 , ( f t ) t∈{2,...,T } and (g t ) t∈{1,...,T } . The BPF associated with the twisted model corresponds to choosing in Algorithm 1; to emphasize the dependence on ψ, we provide in Algorithm 2 the corresponding algorithm and we will denote approximations of L by Z N ψ . We demonstrate below that the BPF associated with the twisted model can also be viewed as an APF associated with the sequence ψ, and so refer to this algorithm as the ψ-APF. Since the class of ψ-APF's is very large, it is natural to consider whether there is an optimal choice of ψ, in terms of the accuracy of the approximation Z N ψ : the following proposition describes such a sequence.
2. For t = 2, . . . , T , sample independently for t ∈ {1, . . . , T − 1}. Then, Z N ψ * = L with probability 1. Proof. It can be established that and so we obtain from (6) that g ψ * 1 ≡ψ * 0 and g ψ * t ≡ 1 for t ∈ {2, . . . , T }. Hence, with probability 1. To conclude, we observe that Implementation of Algorithm 2 requires that one can sample according to μ ψ 1 and f ψ t (x, ·) and compute g ψ t pointwise. This imposes restrictions on the choice of ψ in practice, since one must be able to compute both ψ t andψ t pointwise. In general models, the sequence ψ * cannot be used for this reason as (8) cannot be computed explicitly. However, since Algorithm 2 is valid for any sequence of positive functions ψ, we can interpret Proposition 2 as motivating the effective design of a particle filter by solving a sequence of function approximation problems.
Alternatives to the BPF have been considered before (see, e.g., the "locally optimal" proposal in Doucet, Godsill, and Andrieu 2000 and the discussion in Del Moral 2004, Section 2.4.2). The family of particle filters we have defined using ψ are unusual, however, in that g ψ t is a function only of x t rather than (x t−1 , x t ); other approaches in which the particles are sampled according to a transition density that is not f typically require this extension of the domain of these functions. This is again a consequence of the fact that the ψ-APF can be viewed as a BPF for a twisted model. This feature is shared by the fully adapted APF of Pitt and Shephard (1999), when recast as a standard particle filter for an alternative model as in Johansen and Doucet (2008), and which is obtained as a special case of Algorithm 2 when ψ t (·) ≡ g(·, y t ) for each t ∈ {1, . . . , T }. We view the approach here as generalizing that algorithm for this reason.
It is possible to recover other existing methodological approaches as BPFs for twisted models. In particular, when each element of ψ is a constant function, we recover the standard BPF of Gordon, Salmond, and Smith (1993). Setting ψ t (x t ) = g(x t , y t ) gives rise to the fully adapted APF. By taking, for some k ∈ N and each t ∈ {1, . . . , T }, ψ corresponds to a sequence of look-ahead functions (see, e.g., Lin et al. 2013) and one can recover idealized versions of the delayed sample method of Chen, Wang, and Liu (2000) (see also the fixed-lag smoothing approach in Clapp and Godsill 1999), and the block sampling particle filter of Doucet, Briers, and Sénécal (2006). When k ≥ T − 1, we obtain the sequence ψ * . Just as ψ * cannot typically be used in practice, neither can the exact look-ahead strategies obtained by using (9) for some fixed k. In such situations, the proposed look-ahead particle filtering strategies are not ψ-APFs, and their relationship to the ψ * -APF is consequently less clear. We note that the offline setting we consider here affords us the freedom to define twisted models using the entire data record y 1:T . The APF was originally introduced to incorporate a single additional observation, and could therefore be implemented in an online setting, that is, the algorithm could run while the data record was being produced.

Asymptotic Variance of the ψ-APF
Since it is not typically possible to use the sequence ψ * in practice, we propose to use an approximation of each member of ψ * . To motivate such an approximation, we provide a central limit theorem, adapted from a general result due to Del Moral (2004, Chap. 9). It is convenient to make use of the fact that the estimate Z N ψ is invariant to rescaling of the functions ψ t by constants, and we adopt now a particular scaling that simplifies the expression of the asymptotic variance. In particular, we let Proposition 3. Let ψ be a sequence of bounded, continuous, and positive functions. Then We emphasize that Proposition 3, whose proof can be found in the Appendix, follows straightforwardly from existing results for Algorithm 1, since the ψ-APF can be viewed as a BPF for the twisted model defined by ψ. For example, in the case ψ consists only of constant functions, we obtain the standard asymptotic variance for the BPF From Proposition 3, we can deduce that σ 2 ψ tends to 0 as ψ approaches ψ * in an appropriate sense. Hence, Propositions 2 and 3 together provide some justification for designing particle filters by approximating the sequence ψ * .

Classes of f and ψ
While the ψ-APF described in Section 2 and the asymptotic results just described are valid very generally, practical implementation of the ψ-APF does impose some restrictions jointly on the transition densities f and functions in ψ. Here we consider only the case where the HMM's initial distribution is a mixture of Gaussians and f is a member of F, the class of transition densities of the form where M ∈ N, and (a k ) k∈{1,...,M} and (b k ) k∈{1,...,M} are sequences of mean and covariance functions, respectively and (c k ) k∈{1,...,M} a sequence of R + -valued functions with M k=1 c k (x) = 1 for all x ∈ X. Let define the class of functions of the form where M ∈ N, C ∈ R + , and (a k ) k∈{1,...,M} , (b k ) k∈{1,...,M} and (c k ) k∈{1,...,M} are a sequence of means, covariances, and positive real numbers, respectively. When f ∈ F and each ψ t ∈ , it is straightforward to implement Algorithm 2 since, for each can be computed explicitly and f ψ t (x, ·) is a mixture of normal distributions whose component means and covariance matrices can also be computed. Alternatives to this particular setting are discussed in Section 6.

Recursive Approximation of ψ *
The ability to compute f (·, ψ t ) pointwise when f ∈ F and ψ t ∈ is also instrumental in the recursive function approximation scheme we now describe. Our approach is based on the following observation.
Proof. The definition of ψ * provides that ψ * Let (ξ 1:N 1 , . . . , ξ 1:N T ) be random variables obtained by running a particle filter. We propose to approximate ψ * by Algorithm 3, for which we define ψ T +1 ≡ 1. This algorithm mirrors the backward sweep of the forward filtering backward smoothing recursion which, if it could be calculated, would yield exactly ψ * .

Algorithm 3 Recursive function approximations
For t = T, . . . , 1: Choose ψ t as a member of on the basis of ξ 1:N t and ψ 1:N t .
One choice in Step 2 of Algorithm 3 is to define ψ t using a nonparametric approximation such as a Nadaraya-Watson estimate (Nadaraya 1964;Watson 1964). Alternatively, a parametric approach is to choose ψ t as the minimizer in some subset of of some function of ψ t , ξ 1:N t and ψ 1:N t . Although a number of choices are possible, we focus in Section 5 on a simple parametric approach that is computationally inexpensive.

The Iterated Auxiliary Particle Filter
The iterated auxiliary particle filter (iAPF), Algorithm 4, is obtained by iteratively running a ψ-APF and estimating ψ * from its output. Specifically, after each ψ-APF is run, ψ * is reapproximated using the particles obtained, and the number of particles is increased according to a well-defined rule. The algorithm terminates when a stopping rule is satisfied.
Algorithm 4 An iterated auxiliary particle filter with parameters (N 0 , k, τ ) 1. Initialize: set ψ 0 to be a sequence of constant functions, l ← 0. 2. Repeat: (a) Run a ψ l -APF with N l particles, and setẐ l ← Z N l ψ l . (b) If l > k and sd(Ẑ l−k:l )/mean(Ẑ l−k:l ) < τ, go to 3. (c) Compute ψ l+1 using a version of Algorithm 3 with the particles produced. The rationale for Step 2(d) of Algorithm 4 is that if the sequenceẐ l−k:l is monotonically increasing, there is some evidence that the approximations ψ l−k:l are improving, and so increasing the number of particles may unnecessarily increase computational cost. However, if the approximationsẐ l−k:l have both high relative standard deviation in comparison to τ and are oscillating then reducing the variance of the approximation of Z and/or improving the approximation of ψ * may require an increased number of particles. Some support for this procedure can be obtained from the log-normal CLT of Bérard, Del Moral, and Doucet (2014): under regularity assumptions, log Z N ψ is approximately a N (−δ 2 ψ /2, δ 2 ψ ) random variable and so P(

Approximations of Smoothing Expectations
Thus far, we have focused on approximations of the marginal likelihood, L, associated with a particular model and data record y 1:T . Particle filters are also used to approximate so-called smoothing expectations, that is, π (ϕ) := E[ϕ(X 1:T ) | {Y 1:T = y 1:T }] for some ϕ : X T → R. Such approximations can be motivated by a slight extension of (1), where ϕ is a real-valued, bounded, continuous function. We can write π (ϕ) = γ (ϕ)/γ (1), where 1 denotes the constant function x → 1. We define below a well-known, unbiased, and strongly consistent estimate γ N (ϕ) of γ (ϕ), which can be obtained from Algorithm 1. A strongly consistent approximation of π (ϕ) can then be defined as γ N (ϕ)/γ N (1).
The definition of γ N (ϕ) is facilitated by a specific implementation of step 2. of Algorithm 1 in which one samples t−1 , ·), for each i ∈ {1, . . . , N} independently. Use of, for example, the Alias algorithm (Walker 1974(Walker , 1977 gives the algorithm O(N) computational complexity, and the random variables is unbiased and strongly consistent, and a strongly consistent approximation of π (ϕ) is The ψ * -APF is optimal in terms of approximating γ (1) ≡ Z and not π (ϕ) for general ϕ. Asymptotic variance expressions akin to Proposition 3, but for π N ψ (ϕ), can be derived using existing results (see, e.g., Del Moral and Guionnet 1999 ;Chopin 2004;Künsch 2005;Douc and Moulines 2008) in the same manner. These could be used to investigate the influence of ψ on the accuracy of π N ψ (ϕ) or the interaction between ϕ and the sequence ψ which minimizes the asymptotic variance of the estimator of its expectation.
Finally, we observe that when the optimal sequence ψ * is used in an APF in conjunction with an adaptive resampling strategy (see Algorithm 5), the weights are all equal, no resampling occurs and the ξ i t are all iid samples from P(X t ∈ · | {Y 1:T = y 1:T }). This at least partially justifies the use of iterated ψ-APFs to approximate ψ * : the asymptotic variance σ 2 ψ in (10) is particularly affected by discrepancies between ψ * and ψ in regions of relatively high conditional probability given the data record y 1:T , which is why we have chosen to use the particles as support points to define approximations of ψ * in Algorithm 3.

Applications and Examples
The purpose of this section is to demonstrate that the iAPF can provide substantially better estimates of the marginal likelihood L than the BPF at the same computational cost. This is exemplified by its performance when d is large, recalling that X = R d . When d is large, the BPF typically requires a large number of particles to approximate L accurately. In contrast, the ψ * -APF computes L exactly, and we investigate below the extent to which the iAPF is able to provide accurate approximations in this setting. Similarly, when there are unknown statistical parameters θ , we show empirically that the accuracy of iAPF approximations of the likelihood L(θ ) are more robust to changes in θ than their BPF counterparts.
Unbiased, nonnegative approximations of likelihoods L(θ ) are central to the particle marginal Metropolis-Hastings algorithm (PMMH) of Andrieu, Doucet, and Holenstein (2010), a prominent parameter estimation algorithm for general state space hidden Markov models. An instance of a pseudo-marginal Markov chain Monte Carlo algorithm (Beaumont 2003;Andrieu and Roberts 2009), the computational efficiency of PMMH depends, sometimes dramatically, on the quality of the unbiased approximations of L(θ ) (Andrieu and Vihola 2015; Lee and Łatuszyński 2014;Sherlock et al. 2015;Doucet et al. 2015) delivered by a particle filter for a range of θ values. The relative robustness of iAPF approximations of L(θ ) to changes in θ , mentioned above, motivates their use over BPF approximations in PMMH.

Implementation Details
In our examples, we use a parametric optimization approach in Algorithm 3. Specifically, for each t ∈ {1, . . . , T }, we compute numerically a regularized version of and then set where c is a positive real-valued function, which ensures that f ψ t (x, ·) is a mixture of densities with some nonzero weight associated with the mixture component f (x, ·). This is intended to guard against terms in the asymptotic variance σ 2 ψ in (10) being very large or unbounded. We chose (15) for simplicity and its low computational cost, and it provided good performance in our simulations. For the stopping rule, we used k = 5 for the application in Section 5.2, and k = 3 for the applications in Sections 5.3 and 5.4. We observed empirically that the relative standard deviation of the likelihood estimate tended to be close to, and often smaller than, the chosen level for τ . A value of τ = 1 should therefore be sufficient to keep the relative standard deviation around 1 as desired (see, e.g., Doucet et al. 2015;Sherlock et al. 2015). We set τ = 0.5 as a conservative choice for all our simulations apart from the multivariate stochastic volatility model of Section 5.4, where we set τ = 1 to improve speed. We performed the minimization in (15) under the restriction that was a diagonal matrix, as this was considerably faster and preliminary simulations suggested that this was adequate for the examples considered.
We used an effective sample size based resampling scheme (Kong, Liu, and Wong 1994;Liu and Chen 1995), described in Algorithm 5 with a user-specified parameter κ ∈ [0, 1]. The effective sample size is defined as ESS(W 1 , . . . , where R is the set of "resampling times. " This reduces to Algorithm 2 when κ = 1 and to a simple importance sampling algorithm when κ = 0; we use κ = 0.5 in our simulations. The use of adaptive resampling is motivated by the fact that when the effective sample size is large, resampling can be detrimental in terms of the quality of the approximation Z N .

Linear Gaussian Model
A linear Gaussian HMM is defined by the following initial, transition, and observation Gaussian densities: it is possible to implement the fully adapted APF (FA-APF) and to compute explicitly the marginal likelihood, filtering and smoothing distributions using the Kalman filter, facilitating comparisons. We emphasize that implementation of the FA-APF is possible only for a restricted class of analytically tractable models, while the iAPF methodology is applicable more generally. Nevertheless, the iAPF exhibited better performance than the FA-APF in our examples.

Relative Variance of Approximations of Z When d is Large
We consider a family of linear Gaussian models where m = 0, . . , d} for some α ∈ (0, 1). Our first comparison is between the relative errors of the approximationsẐ of L = Z using the iAPF, the BPF, and the FA-APF. We consider configurations with d ∈ {5, 10, 20, 40, 80} and α = 0.42 and we simulated a sequence of T = 100 observations y 1:T for each configuration. We ran 1000 replicates of the three algorithms for each configuration and report box plots of the ratioẐ/Z in Figure 1.
For all the simulations, we ran an iAPF with N 0 = 1000 starting particles, a BPF with N = 10,000 particles and an FA-APF with N = 5000 particles. The BPF and FA-APF both had slightly larger average computational times than the iAPF with these configurations. The average number of particles for the final iteration of the iAPF was greater than N 0 only in dimensions d = 40 (1033)   Fixing the dimension d = 10 and the simulated sequence of observations y 1:T with α = 0.42, we now consider the variability of the relative error of the estimates of the marginal likelihood of the observations using the iAPF and the BPF for different values of the parameter α ∈ {0.3, 0.32, . . . , 0.48, 0.5}. In Figure 2, we report box plots ofẐ/Z in 1000 replications. For the iAPF, the length of the boxes are significantly less variable across the range of values of α. In this case, we used N = 50,000 particles for the BPF, giving a computational time at least five times larger than that of the iAPF. This demonstrates that the approximations of the marginal likelihood L(α) provided by the iAPF are relatively insensitive to small changes in α, in contrast to the BPF. Similar simulations, which we do not report, show that the FA-APF for this problem performs slightly worse than the iAPF at double the computational time.

Particle Marginal Metropolis-Hastings.
We and simulated a sequence of T = 100 observations. Assuming only that A is lower triangular, for identifiability, we performed Bayesian inference for the 15 unknown parameters {A i, j : i, j ∈ {1, . . . , 5}, j ≤ i}, assigning each parameter an independent uniform prior on [−5, 5]. From the initial point A 1 = I 5 , we ran three Markov chains A BPF 1:L , A iAPF 1:L , and A Kalman 1:L of length L = 300,000 to explore the parameter space, updating one of the 15 parameters components at a time with a Gaussian random walk proposal with variance 0.1. The chains differ in how the acceptance probabilities are computed, and correspond to using unbiased estimates of the marginal likelihood obtain from the BPF, iAPF or the Kalman filter, respectively. In the latter case, this corresponds to running a Metropolis-Hastings (MH) chain by computing the marginal likelihood exactly. We started every run of the iAPF with N 0 = 500 particles. The resulting average number of particles used to compute the final estimate was 500.2. The number of particles N = 20,000 for the BPF was set to have a greater computational time, in this case A BPF 1:L took 50% more time than A iAPF 1:L to simulate. In Figure 3, we plot posterior density estimates obtained from the three chains for 3 of the 15 entries of the transition matrix A. The posterior means associated with the entries of the matrix A were fairly close to A itself, the largest discrepancy being around 0.2, and the posterior standard deviations were all around 0.1. A comparison of estimated Markov chain autocorrelations for these same parameters is reported in Figure 4, which indicates little difference between the iAPF-PMMH and Kalman-MH Markov chains, and substantially worse performance for the BPF-PMMH Markov chain. The integrated autocorrelation time of the Markov chains provides a measure of the asymptotic variance of the individual chains' ergodic averages, and in this regard the iAPF-PMMH and Kalman-MH Markov chains were practically indistinguishable, while the BPF-PMMH performed between 3 and 4 times worse, depending on the parameter. The relative improvement of the iAPF over the BPF does seem empirically to depend on the value of δ. In experiments with larger δ, the improvement was still present but less   pronounced than for δ = 0.25. We note that in this example, ψ * is outside the class of possible ψ sequences that can be obtained using the iAPF: the approximations in are functions that are constants plus a multivariate normal density with a diagonal covariance matrix, while the functions in ψ * are multivariate normal densities whose covariance matrices have nonzero, offdiagonal entries.
To compare the efficiency of the iAPF and the BPF within a PMMH algorithm, we analyzed a sequence of T = 945 observations y 1:T , which are mean-corrected daily returns computed from weekday close exchange rates r 1:T +1 for the pound/dollar from 1/10/81 to 28/6/85. These data have been previously analyzed using different approaches, for example, in Harvey, Ruiz, and Shephard (1994) and Kim, Shephard, and Chib (1998). We wish to infer the model parameters θ = (α, σ, β) using a PMMH algorithm and compare the two cases, where the marginal likelihood estimates are obtained using the iAPF and the BPF. We placed independent inverse Gamma prior distributions IG(2.5, 0.025) and IG(3, 1) on σ 2 and β 2 , respectively, and an independent Beta(20, 1.5) prior distribution on the transition coefficient α. We used (α 0 , σ 0 , β 0 ) = (0.95, √ 0.02, 0.5) as the starting point of the three chains: X iAPF 1:L , X BPF 1:L and X BPF L . All the chains updated one component at a time with a Gaussian random walk proposal with variances (0.02, 0.05, 0.1) for the parameters (α, σ, β). X iAPF 1:L has a total length of L = 150,000 and for the estimates of the marginal likelihood that appear in the acceptance probability we use the iAPF with N 0 = 100 starting particles. For X BPF 1:L and X BPF 1:L we use BPFs: X BPF 1:L is a shorter chain with more particles (L = 150,000 and N = 1000), while X BPF 1:L is a longer chain with fewer particles (L = 1,500,000, N = 100). All chains required similar running time overall to simulate. Figure 5 shows estimated marginal posterior densities for the three parameters using the different chains.
In Table 3, we provide the adjusted sample size of the Markov chains associated with each of the parameters, obtained by dividing the length of the chain by the estimated integrated autocorrelation time associated with each parameter. We can see an improvement using the iAPF, although we note that the BPF-PMMH algorithm appears to be fairly robust to the variability of the marginal likelihood estimates in this particular application.
Since particle filters provide approximations of the marginal likelihood in HMMs, the iAPF can also be used in alternative parameter estimation procedures, such as simulated maximum likelihood (Lerman and Manski 1981;Diggle and Gratton 1984). The use of particle filters for approximate maximum likelihood estimation (see, e.g., Kitagawa 1998;Hürzeler and Künsch 2001) has recently been used to fit macroeconomic models (Fernández-Villaverde and Rubio-Ramírez 2007). In Figure 6  we show the variability of the BPF and iAPF estimates of the marginal likelihood at points in a neighborhood of the approximate MLE of (α, σ, β) = (0.984, 0.145, 0.69). The iAPF with N 0 = 100 particles used 100 particles in the final iteration to compute the likelihood in all simulations, and took slightly more time than the BPF with N = 1000 particles, but far less time than the BPF with N = 10,000 particles. The results indicate that the iAPF estimates are significantly less variable than their BPF counterparts and may therefore be more suitable in simulated maximum likelihood approximations.

Multivariate Stochastic Volatility Model
We consider a version of the multivariate stochastic volatility model defined for and g(x, ·) = N (·; 0, exp(diag(x))), where m, φ ∈ R d and the covariance matrix U ∈ R d×d are statistical parameters. The matrix U is the stationary covariance matrix associated with (φ, U ). This is the basic MSV model in Chib, Omori, and Asai (2009, Sec. 2), with the exception that we consider a nondiagonal transition covariance matrix U and a diagonal observation matrix.
We analyzed two 20-dimensional sequences of observations y 1:T and y 1:T , where T = 102 and T = 90. The sequences correspond to the monthly returns for the exchange rate with respect to the US dollar of a range of 20 different international currencies, in the periods 3/2000-8/2008 (y 1:T , precrisis) and 9/2008-2/2016 (y 1:T , post-crisis), as reported by the Federal Reserve System (available at http://www. federalreserve.gov/releases/h10/hist/). We infer the model parameters θ = (m, φ,U ) using the iAPF to obtain marginal likelihood estimates within a PMMH algorithm. A similar study using a different approach and with a set of six currencies can be found in Liu and West (2001).
The aim of this study is to showcase the potential of the iAPF in a scenario where, due to the relatively high dimensionality of the state space, the BPF systematically fails to provide reasonable marginal likelihood estimates in a feasible computational time. To reduce the dimensionality of the parameter space we consider a band diagonal covariance matrix U with nonzero entries on the main, upper, and lower diagonals. We placed independent inverse Gamma prior distributions with mean 0.2 and unit variance on each entry of the diagonal of U , and independent symmetric triangular prior distributions on [−1, 1] on the correlation coefficients ρ ∈ R 19 corresponding to the upper and lower diagonal entries. We place independent Uniform(0, 1) prior distributions on each component of φ and an improper, constant prior density for m. This results in a 79-dimensional parameter space. As the starting point of the chains, we used φ 0 = 0.95 · 1, diag(U 0 ) = 0.2 · 1 and for the 19 correlation coefficients we set ρ 0 = 0.25 · 1, where 1 denotes a vector of 1s whose length can be determined by context. Each entry of m 0 corresponds to the logarithm of the standard deviation of the observation sequence of the relative currency.
We ran two Markov chains X 1:L and X 1:L , corresponding to the data sequences y 1:T and y 1:T , both of them updated one component at a time with a Gaussian random walk proposal with standard deviations (0.2 · 1, 0.005 · 1, 0.02 · 1, 0.02 · 1) for the parameters (m, φ, diag(U ), ρ). The total number of updates for each parameter is L = 12, 000 and the iAPF with N 0 = 500 starting particles is used to estimate marginal likelihoods within the PMMH algorithm. In Figure 7, we report the estimated smoothed posterior densities corresponding to the parameters for the Pound Sterling/U.S. Dollar exchange rate series. Most of the posterior densities are different from their respective prior densities, and we also observe qualitative differences between the pre-and post-crisis regimes. For the same parameters, sample sizes adjusted for autocorrelation are reported in Table 4. Considering the high-dimensional state and parameter spaces, these are satisfactory. In the later steps of the PMMH chain, we recorded an average number of iterations for the iAPF of around 5 and an average number of particles in the final ψ-APF of around 502. To compare this with the BPF, we found that with N = 10 7 particles, at a cost of roughly 200 times the iAPF-PMMH per step, a BPF-PMMH Markov chain accepted only 4 out of 1000 proposed moves, each of which attempted to update a single parameter.
The aforementioned qualitative change of regime seems to be evident looking at the difference between the posterior expectations of the parameter m for the post-crisis and the pre-crisis chain, reported in Figure 8. The parameter m can be interpreted as the period average of the mean-reverting latent process of the  log-volatilities for the exchange rate series. Positive values of the differences for close to all of the currencies suggest a generally higher volatility during the post-crisis period.

Discussion
In this article, we have presented the iAPF, an offline algorithm that approximates an idealized particle filter, whose marginal likelihood estimates have zero variance. The main idea is to iteratively approximate a particular sequence of functions, and an empirical study with an implementation using parametric optimization for models with Gaussian transitions showed reasonable performance in some regimes for which the BPF was not able to provide adequate approximations. We applied the iAPF to Bayesian parameter estimation in general state-space HMMs by using it as an ingredient in a PMMH Markov chain. It could also conceivably be used in similar, but inexact, noisy Markov chains; Medina-Aguayo, Lee, and Roberts (2015) showed that control on the quality of the marginal likelihood estimates can provide theoretical guarantees on the behavior of the noisy Markov chain. The performance of the iAPF marginal likelihood estimates also suggests they may be useful in simulated maximum likelihood procedures. In our empirical studies, the number of particles used by the iAPF was orders of magnitude smaller than would be required by the BPF for similar approximation accuracy, which may be relevant for models in which space complexity is an issue.
In the context of likelihood estimation, the perspective brought by viewing the design of particle filters as essentially a function approximation problem has the potential to significantly improve the performance of such methods in a variety of settings. There are, however, a number of alternatives to the parametric optimization approach described in Section 5.1, and it would be of particular future interest to investigate more sophisticated schemes for estimating ψ * , that is, specific implementations of Algorithm 3. We have used nonparametric estimates of the sequence ψ * with some success, but the computational cost of the approach was much larger than the parametric approach. Alternatives to the classes F and described in Section 3.2 could be obtained using other conjugate families, (see, e.g., Vidoni 1999). We also note that although we restricted the matrix in (15) to be diagonal in our examples, the resulting iAPF marginal likelihood estimators performed fairly well in some situations, where the optimal sequence ψ * contained functions that could not be perfectly approximated using any function in the corresponding class. Finally, the stopping rule in the iAPF, described in Algorithm 4 and which requires multiple independent marginal likelihood estimates, could be replaced with a stopping rule based on the variance estimators proposed in Lee and Whiteley (2015). For simplicity, we have discussed particle filters in which multinomial resampling is used; a variety of other resampling strategies (see Douc, Cappé, and Moulines 2005, for a review) can be used instead.