27
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Sampling by divergence minimization

ORCID Icon, &
Received 07 May 2022
Accepted 05 Mar 2023
Published online: 21 Apr 2023

Abstract

We introduce a Markov Chain Monte Carlo (MCMC) method that is designed to sample from target distributions with irregular geometry using an adaptive scheme. In cases where targets exhibit non-Gaussian behavior, we propose that adaptation should be regional rather than global. Our algorithm minimizes the information projection component of the Kullback–Leibler (KL) divergence between the proposal and target distributions to encourage proposals that are distributed similarly to the regional geometry of the target. Unlike traditional adaptive MCMC, this procedure rapidly adapts to the geometry of the target’s current position as it explores the surrounding space without the need for many preexisting samples. The divergence minimization algorithms are tested on target distributions with irregularly shaped modes and we provide results demonstrating the effectiveness of our methods.

1. Introduction

Markov Chain Monte Carlo (MCMC) is a class of algorithms designed to efficiently and effectively sample from a diverse set of target distributions (Brooks et al. Citation2011). Classical MCMC methods perform excellently when the target is well-behaved and unimodal. However, when targets exhibit unusual geometry or have multiple modes, core techniques such as random walk Metropolis (RWM) tend to perform poorly. These are the challenges that motivate much active MCMC research. In this paper, we propose an algorithm that specifically aims to effectively sample from irregular and non-Gaussian target distributions.

For targets with atypical geometry, adaptive MCMC has proven to outperform classical MCMC (Haario, Saksman, and Tamminen Citation2001; Andrieu and Thoms Citation2008; Atchadé et al. Citation2011). One of the core ideas driving adaptive MCMC is that a proposal distribution that is similar in shape to the target distribution will produce higher quality samples than a generic proposal. In adaptive Random Walk Metropolis (aRWM), this is accomplished by proposing with the empirical covariance matrix of the samples produced up to the current iteration. Thus, the proposals improve as the algorithm progresses until, eventually, the empirical covariance matrix approaches the hypothetical global optimal sampler. Convergence to the target distribution can be upheld using the principles of containment and diminishing adaptation, or finite adaptation (Roberts and Rosenthal Citation2007; Rosenthal Citation2011). However, aRWM does have its limitations. When the target distribution exhibits highly irregular, non-Gaussian geometry, aRWM may not perform well because a single optimal Gaussian proposal distribution as used by aRWM may not reflect the local geometry in all regions of the target distribution.

In this work, we expand on the idea that a global optimal proposal distribution may not be effective and we instead discuss the idea of region-specific sampling. We introduce an algorithm designed to sample effectively from unusual geometries by exploiting local information about the target distribution. Instead of waiting for samples to be produced in order to trigger adaptation, we use ideas from the recently popular stochastic variational inference class of methods (Salimans, Kingma, and Welling Citation2015). By measuring the similarity between the target and proposal distributions using the Kullback–Leibler (KL) Divergence, we use gradients to devise an update rule that is not reliant on having an existing large batch of samples.

More specifically, consider a target distribution p and a family of proposal distributions to be qQ (Yamano Citation2009). Noting that the KL Divergence between two distributions is asymmetric, we specifically focus on the I-projection of the KL Divergence, defined as D(q||p)=Eq[logqp]. The KL Divergence has two sides, the I- and M- projections, but we rely on the I-projection as it tends to produce an underdispersed q that locks onto a specific mode of p, compared to the M-projection which tends to overestimate the support of q (Murphy Citation2012). In other words, through minimization, the I-projection produces a distribution similar to the local geometry of p. Such a distribution is equipped to rapidly produce samples from oddly shaped regions of a target distribution. We term this approach the divergence minimization (DM) sampler.

In addition, the DM sampler is inherently modular and can be easily integrated into other MCMC methods. As an example, we implement the DM sampler as the non-tempered chain of a two-chain parallel tempering algorithm, which we call Scout MCMC. Recall that parallel tempering executes multiple chains simultaneously on the target with different levels of tempering, then randomly swaps positions of the chains to improve mode discovery (Swendsen and Wang Citation1986; Geyer Citation1991). The DM sampler component of Scout MCMC improves on the basic parallel tempering procedure by offering immediate adaptation following swap moves.

Finally, we recognize that at each iteration, the covariance matrix produced by the gradient update rule represents a proposal distribution adept at sampling from its local region. This generated proposal distribution can reasonably be used for nearby points, assuming some degree of continuity. Thus, we introduce a two-stage extension to the DM sampler and Scout MCMC. In the first stage, we gather proposal distributions using the DM sampler or Scout MCMC. Next, we use these proposal distributions to characterize a non-adaptive Metropolis-Hastings algorithm.

Before discussing the specifics of the algorithms in Secs. 2 and 3, we first discuss a number of relevant related works.

1.1. Related works

The DM sampler draws inspiration from Titsias and Dellaportas (Citation2019). In this paper, the authors optimize an objective function composed of the product of the entropy function and the average proposal acceptance rate. The proposals for the adaptive MCMC algorithm are then based off of gradient updates that aim to maximize this function, producing a wide range of proposals that maintain a balance between entropy and acceptance rate. In addition to supporting significant adaptation at early stages of a chain, Gradient-based Adaptive MCMC also allows for adaptation upon rejecting a proposal, a noteworthy feature as most adaptive algorithms do not directly consider the information offered by rejected samples.

While the entropy function is a general function applied to the entire distribution, the algorithm that we present in this paper is based on the premise that, in cases with difficult geometry, it is necessary to focus on specific local regions instead of the entire target distribution when attempting to sample from the target distribution. This is accomplished by leveraging the I-Projection of the target over the set of proposal distributions. The I-projection underestimates the support of the target distribution and will hone in on one area as opposed to the entropy function which attempts to discover a range of samples from the entire target function at once (Shannon Citation1948; Murphy Citation2012). This regional behavior helps to overcome limitations in the Gaussian function class typically used for proposal distributions by reducing the current region of interest into manageable pieces.

Within the broader adaptive MCMC literature, there are a number of proposals that use non-parametric strategies to construct adaptive proposal distributions that asymptotically converge to the target distribution. In Gilks, Best, and Tan (Citation1995), the Adaptive Rejection Metropolis Sampling (ARMS) algorithm iteratively constructs a piecewise linear proposal density in the log-space that approximates the target density. As the sampler progresses, new support points are collected that improve the approximation of the proposal to the target. The piecewise linear approximation can then be used in the Metropolis-Hastings step within a Gibbs sampler. Martino, Read, and Luengo (Citation2015) build on the ARMS algorithm by improving the mechanism by which support points are included in the construction of the proposal density such that all regions of the target support are eligible for inclusion and Meyer, Cai, and Perron (Citation2008) present an alternative specification of ARMS that uses piecewise quadratic functions to construct the proposal density. The ARMS family of algorithms have proven effective at improving the efficiency of Gibbs samplers, particularly in settings where the conditional distributions are not well-known and thus cannot be sampled from directly. To contrast with our own proposal, we note that the ARMS algorithms are designed to construct proposals that approximate univariate targets, whereas our focus is on leveraging the correlation structure found in local regions of a multivariate target distribution.

Parallel tempering is another popular method, and is related to our Scout MCMC algorithm. In parallel tempering, multiple chains are run simultaneously on the target distribution with different levels of tempering applied. The intuition behind parallel tempering is that in the highly tempered chains, it will be easier to cross low-probability boundaries which can subsequently be randomly swapped with the non-tempered chain for mixing (Swendsen and Wang Citation1986; Geyer Citation1991). However, parallel tempering does have its limitations. From a computational perspective, executing many chains but ultimately only using the samples from the non-tempered chain is burdensome.

Using parallel chains for similar purposes, in Craiu, Rosenthal, and Yang (Citation2009), the authors introduce the algorithm Inter-chain Adaptation (INCA), which uses multiple stages of sampling. The first stage involves sampling the state space with parallel chains to partition the state space, while the second stage uses these predetermined regions as a guide to sample from the target distribution. The acceptance probabilities of new proposed points then depend on the region in which the current and proposed points reside.

Finally, the Jumping Adaptive Multimodal Sampler (JAMS) algorithm addresses the challenges of multimodal sampling by front loading the computational burden of mode discovery, using optimization techniques to search for modes and subsequently incorporating this information into the sampling phase (Pompe, Holmes, and Łatuszyński Citation2019). In the sampling phase, dedicated” jump moves” are used to move between modes directly. Once in a mode, any sampler adept at unimodal sampling can be employed. Similar to how we incorporate the DM sampler into parallel tempering to form Scout MCMC, one could envision an algorithm that links the DM sampler to the JAMS algorithm to address both the irregular geometry and the multimodal sampling challenges at once.

2. Divergence minimization sampler

We propose that the challenge of sampling from irregular geometries can be overcome by focusing on smaller regions of a given target distribution that are simpler and can be adequately sampled from using common proposal distributions such as the Gaussian proposal. This region-specific sampling scheme requires addressing two core issues: identifying regions of interest and determining how best to sample from these regions. At the most granular level, each individual point in the space could constitute its own region. The rationale is that every point has its own unique surrounding geometry and thus there exists some optimal way to generate a new sample when starting at each and every point.

The latter challenge characterizes the problem of identifying this optimal sampling procedure. To address this issue, we propose using the I-projection component of the KL Divergence as a similarity measure between the target and proposal distributions to construct proposals with similar geometry to the region around the current point (Murphy Citation2012). Defining the target distribution as p and the family of proposal distributions to be qQ, the I-projection is D(q||p)=Eq[logqp]. In the context of an MCMC proposal, we consider the family of proposal distributions to be Gaussian and the objective is to determine the covariance matrix that characterizes the Gaussian with minimal divergence from the target distribution at the current point. Such a proposal can be defined as: q(y|x)N(x,LLT) where x is the current position, y=x+Lϵ is the proposal, ϵN(0,1) and L is the Cholesky factor of the proposal covariance matrix (Higham Citation2009).

2.1. Objective

To find a proposal distribution that minimizes the divergence with the local geometry of the target distribution, we consider using gradient updates performed at each iteration of the MCMC chain. Meanwhile, we must be cognizant of the acceptance rate. In essence, we want to have both a small I-projection so that the proposal and the target are similar, as well as a reasonably high acceptance rate so that we are able to use the samples from our proposals. As such, we propose the following as an objective function that balances both the exponential of the negative I-projection and the average acceptance rate of the proposal: s(x)=exp[βD(q||p)]·α(x,y;L)q(y|x)dy

In the above, β is a hyperparameter that balances the impact of the I-projection with the average Metropolis acceptance rate defined by: α(x,y;L)=min{1,p(y)p(x)} where x is the current position, y is the proposal, and L is the proposal distribution Cholesky factor (Brooks et al. Citation2011). Notice the negative inside the exponential term of s(x). As the I-projection is non-negative, the negative exponent bounds the exponential term between 0 and 1 with the maximum obtained when D(q||p)=0. Also note that the average acceptance rate ranges between 0 and 1. As a result of these bounds, s[0,1] and is maximized when we have high acceptance rates with a proposal that is similar to the target. Thus, the problem of identifying a suitable proposal distribution has been reduced to maximizing s(x) where the optimal proposal distribution at any given x can be characterized by the corresponding optimal Cholesky factor Lx at the global optimum.

To make the objective function easier to manipulate, instead of optimizing s(x), we can optimize the logarithm of s(x). That is: logs(x)=βD(q||p)+logα(x,y;L)q(y|x)dy=βEq[logq(y|x)p(y)]+logEq[α(x,y;L)]=βEq[logq(y|x)]+βEq[logp(y)]+logEq[α(x,y;L)]=βHq+βEq[logp(y)]+logEq[α(x,y;L)]

The above statement of logs(x) contains expectations entangled with both the p and q distributions that precludes a closed form solution. In particular, notice that the final term is the logarithm of an expectation. Such a term is certainly not ideal for optimization purposes. The most advisable path forward to maximize logs(x) is to instead bound it below using Jensen’s inequality. We can then optimize the lower bound instead of the objective directly. Thus we have: logs(x)βHq+βEq[logp(y)]+Eq[logα(x,y;L)]=βHq+βEq[logp(y)]+Eq[logmin{1,p(y)p(x)}]=βHq+βEq[logp(y)]+Eq[min{0,logp(y)logp(x)}]=βHq+βEϵ[logp(x+Lϵ)]+Eϵ[min{0,logp(x+Lϵ)logp(x)}]=:J(x)

J(x) can be used as a lower bound for the objective function and for optimization. However, while J(x) is certainly simpler than logs(x), a general closed form solution of the maximum at each value of x is not attainable. Instead, we turn to iterative optimization methods. We choose gradient ascent as a generally accessible method to maximize J(x).

Gradient ascent requires specifying the gradient of J(x) with respect to the Cholesky factor L. We leave the detailed derivation of the approximate gradient of J(x) to Appendix A.1 and present the final result here: LJ(x)=βdiag(1L11,,1Lkk)+1Jj=1Jβp(x+Lϵj)p(x+Lϵj)ϵjT+1Jj=1JLmin{0,logp(x+Lϵj)logp(x)}. where x is the current position, L is the current value of the Cholesky factor, and ϵj are a sample of J standard normal values used to approximate the gradients of the expectations found in J.

Note further that the interior of the second summation in LJ(x) reduces into the following two cases depending on the value of ϵj, Lmin{0,logp(x+Lϵj)logp(x)}={0if logp(x+Lϵj)logp(x)1p(x+Lϵj)p(x+Lϵj)ϵjTif logp(x+Lϵj)<logp(x).

The above gradient characterizes the gradient update rule Lt+1=Lt+γLJ(x) where t is the time step of gradient ascent and γ is the step size used to maximize J(x). Here we make the practical note that due to the presence of the p(x+Lϵj)1 term in the gradient, ϵj values that result in proposals with negligible density can cause an explosion of the gradient. We thus set a large threshold value of h to catch elements in the gradient matrix with absolute values greater than h, and set the offending values to ±h respectively. This event is rare in practice but more common in tail geometries where the fraction of potentially offending proposals is higher.

Now, if we use this procedure to identify a value of L to maximize J(x) for each point x, call these Lx, we could then characterize a Metropolis-Hastings algorithm using these Cholesky factors.

However, we recognize that a great number of steps would be necessary to optimize J(x) to within some small error threshold. As gradient updates can be computationally expensive, executing a complete run of gradient ascent at every iteration of an MCMC algorithm would be untenable.

We propose that instead of fully optimizing J(x) at every iteration, a process that requires many expensive steps, we perform one step of gradient ascent at every MCMC iteration. This will provide approximations of the point-wise optimal sampler discussed so far with the following justifications. First, we note that the early steps of gradient ascent tend to be the most influential and thus a complete run of gradient ascent is not absolutely necessary. Secondly, in practical contexts, changes in geometry are typically gradual which implies that nearby points experience similar behavior, and by extension, similar gradients. While the proposal distribution is not fully optimized at every iteration, on aggregate, the proposal distributions become more optimal as iterations progress.

2.2. Algorithm details

We now gather the results of the above discussions into a complete algorithm summary. The divergence minimization sampler’s objective function and gradient update rule produce a series of covariance matrices for generating Gaussian proposals with an MCMC framework. Consistent with the acceptance rule in the objective function, we incorporate a Metropolis rule for proposal acceptance. At each iteration, we accept the proposal y from the current position xt with probability: α(xt,y|Ct)=min{1,p(y)p(xt)} where t is the current MCMC iteration, and Ct is the current Cholesky factor of the proposal distribution’s covariance matrix. Note that Ct represents the partially optimized Cholesky factor as opposed to the fully optimized Lx used previously. We reserve discussion of convergence issues for Sec. 2.4.

Algorithm 1.

Divergence Minimization Sampler with Perpetual Adaptation

1: Inputs (defaults): target p(x), balancing parameter β (0.2), initial point x0, step size γ (0.002), step threshold h (10/γ), initial scaling σ (2), iterations M

2: Initialize: C0 := σ1

3: for t = 0,…,M do

4:  Generate ϵtN(0, 1)

5:  Propose y=xt+Ctϵt

6:  Compute G=LJ(xt)

7:  Accept y with probability α(xt,y|Ct)

8:  Update xt+1=y if accepted or xt+1=xt if rejected.

9:  If element |Gij|>h, set Gij=sign(G)·h

10:  Update Cholesky Factor: Ct+1Ct+γG

11: end for

Algorithm 1 summarizes the DM sampler with perpetual adaptation. In its most basic form, the initial Cholesky factor is set to a diagonal matrix with equal scaling along each dimension though a more complex initialization would also be valid. Furthermore, parameters including the step size and balancing parameters are constant and supplied as inputs although they could be adapted along with the Cholesky factor.

2.3. Divergence minimization: a case study

To understand the behavior of the DM sampler, we examine a case study using a single banana distribution. The banana distribution is a unimodal distribution with non-Gaussian contours. For context, the contours of this distribution are presented in . The banana distribution is known to be a difficult distribution to sample from with basic MCMC algorithms because of its quickly changing local geometry, especially in the two tails (Haario, Saksman, and Tamminen Citation1999; Citation2001).

Figure 1. Banana distribution contours. (Note: lighter contours indicate higher density, red dot indicates the origin).

Figure 1. Banana distribution contours. (Note: lighter contours indicate higher density, red dot indicates the origin).

An intuitive way to understand the behavior of the DM sampler is to examine its samples. presents parallel results from an adaptive Random Walk Metropolis (aRWM) and DM sampler run. The aRWM algorithm used in this case study and subsequent examples is described by Roberts and Rosenthal (Citation2009). Each algorithm was run for 30,000 iterations with the first 1,000 removed as burn-in. Visually, we notice in the DM sampler results in that the interior of the contours is evenly explored whereas aRWM has blank gaps within the tails of the contours.

Figure 2. Banana distribution samples. The aRWM samples are more sparse and there are gaps in the tails whereas the DM sampler produces more samples in the tails that reach further outwards. (Note: Red dot indicates starting points and blue dots indicate samples).

Figure 2. Banana distribution samples. The aRWM samples are more sparse and there are gaps in the tails whereas the DM sampler produces more samples in the tails that reach further outwards. (Note: Red dot indicates starting points and blue dots indicate samples).

Quantitatively, we compare the algorithms using the acceptance rate and the expected squared jumping distance (ESJD). The ESJD used here balances the goals of a high acceptance rate with the increased exploration of larger steps and is defined as ESJD=t=2M||xtxt1||22 (Roberts and Rosenthal Citation2001; Gelman and Pasarica Citation2007). The DM sampler produced an acceptance rate of 72.25% with an ESJD of 2.4 as compared to the 8.95% acceptance rate and ESJD of 7.3 of aRWM. Since we know that the DM sampler has a higher acceptance rate, this suggests that the DM sampler takes smaller steps and is perhaps less efficient in terms of exploration than aRWM. With that said, more careful steps suggest a lower risk of missing regions of interest.

Such behaviors can be explained by examining the contours of the proposal distribution at different points of the target distribution. presents the contours of the final proposal distribution of the aRWM run centered at the final sample. In other words, they are the contours of the covariance matrix of all samples generated. Notice that the contours have largely failed to adapt to the specific geometry of the target distribution. They have simply expanded so that all regions of meaningful density in the target are covered by the proposal distribution at any given time but have not conformed to the unique geometry of the target distribution (Bédard Citation2007). One might expect that we could achieve similar success with even a simple RWM algorithm, given a large enough proposal distribution.

Figure 3. Proposal distribution contours from the final iteration of aRWM centered at the final sample imposed on the banana distribution contours. Notice that the contours do not match the behavior of the target distribution.

Figure 3. Proposal distribution contours from the final iteration of aRWM centered at the final sample imposed on the banana distribution contours. Notice that the contours do not match the behavior of the target distribution.

We contrast this behavior to that demonstrated by the DM sampler in . The DM sampler delivers on the promise of adaptation to local behavior as illustrated by the contours closely matching the region of interest. The proposal distributions benefit from adapted covariance matrices that align with the current tail, resulting in a dramatically reduced likelihood of bad proposals as compared to the aRWM proposals.

Figure 4. Sample DM sampler contours from the same algorithm execution. Notice that the proposal contours in the given iterations conform to the local geometry of the target distribution.

Figure 4. Sample DM sampler contours from the same algorithm execution. Notice that the proposal contours in the given iterations conform to the local geometry of the target distribution.

In summary, the contour plots demonstrate the intended behavior of the DM sampler to adapt to local regions of interest. Such behavior aligns with the objective of producing desirable adaptation. While it is common for algorithms to simply adapt to the scale of a target distribution, the DM sampler adapts to the behavior of the target distribution, a completely different and much more challenging task that is especially useful for target distributions with unique geometry. Furthermore, while in this instance it seems that aRWM outperforms in efficiency, we must question whether adapting to just the scale of a target distribution is scalable in higher dimensions given that the density will become more and more sparse.

2.4. Convergence and finite adaptation

In a standard adaptive scheme, the algorithm typically involves certain technical conditions (such as diminishing adaptation and containment, or finite adaptation) to guarantee convergence to the target distribution (Roberts and Rosenthal Citation2007; Rosenthal Citation2011). In this work, we argue that certain target distributions, such as those with unusual geometry or those with many unique modes, lend themselves to perpetual adaptation as no single Gaussian proposal distribution could hope to sample well in all regions of interest. The banana example in the previous section is a good example to illustrate this. As the banana distribution is clearly non-Gaussian, a single non-adapting Gaussian proposal cannot appropriately orient itself in the apex and in both tails. However, the consequence of embracing perpetual adaptation is that the standard convergence framework for adaptive MCMC is no longer compatible.

Our goal is to sample efficiently using region-specific proposal distributions while still fulfilling the requirements of the standard convergence framework. As such, we propose a two phase approach that limits adaptation to a finite number of iterations and subsequently transfers the lessons learned in adaptation to a Metropolis-Hastings framework. By limiting adaptation to a finite number of iterations, convergence of the non-adaptive phase to the target distribution is guaranteed (Roberts and Rosenthal Citation2007; Rosenthal Citation2011).

Recall that the basis of the DM sampler is to approximate the optimal proposal distribution characterized by the Cholesky factor Lx that maximizes the objective function s(x). As discussed in Sec. 2.1, if we knew the values of all Lx, we could produce a simple Metropolis-Hastings algorithm with defined proposal distributions. Of course, we have seen that optimizing s(x) is difficult for a single point, let alone all points in space. Fortunately, the procedure described in Algorithm 1 constructs Cholesky factors Ct at each iteration t to approximate the given location’s optimal proposal structure for sampling. If we record these Cholesky factors after each iteration, they can act as a proxy for the optimal Cholesky factor for nearby points as well, assuming some degree of continuity. Thus, after generating a collection of points and their associated Cholesky factors in the adaptive phase, in each iteration of the non-adaptive phase we select the Cholesky factor associated with the closest adaptive phase sample to construct a proposal covariance matrix for the current iteration. The algorithm thus proposes points from the distribution q(y|xt)N(xt,CtCtT) and accept with the following rule: αf(xt,y|Ct,C˜y)=min{1,p(y)q(xt|y)p(xt)q(y|xt)} where xt is the current position, y is the proposal, q(y|xt)N(xt,CtCtT),q(xt|y)N(y,C˜yC˜yT), and Ct and C˜y are the Cholesky factors from the adaptive phase iterations that correspond to the points closest to xt and y respectively. In other words, instead of calculating a new Cholesky factor for every new point, we select the point from our adaptive phase that is closest to the new point and use its corresponding (approximate) Cholesky factor. This non-adaptive phase adheres to the standard validity criteria of a non-adaptive Metropolis-Hastings algorithm. A complete algorithm summary of this scheme is presented in Algorithm 2.

We test the finite adaptation variant of the DM sampler on the banana distribution presented in Sec. 2.3. Essentially, once the adaptive phase completes, we consolidate the samples and covariance matrices and then begin the non-adaptive phase at the last adaptive phase sample. In , we present 20,000 samples generated from the non-adaptive phase. In addition to the visual indication of the non-adaptive samples covering the relevant portions of the state space, we note that the acceptance rate is 55.93%, the proportion of samples in the left side of the distribution is 51%, and the sample mean of [0.318.04] is approaching the true mean. These diagnostics indicate the algorithm is sampling well and is converging to the target distribution as expected.

Figure 5. Finite adaptation DM sampler variant. Samples presented only include those from the non-adaptive phase.

Figure 5. Finite adaptation DM sampler variant. Samples presented only include those from the non-adaptive phase.

We note that the finite adaptation version of the DM sampler performs similarly to the perpetually adapting version with the added benefit of adhering to established convergence criteria.

Algorithm 2.

Divergence Minimization Sampler with Finite Adaptation

1: Inputs (defaults): target p(x), balancing parameter β (0.2), initial point x0, step size γ (0.002), step threshold h (10/γ), initial scaling σ (2), iterations M, finite adaptation threshold F (M/2), finite subsample size s (M/20)

2: Initialize: C0: = σ1

3: Adaptive Phase

4: for t = 0,…,F do

5:  Generate ϵtN(0, 1)

6:  Propose y=xt+Ctϵt

7:  Compute G=LJ(xt)

8:  Accept y with probability α(xt,y|Ct)

9:  Update xt+1=y if accepted or xt+1=xt if rejected.

10:  If element |Gij|>h, set Gij=sign(G)·h

11:  Update Cholesky Factor: Ct+1Ct+γG

12: end for

13: Let S be a sample of s points from 0,1,F

14: Non-Adaptive Phase

15: for t = F + 1,…,M do

16:  Select Ct:=Ci where iS, such that d(xi,xt) is minimized.

17:  Generate ϵtN(0, 1)

18:  Propose yt=xt+Ctϵt

19:  Select C˜y:=Cj where jS, such that d(xj,y) is minimized.

20:  Accept y with probability αf(xt,y|Ct,C˜y)

21:  Update xt+1=y if accepted or xt+1=xt if rejected.

22: end for

Remark.

We now comment on the choice of the Metropolis acceptance rule for the original DM sampler as well as discuss an alternative that could perhaps motivate future work. Recall that each iteration of the adaptive phase triggers the gradient update rule. Any proposal under this framework will be asymmetric which at first glance would suggest the use of a Metropolis-Hastings acceptance rule (Hastings Citation1970). Suppose for a moment that we were to consider a Metropolis-Hastings rule. In other words, we replace the acceptance rule α with the following: α*(xt,y|Ct)=min{1,p(y)q(xt|y)p(xt)q(y|xt)} where xt is the current position, y is the proposal, Ct is the Cholesky factor of the proposal covariance matrix, q(y|xt)N(xt,CtCtT), and q(xt|y)N(y,(Ct+γLJ(xt))(Ct+γLJ(xt))T). The distribution of q(xt|y) considers the gradient step made in the process of moving from xt to y, reflecting the asymmetry involved in returning from y to xt. In this case, reversibility would be upheld at each individual iteration without introducing any finite adaptation (Roberts and Smith Citation1994; Bai, Roberts, and Rosenthal Citation2011; Craiu et al. Citation2015). However, the proposal kernels across iterations are not necessarily identical. Concretely, visiting, leaving, and then returning to a point can result in different proposal kernels at the same point due to the use of only a single gradient step at each iteration. Thus, each individual step under the hypothetical Metropolis-Hastings setup would be reversible but, in aggregate, the entire chain may not be. This perpetual adaptation represents a departure from the established convergence theory. In this paper, we have instead decided to proceed with a finite adaptation scheme that does guarantee convergence to the target distribution.

3. Scout MCMC

The DM sampler is designed as a general procedure for rapid adaptation to local geometry. Given the self-contained setup, it can be combined with other MCMC frameworks such as parallel tempering to produce more complex samplers. In this section, we introduce an extension that combines the DM Sampler with a two-chain parallel tempering setup. Specifically, the untempered first chain uses the DM sampler, and the second chain is tempered by either a factor provided by the user or one proportional to the number of dimensions (Tawn, Roberts, and Rosenthal Citation2019). Such an approach benefits from both the regional adaptation of the DM sampler and the global exploration of parallel tempering swap moves. Given the single tempered chain searching for new regions of density, we term this approach Scout MCMC.

3.1. Algorithm details

Scout MCMC generates proposals for the main chain q(yt|xt)N(xt,CtCtT), and accepts with probability: α(xt,yt|Ct)=min{1,p(yt)p(xt)} Then, it adapts the main chain Cholesky factor by γLJ(xt) as before. Next, a proposal for the scout chain is generated as q(ct|st)N(st,σs1), which is accepted according to the following rule: αs(st,ct)=min{1,p(ct)τp(st)τ}

Finally, Scout MCMC considers swapping xt+1 and st+1 every k iterations according to the swap rule: αswap(x,s)=min{1,p(x)τp(s)p(x)p(s)τ}

Algorithm 3.

Scout MCMC with Perpetual Adaptation

1: Inputs (defaults): target p(x), balancing parameter β (0.2), temperature τ (0.1), initial point x0, step size γ (0.002), step threshold h (10/γ), initial scaling σ (2), tempered scaling σs (9), iterations M, swap frequency k (20)

2: Initialize: C0: = σ1, s0 = x0

3: for t = 0,…,M do

4:  Main Chain Step

5:  Generate ϵtN(0, 1)

6:  Propose yt=xt+Ctϵt

7:  Compute G=LJ(x)

8:  Accept yt with probability α(xt,yt|Ct)

9:  Update xt+1=yt if accepted or xt+1=xt if rejected.

10:  If element |Gij|>h, set Gij=sign(G)·h

11:  Update Cholesky Factor: Ct+1Ct+γG

12:  Scout Step

13:  Propose ctN(st,σs1) and accept with probability min{1,p(ct)τp(st)τ}

14:  Update st+1=ct if accepted or st+1=st if rejected.

15:  Swap Step

16:  if t0modk then

17:   Swap xt+1 and st+1 with probability min{1,p(st+1)p(xt+1)τp(xt+1)p(st+1)τ}

18:  end if

19: end for

Algorithm 3 provides pseudocode for the implementation of Scout MCMC. Once again, control over step size and initial scaling is determined by the user to allow flexibility between targets. For example, depending on the expected global region of interest, the tempered chain scaling can be adjusted. Additional details can be added such as adapting the scaling of the tempered chain or varying the limit on the frequency of swap moves.

3.2. Finite adaptation

Similar to the DM sampler, we present a two-phase finitely adapting variant of Scout MCMC. The first phase is the procedure presented in Algorithm 3. In the second, non-adaptive phase, the structure of the scout chain does not change. However, the main chain follows the same process as the finitely adapting DM sampler where the Cholesky factor corresponding to the nearest iteration of the adapting phase is used to construct proposal distributions in the non-adaptive phase. This reduces the non-adapting phase to a Metropolis-Hastings algorithm. We present the pseudocode associated with the finitely adapting Scout MCMC in Algorithm 4.

Algorithm 4.

Scout MCMC with Finite Adaptation

1: Inputs (defaults): target p(x), balancing parameter β (0.2), temperature τ (0.1), initial point x0, step size γ (0.002), step threshold h (10/γ), initial scaling σ (2), tempered scaling σs (9), iterations M, finite adaptation threshold F (M/2), swap frequency k (20), finite subsample size s (M/20)

2: Initialize: C0: = σ1, s0 = x0

3: Adaptive Phase

4: for t = 0,…,F do

5:  Main Chain Step

6:  Generate ϵtN(0, 1)

7:  Propose yt=xt+Ctϵt

8:  Compute G=LJ(x)

9:  Accept yt with probability α(xt,yt|Ct)

10:  Update xt+1=yt if accepted or xt+1=xt if rejected.

11:  If element |Gij|>h, set Gij=sign(G)·h

12:  Update Cholesky Factor: Ct+1Ct+γG

13:  Scout Step

14:  Propose ctN(st,σs1)

15:  Accept ct with probability min{1,p(ct)τp(st)τ}

16:  Update st+1=ct if accepted or st+1=st if rejected.

17:  Swap Step

18:  if t0modk then

19:   Swap xt+1 and st+1 with probability min{1,p(st+1)p(xt+1)τp(xt+1)p(st+1)τ}

20:  end if

21: end for

22: Let S be a sample of s points from 0,1,F

23: Non-Adaptive Phase

24: for t = F + 1,…,M do

25:  Main Chain Step

26:  Select Ct:=Ci where iS, such that d(xi,xt) is minimized.

27:  Generate ϵtN(0, 1)

28:  Propose yt=xt+Ctϵt

29:  Select C˜t:=Cj where jS, such that d(xj,y) is minimized.

30:  Accept y with probability αf(xt,y|Ct,C˜t)

31:  Update xt+1=y if accepted or xt+1=xt if rejected.

32:  Scout Step

33:  Propose ctN(st,σs1) and accept with probability min{1,p(ct)τp(st)τ}

34:  Update st+1=ct if accepted or st+1=st if rejected.

35:  Swap Step

36:  if t0modk then

37:   Swap xt+1 and st+1 with probability min{1,p(st+1)p(xt+1)τp(xt+1)p(st+1)τ}

38:  end if

39: end for

In the next section, we present examples demonstrating that similar to the DM sampler, the finite adaptation version of Scout MCMC performs similarly in practice to the perpetually adapting version but has the theoretical advantage of adhering to established convergence criteria.

4. Examples

In this section, we examine the performance of the DM Sampler and Scout MCMC using a variety of target distributions. We focus on distributions with atypical geometry as well as a subset of multimodal distributions that have contiguous modes. It is important to note that traditional diagnostics such as effective sample size (ESS) will be misleading in the case of multimodal distributions (Turner and Neal Citation2017; Elvira, Martino, and Robert Citation2018). ESS specifically may prefer a sample that fails to leave the initial mode as compared to a sample that explores modes separated by a low probability chasm. ESJD is arguably a better diagnostic as it increases with increased step size and acceptance rate, both being favorable behaviors. Recall that the ESJD is defined as ESJD=t=2M||xtxt1||22.

Given that there is a lack of consensus on appropriate diagnostics for targets with more than one mode, we have selected target distributions with easily computed true expected values to use as reference points for the simulations. Going forward, we will refer to the true expected value as E[X], the estimated expected value with an MCMC sample as Ê[X], and the Euclidean distance between the true and estimated values as d(E[X],Ê[X]).

As an example of a target distribution with easily computed expectations, the basis vector target that will be discussed in detail consists of a mixture of Gaussian distributions where each component Gaussian lies on one of the basis vectors and all are equidistant from the origin. This leads to a target with negligible density at the origin but with an expected value that is simply at the origin itself. A similar but much more challenging target consisting of a mixture of banana distributions presents a target with a mean at the origin that also has complex geometry. In these instances, we can use the distance from the sample mean to the origin, the true mean, to evaluate algorithm performance.

In the following examples, we compare the DM Sampler and Scout MCMC with standard Random Walk Metropolis (RWM), adaptive RWM (aRWM), Metropolis-adjusted Langevin algorithm (MALA), and Parallel Tempering (PT). For clarity, RWM generates proposals with a single shared Gaussian distribution, aRWM generates proposals using the empirical covariance matrix up to the current iteration, MALA uses gradient information to improve proposals, and PT executes multiple RWM chains on the target distribution with different levels of tempering applied (Swendsen and Wang Citation1986; Geyer Citation1991; Roberts and Rosenthal Citation1998). For consistency, we match the maximum tempering level used by parallel tempering to the level used by the scout chain in Scout MCMC. We also execute two versions of parallel tempering: one with 2 chains to match Scout MCMC, and one with 5 or 10 chains as would be more likely in practice. Finally, we include both the fully adaptive versions of the DM Sampler and Scout MCMC along with the variants that limit adaptation and transition to a second non-adaptive phase. The code used to generate the following examples along with implementations of each algorithm in Python are provided to supplement the discussionFootnote1.

4.1. Double banana distribution

The first distribution we consider is an extension of the banana distribution examined in Sec. 2.3. Specifically, we consider a pair of banana distributions with overlap in the tails. This results in two primary modes along the curves of the two bananas as well as two secondary modes at the intersections. provides the contours of this distribution.

Figure 6. Double banana distribution contours. (Note: lighter contours indicate higher density, red dot indicates the origin).

Figure 6. Double banana distribution contours. (Note: lighter contours indicate higher density, red dot indicates the origin).

All of the algorithms are run for 50,000 iterations with the first 1,000 samples discarded as burn-in. presents the results of this experiment. For this specific distribution, the target mean is [025].

Table 1. Double banana target results. We see here that aRWM and Scout MCMC produced sample means that are closest to the true mean. While aRWM has a greater ESJD value, Scout MCMC has a greater acceptance rate.

Notice that the samples of aRWM and Scout MCMC are closest to the mean of the distribution. It is worth noting that there is negligible density at the mean as illustrated by so the ability to achieve the correct mean indicates that both bananas have been visited. In comparison, standard RWM, MALA, and parallel tempering have not achieved the level of success of the other algorithms. Even with 5 chains, parallel tempering has largely failed to converge within the 50,000 iterations. The distinction between aRWM and Scout MCMC lies in the efficiency diagnostics. We see that Scout MCMC accepts over 10 times as many proposals as aRWM though it has a smaller ESJD. A large acceptance rate is not necessarily indicative of a better algorithm but with both algorithms performing similarly, this could indicate that Scout MCMC produces higher quality proposals. Since ESJD is a measure of both the acceptance rate and the step size made with each move but Scout MCMC has a much higher acceptance rate, this would indicate that aRWM proposes moves with much greater step sizes than Scout MCMC. This is expected, however, as Scout MCMC uses a user-specified cooldown period where the main chain makes local moves and does not swap with the scout chain. Finally, we note that the finite adaptation variants of the DM sampler and Scout MCMC both perform similarly to their fully adapting counterparts though they tend to accept fewer proposals.

In addition to sample diagnostics, we also examine the samples themselves in . A notable observation is the dramatic imbalance of the parallel tempering samples in as well as the DM samples in . RWM also experiences slight imbalance but more notably does not reach far into the tails within the number of iterations. Most surprising is that MALA has not managed to reach the bottom banana regardless of the choice of parameters. We attribute this largely to the gradient pulling proposals away from the tails and thus hampering exploration. The contrast between the DM sampler and Scout MCMC samples highlights the regulating abilities of the Scout chain to help the DM sampler escape from extreme regions. In this example, the left tail in could be considered an extreme region. Both the aRWM and Scout MCMC plots exhibit desirable sampling behavior as the samples are well dispersed over the target and seemingly balanced. However, the primary difference between the two algorithms in this example is the relative concentration of samples due to aRWM’s tendency to reject proposals and stay at the same points whereas Scout MCMC produces a larger number of unique points.

Figure 7. Double Banana Samples. PT and the DM sampler have trouble moving away from tail regions. The best performing algorithms are aRWM and Scout MCMC as they achieve the most accurate sample mean and highest ESJD values. (Note: Red dot indicates starting points and blue dots indicate samples).

Figure 7. Double Banana Samples. PT and the DM sampler have trouble moving away from tail regions. The best performing algorithms are aRWM and Scout MCMC as they achieve the most accurate sample mean and highest ESJD values. (Note: Red dot indicates starting points and blue dots indicate samples).

Finally, we plot the samples generated by the finite versions of the DM sampler and Scout MCMC in . The samples presented largely match those of the fully adaptive versions. This indicates that the bank of covariance matrices generated in the adaptive phase is sufficient to produce region specific samples as intended.

Figure 8. Double Banana Samples from the finite variants of the DM Sampler and Scout MCMC. Both algorithms variants perform in line with their perpetually adapting counterparts.

Figure 8. Double Banana Samples from the finite variants of the DM Sampler and Scout MCMC. Both algorithms variants perform in line with their perpetually adapting counterparts.

4.2. Basis vector distribution

As described briefly in the prelude to this section, the basis vector target consists of a series of normal distributions along the basis vectors in R4. Formally, the distribution is a mixture of the following normal distributions: N(10e1,I4),N(10e1,I4),N(10e2,I4),N(10e2,I4),N(10e3,I4),N(10e3,I4),N(10e4,I4),and N(10e4,I4) where ei is a basis vector in the ith direction and I4 is the 4D identity matrix.

The key feature of this distribution is that the expected value is at the origin but there is negligible density there. As such, any MCMC algorithm that hopes to be successful must be able to cross a vast low probability desert to move between modes. Each algorithm was run for 40,000 iterations with the first 2,000 iterations discarded as burn-in. The results of the 4D basis vector target are presented in .

Table 2. 4D basis vector target results. Both versions of Scout MCMC and PT produced sample means that were closest to the true mean. PT slightly outperforms Scout MCMC in terms of ESJD but both variants of Scout MCMC have slightly closer means and higher acceptance rates.

Note that all of RWM, aRWM, MALA, and the DM sampler produced sample means that were far from the origin, the true mean. This indicates that these algorithms did not visit all of the modes in a balanced manner and likely got stuck in one or more select modes. This behavior is not unexpected, however, as these algorithms have no mechanism to cross low probability boundaries. Scout MCMC and parallel tempering, in contrast, produced sample means approaching the origin with Scout MCMC outperforming both cases of parallel tempering. In this instance, there does not appear to be any major impact from increasing the number of chains from two to five in parallel tempering aside from a larger ESJD. Finally, we note that the finite variants of the DM sampler and Scout MCMC perform in line with their respective fully adapting versions, validating their specification as non-adapting approximations.

To confirm that all eight modes were visited (as opposed to, say, two opposing modes with a mean that is equal to the origin), we examine the trace plots of parallel tempering with 5 chains and Scout MCMC in . Each trace plot represents one of the dimensions and for this target distribution, we should see the trace plots reaching −10, 0 and 10 in all dimensions. From the plots, it is clear that both algorithms are capable of moving between modes in a frequent manner and that all modes have been visited. It is at this juncture that we turn to the point of efficiency. Notice the difference in acceptance rate and ESJD between Scout MCMC and parallel tempering in . Scout MCMC has a tendency to accept almost twice as many proposals as parallel tempering even though parallel tempering takes larger steps. This is in part due to parallel tempering having no limit on the frequency of swap moves whereas Scout MCMC is set to only be able to consider swapping every 20 iterations.

Figure 9. 4D basis vector trace plots. Notice that both PT and Scout MCMC are able to reach −10, 0, and 10 in all four dimensions.

Figure 9. 4D basis vector trace plots. Notice that both PT and Scout MCMC are able to reach −10, 0, and 10 in all four dimensions.

4.3. Banana bunch distribution

The next target consists of a mixture of 12 banana distributions in R3 arranged such that there is even less interaction than there is in the double banana example. We call this distribution the banana bunch. As this example is in R3, we cannot simply present the contours. However, the distribution can be understood as the mixture of three groups. The projection of the target on each pair of axes (x-y, x-z, and y-z) appears as the contours in .

Figure 10. Projections of the banana bunch distribution on each pair of axes (x-y, x-z, and y-z). (Note: lighter contours indicate higher density, red dot indicates the origin).

Figure 10. Projections of the banana bunch distribution on each pair of axes (x-y, x-z, and y-z). (Note: lighter contours indicate higher density, red dot indicates the origin).

Combining all three groups will result in the intersection of the apexes of two component distributions at ±40 along each axis. This results in a target with six modes, each far from the origin, and 24 tails extending from the modes toward each other. Once again, we capitalize on the symmetry of our targets and find the expected value to be at the origin. Though the origin has negligible density, scatter plots of samples projected down to planes composed of the basis vectors can be slightly misleading. presents a sample generated directly from the target distribution. The points seemingly at the origin are actually” above” and” below” the origin with respect to the axis missing from the respective plot.

Figure 11. True banana bunch samples. Presented are samples generated directly from the target distribution projected onto the x-y, x-z, and y-z plane.

Figure 11. True banana bunch samples. Presented are samples generated directly from the target distribution projected onto the x-y, x-z, and y-z plane.

Given the more complex nature of this distribution, we increase the number of samples for all algorithms to 100,000 with the first 1,000 discarded as burn-in. In addition to acceptance rate, first moment, and ESJD, we also consider the second moment, the expectation of the element-wise square of the samples. Such an expectation will assess how well the tails have been explored. A sample that concentrates too heavily in the 6 modes will overshoot this expectation even if it successfully produces a mean near the origin. The true value of this expectation is [400400400]. presents the results of our experiment.

Table 3. Banana bunch results. Although most algorithms come quite close to the true expected mean, not all are successful in finding the squared mean. The best performing ones are Scout MCMC, the finite variance of Scout MCMC, and PT with five chains. However, even with five chains, PT has a much less favorable ESJD than either Scout MCMC, each only utilizing two chains.

Perhaps the most surprising result in is that the sample mean produced by each algorithm is quite close to the true expected value although aRWM and Scout MCMC are clearly the best performers on that front. Regarding the squared expected value, Scout MCMC performs the best even though the results for parallel tempering with 5 chains presents a convincing case for its convergence properties. We note however that parallel tempering with 2 chains fails to match the performance of either Scout MCMC or parallel tempering with 5 chains. This suggests that 2 chains is not generally sufficient without a more nuanced strategy on the main chain such as using the DM sampler in Scout MCMC. Finally, with respect to ESJD, we note that aRWM and Scout MCMC are clearly the best performers using this efficiency metric. However, we warn the reader to recognize that aRWM accepted less than 2% of proposals, did not produce a sample second moment that was close to the true second moment, and finished the algorithm with a covariance matrix with diagonal [8621000209]. This is an indication that as the dimension increases, aRWM relies heavily on expanding its reach to cover the whole region of interest rather than conforming to the shape of the target distribution. The cost of this behavior is that the proposals are not always of high quality and one must hope that it produces enough proposals to extract a decent set of good samples in a limited amount of time. In this case, the enormous proposal distribution was not sufficient to fully explore the tails leading in from the modes toward the origin, thus producing an inaccurate sample second moment.

In order to be complete, however, we must also examine the plots of samples (projected to 2D) to understand whether our algorithms truly explore all modes and tails. Once again, we refer the reader to to view the expected behavior. Given that RWM, MALA, and the DM sampler have no mechanisms for inter-mode movement and that the numerical results reflect this fact, we focus on the remaining algorithms from here. We also focus on the 5 chain version of parallel tempering given that it outperformed the 2 chain version.

The visual results of reflect the behaviors noted by the diagnostics in . Adaptive RWM struggles to explore the tails as well as the z-axis which results in the poor squared expected value. Parallel tempering, in contrast, performs better in the tails but does not explore the distribution evenly within the specified number of iterations which manifests in the poorer expected value. In addition, there are a number of samples well beyond the modes that are the result of a swap from a tempered chain to the main chain. With Scout MCMC and its finite variant, we see the most appropriate distinction between mode and tail concentrations which manifests in the best squared expectation. They also do so with a higher frequency of points than aRWM or parallel tempering. The highly efficient proposals are realized even when the large distance between modes of the basis vector example is combined with the unusual geometry of the banana examples thus illustrating the ability of Scout MCMC to deliver on the promises of a multimodal sampler that excels at rapid adaptation to local geometry.

Figure 12. Banana Bunch Samples (projected onto the x-y, x-z, and y-z planes). aRWM produces much sparser plots that reflect its inability to explore the tails of the bananas. PT, despite having more samples, still results in inaccurate sample statistics because of its imbalance, with many points skewing to one side. The two Scout MCMC variants produce the most optimal results with a large and balanced set of samples as well as strong exploration of the banana tails.

Figure 12. Banana Bunch Samples (projected onto the x-y, x-z, and y-z planes). aRWM produces much sparser plots that reflect its inability to explore the tails of the bananas. PT, despite having more samples, still results in inaccurate sample statistics because of its imbalance, with many points skewing to one side. The two Scout MCMC variants produce the most optimal results with a large and balanced set of samples as well as strong exploration of the banana tails.

4.4. Bayesian linear regression with horseshoe priors

The final example target distribution is motivated by a more practical scenario. MCMC methods are frequently used in Bayesian inference and there is a lot of work being done to develop methods that can sample from complex posterior distributions. To test the DM sampler and Scout MCMC in such a situation, we construct a Bayesian linear regression scenario with half of the coefficients set to zero. In Bayesian settings, sparsity in regression coefficients can be induced using a horseshoe prior (Carvalho, Polson, and Scott Citation2009). The horseshoe prior is known to be difficult to sample from given the use of the Cauchy distribution. Our example uses 20 independent variables and 1000 observations generated from a standard normal distribution. The true values of the first 10 coefficients are [1052.5210.750.50.250.20.1] and the second set of 10 coefficients are all set to 0. In addition to the 20 coefficients, there is one error variance parameter, 20 horseshoe coefficient parameters, and one global horseshoe parameter for a total of 42 parameters to sample.

We execute each of the algorithms for 50,000 iterations with the first 5000 discarded as burn-in. We modify some of the algorithms to accommodate the higher dimension of this example. For parallel tempering, we execute a 2 chain version as before but replace the 5 chain version with 10 chains. For the DM sampler, we adapt the step size to shrink upon rejecting a sample. Finally, for both the DM sampler and Scout MCMC, we only execute the adaptive versions. To compare across algorithms, we consider the acceptance rate, the distance between the true and estimated regression coefficients, and the ESJD. presents the results of the experiment.

Table 4. Bayesian linear regression with horseshoe priors results.

As illustrated in the results, the majority of the algorithms tested produced posterior means that are close to the true coefficient values with aRWM being a notable exception. Also of interest are the ESJD values. Of the algorithms that did converge to the true parameter values, only the DM sampler and Scout MCMC had reasonably high ESJD values, suggesting that they sampled most effectively from the posterior. This example illustrates that both algorithms introduced here can be competitive with established MCMC methods in a practical setting.

We conclude this section by repeating the notion of efficient and effective sampling. We find that aRWM may be efficient from an ESJD perspective and parallel tempering is effective as a way to explore different modes, but neither prove to be adequate across the board. Instead, Scout MCMC proves itself as an efficient and effective “smart” sampler by adopting a strategy of combining rapid regional adaptation with heavy tempering for mode swapping.

5. Discussion and future work

In this paper we have introduced an algorithm designed to rapidly adapt to the local behavior of a given target distribution. Such adaptation is accomplished through the minimization of the information projection (I-projection) side of the KL Divergence between the target distribution and the proposal distribution family. By combining this divergence minimization sampler with one highly tempered chain to create Scout MCMC, we illustrate how the DM sampler may integrate into other existing MCMC approaches to combine algorithm strengths. Finally, we leverage the adaptation of the DM sampler and Scout MCMC in a two-stage algorithm that uses the produced covariance matrices of the DM sampler and Scout MCMC in the first phase to initialize a follow up Metropolis-Hastings phase that adheres to standard convergence criteria. This finite adaptation algorithm continues to use optimized local samplers to efficiently sample from local geometries without needing perpetual adaptive steps.

Sampling from target distributions with irregular geometry is well-established as a difficult problem. Our algorithms address this challenge by focusing on the local behavior of the target distribution at the current chain position. This strategy offers a number of key advantages. First, for any given iteration, local covariance structures can often be more informative than the global covariance. Our algorithms are also able to adapt to the target without a large reserve of previous samples. This is critical when sampling from distributions with rapid changes in curvature. Finally, our DM sampler is designed to be modular. Here we have illustrated how the DM sampler can be integrated into a parallel tempering algorithm, as well as a Metropolis-Hastings algorithm. We then studied how the combined approaches merged the strengths of each individual component. The DM sampler could similarly be used in conjunction with other more sophisticated algorithms. For example, combining the JAMS algorithm of Pompe, Holmes, and Łatuszyński (Citation2019) with the DM sampler could be effective in multimodal settings by allowing for rapid adaptation immediately after executing a jump move into a new mode. The DM sampler could also be combined with Multiple-Try algorithms to select search directions (Liu, Liang, and Wong Citation2000; Martino Citation2018).

We have presented the baseline DM sampler algorithm in this paper and believe that there is much room for future research. For example, the criteria required for a non-diminishing perpetually adaptive algorithm to converge remains under-explored. Moreover, one might be interested in studying whether there is an optimal frequency to adaptation and swapping, or whether there are certain target geometries that are more or less challenging to explore. Smaller changes such as adapting step size and other fixed parameter inputs are also possibilities. Finally, there is certainly room to explore different objective functions in the DM sampler. Some possibilities include testing alternate similarity measures or replacing the regulating term to encourage different behaviors. Such modifications could further improve the performance of the DM sampler and Scout MCMC beyond what has been demonstrated in this paper.

Code availability statement

The implementations of the algorithms discussed here along with all code used to generate examples can be accessed at https://github.com/AmeerD/Scout-MCMC.

Acknowledgements

We thank the anonymous referee for very helpful comments which have improved our paper and helped emphasize its position in the broader adaptive MCMC literature.

Disclosure statement

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Notes

References

A.

Appendix

A.1. Approximating the gradient

Since, J(x)=βHq+βEϵ[logp(x+Lϵ)]+Eϵ[min{0,logp(x+Lϵ)logp(x)}] the gradient of J(x) is: LJ(x)=LβHq+LβEϵ[logp(x+Lϵ)]+LEϵ[min{0,logp(x+Lϵ)logp(x)}]=βLHq+βEϵ[Llogp(x+Lϵ)]+Eϵ[Lmin{0,logp(x+Lϵ)logp(x)}]

Consider each of the three terms of the above of LJ(x) individually:

Term 1: βLHq. We can use the form of the entropy of a multivariate normal distribution to evaluate this gradient: βLHq=βL(k2log(2πe)+12log(|L||LT|))=βL(k2log(2πe)+12i=1klogLii2)=βL(i=1klogLii)=βdiag(1L11,,1Lkk)

Term 2: βEϵ[Llogp(x+Lϵ)]. First, note the following: βEϵ[Llogp(x+Lϵ)]=βEϵ[1p(x+Lϵ)p(x+Lϵ)ϵT] The expectation on the right-hand side does not simplify cleanly. However, the interior of the expectation is simple enough to evaluate for a given ϵ. As such we can draw a number of ϵjN(0,1) at each iteration and compute an unbiased estimate of βEϵ[Llogp(x+Lϵ)] with Simple Monte Carlo. That is, at each iteration, we compute: βEϵ[Llogp(x+Lϵ)]1Jj=1Jβp(x+Lϵj)p(x+Lϵj)ϵjT

Term 3: Eϵ[Lmin{0,logp(x+Lϵ)logp(x)}]. Similar to the second piece of the gradient, note that this expectation does not simplify but we can produce an unbiased estimate by relying on a series of draws of ϵjN(0,1) in a given iteration. Eϵ[Lmin{0,logp(x+Lϵ)logp(x)}]1Jj=1JLmin{0,logp(x+Lϵj)logp(x)} However, the presence of the minimum operator suggests this summation will not simplify in the same way as the previous component. Considering the two cases, we can naturally separate them depending on if logp(x+Lϵt)logp(x).

In the first case, if logp(x+Lϵt)logp(x), acceptance of the proposal under a Metropolis framework is guaranteed and: Lmin{0,logp(x+Lϵt)logp(x)}=0 If logp(x+Lϵt)<logp(x), then the Metropolis ratio is less than 1 and we have, Lmin{0,logp(x+Lϵt)logp(x)}=L(logp(x+Lϵt)logp(x))=1p(x+Lϵt)p(x+Lϵt)ϵtT

Consolidating the three terms, to search for an optimal local proposal distribution, at each iteration of the MCMC chain we perform the following gradient-based update (we omit the iteration subscript t on x and L for clarity purposes):

Lt+1=Lt+γLJ(x) where LJ(x)=βdiag(1L11,,1Lkk)+1Jj=1Jβp(x+Lϵj)p(x+Lϵj)ϵjT+1Jj=1JLmin{0,logp(x+Lϵj)logp(x)}. Note further that the interior of the Term 3 summation reduces into the following two cases depending on the value of ϵj, Lmin{0,logp(x+Lϵj)logp(x)}={0if logp(x+Lϵj)logp(x)1p(x+Lϵj)p(x+Lϵj)ϵjTif logp(x+Lϵj)<logp(x).

Note that the current position is denoted x, the proposal is y=x+Lϵ, the standard multivariate draw is ϵj, and γ is the predetermined step size.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.