The Hamming Ball Sampler

ABSTRACT We introduce the Hamming ball sampler, a novel Markov chain Monte Carlo algorithm, for efficient inference in statistical models involving high-dimensional discrete state spaces. The sampling scheme uses an auxiliary variable construction that adaptively truncates the model space allowing iterative exploration of the full model space. The approach generalizes conventional Gibbs sampling schemes for discrete spaces and provides an intuitive means for user-controlled balance between statistical efficiency and computational tractability. We illustrate the generic utility of our sampling algorithm through application to a range of statistical models. Supplementary materials for this article are available online.


Introduction
Statistical inference of high-dimensional discrete-valued vectors or matrices underpins many problems across a variety of applications including language modeling, genetics, and image analysis. Bayesian approaches for such models typically rely on the use of Markov chain Monte Carlo (MCMC) algorithms to simulate from the posterior distribution over these objects. The effective use of such techniques requires the specification of a suitable proposal distribution that allows the MCMC algorithm to fully explore the discrete state space while maintaining sampling efficiency. While there have been intense efforts to design optimal proposal distributions for continuous state spaces, generic approaches for high-dimensional discrete state models have received relatively less attention but some examples include the classic Swendsen-Wang algorithm (Swendsen and Wang 1987) for Ising/Potts models and more recent sequential Monte Carlo methods (Schäfer and Chopin 2013).
In this article, we consider Bayesian inference using MCMC for an unobserved latent discrete-valued discrete sequence or matrix X ∈ X , where each element x i j ∈ {1, . . . , S}, given observations y = [y 1 , . . . , y N ]. We will assume that the observations are conditionally independent given X and model parameters θ so that the joint distribution factorizes as p(y, X, θ ) = [ N i=1 p(y i |X, θ )]p (X, θ ). We further assume that the posterior distribution p(X, θ|y) has a complex dependence structure so that standard MCMC schemes, such as a (Metropolis-within) Gibbs Sampler, using θ ← p(θ|X, y), (1) are both intractable because exhaustive summation over the entire state space of X has exponential complexity. A popular and tractable alternative is to employ blockconditional (Metropolis-within) Gibbs sampling in which subsets x i of X are updated conditional on other elements being fixed using θ ← p(θ|X, y), (4) where X −i denotes the elements excluding those in x i . Typical block structures might be rows/columns of X, when it is a matrix, or sub-blocks when X is a vector. While blockconditional sampling approaches are often convenient (they may be of closed form allowing for Gibbs sampling without resort to Metropolis-Hastings steps), in high dimensions, major alterations to the configuration of X maybe difficult to achieve as this must be done via a succession of small (possibly low probability) incremental changes. Conditional sampling may lead to an inability to escape from local modes in the posterior distribution particularly if the elements of X exhibit strong correlations with each other and together with θ .
To address these problems, we propose a novel and generic MCMC sampling procedure for high-dimensional discrete-state models, named the "Hamming ball sampler. " This sampling algorithm employs auxiliary variables that allow iterative sampling from slices of the model space. Marginalization within these model slices is computationally feasible and, by using sufficiently large slices, it is also possible to make significant changes to the configuration of X. The proposed sampling algorithm spans a spectrum of procedures that contains the marginal and block-conditional Gibbs sampling strategies as extremes. At the same time, it allows the user to express many more novel schemes so that to select the one that best balances statistical efficiency and computational tractability.
We demonstrate the utility of the sampling procedure with three different statistical models where exhaustive enumeration is impossible for realistic datasets and illustrate the considerable benefits over standard sampling approaches.

Theory
In this section, we describe the theoretical foundations of the proposed sampling algorithm (Sections 2.1-2.4) and we discuss computational complexity and sampling efficiency (Section 2.5).

Construction
The Hamming ball sampler considers an augmented joint probability model that can be factorized as p(y, X, θ, U) = p(y, X, θ )p (U|X), where the extra factor p(U|X) is a conditional distribution over an auxiliary variable U, which lives in the same space and has the same dimensions as X. The conditional distribution p(U|X) is chosen to be a uniform distribution over a neighborhood set H m (X) centered at X, where I(·) denotes the indicator function and the normalizing constant Z m is the cardinality of H m (X).
The neighborhood set H m (X) will be referred to as a Hamming ball since it is defined through Hamming distances so that Here, d(x i , u i ) denotes the Hamming distance j I(u i j = x i j ) and the pairs (u i , x i ) denote nonoverlapping subsets of corresponding entries in (U, X) such that ∪ P i=1 u i = U and ∪ P i=1 x i = X. Also, the parameter m denotes the maximal distance or radius of each individual Hamming ball set. For instance, these pairs can correspond to different matrix columns so that x i will be the ith column of X and u i the corresponding column of U. Hence, the Hamming ball H m (X) would consist of all matrices whose columns are at most m elements different to X.
Furthermore, the auxiliary factor p(U|X) factorizes across the P blocks {u i , x i } P p=1 as follows: where Z i,m is the volume of the individual Hamming ball set which, in case each block x i consists of K elements, is equal to Z i,m = M = m j=0 (S − 1) j K j . Importantly, the Z i,m is independent of the exact values of u i and x i (i.e., the volume of the Hamming ball does not depend on where we are in the model space). This is critical for the application of Gibbs sampling later on.
Notice that the above factorization is just a consequence of the product factorization of the indicator function I(U ∈ H m (X)) across the blocks. More precisely, this indicator function satisfies the following factorization and symmetry properties: where the second one is direct consequence of the symmetry of the Hamming distance d(u i , x i ).

Gibbs Sampling
The principle behind the Hamming ball sampler is that the use of Gibbs sampling for the augmented joint probability distribution p(y, X, θ, U) admits the target posterior distribution p(X, θ|y) as a by-product (since marginalization over U recovers the target distribution). Specifically, the Hamming ball sampler alternates between the steps: (θ, X) ← p(θ, X|y, U).

Restricted State Space
Crucially, the restricted state space defined by the Hamming ball, which has been injected into the model via the auxiliary factor p(U|X), means that the conditional distribution p(X|θ, U, y) can be tractably computed as where we used Equation (10), p(θ, U, y) = p(θ,U,y) Zm and p(θ, U, y) = X ∈H m (U) p(y, X , θ ) is the normalizing constant found by exhaustive summation over all admissible matrices inside the Hamming ball H m (U). Through careful selection of m, the cardinality of H m (U) will be considerably less than the cardinality of X so that exhaustive enumeration of all elements inside the Hamming ball would be computationally feasible.
where the subsets (x, u) correspond to columns of the matrix. Panel (b) illustrates a block strategy for the application of Hamming ball sampling when X is a D × 1 vector split into random blocks of size K.
Overall, the proposed construction uses the auxiliary variable U to define a slice of the model given by H m (U). Sampling of (θ, X) is performed within this sliced part of the model through p(θ, X|y, U). At each iteration, this model slice randomly moves via the resampling of U in step (11), which simply sets U to a random element from H m (X). This resampling step allows for random exploration that is necessary to ensure that the overall sampling scheme is ergodic. The amount of exploration depends on the radius m so that U can differ from the current state of the chain, say X (t ) , at most in mP elements, that is, the maximum Hamming distance between U and X (t ) is mP. Similarly, the subsequent step of drawing the new state, say X (t+1) , is such that at maximum X (t+1) can differ from U in mP elements and overall it can differ from the previous X (t ) at most in 2mP elements. Figure 1(a) graphically illustrates the workings of the Hamming ball sampler.
Furthermore, it is worth noting that when the restriction in the state space is relaxed, that is, when the radius m becomes large enough, the Hamming ball samplers reduce to the (Metropolis-within) Gibbs and marginal schemes outlined in (1)-(2) and (3). More precisely, if the size of each block x i is K and we assume a Hamming radius equal to the block size, that is, m = K, then the Hamming ball set H m (U) is completely unrestricted and becomes equal to X . In such case the restricted conditional p(X|θ, U, y) becomes equal to the exact (unrestricted) conditional p(X|θ, y) since I(X ∈ H m (U)) = 1 for any X, U ∈ X . Similarly, the restricted p(θ, U, y) normalizing constant of p(X|θ, U, y) reduces to the exact marginal p(θ, y) and therefore the Hamming ball sampler schemes reduce to the algorithms described by (1)-(2) and (3), respectively.

Selection of Blocks
The application of the Hamming ball sampler requires the selection of the subsets or blocks {x 1 , . . . , x P }. This selection will depend on the conditional dependencies specified by the statistical model underlying the problem to be addressed. For some problems, such as the tumor deconvolution mixture model considered later, there may exist a natural choice for these subsets (e.g., columns of a matrix) that can lead to efficient implementations. For instance, under a suitable selection of blocks the posterior conditional p(X|θ, U, y) could be fully factorized, that is, p(X|θ, U, y) = P i=1 p(x i |θ, u i , y), or have a simple Markov dependence structure so that exact simulation of X would be feasible. In contrast, for unstructured models, where X is just a large pool of fully dependent discrete variables (stored as a D-dimensional vector), we can divide the variables into randomly chosen blocks x i , i = 1, . . . , P, so that they have equal length K = length(x i ). (If D/P is not an integer, then the final block x P will have size smaller than K.) In such cases, exact simulation from p(X|θ, U, y) may not be feasible and instead we can use the Hamming ball operation to sequentially sample each block. More precisely, this variant of the algorithm can be based on the iteration (11), (13)-(14) with the only difference that the steps (11) and (14) are now split into P sequential conditional steps, where the posterior conditional p( where we made use of the factorization of the auxiliary distribution p(U|X) from (8). The above scheme can be thought of as a block Hamming ball sampler, which incorporates standard block Gibbs sampling (see iterations (4)-(5)) as a special case obtained when the radius m is equal to the block size K. In a purely block Hamming ball scheme we will have m < K and in general the parameters (m, K ) can be used to jointly control algorithmic performance ( illustrative examples are given in Section 3.1 and in the supplementary information at Section S1). This scheme is illustrated in Figure 1(b) and used later in the sparse linear regression application.

Computational Time Complexity and Sampling Efficiency
We now discuss the relative computational time complexity of the Hamming ball sampler compared to the block Gibbs sampler. For simplicity, we assume that the conditional posterior distribution p(X|θ, U, y) factorizes across P blocks of size K.
The computational time complexity of the Hamming ball sampler scales with the Hamming radius m, block size K, and the number of blocks P according to O(MP) where M = m j=0 (S − 1) j K j . The standard block Gibbs sampler is a special case of the Hamming ball sampler where the Hamming distance is always the same as the block size (m = K) and therefore has computational time complexity of O(S K P) (where S K = K j=0 (S − 1) j K j ). The block Gibbs sampler is therefore only practical for small values of block size K since we must enumerate all possible values for that block in each Gibbs sampling sweep. The Hamming ball sampler provides flexibility in that we can choose to use larger block sizes K and simultaneously use a smaller Hamming distance m to limit the possible values to maintain a manageable computational time cost. However, to understand if this flexibility can prove useful, we must also consider the relative sampling efficiencies of the two approaches.
Sampling efficiency characterizes the sampler in terms of its ability to fully explore the probability distribution and traverse between different modes. It depends on both the algorithm as well as the landscape of the probability distribution under exploration. Therefore, an ideal sampling algorithm needs to strike a balance between computational time complexity and sampling efficiency to realize the largest number of (effective) independent posterior samples in each unit of computational time. We use the following measure to quantify overall efficiency: Overall efficiency ∝ #Effective sample size in n iterations time to perform n iterations = s n T n , which expresses a balance between how effective the algorithm is in producing independent samples and how fast it runs. Notice that for real inference problems it is ordinarily very difficult to compute the above measure (in particular, when the posterior distribution has an unknown number of modes, the computation of the effective sample size is very challenging and standard autocorrelation measures can be misleading) to rank the different sampling schemes. For benchmarking purposes, we can develop simulated scenarios where the properties of the posterior landscape can be determined in advanced. Using an illustrative example, similar to that given in Section 3.1, in which the true posterior distribution was bimodal, we applied a range of Hamming ball sampling schemes with different combinations of parameters (m, K ) to explore the posterior. This included schemes where m = K that correspond to block Gibbs samplers. We approximated s n by counting the number of times each sampling scheme was able to switch from one mode to the other. Supplementary Figure S1 shows the performance for each of the sampling schemes and we note that the overall best performing sampler was for (m, K ) = (1, 13) and none of the block Gibbs samplers were among the most effective sampling schemes. We concluded that while block Gibbs samplers, for a given block size, had the best sampling efficiency, the need to consider all enumerations of the latent variables means they also have the worse computational time complexity. We shall discuss in further detail in the forthcoming examples but, by limiting the number of computations with the Hamming distance criterion, the Hamming ball samplers can achieve a better balance between the computational complexity and sampling efficiency.

Examples
We now illustrate the utility of our Hamming ball sampling scheme through its application to two statistical models: in mixture modeling for tumor deconvolution (Section 3.3) and Bayesian variable selection in linear regression (Section 3.2). We also give an example based on factorial hidden Markov models in the supplementary information. Simulated examples also are used to characterize the properties of the Markov chain Monte Carlo sampling approaches tested. Our emphasis is on evaluating the performance of Hamming ball-based sampling versus conventional block Gibbs sampling approaches rather than the quality of the model specifications. However, we first illustrate the use of the Hamming ball sampler through a simulated toy example.

An Illustrative Toy Example
Here, we give a simple example that illustrates the differences between the Hamming ball sampler and a standard Gibbs sampler.
We consider a simple observation model that generates a scalar output y i given a vector of covariates z i according to where X = (x 1 , . . . , x D ) is a latent binary vector that follows a prior distribution such that Pr( This is a simplified version of a variable selection problem in linear regression where, for instance, the regression coefficients are assumed to have known values that are fixed to unity. Given a set of examples {y i , z i } n i=1 , our goal is to estimate the posterior distribution over X while the parameter σ 2 is also taken as known. In the next section, we will consider a much more realistic variable selection problem, but the current toy model suffices to clearly illustrate the differences between the Hamming ball sampler and the standard Gibbs sampler. For this illustration, we simulated datasets where n = 200, D = 20 and the covariates for the ith example where such that z i,d ∼ U (0, 1) with d = 1, . . . , 10 while the remaining covariates where exact replicas, that is, z i,10+d = z d , d = 1, . . . , 10. Then, each response was generated according to y i = z i,6 + η i , η ∼ N (0, σ 2 ). The overall simulation creates two completely symmetric modes in the posterior distribution over X so that the 6th and the 16th covariates could provide equally good explanation of the observed responses. More precisely, the first mode is such that (x 6 = 1, x 16 = 0) and the second one is such that (x 6 = 0, x 16 = 1), while for both modes the remaining x d 's are zero.
We simulated three different datasets by varying the level of the noise variance according to σ 2 = 0.5, 2, 5. In the first case (σ 2 = 0.5), we have a very informative likelihood function that induces a posterior distribution over X where the two modes are very sharply picked and there is little probability in the regions between these modes. In the second case (σ 2 = 2), we have a more diffuse posterior distribution where the two modes are less sharply picked, while in the third case (σ 2 = 5) the posterior becomes even more diffuse and closer to the uniform prior.
To estimate the posterior over X, we consider the simplest possible Hamming ball sampler that jointly samples X using a Hamming ball of radius m = 1, and also the simplest possible Gibbs sampler that in each iteration sequentially samples an x d conditional on the remaining elements of X. Both samplers have the same computational complexity O(D) and the actual running times are roughly the same. They essentially differ on how they allocate their computational resources. Figure 2 shows the evolution of the state vector X for both algorithms and all three cases when performing 1000 sampling iterations (collected after 100 burn-in samples). Clearly, the Hamming ball sampler is able to mix well between the modes in all three cases, since it is able to simultaneously make at most 2m = 2 changes in X during each sweep, and provides a correct estimation of the posterior distribution.
In contrast, the Gibbs sampler becomes stuck in local posterior modes in the first two instances as it cannot simultaneously flip two bits in X and jump between modes. The Gibbs sampler is only able to mix well in the presence of large observation noise variance σ 2 = 5 when the posterior distribution is diffuse and there is sufficient probability mass for the Gibbs sampler to traverse between the two posterior modes. In this case, the posterior distribution is near-independent in the latent variables and the Gibbs sampler becomes very efficient since it performs exhaustive local exploration by giving a chance to all D bits to get flipped in a full sweep. The Hamming ball sampler can flip less than D bits in a single sweep, and therefore for very diffuse posterior distributions it will be less effective than Gibbs sampling. In contrast, for correlated posterior distributions Hamming ball sampling can be more effective than Gibbs sampling since it has the ability to perform joint updates and jump between modes.
The above simple illustrative example demonstrates the potential advantages of the Hamming ball sampler over standard Gibbs approaches. In practice, a mixture of Gibbs and Hamming ball sampling moves could be used to balance global and local exploration of the posterior distribution.

Sparse Linear Regression
We now consider variable selection problems with sparse linear regression models. Applications of such models can arise in problems such as expression quantitative trait loci (eQTL) analysis that is concerned with the association of genotype measurements, typically single nucleotide polymorphisms (SNPs), with phenotype observations (see, e.g., O'Hara et al. 2009; Bottolo and Richardson 2010; Peltola, Marttinen, and Vehtari 2012, and references therein). The interest lies in the posterior distribution over some binary vector X that encodes the selection over the covariates and can inform us as to which covariates are most important for defining the observed response variable. Typically X is assumed to be sparse so that only a few covariates contribute to the explanation of the observations.

.. Setup
We consider a dataset where y i ∈ R is the observed response and z i ∈ R D is the vector of the corresponding covariates. We can collectively store all responses in an N × 1 vector y (assumed to be normalized to have zero mean) and the covariates in an N × D design matrix Z. We further assume that from the total D covariates there exists a small unknown subset of relevant covariates that generates the response. This is represented by a D-dimensional unobserved binary vector X that indicates the relevant covariates and follows an independent Bernoulli prior distribution, where π 0 is assigned a conjugate Beta prior, Beta(π 0 |α π 0 , b π 0 ), and (α π 0 , b π 0 ) are hyperparameters. Given X, a Gaussian linear regression model takes the form where Z X is the N × D X design matrix, with D X = D d=1 x d , having columns corresponding to x d = 1 and β X is the respective D X × 1 vector of regression coefficients. The regression coefficients β X and the noise variance σ 2 are assigned a conjugate normal-inverse-gamma prior where (g, α σ , b σ ) are hyperparameters. Notice that the particular choice g(Z T X Z X ) −1 for the covariance matrix, where is g is scalar hyperparameter, corresponds to the so-called g-prior (Zellner 1986).

.. Posterior Inference
Based on the above form of the prior distributions we can analytically marginalize out the parameters θ = (π 0 , β x , σ 2 ) and obtain the marginalized joint density (Bottolo and Richardson 2010): and (·) denotes the Gamma function. The hyperparameters of the prior were set to fixed values as follows. The hyperparameters of InvGa(σ 2 |α σ , b σ ) were set to α σ = 0.1 and b σ = 0.1, which leads to a vague prior. The scalar hyperparameter for the g-prior were chosen to g = N as also used by Bottolo and Richardson (2010). Finally, the hyperparameters for the Beta prior, Beta(π 0 |α π 0 , b π 0 ), were set to the values α π 0 = 0.001 and b π 0 = 1, which favors sparse configurations for the vector X.
The sampling algorithm we use is a block Hamming ball sampler, which consists of a combination of a block Gibbs sampler with Hamming ball moves. More precisely, at each iteration where the current value of X is X (t ) , we randomly divide the D covariates into blocks of size K so that we have P = D/K separate blocks. Then, we iteratively visit each block x i and sample a new value for its elements based on a Hamming ball move. The whole scheme follows the iteration given below: 1. Randomly initialize X (0) and set t = 0. 2. At iteration t + 1 = 1, . . . , T randomly divide the elements in X into P = D/K random blocks so that x i denotes the elements of the ith block. 3. for i = 1, . . . , P (a) Sample auxiliary variables u (t+1) The above block Hamming ball sampler was applied to all examples assuming a fixed block size K = 10 and different values for the Hamming radius, m = 1 − 3 (named HB1-3). We also applied various block Gibbs samplers, which sample between 1 and 3 elements (named BG1-3) of X at a time and are obtained as special cases of the block Hamming ball sampling algorithm by setting K = m. For all MCMC algorithms, we used a burn-in phase of 100 iterations and then we collected 100,000 samples during the main sampling phase.

.. Results
We simulated a regression dataset with N = 100 responses and D = 1200 covariates in which there were two relevant covariates that fully explain the data while the remainder were noisy redundant inputs. As a consequence this sets up a challenging model exploration problem as only two out of 2 1200 possible models represent the possible truth.
We did this by first generating a 100 × 600 design matrix Z such that [Z] i j is sampled uniformly from {0, 1, 2}. We set the binary latent vector X to be everywhere zero apart from x 11 = 1, which was the only relevant covariate. The regression coefficients β were set to the vector of ones and the responses were generated according to or Subsequently, to create a completely symmetric mode in the posterior distribution over X, we replicated the covariates so that to finally obtain a new design matrix Z = [Z, Z], which had size 100 × 1200. In this way the 11th and the 611th covariates could provide equally good explanations of the observed responses y and therefore an efficient MCMC algorithm should identify both covariates and switch them frequently during sampling. We applied our MCMC sampling schemes to sample from the posterior distribution over this massive space of possible models. Figure 3 compares the relative performance of the various sampling schemes . The trace plots show the running marginal posterior inclusion probabilities of the two relevant variables x 11 and x 611 , which converge to the expected values of 0.5 with the Hamming ball samplers but not with the block Gibbs samplers. This indicates that the Hamming ball schemes were mixing well, able to identify the two relevant variables and frequently switched between their inclusions. In contrast, the block Gibbs samplers exhibited strong correlation effects (stickiness) that impaired their efficiency.
For such a high-dimensional problem, the performance of the simplest Hamming ball sampler (HB1) was particularly outstanding as it used the least CPU time and achieved a lower integrated autocorrelation time than BG1 and BG2. The performance can be explained by the fact that the Hamming ball sampling schemes can handle a large block of variables at a time but do not require exhaustive enumeration of all possible latent variable combinations within each block. This provides an important computational saving for sparse problems where most combinations will have low probability and the reason why the HB1 sampler was particularly effective for this example.

.. eQTL Analysis
We tackled a real expression quantitative trait (eQTL) data analysis problem by considering a subset of the eQTL dataset from Myers et al. (2007)  and gene expression measurements of neuropathologically human brain samples. The full dataset corresponds to a whole-genome genotyping and expression analysis on a series of 193 neuropathologically normal human brain samples using the Affymetrix GeneChip Human Mapping 500K Array Set and Illumina HumanRefseq-8 Expression BeadChip platforms; see Myers et al. (2007) for complete details. (This dataset is freely available from http://www.bios. unc.edu/research/genomic_software/seeQTL/data_source, which provides a unified data depository and summary page for several genome-wide eQTL datasets.) We specifically consider the task of discovering cisassociations for a certain gene, called KIF1B, on chromosome 1 so that the expression of KIF1B across samples was treated as the response in the Bayesian regression model. Local cisassociations had already been identified for KIF1B by Myers et al. (2007) but we sought uncover additional nonlocal associations that were still on the same chromosome. We used a subset of the dataset in Myers et al. (2007) consisting of 10,000 SNPs from chromosome 1 that we considered to be the set of covariates. These covariates were used to explain the gene expression of the gene KIF1B that exists on the same chromosome. The gene expression values of KIF1B across all samples were treated as the responses in the linear regression model. These responses were centered to have zero mean and scaled to have unit variance.
The set of 10,000 covariates was obtained from the initial large pool of 23,979 SNPs in chromosome 1 by identifying 9998 SNPs with no significant correlation with the gene expression of KIF1B (based on the empirical pairwise Pearson correlation coefficient less than 0.3). We then randomly chose two SNPs with Pearson correlation coefficient greater than 0.3 to complete the set. The goal is to identify if the samplers are able to identify one or both of these SNPs. Each covariate value for a certain SNP and sample was encoded based on three integer values in the range {0, 1, 2} so that 0 corresponds to genotype AA, 1 to genotype AB, and 2 to genotype BB. This encoding was already the data format in the data repository and it is standard for encoding SNPs in eQTL analysis.
Based on the results of the simulated data study, we tried two sampling schemes: (i) a block Gibbs sampler (BG2) and (ii) a Hamming ball sampler (HB1). These were chosen because they were sufficiently computationally inexpensive to operate on this problem involving 10,000 covariates. For both algorithms, we used a burn-in phase of 1000 iterations and then we collected 10,000 samples during the main sampling phase. Figure 4(a) shows the distribution of the log-likelihoods for both samplers. The distributions are remarkably similar demonstrating that both methods were exploring similar regions of the posterior space. However, while the block Gibbs sampler took over 26 hr to complete, the Hamming ball sampler took just 7 hr. Figure 4(b) verifies that both methods produce nearidentical output in terms of the posterior probabilities estimated. Both sampling approaches identified the SNP rs12120191 (one of the two artificially included SNPs) as being a putative eQTL for KIF1B. This was also found by Myers et al. (2007) and is unsurprising since the SNP actually lies in the KIF1B coding region!

Tumor Deconvolution Through Mixture Modeling
We now turn to a real world application in cancer genome sequence analysis. Tumor samples are genetically heterogenous and typically contain an unknown number of distinct cell subpopulations. Current DNA sequencing technologies ordinarily produce data that come from an aggregation of these subpopulations thus, to gain insight into the latent genetic architecture, statistical modeling must be applied to deconvolve and identify the constituent cell populations and their mutation profiles.

.. Setup
We assume that the data y = {r i , d i } N i=1 consist of N pairs of read counts where r i corresponds to the number of sequence reads corresponding to the variant allele at the ith locus and d i is the total number of sequence reads covering the mutation site. The distribution of the variant allele reads given the total read count follows a Binomial distribution where the variant allele frequency is given by φ i = (1 − e)p i + e(1 − p i ) and e is a sequence read error rate and p i = The parameter θ is a K × 1 vector denoting the proportion of the observed data sample attributed to each of the K tumor subpopulations whose genotypes are given by a K × N binary matrix X. We specify the prior probabilities over θ as and γ k ∼ Gamma(α/K, 1), k = 1, . . . , K.
This hierarchical representation is equivalent to a marginal prior distribution θ |α ∼ Dirichlet(α/K, . . . , α/K ), which induces a sparsity forcing values of θ to be close to zero when α ≤ 1 allowing us to do automatic model selection for the number of tumor subpopulations. The use of the auxiliary variables γ is for computational convenience and only the normalized parameters θ have a physical interpretation. We further specify the prior probabilities over X as and This framework is similar to that recently adopted by Zare et al. (2014) and Xu et al. (2015).

.. Posterior Inference
We used a Metropolis within Gibbs sampling approach incorporating our Hamming ball auxiliary variable construction to sample from the posterior p(θ, X, f |y). First, we define, v k = log(γ k ), k = 1, . . . , K, then to update the weight parameters θ (in the implementation we actually work with v) we used a mixture proposal of an independent proposal drawn from the prior distribution (in this case, the Log-Gamma distribution as v k = log(γ k )) and a random walk proposal drawn from a Normal distribution centered on the current value of θ : The parameter is set to a small value to allow occasional proposals from the prior and potentially facilitate larger joint changes to (θ, X).
We accept the joint proposal using a Metropolis-Hastings step as follows: where the superscript t indicates the iteration number, r ∼ Uniform(0, 1) and the acceptance probability is computed according to and g(v k |α/K, 1) is the probability density function of the loggamma distribution with parameters (α/K, 1). The above sampling move consists of a marginal Hamming ball sampler operation, as defined in Section 2.2, where together with the parameters v we jointly sample the full mutation matrix X from its restricted posterior conditional distribution, which factorizes across the columns. More precisely, the columns of X are considered as the blocks in the Hamming ball construction so that a new value for each column x i is proposed according to the following (and accepted or rejected jointly with v): where the normalizing constant on the denominator is truncated by the Hamming ball constraint potentially substantially reducing the computations required. For example, for (m, K ) = (1, 10), the number of summation terms is 11 but for (m, K ) = (3, 10) this increases to 176.
The subclonal mutation frequencies f are updated in a separate Gibbs step according to where I(·) denote the indicator function. Finally, the Hamming ball auxiliary variables U are updated column-wise by sampling uniformly from the Hamming ball centered on each column of the mutation matrix x i according to the distribution in (8).
In all simulations, we use a setting of = 0.01 for the mixture proposal. We used a burn-in phase of 10,000 iterations (including an initial 1000 iteration tuning phase) and took 100,000 samples during the main sampling run. We calibrated the variance of the proposal distribution σ 2 v during the tuning phase to achieve an acceptance rate of between 10% and 40% but disallowed the variance from falling below 0.01 or going above 10.
We compared three posterior sampling approaches: (i) our Hamming ball-based sampling scheme as defined above, (ii) a conventional block Gibbs sampling strategy that proceeds by conditionally updating one column x i at a time with the remaining columns X −i and the weights θ fixed, and (iii) a fully marginalized sampling strategy where X was marginalized through exhaustive summation over all column configurations (note, this corresponds to the Hamming ball sampler with m = K).

.. Results
For the simulation study, we considered a simulated data example, illustrated in Figure 5(a), where the observed sequence data are generated so that it can be equally explained by two different latent genetic architectures. This is an interesting example as one configuration corresponds to a linear phylogenetic relationship between cell types and the other to a branched phylogeny and represent fundamentally different evolutionary pathways. For this example, we would expect an efficient sampler to identify both configurations and to be able to move freely between the two during the simulation revealing the possibility of the existence of dual physical explanations for the observed data.
We simulated data by forward simulation from this assumed system. Specifically, we assumed the sequencing read depth d = 800, which represents a high coverage typical of a target resequencing experiment (note that our objective here is to explore the properties of the inference algorithm as supposed to the accuracy of the generative model or the experimental designs. Therefore, we have not exhaustively explored the interactions between these factors, which will be the subject of future investigation) and the following parameters: 1 1 1 1 1 1 1  1 1 1 1 1 1 0 0 0  1 1 1 0 0 where X 0 is a version of X 0 shown in Figure 5(a) with replicated columns, r i ∼ Binomial(d, p 0i ), i = 1, . . . , 9, where p 0 = θ T 0 X 0 to give values of 405,397,393,239,245,247,123,121,123] .
We chose hyperparameter values of α = 1, f 0 = 0.5, ( f α , f β ) = (0.5, 0.5), and e = 10 −2 . Figure 5(b) and 5(c) displays trace plots of the largest component weight, max (θ ), and the relative computational times for the three sampling schemes. We use max (θ ) as meaningful visualization of the multidimensional mutation matrix X is challenging. All Hamming ball samplers were effective at identifying both modes but the efficiency of the mode switching depends on the Hamming ball size m. This effectiveness can be attributed to the fact that the Hamming ball schemes can jointly propose to change up to 2mN bits across all N columns of the current X. Furthermore, conditional updates of θ can be made by marginalizing over a range of mutation matrices. In fact, for m ≥ K/2, each iteration of the Hamming ball sampler allows any element of X to be changed but, unlike the fully marginalized sampling procedure (m = K = 8), it is more computationally tractable if the number of mutations is large as exhaustive enumeration is not required. The conditional updates employed by the block Gibbs sampler require significantly less computational effort but the approach is prone to being trapped in single posterior mode and our simulations show that it failed to identify the mode corresponding to the linear phylogeny structure (max θ = 0.4).

.. Real Data Analysis
We next sought to demonstrate the method using real mutation sequencing data. We extended the model to allow each tumor sample s to have its own mixture weights θ (s) over a common pool of K possible cell types defined by X and performed MCMC inference over the joint distribution p(θ (1) , . . . , θ (13) , X|y (1) , . . . , y (13) ).
We obtained the breast cancer data from Zare et al. (2014). This dataset consists of 17 confirmed somatic variants sequenced in 12 tumor samples and 1 normal tissue sample obtained from a single breast cancer patient (10 samples from 3 primary sections and 2 samples from a metastasis). As in Zare et al. (2014), we modified the model and inference algorithm to account for the multiple datasets by assuming that the data for each tumor sample are generated from a common mutation matrix X but each tumor sample s ∈ {1, . . . , 12} is associated with its own weight parameters θ s . Figure 6 shows heatmaps of the model fit residuals calculated as Residual i j = Observed VAFr i j /d i j − Posterior mean of φ i j for the ith mutation of the jth sample for each of the samplingbased algorithms and compared to the six-component model given in Zare et al. (2014). In all instances, the Bayesian model fits are superior to the point estimates given by the EM algorithm of Zare et al. (2014), which shows some extreme discrepancies and there is little difference between the block Gibbs and Hamming ball samplers. Figure 7 shows the posterior distribution of the largest two values of θ for each tumor sample given by each of the samplers. The posterior distributions given by the Hamming ball sampler tend to exhibit greater variance and captures multiple modes demonstrating that it is exploring the posterior model space better than the block Gibbs sampler. In fact, the posterior distribution given by the block Gibbs sampler tends to be highly concentrated in relatively small areas of the posterior space but this area of the space cannot correspond to the only values of (θ, X) consistent with the data as the Hamming ball sampler is able to explore a greater range of parameter values that give residual fits as good as the block Gibbs sampler (see Figure 6). In combination, the two results show that there are indeed a large range of possible model configurations that can give rise to the same observed data and that the Hamming ball sampler better captures the true statistical uncertainty than the point estimates of Zare et al. (2014) or a standard block Gibbs sampler. Note that in this case a Hamming distance m = 4 gives approximately the same result as m = 10 (the exact marginal sampling approach) but with a substantial computational saving. Overall, we demonstrate two important points. First, misleading inferences can be made about the latent subclonal architecture of a tumor not only through poor data quality or model misspecification but also an inadequate inference procedure. In this case, the use of conditional updates in a standard block Gibbs sampling approach is inappropriate due to correlations between θ and X. Second, the statistically most efficient marginal sampling scheme is prohibitively expensive to run for large problems. Our Hamming ball sampling strategies provides access to a continuum of intermediate sampling schemes that lie between the extremes offered by the block Gibbs and marginal approaches. This gives control to the analyst to determine a suitable inferential procedure that is appropriate for their resources and time frame.

Related Methods
The Hamming ball sampler provides a generic sampling scheme for statistical models involving high-dimensional discrete latent state-spaces that generalizes and extends conventional block Gibbs sampling approaches. It can be considered as a type of slice sampling that is suitable for discrete combinatorial state spaces. It differs from regular slice samplers that are based on uniform sampling in the subgraph of the probability distribution (see Robert and Casella (2005) for a review), since now the model slicing is constructed directly around a local neighborhood set centered on an auxiliary variable that is resampled based also on the same neighborhood set principle. Furthermore, the slicing idea in Hamming ball sampling leads to a computationally tractable algorithm for discrete state spaces while for continuous high-dimensional spaces it would be intractable. To see this, notice that if we were to slice based on the set {U : ||U − X|| ≤ m} where U, X ∈ R D and || · || denotes a continuous norm, the step of sampling X would require simulating from a truncated high-dimensional continuous distribution, which is typically infeasible.
The neighborhood approach used by the Hamming ball sampler resembles certain types of purely M-H schemes, particularly the Metropolized Shotgun Stochastic Search (M-SSS) procedure (see sec. 4.1 in Hans, Dobra, and West 2007) that has been applied in sparse linear regression. More precisely, M-SSS can be viewed as using the restricted space defined by the Hamming Ball set H m (X) (where radius was set to m = 1 in Hans, Dobra, and West 2007) to construct directly a proposal distribution in a accept/reject M-H step to sample X without the use of the intermediate auxiliary variable U. Specifically, if X (t ) is the value of the current state we can define a proposal Q(X |X (t ) ) by slicing the exact model probability distribution around X (t ) according to where Z m (X (t ) ) = X ∈H m (X (t ) ) p(y, X , θ ). Sampling then proceeds by proposing a certain X from the above instrumental distribution, which is then accepted (i.e., the next state is set to otherwise X is rejected and X (t+1) = X (t ) . Notice that Z m (X (t ) ) is the probability mass or volume of the sliced part of the model around X (t ) , which is the state we are starting from, while similarly Z m (X ) is the volume around the state we attempt to move into. Thus, to satisfy detailed balance the above M-H probability needs to take into consideration the probability volume around the current state and the volume around the proposed state.
The above scheme has two main differences with the Hamming ball sampler. First, controlling the acceptance rate in the above scheme, when sampling X, is difficult and sampling might exhibit unpredictably low acceptance rate so that the chain can get stuck to the same state for many iterations. In contrast, the Hamming ball sampler has exactly the same computational cost with the above pure M-H algorithm, but through the specific auxiliary variable construction we can derive a Gibbs sampler that always accepts any proposed X (t+1) .
A second important difference is that the amount of possible exploration (per iteration) based on the above purely M-H scheme is half the amount of exploration of the Hamming ball sampler for the same value of the radius m and the same computational cost. This is because now the next state X (t+1) can differ from the current X (t ) in at most mP elements. In contrast, in the Hamming ball sampler the next (always accepted) X (t+1) can differ from X (t ) by up to 2mP elements.

Extensions of the Hamming Ball Sampler
A generalization of the algorithm is obtained by allowing blockwise random maximal Hamming distances.
More precisely, if we assume a varying radius for the individual Hamming balls, then the conditional distribution over U is now uniform on the generalized Hamming ball H m (X) denotes the set of maximal distances for each subset of variables. Furthermore, we can allow m, at each iteration, to be randomly drawn from a distribution p(m), which leads us to the following generalization of the algorithm: (θ, X) ← p(θ, X|y, U, m), where again the second step can be implemented either by applying two conditional Gibbs steps or a joint M-H step. This scheme remains valid since essentially it is Gibbs sampling in an augmented probability model where we added the auxiliary variables (U, m). Furthermore, via the randomness over m we can again interpret standard block Gibbs Samplers as special cases. When p(m) is such that at each iteration it randomly picks a single index i, sets m i = length(x i ) and for the rest j = i sets m j = 0, the algorithm reduces to a block Gibbs sampler. Since, in such a case, x i would be freely sampled from its conditional p(x i |X −i , θ, y) while the remaining blocks X −i = {x j : j = i} would be frozen to their current values because the maximal distances of their individual Hamming balls are zero.
In practice, using a varying or random m could lead to more efficient variations of the Hamming ball sampler where, for instance, the vector m could be automatically tuned during sampling to focus computational effort in regions of the sequence where there is most uncertainty in the underlying latent structure of X.
A further extension can be obtained by using a more complex form for the auxiliary conditional distribution p(U|X). For instance, a simple generalization is to assume an auxiliary distribution of the form where the parameter λ ≥ 0 controls the variance of the distribution, When λ = 0, the above reduces to the uniform distribution while as λ increases p(U|X) places more and more probability mass toward the center X in a spherically symmetric way. Given that all vector x i 's have size K, the overall normalizing constant can be written as Z m = M P λ where M λ = m j=0 e −λ j (S − 1) j K j , which is a weighted volume of each block-specific Hamming ball that still remains independent from x i . Notice that in the above scheme λ is a parameter that the user could tune to optimize sampling performance.
Other more complex forms of the auxiliary distribution could be possible (for instance, this distribution could depend on the data y). However, from practical point of view, it is critical that this distribution factorizes across the blocks, similarly to (31), and it has an analytically computed normalizing constant so that exact simulation is straightforward. In our simulations, we experiment with the simplest (and perhaps the most practical) Hamming ball schemes where the Hamming radius is fixed and the auxiliary distribution is uniform and leave the above more elaborate cases for future work.

Conclusions
In our investigations, we have applied the Hamming ball sampling scheme to three different statistical models and shown benefits over standard Gibbs samplers. Importantly, the Hamming ball sampler gives the statistical investigator control over the balance between statistical efficiency and computational tractability through an intuitive mechanism-the specification of the Hamming Ball radius and the block design strategywhich is important for Big Data applications where the volume of data precludes exact analysis. Yet, we have also demonstrated that in many problems, basic Hamming ball samplers (m = 1 or m = 2) that are computationally inexpensive can still give relatively good performance compared to standard block Gibbs sampling alternatives.
Throughout we have provided pure and unrefined Hamming ball sampler implementations. In actual applications, the proposed methodology should not be seen as a single universal method for speeding up MCMC but a novel addition to the toolbox that is currently available to us. For example, the computations performed within each Hamming ball update are often trivially parallelizable, which would allow the user to take advantage of any special hardware for parallel computations, such as graphics processing units (Lee et al. 2010;Suchard et al. 2010). In addition, Hamming ball sampling updates can also be used alongside standard Gibbs sampling updates as well as within parallel tempering schemes in Evolutionary Monte Carlo algorithms (Brooks et al. 2011).
Finally, we believe the ideas presented here can have applications in many areas not yet explored, such as Bayesian nonparametrics (e.g., in the Indian Buffet Process; Griffiths and Ghahramani 2005) and Markov random fields. Further investigations are being conducted to develop the methodology for these statistical models.

Supplementary Materials
Supplementary information details blocking strategies and applications to factorial hidden markov models.