Posterior Contraction Rate of Sparse Latent Feature Models with Application to Proteomics

The Indian buffet process (IBP) and phylogenetic Indian buffet process (pIBP) can be used as prior models to infer latent features in a data set. The theoretical properties of these models are under-explored, however, especially in high dimensional settings. In this paper, we show that under mild sparsity condition, the posterior distribution of the latent feature matrix, generated via IBP or pIBP priors, converges to the true latent feature matrix asymptotically. We derive the posterior convergence rate, referred to as the contraction rate. We show that the convergence holds even when the dimensionality of the latent feature matrix increases with the sample size, therefore making the posterior inference valid in high dimensional setting. We demonstrate the theoretical results using computer simulation, in which the parallel-tempering Markov chain Monte Carlo method is applied to overcome computational hurdles. The practical utility of the derived properties is demonstrated by inferring the latent features in a reverse phase protein arrays (RPPA) dataset under the IBP prior model. Software and dataset reported in the manuscript are provided at http://www.compgenome.org/IBP.

goal is to explain the variability of the observed data with a latent binary feature matrix Z p×K where each column of Z represents a latent feature that includes a subset of the p objects. The number of latent features K is unknown and is inferred as well.
Bayesian nonparametric latent feature models such as the Indian buffet process (IBP) Ghahramani, 2006, 2011) can be used to define the prior distribution of the binary latent feature matrix with arbitrarily many columns. In many applications (such as Chu et al. 2006) these priors could lead to desirable posterior inference. An important property of IBP is that the corresponding distribution maintains exchangeability across the rows that index the experimental units, making posterior inference relatively simple and easy to implement. However, sometimes the rows of the latent feature matrix must follow a group structure, such as in phylogenetic inferences.
To address such needs, the phylogenetic Indian Buffet Process (pIBP) (Miller et al., 2008) has been developed to allow different rows to be partially exchangeable.
Despite the increasing popularity in the application of IBP and pIBP prior models, such as in cancer and evolutional genomics, few theoretical results have been discussed on the posterior inference based on these models. For example, from a frequentist view, it is important to investigate the asymptotic convergence of the posterior distribution of the latent feature matrix under IBP and pIBP priors. Existing literature on the theory of Bayesian posterior consistency includes, for example, Schwartz (1965); Barron et al. (1999) and Ghosal et al. (2000). Chen et al. (2016) is a motivational work exploring theoretical properties of the posterior distribution of the latent feature matrix based on IBP or pIBP priors. They explored the asymptotic behavior of the IBP or pIBP-based posterior inference, where the sample size n increases in a much faster speed than the number of objects p n , i.e., the dimensionality of Z. This might be hard to achieve in some real applications. We consider important extensions based on Chen et al. (2016). In particular, we consider properties of posterior inference based on IBP and pIBP priors in high dimensions and with sparsity. Under a similar high dimensional and sparse setting, a related work is Pati et al. (2014), where the authors studied the asymptotic behavior of sparse Bayesian factor models discussed in West (2003). These models are concerned about continuous latent features, which are different from the binary feature models like IBP and pIBP.
High dimensional inference is now routinely needed in many applications, such as genomics and proteomics. Due to the reduced cost of high-throughput biological experiments (e.g., next-generation sequencing), a number of genomics elements (such as genes) can be measured with a relatively short amount of time and low cost for a large number of patients. In our application, the number of genomics elements n is the sample size, the number of patients p is the number of rows, or the number of objects in the latent feature matrix, and the number of latent features K is assumed unknown. Depending on the particular research question, in some applications, genes can be the objects and patients can be the samples. When p = p n becomes large relative to n, the sparsity of the feature matrix critically ensures the efficiency and validity of statistical inference.
We will show that under the sparsity condition, the requirement for posterior convergence can be relaxed from p 3 n = o(n) (Chen et al., 2016) to p n (log p n ) 2 = o(n) (see Remark 2). The proposed sparsity condition can be reasonably interpreted and checked in practice. For example, in genomics and proteomics applications, our sparsity condition means that the number of features shared by different patients is small, i.e., the patients are heterogeneous. This is different from some published sparsity conditions that involve more complicated mathematical expressions, possibly in terms of the properties of complex matrices, which are difficult to check in real-world applications.
The rest of the paper is organized as follows. Section 2 introduces the latent feature model and the IBP/pIBP priors. Section 3 establishes the posterior contraction rate of sparse latent feature models under IBP and pIBP prior, which is the main theoretical result of this paper. Section 4 proposes an efficient posterior inference scheme based on Markov chain Monte Carlo (MCMC) simulations. Section 5 provides both simulated and real-world proteomics examples that support the theoretical derivations. We conclude the paper with a brief discussion in Section 6. Some technical details are provided in the supplement.

Notation and Probability Framework
In this section, we first introduce some notation, and then specify the hierarchical model including the sampling model and the prior model. In particular, the sampling model is the latent feature model, and the prior model is the IBP mixture or pIBP mixture.

Notation
Throughout the paper, we denote by p(·) and P (·) probability density functions (pdf) and probability mass functions (pmf), respectively. Specifically for the latent feature matrix Z, we use Π (Z) and Π(Z | X) to denote the prior and posterior distribution of Z, respectively. The likelihood . For two sequences a n and b n , the notation a n = O(b n ) means there exists a positive real number C and a constant n 0 such that a n ≤ Cb n for all n ≥ n 0 ; the notation a n = o(b n ) means for every positive real number ε there exists a constant n 0 such that a n ≤ εb n for all n ≥ n 0 . For a matrix A, A denotes the spectral norm defined as the largest singular value of A. Finally, C is a generic notation for positive constants whose value might change depending on the context but is independent from other quantities.

Latent Feature Model
Suppose that X n×p is a collection of the observed data. Each row x i = (x i1 , · · · , x ip ) represents a single observation of p objects, for i = 1, · · · , n, where x i 's are independent. Assume that the mechanism of generating X can be characterized by latent features, Here Z = (z jk ) p×K denotes the latent binary feature matrix, each entry z jk ∈ {0, 1} represents object j possesses feature k (z jk = 1) or not (z jk = 0), respectively. The loading matrix A = (a ki ) K×n , with each entry being the contribution of the k-feature to the i-th observation. We assume a ki i.i.d ∼ N (0, σ 2 a ). The error matrix E = (e ji ) p×n , where e ji 's are independent Gaussian errors, e ji After integrating out A, we obtain for each observation of the p objects, where N p is a p-variate Gaussian distribution. Therefore, the conditional distribution of X given Z is Without loss of generality, we always assume that σ 2 = σ 2 a = 1. One of the primary interests is to conduct appropriate estimation on ZZ T , which is usually called the similarity matrix since each entry of ZZ T is the number of features shared by two objects.

Prior Distributions Based on IBP and pIBP
In the latent feature model (Equation 1), it remains to specify the prior for the binary feature matrix Z. IBP and pIBP are popular prior choices on binary matrices with an unbounded number of columns. IBP assumes exchangeability among the objects, while pIBP introduces dependency among the entries of the k-th column of Z through a rooted tree T . See Figure 1 for an example of the tree. IBP is a special case of pIBP when the root node is the only internal node of the tree. The construction and the pmf of IBP are described and derived in detail in Griffiths and Ghahramani (2011). For pIBP, only a brief definition is given in Miller et al. (2008). For the proof of the main theoretical result of this paper, we propose a construction of pIBP in a similar way as IBP and derive the pmf of pIBP. We first introduce some notation. Let Z p×K denote the collection of binary matrices with p rows and K (K ∈ N + ) columns such that none of these columns consist of all 0's. Let Z p = ∪ ∞ K=1 Z p×K and let 0 denote a p-dimensional vector whose elements are all 0's, i.e., 0 = (0, 0, · · · , 0) T .
In the following sections, we also regard 0 as a p × 1 matrix when needed. Both IBP and pIBP are defined over Z 0 p {0} Z p . It can be shown that with probability 1, a draw from IBP or pIBP has only finitely many columns. For the construction of IBP and pIBP, we introduce some more notations as follows. Denote byZ p×K the collection of all binary matrices with p rows andK columns (where the columns can be all zeros). We define a many-to-one mapping G(·) :Z p×K → Z 0 p . For a binary matrixZ ∈Z p×K , if all columns ofZ are 0's, then G(Z) = 0; otherwise, G(Z) is obtained by deleting all zero columns ofZ. For the purpose of performing inference on the similarity matrix ZZ T , it suffices to focus on the set of equivalence classes induced by G(·). Two matricesZ 1 ,Z 2 ∈Z p×K are G-equivalent if G(Z 1 ) = G(Z 2 ), and in this case the similarity matrices induced byZ 1 andZ 2 are the same,Z 1Z We now turn to a constructive definition of pIBP. The pIBP can be constructed in the following three steps by taking the limit of a finite feature model.
∼ Beta(α/K, 1). The columns ofZ are conditionally independent given π and T , where P (z k | π k , T ) is determined as in Miller et al. (2008) , pIBP reduces to IBP. The marginal probability ofZ is Step 2. Next, for any Z ∈ Z 0 p with K columns, we define a probability distribution (forK ≥ K) for PK(Z | α, T ) defined in Step 1. That is, we collapse all binary matrices inZ p×K that are G-equivalent.
Step 3. Finally, for any Z ∈ Z 0 p , define Here Π(Z | α, T ) is the pmf of pIBP under G-equivalence classes.
Based on the three steps of constructing pIBP, we derive the pmf of pIBP given α and T .
Details on the derivation is given in Supplementary Section S1. Let S(T ) denote the total edge lengths of the tree structure (see Figure 1) and ψ(·) denote the digamma function. For Z ∈ Z 0 p , we have where λ k λ(z k , T ) = 1 0 P (z k | π k , T )π −1 k dπ k (See Supplementary Section S1) and z k is the k-th column of Z.
Assume α ∼ Gamma(1, 1). After integrating out α, we obtain the pmf of pIBP mixture, For notational simplicity, we suppress the condition on T hereafter when we discuss pIBP.
When S(T ) = p and λ k = 1 0 p j=1 π m k −1 k (1 − π k ) p−m k dπ k , pIBP reduces to IBP, where m k = p j=1 z jk denotes the number of objects possessing feature k. The pmf of IBP mixture is Figure 1: An example of a tree structure T , which is a directed graph with random variables at the nodes (marked as circles). Entries of the k-th column of Z, z jk 's, are at the leaves. The lengths of all edges of T , t i 's and η l 's, are marked on the figure. In particular, η l 's represent the lengths between each leaf (z jk , in black) and its parent node (ζ l , in red). The total edge lengths S(T ) is the summation of the lengths of all edges of T . In this example, S(T ) = 1≤i≤6 t i + (2η 1 + η 2 + 3η 3 + η 4 + 4η 5 + η 6 ). The condition in case (2) of Lemma 1 in Section 3 means inf 1≤l≤6 η l ≥ η 0 for some η 0 > 0.

Posterior Contraction Rate under the Sparsity Condition
In this section, we establish the posterior contraction rate of IBP mixture and pIBP mixture under a sparsity condition. All the proofs are given in the supplement.
The sparsity condition is defined below for a sequence of binary matrices {Z * n , n = 1, 2, . . .}.
The condition indicates that as sample size increases, the number of objects possessing any feature is upper-bounded and must be relatively small compared to the total number of objects.
Therefore, such an assumption may be assessed via simulation studies and then applied to realworld applications. Examples will be provided later on. Similar but more strict assumptions are made in Castillo et al. (2012) and Pati et al. (2014), under different contexts.
Next, we review the definition of posterior contraction rate.
Definition 2 (Posterior Contraction Rate). Let {Z * n , n = 1, 2, . . .} represent a sequence of true latent feature matrices where each Z * n has p n rows. For each n, the observations are generated from x T i i.i.d.
∼ N (0, Z * n Z * T n + I pn ) for i = 1, 2, · · · , n. Denote by Π(· | X) the posterior distribution of the latent feature matrix under IBP mixture or pIBP mixture prior. If as n → ∞, where · represents the spectral norm and C is a positive constant, then we say the posterior contraction rate of Z n Z T n to the true Z * n Z * T n is n under the spectral norm.
For the proof of the main theorem, we derive the following lemma. The lemma establishes the lower bound of Π(Z n = Z * n ) for a sequence of binary matrices {Z * n , n = 1, 2, . . .} satisfying the sparsity condition, where Π(·) represents the pmf of IBP mixture or pIBP mixture.
Lemma 1. Consider a sequence of binary matrices {Z * n : Z * n ∈ {0} Z pn×K * n , n = 1, 2, . . .} that are sparse under Definition 1. Parameters m kn and s n are defined accordingly (for 0 matrices, let s n = K * n = 1). We have Π(Z n = Z * n ) ≥ exp (−Cs n K * n log(p n + 1)) for some positive constant C, if either of the following two cases is true: (1) Z n follows IBP mixture; (2) Z n follows pIBP mixture, and the minimal length between each leaf and its parent node is lower bounded by η 0 > 0 (see Figure 1).
To consider a very extreme case, when s n = log(p n +1), the lower-bound exp(−CK * n (log(p n +1)) 2 ) is much larger than exp(−CK * n p n ).
We present the main theorem of this paper, which proves that for a sequence of true latent feature matrix Z * n that satisfy the sparsity condition in Definition 1, the posterior distribution of the similarity matrix Z n Z T n converges to Z * n Z * T n . The theorem eventually leads to the main theoretical result in Remark 2 later. ∼ N (0, Z * n Z * T n + I pn ) for i = 1, 2, · · · , n. Let n = max √ p n , s n K * n log(p n + 1) √ n max{1, Z * n Z * T n }.
If n → 0 as n → ∞, then we have as n → ∞ for some positive constant C. In other words, n is the posterior contraction rate under the spectral norm.
For the sequence of sparse binary matrices {Z * n , n = 1, 2, . . .} considered in Theorem 1, if i.e., the number of features possessed by each object (non-zero entries of each row of Z * n ) is upper bounded by q n . Given the same assumptions in Theorem 1 except for replacing n → 0 bỹ n max √ p n , s n K * n log(p n + 1) √ n s n q n → 0 as n → ∞, we have Remark 2. Consider the second case of Corollary 1. If (1) s n K * n log(p n + 1) = O(p n ) and (2) is a valid posterior contraction rate. In other words, to ensure posterior convergence, we only need n to increase a little bit faster than p n , given the assumptions (1), (2) and the condition in the second case of Corollary 1.

Posterior Inference Based on MCMC
We have specified the hierarchical models including the sampling model p(X | Z) and the prior models for the parameters Π(Z | α) and p(α). In particular, p(X | Z) is the latent feature model is the IBP or pIBP prior (Equation 3) and p(α) = Gamma(1, 1). For the theoretical results in Section 3, we integrate out α. For posterior inference, we keep α so that the conditional distributions can be obtained in closed form.
We use Markov chain Monte Carlo simulations to generate samples from the posterior (Z, α) ∼ After iterating sufficiently many steps, the samples of Z drawn from the Markov chain approximately follow Π(Z | X). Gibbs sampling transition probabilities can be used to update Z and α, as described in Griffiths and Ghahramani (2011) and Miller et al. (2008).
To overcome trapping of the Markov chain in local modes in the high-dimensional setting, we use parallel tempering Markov chain Monte Carlo (PTMCMC) (Geyer, 1991) in which several Markov chains at different temperatures run in parallel and interchange the states across each other. In particular, the target distribution of the Markov chain indexed by temperature T is Parallel tempering helps the original Markov chain (the Markov chain whose temperature T is 1) avoid getting stuck in local modes and approximate the target distribution efficiently.
We give an algorithm below for sampling from Π(Z, α | X) where Π(Z | α) follows IBP. The algorithm describes in detail how PTMCMC can be combined with the Gibbs sampler in Griffiths and Ghahramani (2011). The algorithm iterates Step 1 and Step 2 in turn.
We update Z by row. For each row j, we iterate through the columns k = 1, . . . , K. We first make a decision to drop the k-th column of Z, z k , if and only if m (−j)k = j =j z j k = 0. In other words, if feature k is not possessed by any object other than j, then the k-th column of Z should be dropped, regardless of whether z jk = 0 or 1.
If the k-th column is not dropped, we sample z jk from , the k-th column of Z excluding z jk ). Specifically, After updating all entries in the j-th row, we add a random number of K + j columns (features) to Z. The K + j new features are only possessed by object j, i.e. only the j-th entry is 1 while all other entries are 0. Let Z + denote the feature matrix after K + j columns are added to the old feature matrix. The conditional posterior distribution of K + j is , we work with an approximation by truncating P (T ) (K + j | . . .) at level K + max , similar to the idea of truncating a stick-breaking prior in Ishwaran and James (2001). The value K + max is the maximum number of new columns (features) that can be added to Z each time we update the j-th row. Denote byP (T ) (K + j | . . .) the truncated conditional posterior, we havẽ Lastly, we update α. Given Z, the observed data X and α are conditionally independent, which implies that the conditional posterior distribution of α at any temperature T is the same.
We sample α from Step 2 (Interchanging States across Parallel Chains). We sort the Markov chains in descending order by their temperatures T . The next step of PTMCMC is interchanging states between adjacent Markov chains. Let Z (T ) , α (T ) denote the state of the Markov chain indexed by temperature T . Suppose we run N parallel chains with descending temperatures

Simulation Studies
We conduct simulation to examine the convergence of the posterior distribution of Z n Z T n (under IBP mixture prior) to the true similarity matrix Z * n Z * T n . We consider several true similarity matrices with different sample sizes n's and explore the contraction of the posterior distribution.
Hereinafter, we suppress the index n.
Simulations under the Sparsity Condition of Z * In this simulation, the true latent feature matrix, Z * = (z * jk ) p×K * , is randomly generated under the sparsity condition in Definition 1 (i.e., m k ≤ s for k = 1, . . . , K * , where s is relatively small compared with p). We set s = 10, K * = 10 and p = 50, 100, or 150. We use s/p to measure the sparsity (as in Definition 1).
Once Z * is generated, the samples x 1 , x 2 , · · · , x n are generated under the sampling model where x i 's are rows of X. For each value of p, we conduct 4 simulations, with sample sizes n = 20, 50, 80 or 100.
Given simulated data X, we use the proposed PTMCMC algorithm introduced in Section 4 to sample from the posterior distribution Π(Z | X), setting K + max in (4) at 10. We set the number of parallel Markov chains N = 11 with geometrically spaced temperatures. Namely, the ratio between adjacent temperatures β T i+1 /T i = 1.2, with T 1 = 1. We run 1, 000 MCMC iterations.
The chains converge quickly and mix well based on basic MCMC diagnostics.
We repeat the simulation scheme 40 times, each time generating a new data set with a new random seed and applying the PTMCMC algorithm for inference. We report two summaries in Table 1  For fixed p, ZZ T converges to Z * Z * T as sample size increases. When n = 100, ZZ T − Z * Z * T is very close to 0 and K is close to the truth K * = 10. On the other hand, for fixed n, increasing p does not make the posterior distribution of ZZ T significantly less concentrated at the true Z * Z * T , implying that the inference is robust to high-dimensional matrix of Z as long as the true matrix Z * is sparse. This verifies the theoretical results we have reported early on.

Sensitivity of Sparsity
In this part, we use simulation results to demonstrate the effect of sparsity of Z * in the posterior convergence. We set p = 50, K * = 10 and n = 100, and use different s = 10, 15, 18, 20, 23, 25, reducing the sparsity of Z * gradually. We generate Z * and X the same way as in the previous simulation. We set the number of parallel Markov chains N = 11 and K + max = 10. To increase the frequency of interchange between adjacent chains, we reduce β to 1.15. Table 2 reports the simulation results. As Z * becomes less sparse, the posterior distribution of Z becomes less concentrated on Z * , in terms of both K and ZZ T − Z * Z * T . Specifically, for s ∈ {15, 20, 25}, when s increases by 5, ZZ T − Z * Z * T inflates more than 10 fold.

Application for Proteomics
We apply the binary latent feature model to the analysis of a reverse phase protein arrays (RPPA) dataset from The Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov/ tcga/) downloaded by TCGA Assembler 2 (Wei et al., 2017). The RPPA dataset records the levels of protein expression based on incubating a matrix of biological samples on a microarray with   specific antibodies that target corresponding proteins (Sheehan et al., 2005;Spurrier et al., 2008).
We focus on patients categorized as 5 different cancer types, including breast cancer (BRCA), diffuse large B-cell lymphoma (DLBC), glioblastoma multiforme (GBM), clear cell kidney carcinoma (KIRC) and lung adenocarcinoma (LUAD). Data of n = 157 proteins are available. We randomly choose p = 100 patients for our analysis, with an equal number of 20 patients for each cancer type. Note that we consider proteins as experimental units and patients as objects in the data matrix with an aim to allocate latent features to patients (not proteins). This will be clear later when we report the inference results. The rows of the data matrix X n×p are standardized, with p j=1 x ij = 0 for i = 1, . . . , n. Since the patients are from different cancer types, we expect the data to be highly heterogeneous and the number of common features that two patients share to be small, i.e the feature matrix is sparse. We fit the binary latent feature model under the IBP prior Π(Z | α) and α ∼ Gamma(1, 1).
In addition, rather than fixing σ 2 = σ 2 a = 1, we now assume inverse-Gamma distribution priors on them, σ 2 ∼ IG(1, 1) and σ 2 a ∼ IG(1, 1). We run 1,000 MCMC iterations, as in the simulation studies. We also repeat the MCMC algorithm 3 times with different random seeds and do not observe substantial difference across the runs. We report point estimates (Ẑ,α,σ 2 ,σ 2 a ) according to the maximum a posteriori (MAP) estimate, (Ẑ,α,σ 2 ,σ 2 a ) = arg max (Z,α,σ 2 ,σ 2 a ) p(Z, α, σ 2 , σ 2 a | X). database (Kanehisa and Goto, 2000). The GSEA analysis reports back the enriched pathways and the corresponding genes, which is listed in Table S2 in the supplement. We observe the following findings. Lastly, LUAD seems to have two subgroups, possessing mostly feature 1 or 2, respectively. They correspond to cell death and cell cycle functions, which suggest that these two subgroups of cancer could be related to abnormal cell death and cell cycle regulation.
We also note that there are other informative features besides the five mentioned above.

Conclusion and Discussion
Our main contributions in this paper are (1)  There are several directions along which we plan to investigate further. First, since the assumptions made on the true latent feature matrix Z * n are quite mild, the posterior convergence in Theorem 1 only holds when p n = o(n). It is of interest whether posterior convergence still holds when p n increases faster, e.g., p n n. As a trade-off, results with a faster-increasing p n would likely require additional assumptions on Z * n , such as the Assumption 3.2 (A3) in Pati et al. (2014). It is also of interest to explore whether the contraction rate in Theorem 1 can be further improved with additional assumptions. This is closely related to the problem of minimax rate optimal estimators for Z * n Z * n T , or more broadly, the covariance matrix of random samples, which has been partially addressed in Pati et al. (2014).
Another potential direction for further investigation is to extend the latent feature model (1) to a more general latent factor model, in which the binary matrix Z is replaced with a real-valued factor matrix G. The binary matrix Z is then used to indicate the sparsity of G. See, e.g., Knowles and Ghahramani (2011). To prove posterior convergence for such a model, Lemma 1 needs to be modified based on the factor loading matrix, such as Lemma 9.1 in Pati et al. (2014).
Throughout this paper we measure the difference between the similarity matrices by the spectral norm. Other matrix norms, such as the Frobenius norm, may be explored. Our current results focus on the posterior convergence of Z n Z T n rather than Z n itself due to the identifiability issue of Z n arose from (2). A future direction is to investigate to what extent can Z * n be estimated, and a Hamming distance like measure between the feature matrices can be considered.
Finally, we are working on general hierarchical models that embed sparsity into the model construction.

Supplementary Materials S1 Derivation of the PMF of pIBP
In Step 1 of the construction of pIBP, recall Here P (z k | π k , T ) is defined by the universal tree structure T for any k = 1, . . . ,K according to Miller et al. (2008).
Denote byz k = (z 1k , z 2k , · · · , z pk ) T . For each k, we treat the tree as a directed graph with random variables at the interior nodes and z jk 's at the leaf nodes, j = 1, . . . , p. See Figure S1.
Given π k , for any variable x at the parent node (including the root node) and any variable y at the child node (including the leaf node, in which case y = z jk ), if the edge between x and y has length t, then where γ k = − log(1 − π k ). The value of the random variable at the root node is always fixed at 0.
The joint distribution of all random variables on the tree (nodes and leaves) is uniquely defined by the conditional probabilities (see Figure S1). So is P (z k | π k , T ). Note that given π and T , the columns ofZ are conditionally independent, i.e. PK(Z | α, T ) is column-exchangeable. IfZ 1 is generated by permuting columns ofZ, then PK(Z | α, T ) = PK(Z 1 | α, T ).

In
Step 2, for Z ∈ Z 0 p with K columns, whereZ * is such that the first K columns ofZ * are exactly the same with Z and the restK − K columns are all 0. For notational simplicity we still denoteZ * byZ. In Step 3, to obtain the pmf Π(Z | α, T ), we take the limit limK →∞ ΠK(Z | α, T ). We consider two cases (1) Z ∈ Z p and (2) Z = 0. Leaves z 1,k z 2,k z 3,k z 4,k z 5,k z 6,k z 7,k z 8,k z 9,k z 10,k z 11,k z 12,k Subgroup 6 } Figure S1: An example of a tree structure T , which is a directed graph with random variables (ξ's, ζ's and z's) at the nodes. In particular, entries of the k-th column of Z, z jk 's, are at the leaves. The length of the edge between ζ 4 and ξ 1 is t 1 , and similarly for t 2 and η 3 . The total edge length from the root to any leaf is always equal to 1. For example, t 1 + t 2 + η 3 = 1. Thus S(T ) ≤ p. Conditional probabilities (S1) imply that once the variable at a node becomes 1, all its child nodes (including leaf nodes) become 1. Solid circles represent that the variables at the corresponding positions are 1. Hollow circles denote 0's. Bold segments are drawn for the edges connecting 1's. The leaves are divided into 6 subgroups according to their common parent nodes ζ l 's.
We first consider Z ∈ Z p . Denote by Z = (z 1 , z 2 , · · · , z K ), we have . We consider P 1,K (Z | α, T ) and P 2,K (Z | α, T ) separately. For P 1,K (Z | α, T ), we define For P 2,K (Z | α, T ), we need to calculate P (0 | π k , T ). In the case that the variables at the leaves z k = 0, the random variables at all nodes of the tree must take value 0 (see Figure S1).

S2 Proof of Lemma 1
For the proof of Lemma 1, we first introduce some definition and notation. We say two leaf nodes belong to the same subgroup, if they share the same parent node. Let L denote the total number of subgroups. Denote by node l the common parent node for subgroup l, M l the number of leaf nodes in subgroup l, and η l the length between node l and any leaf node in subgroup l, l = 1, . . . , L. We have p = L l=1 M l . Furthermore, denote by m kl the number of leaf nodes in column k that belong to subgroup l and equal to 1, i.e. m kl = j∈subgroup l z jk . We have L l=1 m kl = m k . See Figure S1 for an example.
Next, we give a preparatory lemma.

Proof of Lemma 2.
Let Ω denote the event that the variables at all interior nodes of the tree take value 0. For example, in Figure S1, all interior nodes (in blue and red) should take value 0 (hollow circle). We have which implies that Proof of Lemma 1. For notational simplicity, we omit the index n in K * n , p n , s n , m kn , Z n , Z * n , etc., in the following proof. We use C, C 1 , C 2 , . . . , C 7 to represent positive constants. It suffices to prove the result for pIBP, as IBP is a special case of pIBP, in which the length between each leaf node and its parent node (the root) is always 1.
We first consider Z * ∈ Z p×K * , i.e., none of the columns of Z * consist of all 0's. Based on the assumption in case (2), η l ≥ η 0 , ∀ 1 ≤ l ≤ L and ∀ n. Lemma 2 implies that for some positive constant C 1 , considering that S(T ) ≤ p (see Figure S1 for an explanation).

S3 Proof of Theorem 1
For the proof of Theorem 1, we first introduce two preparatory lemmas from the literature.
Lemma 3 (Theorem 1 in Chen et al. 2016). Let Z be a collection of binary matrices that contains Z * . Consider a family of probability measures indexed by Z ∈ Z, i.e., {µ Z : Z ∈ Z}. For any subset U of Z and any testing function φ based on X, where E Z (·) means taking expectation in the case where X ∼ µ Z .
Lemma 4 (Remark 5.40.2 in Vershynin 2012). Suppose that p dimensional random variables ∼ N (0, Σ) for i = 1, 2, · · · , n and letΣ = n i=1 x T i x i /n denote the sample covariance matrix. Then for every t ≥ 0, where δ = C p n + t √ n and C and C are positive constants.
Proof of Theorem 1. Let Σ n = Z n Z T n + I pn and Σ * n = Z * n Z * T n + I pn denote the model covariance matrix and the true covariance matrix, respectively. For anyZ * n ∈Z pn×K * n and Z * n = G(Z * n ) ∈ Z 0 pn , we have 2. EZ * n (·) = E Z * n (·), considering the binary factor model where Based on the above two facts, it suffices to prove for Z * n ∈ Z 0 pn . In the following proof, C represents the positive constant in Theorem 1, and C , C 1 , C 2 , . . . , C 11 represent other positive constants.
Condition (S7) is due to n ≤ C 2 n → 0 as n → ∞, condition (S8) is due to p n /n ≤ p n /n Z * Z * T + 1 ≤ n → 0 as n → ∞, and condition (S9) is due to s n K * n log(p n + 1)/n ≤ s n K * n log(p n + 1)/n Z * Z * T + 1 ≤ n → 0 as n → ∞. Since C 2 n ≥ n , we can re-write C n ≥ C n for some positive constant C ≤ C/C 2 . To show it suffices to show For notational simplicity, we omit the index n in K * n , p n , s n , m kn , Z n , Z * n , Σ n , Σ * n , etc., in the following proof. Let Thus,

Lemma 3 implies that to show
Define the testing function We first show lim . We consider two scenarios Σ ≥ 2 Σ * and Σ < 2 Σ * separately.
Next, it remains to show lim Applying Lemma 4 with t = √ p and δ = C 9 p n + t √ n = C 10 p n . Due to condition (S8), δ → 0 as n → ∞. Thus, max{δ, δ 2 } = δ = C 10 p n for sufficiently large n and Combining results in (S11) and (S12), by Lemma 3, we have for C ≥ 2 max{C 8 , C 10 }. By choosing C ≥ C C 2 we have The proof for Z * ∈ Z p×K * is now complete.
Similar to the proof for Z * ∈ Z p×K * , if → 0 as n → ∞, then as n → ∞ for some positive constant C. The proof is now complete.

S4 Proof of Corollary 1
Proof of Corollary 1. For notational simplicity, we omit the index n. Let Z * = (z 1 , z 2 , · · · , z p ) T , For each a, p b=1 z T a z b represents the total number of features shared by object a and all the other objects. If z a has at most q non-zero entries and each column of Z * has at most s non-zero entries, then for every a, which implies that Since s, q ≥ 1, we immediately have Thus,˜ → 0 implies → 0 (as n → ∞). According to Theorem 1, we have s is a valid posterior contraction rate given˜ → 0 as n → ∞. Tables   Table S1: Genes (gene symbol:entrez id) associated with the top 10 proteins with the largest loadings for the 5 most popular features. A means the gene is included in the feature.