Scaled Process Priors for Bayesian Nonparametric Estimation of the Unseen Genetic Variation

Abstract There is a growing interest in the estimation of the number of unseen features, mostly driven by biological applications. A recent work brought out a peculiar property of the popular completely random measures (CRMs) as prior models in Bayesian nonparametric (BNP) inference for the unseen-features problem: for fixed prior’s parameters, they all lead to a Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. CRMs are thus not a flexible prior model for the unseen-features problem and, while the Poisson posterior distribution may be appealing for analytical tractability and ease of interpretability, its independence from the sampling information makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior’s parameters. In this article, we introduce the stable-Beta scaled process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution, which depends on the sampling information through the sample size and the number of distinct features, with corresponding estimates being simple, linear in the sampling information and computationally efficient. We apply our BNP approach to synthetic data and to real cancer genomic data, showing that: (i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; (ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors. Supplementary materials for this article are available online.


Introduction
The problem of estimating the number of unseen features generalizes the popular unseen-species problem [Orlitsky et al., 2016], and its importance has grown dramatically in recent years, driven by applications in biological sciences [Ionita-Laza et al., 2009, Gravel, 2014, Zou et al., 2016, Chakraborty et al., 2019. Consider a generic population in which each individual is endowed with a finite collection of W-valued features, with W possibly being an infinite space, and denote by p i the probability that an individual has feature w i ∈ W for i ≥ 1. The unseen-features problem assumes N ≥ 1 observable random samples Z 1:N = (Z 1 , . . . , Z N ) from the population, such that Z n = (A n,i ) i≥1 are independent Bernoulli random variables with unknown parameters (p i ) i≥1 . Then, the goal is to estimate the number of hitherto unseen features that would be observed if M ≥ 1 additional samples were collected, i.e.
with 1 being the indicator function. The unseen-species problem arises under the assumption that each individual is endowed with only one feature, i.e. a species. A wide range of approaches have been developed to estimate U , including Bayesian methods [Ionita-Laza et al., 2009, Masoero et al., 2021, jackknife [Gravel, 2014], linear programming [Zou et al., 2016], and variations of Good-Toulmin estimators [Orlitsky et al., 2016, Chakraborty et al., 2019.
In biological sciences, we may think of individuals as organisms and of features as groups to which organisms belong to, with each group being defined by any difference in the genome relative to a reference genome, i.e. a (genetic) variant. In human biology, the estimation of U arises in the context of optimal allocation of resources between quantity and quality in genetic experiments: spending resources to sequence a greater number of genomes (quantity), which reveals more about variation across the population, or spending resources to sequence genomes with increased accuracy (quality), which reveals more about individual organisms' genomes. Accurate estimates of U are critical in the experimental pipeline towards the goal of maximizing the usefulness of experiments under the trade-off between quantity and quality Laird, 2010, Zou et al., 2016]. While in human-biology the cost of sequencing has decreased in recent years [Schwarze et al., 2020], the expense remains nontrivial, and it is still critical in fields where scientists work with relatively budgets, e.g. non-human and non-model organisms [Souza et al., 2017]. Other applications arise in precision medicine [Momozawa and Mizukami, 2020], microbiome analysis [Sanders et al., 2019], single-cell sequencing [Zhang et al., 2020] and wildlife monitoring [Johansson et al., 2020].

Our contributions
We introduce a Bayesian nonparametric (BNP) approach to the unseen-features problem, which relies on a novel prior distribution for the unknown (p i ) i≥1 . Completely random measures (CRMs) [Kingman, 1992] provide a broad class of nonparametric priors for feature sampling problems, the most popular being the stable-Beta process prior [James, 2017, Broderick et al., 2018. In a recent work, Masoero et al. [2021] brought out a peculiar feature of CRM priors in the unseen-features problem: they all lead to a Poisson posterior distribution of U , given Z 1:N and fixed prior's parameters, which depends on Z 1:N only through the sample size N . Despite the broadness of the class of CRM priors, such a common Poisson posterior structure makes CRMs not a flexible prior model for the unseenfeatures problem. While the Poisson posterior distribution may be appealing in principle, making posterior inferences analytically tractable and easy to interpret, its independence from Z 1:N makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of the unknown prior's parameters. A somehow similar scenario occurs in BNP inference for the unseen-species problem under a Dirichlet process (DP) prior [Ferguson, 1973], and led to the use of the Pitman-Yor process (PYP) prior [Pitman and Yor, 1997] for enriching the posterior distribution of the number of unseen species, while maintaining analytical tractability and interpretability of the DP prior [Lijoi et al., 2007].
We show that scaled process (SP) priors, first introduced in James et al. [2015], allow to enrich the posterior distribution of U arising under CRM priors. Under SP priors, we characterize the posterior distribution of U as a mixture of Poisson distributions that may include, through the mixing distribution, the whole sampling information in terms of the number of distinct features and their frequencies. While this is appealing in principle, it may be at stake with analytical tractability and interpretability, which are critical for a concrete use of SP priors. Then, we introduce the stable-Beta SP (SB-SP) prior, which provides a sensible trade-off between the amount of sampling information introduced in the posterior distribution of U , and analytical tractability and interpretability of the posterior inferences. In particular, we characterize the SB-SP prior as the sole SP prior for which the posterior distribution of U , given Z 1:N and fixed prior's parameters, depends on Z 1:N through the sample size N and the number K N of distinct features; the SB-SP may thus be considered as the natural counterpart of the PYP for the unseen-feature problem. Under the SB-SP prior, the posterior distribution of U , as well as of a refinement of U that deals with the number of unseen rare features, is a negative Binomial posterior distributions, whose parameters depend on N , K N and the prior's parameters. Corresponding Bayesian estimates of U , with respect to a squared loss function, are simple, linear in K N and computationally efficient.
We present an empirical validation of the effectiveness of our BNP methodology, both on synthetic and real data. As for real data, we consider cancer genomic data, where the goal is to estimate the number of new (genomic) variants to be discovered in future unobservable samples. In cancer genomics, accurate estimates of the number of new variants is of particular importance, as it might help practitioners understand the site of origin of cancers, as well as the clonal origin of metastasis, and in turn be a useful tool to develop effective clinical strategies [Chakraborty et al., 2019, Huyghe et al., 2019. We make use of data from the cancer genome atlas (TCGA), and focus on the challenging scenario in which the sample size N is particularly small, and also small with respect to the extrapolation size M . Such a scenario is of interest in genomic applications, where only few samples of rare cancer might be available. We show that our BNP methodology outperforms the most popular parametric and nonparametric competitors, both classical (frequentist) and Bayesian, in terms of estimation accuracy of U and a refinement of U for rare features. In addition, with respect to the BNP approach under the stable-Beta process prior [Masoero et al., 2021], our approach provides improved coverage for the estimation. This is an empirical evidence of the effectiveness of replacing the Poisson posterior distribution with the negative Binomial posterior distribution, which allows to better exploit the sampling information.

Organization of the paper
In Section 2 we show how SP priors allow to enrich the posterior distribution of U arising under CRM priors. In Section 3 we introduce and investigate the SB-SP prior in the context of the unseenfeatures problem: i) we characterize the SB-SP prior in the class of SP priors, providing its predictive distribution; ii) we apply the SB-SP prior to the unseen-features problem, providing the posterior distribution of U and a BNP estimator. Section 4 contains illustrations of our method. In Section 5 we discuss our approach, a multivariate extension of it, and future research directions. Proofs and additional experiments are in the Appendix.

Scaled process priors for feature sampling problems
For a measurable space of features W, we assume N ≥ 1 observable individuals to be modeled as a random sample Z 1:N from the {0, 1}-valued stochastic process Z(w) = i≥1 A i δ wi (w), w ∈ W, where (w i ) i≥1 are features in W and (A i ) i≥1 are independent Bernoulli random variables with unknown parameters (p i ) i≥1 , p i being the probability that an individual has feature w i , for i ≥ 1. That is, Z is a Bernoulli process with parameter ζ = i≥1 p i δ wi , denoted as BeP(ζ). BNP inference for feature sampling problems relies on the specification of a prior distribution on the discrete measure ζ, leading to the BNP-Bernoulli model, namely ζ is a discrete random measure on W whose law Z takes on the interpretation of a prior distribution for the unknown feature's composition of the population. By de Finetti's theorem, the random variables Z n 's in (1) are exchangeable with directing measure Z [Aldous, 1983]. In this section, we show how SP priors for ζ [James et al., 2015] allow to enrich the posterior distribution of the number of unseen features arising under CRM priors.

CRM priors for Bernoulli processes
CRMs provide a standard tool to define nonparametric prior distributions on the parameter ζ of the Bernoulli process Z. Consider a homogeneous CRM µ 0 on W, i.e. µ 0 = i≥1 ρ i δ Wi , where the ρ i 's are (0, 1)-valued random atoms such that i≥1 ρ i < +∞, while the W i 's are i.i.d. W-valued random locations independent of the ρ i 's. The law of µ 0 is characterized, through Lévy-Khintchine formula, by the Lévy intensity measure ν 0 (ds, dw) = λ 0 (s)dsP (dw) on (0, 1) × W, where: i) λ 0 is a measure on (0, 1), which controls the distribution of the ρ i 's, and such that (0,1) min{s, 1}λ 0 (s)ds < +∞; ii) P is a non-atomic measure on W, which controls the distribution of the W i 's. For short, µ 0 ∼ CRM(ν 0 ).
See Appendix A for an account on CRMs [Kingman, 1992, Chapter 8]. Note that, since P is nonatomic, the random atoms W i 's are almost surely distinct, that is to say the different features cannot coincide almost surely. The law of µ 0 provides a natural prior distribution for the parameter ζ of the Bernoulli process. The Beta and the stable-Beta processes are popular examples of µ 0 ∼ CRM(ν 0 ), for suitable specifications of ν 0 . A comprehensive posterior analysis of CRM priors is presented in James [2017]. In the next proposition, we recall the predictive distribution of CRM priors [James, 2017, Proposition 3.2].
Proposition 1. Let Z 1:N be a random sample from (1) with ζ ∼ CRM(ν 0 ). If Z 1:N displays K N = k distinct features {W * 1 , . . . , W * K N }, each feature W * i appearing exactly M N,i = m i times, then the conditional distribution of Z N +1 , given Z 1:N , coincides with the distribution of of the Bernoulli variables (A n,i ) i≥1 and the random features (W i ) i≥1 . In other terms, there exits a one-to-one correspondence between Z n and the sequence of points {(A n,i , W i )} i≥1 . Finally, note that, although the values of features' labels W i are immaterial, the features W i 's are assumed to be random. This is in line with the BNP literature on species sampling models, where the species' labels are assumed to be random [Pitman, 1996].

SP priors for Bernoulli processes
Consider a homogeneous CRM µ = i≥1 τ i δ Wi on W, where the τ i 's are non-negative and such that i≥1 τ i < +∞, and the W i 's are i.i.d. and independent of the τ i 's. We denote by ν(ds, dw) = λ(s)dsP (dw) on R + × W, with R+ min{s, 1}λ(s)ds < +∞, the Lévy intensity measure of µ. Let ∆ 1 > ∆ 2 > . . . be the decreasingly ordered τ i 's, and consider the discrete random measure Ferguson andKlass, 1972, pg. 1636], and let G a be the conditional distribution of (∆ i+1 /∆ 1 ) i≥1 given ∆ 1 = a. Moreover, let ∆ 1,h denote a random variable whose distribution has a density function where h is a non-negative function and f ∆1 is the density function of is a SP. For short, µ ∆ 1,h ∼ SP(ν, h). The law of µ ∆ 1,h is a prior distribution for the parameter ζ of the Bernoulli process. The next proposition characterizes the predictive distribution of SP priors. See also James et al. [2015, Proposition 2.2] for a posterior analysis of SP priors.
Proposition 2. Let Z 1:N be a random sample from (1) with ζ ∼ SP(ν, h). If Z 1:N displays K N = k distinct features {W * 1 , . . . , W * K N }, each feature W * i appearing exactly M N,i = m i times, then the conditional distribution of ∆ 1,h , given Z 1:N , has a density function of the form Moreover, the conditional distribution of Z N +1 , given (∆ 1,h , Z 1:N ), coincides with the distribution of (dw); ii) the A N +1,i 's are independent Bernoulli random variables with parameters J i 's, respectively, such that J i | ∆ 1,h is distributed according to the See Appendix B for the proof of Proposition 2. The marginalization of (5) with respect to (4) leads to the predictive distribution of SP priors: i) Z N +1 displays "new" features W i 's, and the posterior distribution of statistics of "new" features, given Z 1:N , is determined by the law of (∆ 1,h , Z N +1 ); ii) Z N +1 displays "old" features W * i 's, and the posterior distribution of statistics of "old" features, given Z N +1 , is determined by the law of (∆ 1,h , 1≤i≤K N A N +1,i δ W * i ). Because of (4) and (5), the law of (∆ 1,h , Z N +1 ) may include the whole sampling information, depending on the specification of ν and h, and hence the posterior distribution of statistics of "new" features, given Z 1:N , also includes such an information. As a corollary of Proposition 2, the posterior distribution of the number of unseen features, given Z 1:N and fixed prior's parameters, is a mixture of Poisson distributions that may include the whole sampling information; in particular, the amount of sampling information in the posterior distribution is uniquely determined by the mixing distribution, namely by the conditional distribution of ∆ 1,h , given Z 1:N . SP priors thus allow to enrich the Poisson posterior structure arising from CRM priors, in terms of both a more flexible distribution and the inclusion of more sampling information than the sole sample size N , though they may lead to unwieldy posterior inferences due to the marginalization with respect to (4).
The use of the sampling information in the predictive structure of SPs somehow resembles that of Poisson-Kingman (PK) models [Pitman, 2006]. PK models form a broad class of nonparametric priors for species sampling problems. The DP prior is a PK model whose predictive distribution is such that: i) the conditional probability that the (N +1)-th draw is a "new" species, given N observable samples, depends only on the sample size; ii) the conditional probability that the (N + 1)-th draw is an "old" species, given N observable samples, depends on the sample size, the number of distinct species and their frequencies. Such a behaviour resembles that of CRM priors, i.e. Proposition 1. PK models allow to include more sampling information in the probability of discovering a "new species" arising under the DP prior, which typically determines a loss of the analytical tractability of posterior inferences for the number of unseen species [Bacallado et al., 2017]. Such a behaviour resembles that of SP priors, i.e. Proposition 2. The PYP prior is arguably the most popular PK model. It stands out for enriching the probability of discovering a "new" species arising under the DP prior, by including the sampling information on the number of distinct species, while maintaining the analytical tractability and interpretability of the DP prior.
3 Stable-Beta Scaled Process (SB-SP) priors for the unseen-

features problem
In Section 2 we showed how SP priors allow to enrich the Poisson posterior structure of the number of unseen features arising under CRM priors, e.g. the Beta and the stable-Beta process priors. While this is an appealing property, it may lead to a lack of analytical tractability and interpretability of posterior inferences, thus making SP priors not of practical interest in applications. In this section, we introduce and investigate a peculiar SP prior, which is referred to as the SB-SP prior, and we show that: i) it leads to a negative Binomial posterior distribution for the number of unseen features, which generalizes the Poisson distribution while maintaining its analytical tractability and interpretability; ii) it leads to a posterior distribution for the number of unseen features, which depends on the sampling through the sample size and the number of distinct features. The SB-SP prior thus provides a sensible trade-off between the enrichment of the Poisson posterior structure of the number of unseen features arising under CRM priors and the analytical tractability and interpretability of posterior inferences.
In particular, we characterize the SB-SP prior as the sole SP prior for which the posterior distribution of the number of unseen features depends on the observable sample only through the sample size and the number of distinct features. The SB-SP may thus be considered as a natural counterpart of the PYP for the unseen-feature problem.
For any non-negative function h, a S-SP on W is defined as the SP with law SP(ν σ , h). S-SP priors generalizes the Beta process prior, which is recovered by setting h to be the identity function, and then letting σ → 0 [James et al., 2015]. The predictive distribution of ζ ∼ SP(ν σ , h) is obtained from Proposition 2. In the next theorem, we characterize the S-SP priors as the sole SP priors for which the conditional distribution of ∆ 1,h , given Z 1:N , depends on Z 1:N only through the sample size N and the number K N of distinct features in Z 1:N .
Theorem 1. Let Z 1:N be a random sample from (1) with ζ ∼ SP(ν, h), and let Z 1:N displays K N distinct features with corresponding frequencies (M N,1 , . . . , M N,K N ). Moreover, let ν(ds, dw) = λ(s)dsP (dw), and let f ∆ 1,h be the density function of ∆ 1,h . If f ∆ 1,h > 0 on R + and the functions λ and f ∆ 1,h are continuously differentiable, then the conditional distribution of ∆ 1,h , given Z 1:N , depends on Z 1:N only through N and K N if and only if ν = ν σ .
See Appendix C for the proof of Theorem 1. We recall from Section 2 that the conditional distribution of ∆ 1,h , given Z 1,N , uniquely determines the amount of sampling information included in the posterior distribution of statistics of "new" features. Then, according to Theorem 1, S-SP priors are the sole SP priors for which the posterior distribution of the number of unseen features, given Z 1:N and fixed prior's parameters, depends on Z 1:N only through N and K N . As a corollary of Theorem 1, the Beta process prior is the sole S-SP prior for which the posterior distribution of statistics of "new" features depends on Z 1:N only through N . Analogous predictive characterizations are well-known in species sampling problems, and they are typically referred to as "sufficientness" postulates' [Bacallado et al., 2017]. In particular, the DP prior is characterized as the sole species sampling prior for which the conditional probability that the (N + 1)-th draw is a "new" species, given N observable samples, depends only on the sample size [Regazzini, 1978]. Moreover, the PYP prior is characterized as the sole species sampling prior for which the conditional probability that the (N + 1)-th draw is a "new" species, given N observable samples, depends only on the sample size and the number of distinct species in the sample [Zabell, 2005]. Theorem 1 provides a "sufficientness" postulates' in the context of feature sampling problems.
As a noteworthy example of S-SPs, we introduce the SB-SP. The SB-SP is a S-SP obtained by a suitable specification of the non-negative function h. In particular, for any c, β > 0 let where Γ(·) denotes the Gamma function. Then a SB-SP on W is defined as the SP with law SP(ν σ , h c,β ). For short, we denote the law of a SB-SP by SB-SP(σ, c, β). The SB-SP prior generalizes the Beta process prior, which is recovered by setting c = 0 and β = 1, and then letting σ → 0. According to the construction of SPs, the distribution of ∆ 1,h c,β has a density function obtained by combining (6) and (7); this is a polynomial-exponential tilting of the density function (6). In particular, ∆ −σ 1,h c,β is distributed as a Gamma distribution with shape (c + 1) and rate β. Such a straightforward distribution for ∆ 1,h c,β is at the core of the analytical tractability of posterior inferences under the SB-SP prior; this fact will be clear in the application of the SB-SP prior to the problem of estimating the number of unseen features. The next proposition characterizes the predictive distribution of the SB-SP prior.
Proposition 3. Let Z 1:N be a random sample from (1) with ζ ∼ SB-SP(σ, c, β). If Z 1:N displays K N = k distinct features {W * 1 , . . . , W * K N }, each feature W * i appearing exactly M N,i = m i times, then the conditional distribution of ∆ 1,h c,β , given Z 1:N , has a density function of the form where γ (N ) 0 = σ 1≤i≤N B(1 − σ, i), with B(·, ·) being the (Euler) Beta function. Moreover, the conditional distribution of Z N +1 , given (∆ 1,h c,β , Z 1:N ), coincides with the distribution of where: ii) the A N +1,i 's are independent Bernoulli random variables with parameters J i 's, respectively, such that each J i | ∆ 1,h c,β is distributed according to a density function of the form See Appendix C for the proof of Proposition 3. According to Equation (8), the conditional distribution of ∆ 1,h c,β , given Z 1:N , depends on Z 1:N only through the sample size N and the number K N of distinct features in Z 1:N . This agrees with Theorem 1, implying that the posterior distribution of the number of unseen features, given Z 1:N and fixed prior's parameters, depends on Z 1:N only through N and K N . Because of (8) and (9), the posterior distribution of statistics of "new" features stands out for analytical tractability, thus being competitive with that arising from CRMs, e.g. the Beta and the stable-Beta processes. In particular, from Equation (9), the conditional distribution of Z N +1 , given (∆ 1,h c,β , Z 1:N ) is a Poisson distribution that depends on Z 1:N only through N . Then, from (8), its marginalization with respect to the conditional distribution of ∆ 1,h c,β , given Z 1:N , leads to a negative Binomial posterior distribution. Such an appealing property arises from the peculiar form h c,β that, combined with ν σ , leads to a conjugacy property for the conditional distribution of ∆ 1,h c,β , given Z 1:N . That is, the conditional distribution of ∆ −σ 1,h c,β , given Z 1:N , is a Gamma distribution with shape (K N + c + 1) and rate β + γ (N ) 0 , which is the distribution ∆ −σ 1,h c,β with shape and rate being updated through Z 1:N . The next proposition establishes the distribution of a random sample Z 1:N from a SB-SP prior. See Appendix C for details.
Proposition 4. Let Z 1:N be a random sample from (1) with ζ ∼ SB-SP(σ, c, β). The probability that Z 1:N displays a particular feature allocation of k distinct features with frequencies (m 1 , . . . , m k ) is

BNP inference for the unseen-features problem
Now, we apply the SB-SP prior to the unseen-features problem. For any N ≥ 1 let Z 1:N be an observable sample modeled as the BNP Bernoulli model (1), with ζ ∼ SB-SP(σ, c, β). Moreover, under the same model of the Z n 's, for any M ≥ 1 let (Z N +1 , . . . , Z N +M ) be additional unobservable sample. Then, the unseen-feature problem calls for the estimation of namely the number of hitherto unseen features that would be observed in (Z N +1 , . . . , Z N +M ). As generalization of the unseen-feature problem (11), for r ≥ 1 we consider the estimation of namely the number of hitherto unseen features that would be observed with prevalence r in (Z N +1 , . . . , Z N +M ).
Of special interest is r = 1, which concerns rare (unique) features. The next theorem characterizes the posterior distributions of U (M ) N and U (M,r) N , given Z 1:N . We denote by NegativeBinonial(n, p) the negative Binomial distribution with parameter n and p ∈ (0, 1).
Theorem 2. Let Z 1:N be a random sample from (1) with ζ ∼ SB-SP(σ, c, β), and let Z 1:N displays and for any index of prevalence r ≥ 1, respectively, where γ  (13) and (14), showing that the number of unseen features has a power-law growth in M . The same growth in M holds under the stable-Beta process prior [Masoero et al., 2021, Proposition 2], though the limiting distribution is degenerate.
Theorem 3. Let Z 1:N be a random sample from (1) with ζ ∼ SB-SP(σ, c, β), and let Z 1:N displays where W N is a Gamma random variable with shape (K N + c + 1) and rate (β + γ where W N,r is a Gamma random variable with shape (K N +c+1) and rate Γ(r+1)(β+γ

Experiments
Over the last decade, genomics has witnessed an extraordinary improvement in the data availability due to the advent of next generation sequencing technologies. Thanks to larger and richer datasets, researchers have started uncovering the role and impact of rare genetic variants in heritability and human disease [Hernandez et al., 2019, Momozawa andMizukami, 2020]. The development of methods for estimating the number of new genomic variants to be observed in future studies is an active research area, as it can aid the design of effective clinical procedures in precision medicine [Ionita-Laza et al., 2009, Zou et al., 2016, enhance understanding of cancer biology [Chakraborty et al., 2019], and help to optimize sequencing procedures [Rashkin et al., 2017, Masoero et al., 2021. Here, we consider datasets of individual genomic sequences. Following common practice, we assume that an underlying fixed and idealized genomic sequence (the "reference") is given. Then, each coordinate of an individual sequence reports the presence (1) or absence (0) of variation at a given locus with respect to the reference. All variants are treated equally, namely, any expression differing from the underlying reference at a given locus counts as a variant. We make use of our methodology to estimate the number of genomic loci at which variation was not observed in the original sample, and is going to be observed in (at least one of) M additional datapoints.
We find in our experiments that the estimates of the total number of new variants to be observed produced using the SB-SP-Bernoulli model, hereafter referred to as SSB, tend to be more accurate than other available methods in the literature. This phenomenon is particularly evident when the sample size N of the training set is small, and when the extrapolation size M is large with respect to N . Moreover, the SSB model is particularly effective in estimating the number of new rare variants, e.g. variants appearing only once in the additional unobservable samples. Accurate estimation of rare variants is particularly important, as these are believed to be largely responsible for heritability of human disease [Rashkin et al., 2017, Chakraborty et al., 2019. To benchmark the quality of the SSB, we consider a number of competing methodologies for the feature prediction problem available in the literature: i) Jackknife estimators (J) [Gravel, 2014]; ii) a linear programming method (LP) [Zou et al., 2016] and variations of Good-Toulmin estimators (GT) [Chakraborty et al., 2019]. We also compare our empirical findings to a BNP estimator obtained under the stable-Beta process prior (3BB), which has been introduced in Masoero et al. [2021]. We complete our analysis with a thorough investigation on synthetic data in Appendix F and Appendix G, as well as on additional real data from the gnomAD database [Karczewski et al., 2020] in Appendix H.

Empirics and evaluation metrics
For the SSB method to be useful, we need to estimate the underlying, unknown, parameters of the SB-SP prior. To learn these prior's parameters, we here adopt an empirical Bayes procedure, which consists in maximizing the marginal distribution (10). In particular, we maximize numerically Equation (10) with respect to the parameters β > 0, c > 0 and σ ∈ (0, 1) of the SB-SP prior, and use the resulting values to produce our estimators. That is, we let (β,ĉ,σ) = arg max (β,c,σ) p (N ) and plug these values in the BNP estimator (13) and ( To assess the accuracy of our estimates, we consider the percent deviation of the estimate from the truth to be the achieved accuracy. That is, the accuracy of the estimatorÛ In particular, the accuracy v

Estimating the number of new variants in cancer genomics
Following the empirical study of Chakraborty et al. [2019], we make use of data from the Cancer Genome Atlas (TCGA), the largest publicly available cancer genomics dataset, containing somatic mutations from 10,295 patients and spanning 33 different cancer types. We partition the samples into 33 smaller datasets according to cancer-type annotation of each patient. See Chakraborty et al. [2019] and Masoero et al. [2021, Appendix F] for details on the data and the experimental setup. For each cancer type, we retain a small fraction of the data for purposes of training, and consider the task of estimating the number of new variants that will be observed in a follow-up sample given a pilot sample. We validate our estimates by comparing the estimateÛ (M ) N of the number of distinct variants to the true value, obtained by extrapolating to the remaining data. To assess the variability and error in our estimates, we repeat for every cancer type the experiment on S = 1,000 subsets of the data, each obtained by randomly subsampling without replacement from the full sample.
We find that the SSB and 3BB methods perform particularly well when the training sample size N is small compared to the extrapolation sample size M . This setting is relevant in the context of cancer genomics, as scientists are interested in understanding the "unexploited potential" of the genetic information, especially for rare cancer subtypes [Chakraborty et al., 2019, Huyghe et al., 2019.
To compare and quantify the performance of the available methodologies in this setting, we report in Figure 1 the distribution of the estimation accuracy when retaining only N = 10 samples for training and extrapolating to the largest possible sample size M for which we can compute the accuracy metric (Equation (19)). We report results for the 10 cancer types with the largest number of samples in the original dataset. For each cancer type and for each method, the distribution of the estimation accuracy is obtained by considering its performance across the S = 1,000 replicates. Across all cancer types, the estimates obtained from the SSB method achieve higher accuracy.
We show in Figure 2 the behavior ofÛ (i) N for five different cancer types as i = 1, . . . , M . Again, we let N = 10, and M be the largest possible extrapolation value, as dictated by the dataset size. We report the estimates obtained from a fixed sample of size N = 10, as well as the variability around such estimates obtained by re-fitting each model, iteratively leaving one datapoint out from the sample.
In this setting, the SSB method outperforms competing methods in terms of estimation accuracy.
Moreover, the variability in the estimates arising from re-fitting the model on subsets of the data provides a useful measure of uncertainty in such estimation.

Estimating the number of new rare variants in cancer genomics
In recent years, the cancer genomics research community has become increasingly interested in study- For each method and cancer type, we retain N = 10 random samples and use them to estimate up to the largest possible size. We fit each model on the full sample, as well as N = 10 additional times by iteratively leaving one datapoint out from the training sample. The solid black line is the true number of features that would have been observed (vertical axis) for any extrapolation size N + M (horizontal axis), for a fixed ordering of the data. Shaded regions report the prediction range obtained from the estimates from the leave-one-out fits.
one patient. Evidence suggests that rare deleterious variants can have far stronger effect sizes than common variants [Rasnic et al., 2020] and can play an important role in the development of cancer.
For example, in breast cancer, it is well accepted that the risk of a variant is inversely proportional with respect to its prevalence: the rarer the variant, the higher the risk [Wendt and Margolin, 2019].
Therefore, effective identification and discovery of rare variants is an active, is an ongoing research area [Lawrenson et al., 2016, Lee et al., 2019. This phenomenon is not limited to breast cancer, but is progressively being studied across different cancer types. See, e.g. the recent works on ovarian [Phelan et al., 2017], skin [Goldstein et al., 2017], prostate [Nguyen-Dumont et al., 2020] and lung [Liu et al., 2021] cancers and references therein. In downstream analysis, these estimates could be useful for planning and designing future experiments, e.g. informing scientists on the number of new samples to be collected in order to observe a target number of new variants, or for power analysis considerations in rare variants association tests [Rashkin et al., 2017].
The BNP framework considered here allows us to estimate the number of new rare variants to be discovered. While Zou et al. [2016] did not consider the problem of estimating rare variants, it is straightforward to obtain an estimate for this quantity from their framework. Indeed, for every prevalence x ∈ [0, 1], the LP estimates the histogram h(x), which counts the number of variants appearing with prevalence x in the population, and the number of variants appearing with prevalence r follows from the binomial sampling model assumption, namelyÛ Figure 4 that the SSB method provides better estimates than the 3BB and LP methods.

Coverage and calibrated uncertainties
One of the benefits of the BNP approach is that it automatically yields a notion of variability of the estimate of U via posterior credible intervals. We here check whether these intervals produce a useful notion of uncertainty, by investigating their calibration. For α ∈ (0, 1), we say that a 100 × α% credible interval is calibrated if it contains the true value of interest, arising from hypothetical repeated draws, 100 × α% of the times. We here assess the calibration of a 100 × α% credible interval for U This is the fraction of the S experiments in which the true value was contained by an 100×α% credible interval. The closer w (M ) N (α) to α, the better calibrated the credible intervals. We compute the same quantity for the 3BB method using the results in Masoero et al. [2021]. Although still not perfect, we find that the posterior predictive intervals obtained from the SSB method are better calibrated than the ones under the 3BB method (see Figure 5). the posterior inferences analytically tractable and of easy interpretability, its independence from Z 1:N makes the BNP approach a questionable oversimplification, with posterior inferences being completely determined by the estimation of unknown prior's parameters. In this paper, we introduced the SB-SP prior, and showed that: i) it enriches the posterior distribution of the number of unseen features arising under CRM priors, which results in a negative Binomial distribution whose parameters depend on the sample size and the number of distinct features; ii) it maintains the same analytical tractability and interpretability as CRM priors, which results in BNP estimators that are simple, linear in the sampling information and computationally efficient. The effectiveness of the SB-SP prior is showcased through an empirical analysis on synthetic and real data. Under the SB-SP prior, we found that estimates of the unseen number of features are accurate, and they outperform the most popular competitors in the challenging scenario where the sample size N is particularly small, and also small with respect to the extrapolation size M .
Our approach admits an extension to the multiple-feature setting, which takes into account of the many forms of variation, e.g. single nucleotide changes, tandem repeats, insertions and deletions, copy number variations [Zou et al., 2016]. We briefly describe the multiple-feature setting, and defer to Appendix E for details. It is assumed that a feature w i comes with a characteristic, i.e. the form of variation, chosen among q > 1 characteristics. For N ≥ 1, the observable sample Z 1: i.e. w i does not display variation, or only one A i,j 's is equal to 1 with probability p i,j , i.e. w i displays variation with characteristic j. Z is a multivariate Bernoulli process with parameter ζ = i≥1 p i δ wi .
The stable-Beta-Dirichlet process prior for ζ is a multivariate generalization of the stable-Beta process prior [James, 2017], and it leads to a Poisson posterior distribution for the number of unseen features, given Z 1:N , which depends on Z 1:N only through N . In Appendix E we introduce a scaled version of the stable-Beta-Dirichlet process, and show that it leads to a negative Binomial posterior distribution for the number of unseen features, which depends on Z 1:N through N and the number of distinct features in Z 1:N .
SP priors have been introduced in James et al. [2015] and, to the best of our knowledge, since then no other works have further investigated such a class of priors. To date, the peculiar predictive properties of SP priors appear to be unknown in the BNP literature. Our work on the unseen-features problem is the first to highlight the great potential of SP priors in BNPs, showing that they provide a critical tool for enriching the predictive structure of the popular CRM priors [James, 2017, Broderick et al., 2018. We believe that SPs may be of interest beyond the unseen-features problem, and more generally beyond the broad class of feature sampling problems. CRM priors, and in particular the Beta and stable-Beta process priors, have been widely used in several contexts, with a broad range of applications in topic modeling, analysis of social networks, binary matrix factorization for dyadic data, analysis of choice behaviour arising from psychology and marketing surveys, graphical models, and analysis of similarity judgement matrices. See Griffiths and Ghahramani [2011] and references therein for details. In all these contexts, SP priors may be more effective than CRM priors, as they allow to better exploit the sampling information in posterior inferences.
Among applications of SP priors beyond features sampling problems, it is worth mentioning the use of SP priors as hierarchical (or latent) priors in models of unsupervised learning [Griffiths and Ghahramani, 2011, Section 5], the most popular being Gaussian latent feature modeling. Differently from features sampling problems, where the values of features' labels W i s are immaterial, in Gaussian latent feature modeling the values the W i 's become material. That is, under the Gaussian latent feature model with a SP prior, observations are assumed to modeled as a multivariate Gaussian distribution, whose mean depends on latent features that are modeled with a SP prior, thus making the values of features' labels W i 's of critical importance for the analysis. Bayesian factor analysis [Knowles and Ghahramani, 2011] provides another context where SP priors may be usefully applied as hierarchical priors. Within the context of factor analysis, we also mention the work of Ayed and Caron [2021] with applications to network analysis. There, the authors exploit CRM priors to recover the latent community structure in a network between individuals, and the features' labels describe the level of affiliation of a certain individual to a latent community. In such a context, we believe that SP priors may be used in place of CRM priors, with the advantage of introducing richer predictive structure. In this respect, our work paves the way to promising directions of future research, in terms of both methods and applications. and Tamara Broderick were supported in part by the DARPA I2O LwLL program, an NSF CAREER Award, and ONR award N00014-17-1-2072.

A A brief account on completely random measures
In this section we provide a short account on completely random measures (CRMs). For a more exhaustive treatment refer to Daley and Vere-Jones [2008], Kingman [1992]. Let us denote by W a Polish space equipped with its Borel σ-field W, and we also indicate by B R+ the Borel σ-field of the positive real line R + . Denote by M W the space of all bounded and finite measures on (W, W), in other words µ ∈ M W iff µ(A) < +∞ for any bounded set A ∈ W. The space M W is usually assumed to be equipped with a proper Borel σ-algebra, which is induced by the so called weak-hash convergence and denoted here as M W (see Daley and Vere-Jones [2008] for details). are independent for any choice of bounded and disjoint sets A 1 , . . . , A n ∈ W and for any n ≥ 1. Kingman [1967] proved that a CRM may be decomposed as the sum of three main components: i) a deterministic drift u, namely a deterministic measure on (W, W); ii) a part with random jumps See Daley and Vere-Jones [2008] for a proof.
Following standard practice in the nonparametric literature, in this paper we deal with CRMs without deterministic drift and without fixed atoms, namely we assume that µ ≡ µ c . In this case µ = µ c is characterized through the Lévy-Khintchine representation of its Laplace functional: for any measurable function f : W → R + , where ν is a measure on R + × W and it is referred to as the Lévy intensity of the CRM µ c . The measure ν is also required to satisfy the following conditions for any bounded A ∈ W. The representation (21) is of paramount importance to prove all our posterior results, and it clarifies the pivotal role of ν in the determination of the distributional properties of µ c . Kallenberg [2010] provides a very general decomposition for such a measure ν as follows: ν(ds, dw) = λ w (ds)Λ(dw), where Λ is a σ-finite measure on (W, W) and λ w is a transition kernel, i.e., When λ w (ds) ≡ λ(ds) does not depend on w ∈ W, we say that the CRM is homogeneous, which is tantamount to saying that the atoms W i 's and the jumps τ i 's are independent random variables. In BNP problems, it is common to suppose that Λ(dw) = αP (dw), where P is a probability measure on (W, W) and α > 0. Two remarkable examples of CRMs are the σ-stable process, which can be recovered by choosing λ(ds) = σs −1−σ ds, and the gamma process, which corresponds to the choice λ(ds) = e −s /s ds. See also [Lijoi and Prünster, 2010] for additional details and connections with the BNP literature.
In Section E, we will make use of multivariate CRMs to define a multivariate extension of the Bernoulli process model, called the Bernoulli process model with a condiment. For this reason we now specify what we mean for a multivariate CRM. A vector µ = (µ 1 , . . . , µ q ) of completely random measures is said to be a multivariate CRM if the random variables are independent for any choice of bounded and disjoint Borel sets A 1 , . . . , A n ∈ W and for any n ≥ 1.
A decomposition similar to the one stated in Equation (20) holds true for multivariate CRMs as well [Kallenberg, 2010]. In the present paper we focus on multivariate CRMs which are functionals of marked Poisson point processes on R q where (τ i ) i≥1 are random jumps in R q + and (W i ) i≥1 is a sequence of random atoms in W. Such a multivariate CRM has the following Lévy-Khintchine representation which generalizes Equation (21): for arbitrary measurable functions f 1 , . . . , f d : W → R + . The intensity measure ν (q) in (22) is required to simultaneously satisfy for any bounded A ∈ W, and having denoted by ||s|| the Euclidean norm of the vector s := (s 1 , . . . , s q ).

B Posterior analysis for SP priors: proofs and details
In the present section we derive the marginal, posterior and predictive distributions for the Bernoulli process model under a scaled process prior. Specifically we focus on the following statistical model throughout the section: where µ ∆ 1,h has been defined at the beginning of Section 2.2. In Subsection B.1 we provide some lemmas regarding SP priors, then Subsection B.2 is concerned with the Bayesian posterior analysis of the model in (23).

B.1 Preparatory lemmas
Some preparatory lemmas are required before the posterior analysis. The first lemma provides the reader with the conditional distribution of µ ∆ 1,h given ∆ 1,h .
Lemma 1. Let µ ∆ 1,h ∼ SP(ν, h), governed by the Lévy intensity measure ν(ds, dw) = λ(s)dsP (dw) on R + × W. The conditional distribution of µ ∆ 1,h , given ∆ 1,h , equals the one of a CRM on (W, W) with Lévy intensity Proof. Recall the construction of a SP prior, as detailed in Section 2.2. It starts from an underlying CRM µ = i≥1 τ i δ Wi with intensity ν on R + × W. Moreover, having denoted by ∆ 1 > ∆ 2 > . . . the decreasingly ordered jumps τ i 's of µ, one considers: and the SP process is defined by a change of measure of the largest jump ∆ 1 , replaced with the distribution of ∆ 1,h . As a consequence it is sufficient to prove that µ ∆1 | ∆ 1 is a CRM with Lévy intensity ∆ 1 λ(∆ 1 s)1 (0,1) (s)dsP (dw).
In order to prove this remind that (∆ i ) i≥2 |∆ 1 are the points of a Poisson process with Lévy intensity λ(s)1 (0,∆1) (s)ds, thanks to the representation by Ferguson and Klass [1972]. Therefore, the conditional distribution of µ ∆1 , given ∆ 1 , may be found by a simple evaluation of the Laplace functional.
To this end, consider a measurable function f : W → R + and compute which is exactly the Laplace functional of a CRM having Lévy intensity (24).
We now provide the reader with a sufficient condition to ensure that each Z n in (23) is almost surely finite, for any n ≥ 1.
Lemma 2. Consider the model in Equation (23). If then each Z n displays almost surely finitely many features -i.e. i≥1 A n,i < ∞, almost surely, for every n ≥ 1.
Proof. For a fixed n ≥ 1, it is sufficient to show that condition (25) entails The expected value in the previous formula may be computed as follows where we have applied the Campbell theorem [Kingman, 1992] and Lemma 1 to evaluate the total mass µ ∆ 1,h (W) of µ ∆ 1,h . As a consequence, condition (25) is sufficient for the finiteness of the Bernoulli process Z n .

B.2 Posterior analysis
We start with the marginal distribution of the observations Z 1:N induced by the model. Our derivation closely follows the proof in James [2017]. The marginal distribution is the counterpart of the "exchangeable feature probability function" (EFPF) for the Indian Buffet Process (IBP; see, e.g., Broderick et al. [2013]). Proof. From the result showed in Lemma 1, we know that conditionally on a known value of ∆ 1,h = a, the random measure µ ∆ 1,h is completely random. Therefore, we can exploit the result in James [2017, Proposition 3.1] to characterize the marginal distribution of the feature counts m N,1 , . . . , m N,K N . This is given by with φ n (a) = 1 0 s(1 − s) n−1 aλ(as)ds. Moreover, the posterior distribution of the random measure µ ∆ 1,h , conditionally given Z 1:N and ∆ 1,h , equals ii. J 1:K N are K N independent random jumps and independent of µ ∆ 1,h , with density on [0, 1] proportional to Proof. Again, leveraging the result showed in Lemma 1, we know that conditionally on a known value of ∆ 1,h , the measure µ ∆ 1,h is completely random. Therefore, we can simply apply James [2017, Theorem 3.1] to obtain the posterior distribution of µ ∆ 1,h |(∆ 1,h , Z 1:N ) as described in Equation (28).
Finally, the posterior distribution of the largest jump ∆ 1,h conditionally on the observations Z 1:N derived in Equation (27) follows by direct application of Bayes' theorem, recognizing that f ∆ 1,h is the prior distribution for ∆ 1,h , and the distribution in (26) as the likelihood of the observations Z 1:N |∆ 1,h .
Last, we prove the predictive characterization provided in Proposition 2, which has a pivotal role in our analysis, as it is the conceptual starting point in order to study the predictive behavior of the model, and it again follows form [James, 2017].
Proof of Proposition 2. We consider ζ d = µ ∆ 1,h , thus we are dealing with the model (23). The posterior distribution of ∆ 1,h in (8) follows from (27), by the argument used in Proposition 6. In order to prove the characterization in Equation (9), we use once again the fact that conditionally on a known value of ∆ 1,h , µ ∆ 1,h is a completely random measure (see Lemma 1). Thus, we can exploit the results in [James, 2017] to characterize the predictive distribution of Z N +1 given the sample Z 1:N and the jump ∆ 1,h . More specifically the form of the predictive distribution in (9)

C Posterior analysis for SB-SP priors: proofs and details
Here we provide details and proofs of the results in Section 3.1, i.e. a full Bayesian analysis for the SB-SP prior. More specifically we prove Theorem 1, then we move to characterize the posterior distribution of ∆ 1,h c,β , marginal, predictive and posterior distributions of the SB-SP model.

C.1 Proof of Theorem 1
The posterior density of ∆ 1,h , given Z 1:N , has density proportional to N n=1 e −φn(a) where we used the notation φ n (a) = 1 0 s(1 − s) n−1 aλ(as)ds. Hence, there exists a normalizing factor c(m N,1 , . . . , m N,k , N, K N ), depending on the sample size N , the distinct number of features K N and the frequency counts, such that or equivalently we can write If the posterior density g ∆ 1,h |Z 1:N (a) does not depend on m N,1 , . . . , m N,K N , then the function depends only on K N , N and a, but not on the frequency counts. Therefore, (31) boils down to As a consequence, the function on the right hand side of (32) does not depend on a ∈ R + . Note that, since f ∆ 1,h and λ are functions of class C 1 (R + ), i.e., derivable with continuous derivative, also w is in class C 1 (R + ) with respect to the variable a. Thus, we can take the derivative of (34), and this is equal to 0: where C is constant with respect to a. If one takes the derivative of the previous equation two times with respect to a, then she obtains λ(a)(1 − N C) = aλ (a)C, which is an ordinary differential equation in λ that can be solved by separation of variables. In particular we get the following result The exponent of a in (35) should satisfy +∞ 0 min{1, a}λ(a)da < +∞, from which it is easy to realize that −2 < (1 − N C)/C < −1, hence where α > 0 and σ ∈ (0, 1). The reverse implication of the theorem is trivially true, hence the proof is completed.

C.2 Detailed derivation of the distribution of ∆ 1,h c,β
We first derive explicitly the distribution of the largest jump given in Equation (6). This follows from direct application of the law of the largest jump, when the Lévy measure is λ σ (s)ds = σs −σ−1 1 R+ (s)ds.
Having denoted by f ∆1 the density function of F ∆1 , we get From direct inspection, we recognize that this is the density function of ∆ 1 = T −1/σ , where T is a Gamma with parameters (1, 1). The mixing measure is then obtained by tilting the density f ∆1 as follows: By integration, we get the normalizing constant:

C.3 Posterior distribution of SB-SP priors
Here we characterize the posterior distribution of SB-SP priors: the result is not included in the paper, but we think it is useful to have a full picture on SB-SP priors from a Bayesian viewpoint.
Proposition 7. For N ≥ 1 let Z 1:N be a random sample modeled as the BNP-Bernoulli model (1), appearing exactly M N,i = m i times in the samples, then the conditional distribution of ∆ 1,h c,β , given Z 1:N , has a density function of the form where γ (n) 0 = σ 1≤i≤n B(1 − σ, i), with B(·, ·) denoting the (standard) Beta function. Moreover, the conditional distribution of ζ, given (∆ 1,h c,β , Z 1:N ), coincides with the distribution of where: ii) where Beta denotes the beta distribution.
Proof. We apply Proposition 6, which describes the general posterior distribution of a SP process. We first compute the posterior distribution (8) of the largest jump conditionally on observations Z 1:N .
To do so we specify (27) in our case, and we first compute the exponent φ n (a). In our case the Lévy density equals λ σ (s) = σs −σ−1 and the mixing density of ∆ 1,h c,β is provided in Equation (36), thus the exponent φ n takes the form Recalling the shorthand notation γ where f ∆ 1,h c,β has been specified in (36). As a consequence we get , which corresponds to the posterior density in (37). The characterization of the posterior distribution in (38) is an easy consequence of Proposition 6, by a specialization of this result with the choice λ(s) = λ σ (s) = σs −σ−1 for the underlying Lévy intensity.

C.4 Proof of Proposition 3
The predictive characterization is a simple consequence of the general characterization in Proposition 2 with the SB-SP specifications λ(s) = λ σ (s) = σs −σ−1 .

C.5 Proof of Proposition 4
We apply Proposition 5 to obtain the marginal distribution for the SB-SP prior. Conditionally on ∆ 1,h c,β = a, using the form φ n (a) derived in (41), the marginal distribution is given by that may be written in terms of the Beta function as follows Last, we obtain the marginal distribution in Equation (10) by randomizing with respect to the mixing distribution of the largest jump given in Equation (36). We need to compute and the thesis now follows.

D Estimation of the unseen features via SB-SP priors: proofs
Here we detail the proofs of Section 3.2, which is devoted to the unseen-features problem under the SB-SP prior.

D.1 Proof of Theorem 2
We first focus on the proof of (13), i.e. the posterior distribution of U where we have applied the tower property of the conditional expectation. We now observe that, conditionally on Z 1:N and ∆ 1,h c,β , the random variable U (M ) N may be represented as where we used the representation given in Proposition 3. Here, independently across i, A N +m,i is a Bernoulli random variable with parameter ρ i , conditionally on the random measure µ ∆ 1,h c,β = i≥1 ρ i δ W i with Lévy intensity σ∆ −σ 1,h c,β (1 − s) N s −1−σ 1 (0,1) (s)dsP (dw). We now focus on the eval-uation of the expected value in Equation (42): where we applied the independence of the Bernoulli random variables A N +m,i s, conditionally on µ ∆ 1,h c,β . We now recall that µ ∆ 1,h c,β is a CRM with a known Lévy measure and that the A N +m,i s are Bernoulli with parameter ρ i to obtain where we used the identity We replace this expression in Equation (42) to obtain The results now follows by integrating with respect to the posterior distribution of ∆ −σ 1,h c,β , given in Equation (37): , for any |t| < 1/p It is now easy to see that, conditionally on Z 1 , . . . , Z N , ∆ 1,h c,β , the random variable U We then evaluate the expected value appearing in (44) as follows: Since µ ∆ 1,h c,β is a CRM with a known Lévy measure, we can evaluate the previous expected value: where we used the notation introduced in the statement of the theorem, i.e. ρ (M,r) N = σ M r B(r − σ, M +N −r+1). Then the probability generating function in Equation (44) is obtained by integrating with respect to the posterior distribution of the largest jump provided in (37):

D.2 Proof of Theorem 3
In order to prove this result, we first exploit the Lévy continuity theorem, to obtain a convergence in distribution, and later strengthen this result to show that the convergence holds true also in the almostsure sense. For the convergence in distribution, thanks to Theorem 2, the characteristic function of where t ∈ R, K N is the number of distinct features in Z 1:N and p The quantity above can be rewritten as We can exploit Masoero et al. [2021, Lemma 1] to determine the asymptotic expansion γ ) as M → +∞, having used the big-O notation. Thus, using the asymptotic expansion of the exponential function, one has which converges to the characteristic function of a gamma random variable with parameters (K N + c + 1, (γ In order to prove convergence in the almost sure sense, we exploit the corresponding results proved for the stable beta-Bernoulli process in Masoero et al. [2021, Theorem 2] for the statistic U (M ) N . We first notice that if we condition on the value of the largest jump ∆ 1,h c,β , then the SB-SP-Bernoulli is a completely random measure whose asymptotic behavior is analogous to the stable beta-Bernoulli process. Thus, specializing the almost sure convergence results given in Masoero et al. [2021, Theorem 2], a posteriori, we have The probability limit for the model in which the largest jump is random is obtained by observing that N /M σ converges almost surely to the random variable ∆ −σ 1,h c,β Γ(1 − σ), with respect to the conditional probability P given Z 1:N . Note also that the posterior distribution of ∆ −σ 1,h c,β Γ(1−σ) is a Gamma with parameters thus the a.s. convergence in (17) where t ∈ R, and p Thanks to the well-known asymptotic relation for the ratio of gamma functions, it is easy to see that as M → +∞. Hence, the characteristic function under study boils down to which converges, as M → +∞, to the characteristic function of a gamma random variable with parameters as in the thesis. The almost sure statement of (18) goes along similar lines, indeed one can exploit the convergence theorems proved by Masoero et al. [2021] to state that Exactly as before, one can conclude that where the posterior distribution of the limiting random variable is a gamma with the same parameters as in the statement of the theorem (Equation (18)).

E Multivariate extension
In the present section we discuss the multivariate version of the Bernoulli process, which we call the Bernoulli process with a condiment or the simple multinomial process, using the terminology of James [2017]. We first revise the model of James [2017] and the associated prior, called stable-Beta-Dirichlet process, then we move to introduce a new scaled prior for the model. In both the cases, we determine closed-form results to face prediction of new features with condiments. These models are extremely important in genomics to account for the presence of variants at certain genomic loci with a specific characteristic (or condiment). See, e.g., Lee et al. [2016].

E.1 Bernoulli process with a condiment
The IBP process with a condiment has been introduced by James The Bernoulli process with a condiment assumes that each observation Z is a multivariate {0, 1} qvalued stochastic process where (w i ) i≥1 are features in W and (A i ) i≥1 are independent simple multinomial random variables with parameter vector p i = (p i,1 , . . . , p i,q ) as i = 1, 2, . . .. Here |p i | represents the probability that an individual displays feature w i , while p i,j is the probability that the individual exhibits feature w i with condiment j ∈ {1, . . . , q}. Thus, Z is termed a simple mutlinomial process with parameter ζ = i≥1 p i δ wi , and it is denoted by MP(ζ). In order to carry out BNP inference, we need to specify a distribution for the discrete measure ζ. Thus, we obtain a multivariate version of the model (1): where Z denotes the distribution of the discrete random measure ζ.

E.2 Priors based on multivariate CRMs
In this section we consider a class of priors Z in (46) defined by James [2017] and based on a multivariate extension of CRMs (see Daley and Vere-Jones [2008]). In particular consider a multivariate CRM on W: As a simple CRM of Section A, the multivariate extension of a CRM is characterized by its Lévy-Khintchine representation: (1 − e −s1f1(w)−···−sqfq(w) )λ (q) (s 1 , . . . , s q )ds 1 · · · ds q P (dw) for arbitrary measurable functions f 1 , . . . , f d : W → R + , where P is a probability measure on W.
The multivariate Lévy intensity λ (q) is assumed to satisfy the integral condition R q + min{1, ||s||}λ (q) (s 1 , . . . , s q )ds 1 · · · ds q < +∞ where ||s|| is the Euclidean norm of the vector s. When λ (q) (s 1 , . . . , s q ) concentrates on S q , the law of µ may be employed as a distribution for the parameter ζ of the simple multinomial process in (46).
Note that in Theorem 4 M N,i,j is the random number of times feature W * i has been observed out of Z 1:N with condiment j ∈ {1, . . . , q}, while m i = q j=1 m i,j is the number of times feature W * i has been observed out of the sample.
For any N ≥ 1, let Z 1:N be an observable sample modeled as the multinomial model in (46), with ζ ∼ mSBD(α, κ + α; γ; ϑ). Moreover, under the same model, for M ≥ 1 let Z N +1:N +M = (Z N +1 , . . . , Z N +M ) be an additional and unobserved sample. We now define the number of hitherto unobserved feature with condiment ∈ {1, . . . , q} that will be recorded out of Z N +1:N +M as Posterior inference for such a quantity could have potential interest in genomics to account for the presence of a variant with certain biological characteristics (condiment). The next theorem provides the posterior distribution of U Theorem 5. For any N ≥ 1, let Z 1:N be a random sample modeled as the BNP simple multinomial process model (46), with ζ ∼ mSBD(α, κ + α; γ; ϑ). Suppose that Z 1:N displays K N = k dis-  Fix t in a neighborhood of the origin, then one has Here, independently across i, A N +m,i, is a Bernoulli random variable with parameter ρ i, , conditionally on the random measure µ = i≥1 ρ i δ W i with Lévy intensity λ (q) (s)ds 1 · · · ds q P (dw) such that Thus, the expected value in (52) boils down to where we used the fact that each A m+N,i, is a Bernoulli random variable with parameter ρ i, , conditionally on the random measure µ , and in addition these random variables are conditionally independent. We now exploit the Laplace functional of the multivariate CRM µ to obtain where λ (q) has been specified in (53) and we exploited the following formula The integrals over S q in (54) may be easily evaluated (see, e.g., [Gradshteyn and Ryzhik, 2007, Formula 4.635.2]) to get By substituting the previous expression in (54), we obtain which is exactly the probability generating function of a Poisson random variable with parameter As a consequence of Theorem 5, one can define a BNP estimator of U (M ) N, with respect to a squared loss function as follows: We point out that for computational convenience one may writê where the expected value is taken with respect to the two independent random variables with the following beta distributions The equality (57) may be easily proved by observing that

E.3 Scaled stable-Beta-Dirichlet prior for multinomial processes
From Theorems 4-5, it is apparent that, under the stable-Beta-Dirichlet process, the conditional distribution of a statistic involving hitherto unobserved features, depends on the initial sample Z 1:N only trough the sample size N and not on other sample statistics. This behavior resembles what happens for the Bernoulli process model described in the main paper when the prior ζ in (1) is a CRM. We then introduce a multivariate analogue of the stable-Beta scaled prior, that will be termed scaled stable-Beta-Dirichlet process with the goal to enrich the predictive structure. We introduce a discrete random measure depending on the random jump ∆ 1,h c,β , that has been defined in the main paper as a polynomial-exponential tilting of the density function (6), whose density equals as shown in (36). The scaled stable-Beta-Dirichlet random measure is an almost surely discrete random measure that can be represented as and consisting of q components µ ∆ 1,h c,β ,j = i≥1 ρ i,j δ Wi as j = 1, . . . , q.
Conditionally on the jump ∆ 1,h c,β , the multivariate random measure µ ∆ 1,h c,β is completely random with Lévy intensity λ (q),∆ 1,h c,β (s)ds 1 · · · ds q P (dp) with the specification where 0 < σ < 1 and γ j > 0 for any j = 1, . . . , q. We write µ ∆ 1,h ∼ S-mSBD(σ, γ; h c,β ). A remarkable property of this model is that q j=1 µ ∆ 1,h c,β ,j is distributed as the stable-Beta scaled process prior, i.e., |µ ∆ 1,h c,β | ∼ SB-SP(σ, c, β). Such a property may be easily proved by means of the Laplace functionals. Note that one could potentially introduce an additional mass parameter in the model, but this is irrelevant to carry out posterior inference in the stable case.

E.3.1 Posterior Analysis
We now provide posterior, predictive and marginal characterizations for the multivariate model (46) under the scaled stable-Beta-Dirichlet process prior specification for Z . The results we present here may be proved by exploiting [James, 2017, Section 5], conditionally on ∆ 1,h c,β and then by marginalizing over the mixing distribution (58). We omit the details.

E.3.2 Estimation of the unseen features with a condiment
For any N ≥ 1, let Z 1:N be an observable sample modeled as the simple multinomial model in  (50), counting the number of hitherto unobserved feature with condiment ∈ {1, . . . , q} that will be recorded out of the additional sample.
Then, the posterior distribution of U Fix t in a neighborhood of the origin, then one has by an application of the tower property. We now focus on the evaluation of the inner expected value in (67): From Theorem 7, the A m+N,i, s are independent random variables as m = 1, . . . , M , and each one A N +m,i, is a Bernoulli with parameter ρ i, , conditionally on the random measure µ ∆ 1,h c,β = i≥1 ρ i δ W i with Lévy intensity (62). As a consequence we obtain Proceeding along the same lines as in the proof of Theorem 5 we have that thus, the conditional expected value under study may be written as where we applied (55). The integral over S q appearing in (68) may be evaluated resorting to [Gradshteyn and Ryzhik, 2007, Formula 4.635.2], therefore Thus, by substituting the previous expression in (68) one obtains where we recall that ψ As a consequence, the probability generating function in (67) equals The conclusion follows by a marginalization w.r.t. the posterior distribution of ∆ −σ 1,h c,β which is a gamma random variable (see (60)): which is the probability generating function of a negative binomial distribution as in the statement.
As a consequence of Theorem 9, the BNP estimator of U (M ) N, under a squared loss function equalŝ For computational purposes, we finally note that the parameter ψ where the expected value is made w.r.t. two independent random variables X and Y having beta distributions as follows X ∼ Beta(γ , |γ| − γ ) and Y ∼ Beta(1 − σ, N + 1).

F Synthetic experiments from the model
We now analyze empirically the properties of the SB-SP-Bernoulli model used in Section 3. We will use the acronym SSB for brevity in the captions. I.e., we consider the hierarchical model detailed in (1), with µ ∼ SB-SP(σ, c, β). The predictive characterization detailed in Proposition 3, together with Equation (13), provides an algorithm to sample N observations from the model: given β > 0, σ ∈ (0, 1), c > 0, • at every step n = 1, . . . , N , conditionally on the previous n − 1 samples Z 1:n−1 showing K n−1 distinct features, each feature k = 1, . . . , m Kn−1 with frequency m k , sample a random number of new features observed: ; for previously observed feature i = 1, . . . , K n−1 : In particular, for the first step, K 0 = 0. size N (x-axis). We repeat the same experiments as in Figure 6, but now fix β = 1, σ = 0.2, and vary c across subplots.

F.1.1 The role of σ
We start by analyzing the role of σ in Figure 6. As suggested by the asymptotic behavior analyzed in Theorem 3, σ directly controls the asymptotic rate of growth of the number of distinct features: as σ increases, the expected number of variants increases, approaching a linear behavior as σ → 1. We notice that this behavior is reminiscent of the tail parameter of the stable beta-Bernoulli process Gorur, 2009, Broderick et al., 2012].

F.1.2 The role of c
We now move to the analysis of the polynomial tilting parameter, c. As suggested by the predictive distribution given in Equation (13) F.2 Predictive behavior of the number of new features from the posterior Next, we perform a slightly different exercise from the one described above. We still assume the parameters to be known, and we investigate how the posterior predictive behavior varies as we change the number of training samples N with respect to a total sampling "capacity" L. Intuitively, for a fixed value of this "sampling capacity", N + M = L, the expected number of observed features from the model should be the independent of the choice of N, M . However, we expect the distribution (e.g., the posterior variance), to concentrate as N increases relative to M . To perform this experiment, we do as follow: we fix β, c, σ and, for each = 1, . . . , 2000, we let K = U

F.3 Estimation of the parameters
Next, we move to the more interesting scenario in which the parameters are unknown and need to be inferred from the data. The natural way to estimate the unknown parameters is to maximize a likelihood criterion, such as the marginal distribution of the feature counts m 1 , . . . , m K , given in | Z 1:N (y-axis) as a function of the sample size N (x-axis). We fix β = 1, c = 20, σ = 0.5, and learn the parameters for different training size N ∈ {100, 400, 700, 1000} across subplots for a total sequencing capacity L = 5000. Equation (10). We found this method to work well both on real data, as displayed in Section 4, and on synthetic data. We here report some results in Figures 11 and 12. In general, and not surprisingly, the precision of our estimates increases with larger sample sizes.
In our synthetic experiments, as expected, the values maximizing the marginal likelihood converge to the underlying true values of the data generating process as the sample size N → ∞. By performing a visual investigation, we find that indeed the negativd marginal likelihood is a convex function in each argument, with a unique, well-defined minimum (see Figures 13 to 15).
When most of the features are very rare (e.g., they appear once or twice in the sample), we found that an alternative empirical Bayes approach, akin to the one adopted in Masoero et al. [2021], worked better, as further discussed in Appendix G. True Emprical Figure 13: We draw a synthetic dataset of size N = 10 000 from a SSB with parameters β = 5, σ = 0.1, c = 3. In the left subplot, we plot the value of the negative marginal likelihood (vertical axis) as we vary the value of β (horizontal axis), keeping σ = 0.1 and c = 3 fixed at the true value. We repeat the same procedure, now varying σ and keeping β = 5, c = 3 fixed at their true value in the central subplot. Last, in the right subplot, we inspect the marginal likelihood as we vary the value of c, keeping β = 5, σ = 0.1. We then minimize numerically the negative log-likelihood, and report in each subplot with a blue cross the numerical value of the corresponding hyperparameter (horizontal axis) together with the corresponding marginal likelihood value (vertical axis).

G Synthetic experiments from Zipf distributions
To compare the predictive performance of the SB-SP-Bernoulli process proposed in Section 3.1 to existing competing methods, we also consider synthetic data from Zipf-distributed frequencies (see Figure 16). That is to say, we imagine that there exists a countable number of features in the population, and that, for some ξ > 0, feature k is observed independently of any other feature with probability π k = (k + 1) −ξ . An observation X then a binary vector, in which, conditionally on the frequencies π = (π 1 , π 2 , . . .) the k-th coordinate is a Bernoulli random variable: We perform our simulations as follows: we fix a total sequencing capacity of L = 2000, and draw L i.i.d. samples from the model, following the recipe given in Equation (71). For simulation purposes, we only consider the first K = 10 5 features to have non-zero probability, i.e. π k = 0, for all k > K.
We compare the estimates of our proposed SB-SP-Bernoulli model (Section 3.1), to the stable beta-Bernoulli process [3BP], the linear program of Zou et al. [2016], the first four orders of the Jackknife estimator originally proposed in Burnham and Overton [1978] and recently employed in the genomics context by Gravel [2014], and the Good-Toulmin estimator, recently used in Chakraborty et al. [2019], with the two alternative smoothing choices described in Orlitsky et al. [2016]. Estimates for Bayesian methods are obtained by using the posterior predictive mean for the number of new variants conditionally on the observed sample, with hyper-parameters learned by numerically maximizing the marginal distribution (EFPF) of the features counts, as described in Section 4.1.
As expected, we find the nonparametric Bayesian estimators to do particularly well for larger values of the exponent ξ -that is when most features are exceedingly rare. The SB-SP-Bernoulli and the SB-SP-Bernoulli-parameter beta-Bernoulli processes performed comparably on these datasets, both in terms of estimation accuracy and uncertainty quantification, as displayed in Figure 17. all methods improve their performance with larger sample sizes, we find that the BNP estimators (SSP, 3BP) provide relatively more accurate results for smaller samples sizes (e.g., N = 10, N = 50 in Figures 18 and 19). The performance of the BNP methods exceed those of competing methods for larger values of the exponent (ξ ∈ {1.2, 1.4, 1.6}), while higher order Jackknife and linear programs tend to do better for smaller values of the exponent (ξ ∈ {0.8, 1}).

H Additional experiments on the gnomAD dataset H.1 Experimental setup
In order to run our experiments, we use data from the gnomAD (genome aggregation dataset) discovery project [Karczewski et al., 2020], the largest and most comprehensive publicly available human genome dataset. We follow the same experimental setup adopted in Masoero et al. [2021]. We briefly summarize this setup in this section. The gnomAD dataset contains 125'748 exomes sequences (i.e. protein-coding regions of the genome), from 8 main populations. Sample size varies widely across sub populations, e.g. the "Other" subgroup counts about 3'000 observations, while "South Easy Asian" contains almost 16'000 individuals (see Karczewski et al. [2020] for additional details). For privacy reasons not all individual sequences are accessible. Hence, in order to run our analysis we generate synthetic data which closely resembles the true data as follows. For every subpopulation with N individuals and every position j = 1, . . . , K in the exome, we have access to the total number of individuals N j showing variation at position j. We compute the empirical frequency of variation at site j,θ j := N j /N for all j = 1, . . . , K. Our data is then generated by sampling independent Bernoulli random vectors X 1 , . . . , X N , with X n = [x n,1 , . . . , x n,K ]. The entries in the vector are independent Bernoulli random variables, x n,j ∼ Bernoulli(θ j ).

H.2 Results from the gnomAD data
For each of eight subpopulations in the data, we performed the following experiment. Letθ = [θ 1 , . . . ,θ Kmax ] ⊆ [0, 1] denote the "genetic signature" of the population, withθ k = N k /N , with N tot the total number of individuals in the population and N k the number of individuals in the population displaying such variant, 1 ≤ N k ≤ N tot . Then, for each population, we generate S = 50 datasets by drawing N tot i.i.d. binary random vectors of length K max as described above, with biases given bŷ θ. We then retain for each dataset N ∈ {50, 100} observations for training, and try to predict the number of new variants that are going to be observed if we were to sample additional M = N tot − N observations.
In a nutshell, also on this data, the findings are similar to the results obtained on the MSK-IMPACT cancer data. In particular, we find that when the sample size N is small, the proposed SB-SP Bernoulli model leads to predictions that are often comparable or more accurate than competing methods.
First, we report the accuracy metric v Next, we provider boxplots that report the (aggregated) accuracy of the metric v

H.3 Additional boxplots
Since in Figure 23 and Figure 24 we are aggregating result in which N is consistent for all populations, but M differs, we also report boxplots of each subpopulation individually.  Figure 24, but only for the "Other" subpopulation.