Generalized Bayes Quantification Learning under Dataset Shift

Abstract Quantification learning is the task of prevalence estimation for a test population using predictions from a classifier trained on a different population. Quantification methods assume that the sensitivities and specificities of the classifier are either perfect or transportable from the training to the test population. These assumptions are inappropriate in the presence of dataset shift, when the misclassification rates in the training population are not representative of those for the test population. Quantification under dataset shift has been addressed only for single-class (categorical) predictions and assuming perfect knowledge of the true labels on a small subset of the test population. We propose generalized Bayes quantification learning (GBQL) that uses the entire compositional predictions from probabilistic classifiers and allows for uncertainty in true class labels for the limited labeled test data. Instead of positing a full model, we use a model-free Bayesian estimating equation approach to compositional data using Kullback–Leibler loss-functions based only on a first-moment assumption. The idea will be useful in Bayesian compositional data analysis in general as it is robust to different generating mechanisms for compositional data and allows 0’s and 1’s in the compositional outputs thereby including categorical outputs as a special case. We show how our method yields existing quantification approaches as special cases. Extension to an ensemble GBQL that uses predictions from multiple classifiers yielding inference robust to inclusion of a poor classifier is discussed. We outline a fast and efficient Gibbs sampler using a rounding and coarsening approximation to the loss functions. We establish posterior consistency, asymptotic normality and valid coverage of interval estimates from GBQL, which to our knowledge are the first theoretical results for a quantification approach in the presence of local labeled data. We also establish finite sample posterior concentration rate. Empirical performance of GBQL is demonstrated through simulations and analysis of real data with evident dataset shift. Supplementary materials for this article are available online.


Introduction
Classifiers are most commonly developed with the goal of obtaining accurate predictions for individual units. However, in some applications, the objective is not individual-level predictions, but rather to learn about population-level distributions of a given outcome. Examples include sentiment analysis for Twitter users (Giachanou and Crestani 2016), estimating the prevalence of chronic fatigue syndrome (Valdez et al. 2018), and cause of death distribution estimation from verbal autopsies (King et al. 2008;McCormick et al. 2016;Serina et al. 2015;Byass et al. 2012;Miasnikof et al. 2015).
The task of predicting the population distribution p(y) of unobserved true outcomes (labels) y based on observed (and possibly high dimensional) covariates x has been termed quantification (Forman 2005;Bella et al. 2010;González et al. 2017;Pérez-Gállego et al. 2019) in the machine learning literature. Since the covariates are usually passed through a "trained" classification algorithm A to obtain predicted labels a := a(x), quantification can be viewed as prevalence estimation using these predicted labels, and mathematically can be formulated as solving for p(y) from the identity p(a) = y p(a| y)p(y) .
( 1 ) and p(a| x), the prediction map for the trained classifier is same for a given x irrespective of whether x is in the training or the test populations, implicit in the transportability assumption is the assumption that p tr (x|y) = p test (x|y) (Pérez-Gállego et al. 2019). The marginal distributions of the outcomes p tr (y) and p test (y) are allowed to be different. Dataset shift occurs when the classifier is trained using data or information (like expert knowledge, Kalter et al. 2015) from a population different from the test population of interest resulting in both p tr (y) = p test (y) and p tr (x|y) = p test (x|y) (Moreno-Torres et al. 2012) (as illustrated in Figure 4 for the real application of Section 6). It is evident from Equation (2) that under dataset shift as p tr (x|y) = p test (x|y), we will not generally have p tr (a| y) = p test (a| y), and all the aforementioned quantification learning methods will be biased.
When limited validation data with known labels is available from the test set, Datta et al. (2018) proposed population-level Bayesian transfer learning (BTL) -a quantification approach for dataset shift. BTL uses this limited labeled data to estimate the misclassification rates p(a| y) on the test set, while using classifier predicted labels for the abundant unlabeled test data to estimate p(a). The two estimation pieces are combined to solve for p(y) from (1) using a hierarchical Bayesian model. BTL only assumes transportability of the misclassification rates from the labeled test data to the unlabeled test data. Even the marginal distribution of y in the labeled test set is allowed to be different from that in the unlabeled test set.
BTL uses a multinomial model requiring a single-class (categorical) prediction for each instance. Statistical classifiers are often probabilistic (McCullagh and Nelder 1989;Murphy et al. 2006;Specht 1990) producing a compositional prediction -a vector of prediction probabilities for every class. To apply BTL, these compositional predictions need to be transformed to categorical predictions by using the plurality rule (most probable category). This categorization leads to information loss and Bella et al. (2010) showed, in a setting without dataset shift, that quantification using the compositional class probability estimates can outperform such a practice. To our knowledge, there is no quantification method for dataset shift that utilizes the compositional predictions from probabilistic classifiers.
In this article, we generalize Bayesian quantification under dataset shift to use entire compositional prediction distributions from classifiers. Rather than positing a complete likelihood for compositional data, we use a Kullback-Leibler divergence loss equivalent to a Bayesian-style estimating equation for compositional data. The advantages of using this loss function over proper likelihoods for compositional data are many fold including robustness to model misspecification, coherence of the estimating equations for the labeled and unlabeled set, and allowing of 0's and 1's in the compositions ensuring use of the same lossfunction for categorical, compositional or mixed-type predictions from the classifiers. Estimates of p(y) can be obtained using generalized or Gibbs posteriors that updates prior beliefs using loss-functions without requiring full distributional specification (Shawe-Taylor and Williamson 1997;McAllester 1999;Chernozhukov and Hong 2003;Bissiri, Holmes, and Walker 2016).
Our second innovation concerns allowing uncertainty in true labels in the labeled test set. This is not uncommon. For example, physicians may be uncertain in the final cause of death (McCormick et al. 2016), or labels may be produced by aggregating crowd sourced responses (Bragg et al. 2013). Existing quantification approaches do not allow for uncertainty in the true labeled test instances, as it is not clear how to define and estimate the misclassification rates where the true labels are probabilistic.
We use belief-based mixture modeling (Szczurek et al. 2010) to represent true-label uncertainty as prior class probabilities, and extend the notion of misclassification rates for such compositional true labels. This contribution is of independent importance, as it offers a generalized Bayes estimating equation approach for regressing compositional outcome on compositional covariate without requiring full model specification or transformation of the data, and allowing 0's and 1's in both variables, and with an efficient Gibbs sampler. To our knowledge, this is novel. There has been very little work on using KLD-loss based generalized Bayes for compositional data. The few existing approaches (Kessler and Munkin 2015;Yuan, Chappell, and Bailey 2007) only consider compositional outcome, not compositional predictors, and have not been theoretically studied.
We refer to our method as generalized Bayes quantification learning (GBQL). We show that GBQL subsumes existing quantification approaches (CC,PA,ACC,APA,BTL) as special cases. Like BTL, we extend GBQL to an ensemble approach that can utilize predicted labels from multiple classifiers to produce an ensemble quantification that is robust to inclusion of poor classifiers in the group. We device an fast and efficient Gibbs sampler for GBQL, harmonizing the KL loss function with conjugate priors, and using a simple coarsening and rounding approximation to the likelihood.
Our final contribution is a thorough theoretical study of GBQL. To our knowledge, there is no supporting theory about the accuracy of quantification under dataset shift. There is substantial existing large sample theory for generalized posteriors and related approaches (Chernozhukov and Hong 2003;Zhang et al. 2006;Jiang and Tanner 2008;Bhattacharya et al. 2019;Miller 2019). These results can be applied contingent upon identifiability of the parameters specifying the loss function. Identifiability is challenging for quantification under dataset shift, as it involves two different loss functions-one each for the labeled and unlabeled datasets which on their own are both incapable of identifying the parameters. Our central result is identifiability of parameters for GBQL that uses both loss functions. Using this we prove asymptotic consistency of the Gibbs posterior, asymptotic normality of the posterior mean, and provide asymptotically well calibrated confidence intervals. We also prove a finite sample rate result on posterior concentration. All the theory only relies on a correct first-moment assumption and is thus robust to misspecification of the full model. We also extend the theory to accommodate the practical modifications used to implement the Gibbs sampler, and to ensemble GBQL for multiple classifiers.
The rest of the article is organized as follows. Our method, various extensions, and connection to existing approaches are offered in Section 2. Theoretical properties are discussed in Section 3. Bayesian implementation and computational considerations are presented in Section 4. We show the robustness of our method through simulations in Section 5, and in Section 6 we demonstrate its performance on the problem of deriving the cause-specific rates of children deaths using the PHMRC verbal autopsy dataset.

Method
Let U denote an unlabeled dataset of N instances randomly sampled from our test population of interest. These data do not come with the true class labels y r ∈ {1, . . . , C}, where C is the total number of categories, but using some pre-trained classifier algorithm A, one can predict labels a r = a(x r ) for r = 1, . . . , N. We do not assume that the training data for the algorithm is available, nor do we assume the knowledge of the covariates x for the test set, as long as a(x) is available to us. Our target of interest is p = p U (y) = (p 1 , . . . , p C ) , the distribution of the outcome y in our population of interest U , i.e, We further assume availability of a dataset L of size n from our population of interest with both true labels y r and predicted labels a r . Because true labels are potentially expensive to obtain, we assume n N. We do not assume the distribution of y in L to be representative of our whole population as true labels may only be available for a convenient sample, that is, p L (y) = p U (y). We only assume transportability of the conditional distribution p(x| y) from L to U . This transportability assumption for p L (x| y) = p U (x| y) is more likely to hold even if the marginal distributions of y are different between L and U . For example, even if the marginal cause of death distributions are different for hospital and community deaths, given a cause y, the symptoms x observed in the patient are likely to have similar distribution in both settings. The transportability assumption implies from (2) that p(a| y) is also same for L and U as the prediction p(a| x) from a trained classifier remains the same given x irrespective of the population x is drawn from.
Bayesian transfer learning (BTL) (Datta et al. 2018) assumes that the predictions are deterministic, that is, a r 's are categorical. Under transportability, if M = (M ij ) = (p(a r = j| y r = i, r ∈ U ∪L)) is the misclassification matrix of the classifier on the test population, then marginal distribution of a r , r ∈ U will be given by M p. BTL essentially uses the labeled data L to estimate M, the unlabeled data U to estimate M p and uses the two pieces to solve for p. This is done using the data model with M i * denoting the ith row of M. A Bayesian framework is used with priors for p and M.

Bayesian Estimating Equations for Compositional Data
The major limitation of BTL is its reliance on multinomial distributions for modeling the data in Equation (3). This restricts its use to cases where the predicted labels a r are categorical. Classifiers are often probabilistic producing a compositional prediction a r = (a r1 , . . . , a rC ) with 0 ≤ a rj and j a rj = 1. To use BTL for such probabilistic outputs, one would require unnecessary categorization. Instead, we will generalize the model based BTL to a Bayesian estimating equation based quantification method for compositional labels. Central to BTL's estimation of population class probabilities ("quantification") is the assumption of transportability of conditional distribution between L and U , that is, The distributional assumption (5) can also be viewed as a firstmoment assumption The two viewpoints are equivalent for categorical a r used in BTL, but Equation (6) is more general as it is no longer restricted to categorical data. For compositional a r , rather than specifying p(a r |y r = i), we only make the general first moment assumption (6). This is similar to the first-moment assumption in the PA and APA approaches. The challenge is of course how to do Bayesian estimation without a full model specification.
First focusing on labeled instances r ∈ L, we consider the following loss function to connect the parameter M to our data a r , y: where D KL (p||q) is the Kullback-Leibler divergence (KLD) between two distributions p and q. There are several reasons to choose the KLD loss functions. First, if Equation (6) is true for some M = M 0 , then To see this, observe that −d L /dM is the derivative of a multinomial log-likelihood. Hence, E M 0 (d L /dM) = 0 when a r are categorical. However, this derivative is only a linear function of a r and hence the expectation remains unchanged when we switch to compositional a r with the same conditional mean. So, the loss function L leads to a set of unbiased estimating equations (Liang and Zeger 1986) for compositional data. The second advantage of using KLD is that, as x log x = 0 when x = 0, it seamlessly accommodates instances 0's and 1's in a r . Finally, minimizing Equation (7) is equivalent to maximizing If only inference on M was of interest, frequentist optimization on Equation (7) or GEE using its derivative can be executed. Using the rich theory of estimating equations, the estimate M has been shown to be a consistent estimator for M (Papke and Wooldridge 1996;Mullahy 2015), and such frequentist approaches have been commonly used in the econometrics literature for regression with a compositional outcome.
However, the primary interest in quantification learning is estimation of p and accurate estimation of the nuisance parameter M is only an important intermediate step. The unlabeled dataset U is the only one informing estimation of p, and using (4) and (6), the marginal first-moment condition for a r in U is given by This harmonizes with the loss-function The loss function U for the marginal distribution of the predicted labels is coherent with the loss-function L for their conditional distribution, as they are based off of coherent moment conditions (6) and (9). Assuming Equations (4) and (6) holds for some true p 0 and M 0 , following the same logic used in Equation (8), we can show that is, the derivative is once again an estimating equation. However, if we only considered U without bringing in L , M and p cannot be identified. For example, U (M, p) = U (I, M p).
Hence, we will consider the joint loss-function = L + U as adding L helps to identify M which in turns makes p identifiable.
Bayesian inference using only loss functions, without full model specification, is now well-established. For any reasonable choice of a loss-function (θ | data) and prior (θ), a Gibbs posterior is defined as the distribution for some α > 0, provided the normalizing constant exists. The idea of updating prior beliefs through loss functions via Equation (12) has developed independently in multiple fields, dating back atleast to Vovk (1990). This posterior is interpreted as the distribution ν for θ minimizing the loss function αE ν ( (θ | data)) + D KL (ν, ). Gibbs posteriors (also known as pseudo-or generalized posteriors) have been widely used to derive generalization errors in the PAC-Bayesian framework (Shawe-Taylor and Williamson 1997;McAllester 1999;Catoni 2003). Functionals of the posterior in Equation (12) has been referred to as Laplace-type estimators (LTE) or quasi-Bayesian estimators (QBE) in Chernozhukov and Hong (2003). Jiang and Tanner (2008) used Gibbs posteriors for high-dimensional variable selection. The case where the loss-function exp(− ) is a fractional likelihood has received extra attention with the literature demonstrating the utility of fractional posteriors over full posteriors especially under model misspecification (Walker and Hjort 2001;Zhang et al. 2006;Bhattacharya et al. 2019). Bissiri, Holmes, and Walker (2016) showed that, given the loss and the prior, (12) is the unique update that is invariant to sequentially updating with each additional data point or joint updating using all data points. The parameter α is related to calibration of credible intervals based on Gibbs posteriors and its choice will be discussed in Section 3.1. The problem of quantification learning under dataset shift using compositional predicted labels have not been studied using a Bayesian or generalized Bayes framework. We use a L and a U to respectively denote {a r } r∈L and {a r } r∈U , and similar notations for collections of the other variables. The two loss functions L and U have same functional form leading to the Gibbs posterior: If all a r were categorical, this posterior with α = 1 is identical to the one from the BTL model (3). However, using estimating equations and generalized Bayes, we now have an unified framework for Bayesian quantification for both categorical, compositional or mixed-type a r without having to specify the full models for the different data types. In subsequent sections we illustrate how this generalized Bayes framework naturally lends itself to accommodating uncertainty in true labels, multiple classifiers, and shrinkage priors.

Advantages Over Full Bayesian Modeling of Compositional Data
Before discussing these extensions, we highlight the advantages of a generalized Bayes framework over a fully Bayesian approach for quantification learning. There are fundamental hurdles to extend the model in Equation (3) when some or all a r are compositional. The Dirichlet distribution and its generalizations (Hijazi and Jernigan 2009;Wong 1998;Tang and Chen 2018) are standard models for compositional data. However, there are several issues with specifying a Dirichlet model for a r in quantification learning.
1. We allow the a r to take 0 and 1 values as the predictions can be categorical or sparse-compositional (prediction for some classes to be exactly 0). Dirichlet distributions do not support 0's and 1's, and would require forcing the a rj 's to lie strictly in (0, 1) using some arbitrary cutoff. Alternatively, one can use the zero-inflated Dirichlet distribution (Tang and Chen 2018) to formally account for the presence of 0's, which leads to a significant increase in the number of parameters. A related point is that single-class classifiers can be viewed as a subclass of probabilistic classifiers, with the predicted distribution being degenerate. Hence, if using two classifiers, one with compositional predictions and one with single-class predictions, use of the Dirichlet model for the former and a multinomial model for the latter is discordant. 2. Our generalized Bayes approach has a coherence property required for quantification learning. The conditional expectation model (6) for the labeled data leads to the marginal model (9) for the unlabeled data. This is central to identification of p. Specifying a r | y = i as a Dirichlet distribution (or its variants), will endow a r with a mixture-Dirichlet marginal distribution which presents a computational challenge in posterior sampling. Our pseudo-likelihood for a r nicely harmonizes with conjugate Dirichlet priors for the parameters M and p leading to an efficient Gibbs sampler. 3. Fully specified Dirichlet distributions are susceptible to model misspecification. The generalized Dirichlet distribution (Wong 1998) can be used to broaden the model class, however increased model complexity comes with added computational burden.
Finally, as an alternate to Dirichlet-based likelihoods, one can log-transform the data and use multivariate normal or skew-normal to fully model the log-ratio coordinates of the compositional a r (Comas-Cufí, Martín-Fernández, and Mateu-Figueras 2016). However, a transformation-free approach is generally more desirable. Also, a model on the transformed compositional a r will be discordant with the multinomial model for the categorical a r . The transformations also generally do not allow for 0's and 1's.

Quantification Using Uncertain True Labels
As stated in Section 1, in many applications, there is uncertainty in some or all of the true labels in the labeled test set L. For example, a panel of physicians may fail to unanimously agree on a single cause of death, and only provide a subset of the list of causes from which they believe the individual was equally likely to die. No existing quantification approach can work with uncertainty in true labels. In this section, we generalize the notion of misclassification rates to uncertain true labels and extend GBQL accordingly.
Following the belief based modeling framework of Szczurek et al. (2010), we let b ri represent the a priori probability that instance r belongs to label i. Then b r is constrained such that 0 ≤ b ri and C i=1 b ri = 1. For r ∈ L we no longer observe the y r 's but observe the belief vector b r . Cases where the true label is identified with complete certainty can be subsumed by writing b r = e i when y r = i, e i denoting the vector with 1 at the ith component and zeros elsewhere. We can generalize the conditional first-moment condition (6) to So, our loss function for L is now The loss for the unlabeled data remains the same, and generalized Bayes proceeds using the likelihood L + U with this generalized choice of L . Appealing to the motivation of generalized Bayes (Chernozhukov and Hong 2003;Zhang et al. 2006; Bissiri, Holmes, and Walker 2016), we can see that the Gibbs posterior is the probability measure which, as n, N → ∞ and n N → ξ , minimizes the Bayes risk

Ensemble Quantification Incorporating Multiple Classifiers
There may be k = 1, . . . , K predictions for each instance corresponding to K classifiers. Datta et al. (2018) has shown the advantage of incorporating multiple algorithms for quantification when only categorical predictions are available, and their ensemble quantification can easily be extended to compositional settings. For the ensemble approach, a fundamental observation is that each algorithm is expected to have their own sensitivities and specificities. Representing the kth algorithm prediction for instance r as a k r and the corresponding misclassification matrix as M k , the conditional first moment assumption (6) becomes For the unlabeled data, we will now have the labels satisfying the marginal first moment condition E(a k r ) = M k p. Hence, each of the K predictions for the unlabeled test data U informs about the same parameter p (our estimand) and we define ensemble GBQL using sum of the losses for the individual algorithms: Ensemble GBQL offers a unified framework for combining information from probabilistic classifiers (compositional a r ) and deterministic ones (categorical a r ) like clinical classifiers for cause of deaths.

Shrinkage Towards Default Quantification Methods
We now discuss how existing quantification approaches are special cases of GBQL with specific choices of degenerate priors for M. We will leverage this property to construct shrinkage priors in data-scarce settings.
The simplest quantification approach is called classify and count (CC) (Forman 2005). CC requires a single predicted class j for each instance, so that a rj ∈ {0, 1}. The CC estimate of p i is (Bella et al. 2010) extended this to allow probabilistic predictions. The PA estimate,p PA i , is obtained in the same manner asp CC i , but does not require a rj ∈ {0, 1}. It is clear from above that CC (or PA) produces a biased estimate as E(p CC ) = E(p PA ) = E(a r ) = M p which is generally does not equal p unless M = I (i.e., the classifier is perfect) or p is a stationary distribution for M.
Adjusted classify and count (ACC) (Forman 2005) accounts for the classifier being not perfect even for the training population. ACC relies on cross-validation using training data splits to estimate the true positive and false positive rates (tpr and fpr) of the classifier (for the base case of C = 2), and proposê Bella et al. (2010) developed an adjusted version of the PA estimate (APA) similar to ACC but for probabilistic predictions. ACC, APA and their multi-class extensions (Hopkins and King 2010) are inappropriate for quantification in the presence of dataset shift, as the fpr and tpr estimated from the training data will not be representative of those in the test population (Pérez-Gállego et al. 2019). Also,p ACC i is not guaranteed to be in [0, 1], although Hopkins and King (2010) correct for this using constrained optimization.
To make the connection between these methods and GBQL, we first consider the scenario where n = 0, that is, when there is no labeled test set to estimate dataset shift. Consider a sequence { u (M)| u = 1, 2, . . .} of priors for M such that u converges in distribution to δ(M pr ), a degenerate prior at some pre-fixed transition matrix M pr . Then the Gibbs posterior ν u of GBQL using the prior (p) u (M) converges in distribution to If M pr = I, then for any prior choice of p, For categorical a r , this result was proved in Datta et al. (2018), and shows that E lim u ν u (p) = p CC , that is, using priors u (M) shrinking toward the degenerate prior at I, inference from GBQL becomes identical to inference from Classify and Count (Forman 2005) when there is no labeled dataset. Analogously, for the same settings, when a r are compositional, posterior mean from GBQL becomes identical to probabilistic average (Bella et al. 2010). Extending, the argument to the settings with multiple predictions, it is straightforward to see that E lim u ν u (p) = 1/K K k=1 p k,PA , that is, the posterior mean from our ensemble classifier coincides with the average of the CC or PA estimates for the K classifiers.
Alternatively, if the misclassification matrix M tr for the training data is available and can be trusted for test data, then one can use M pr = M tr . Then the posterior lim u→∞ ν u (p) coincides with the implicit likelihood in adjusted classify and count (for categorical a r ) and in adjusted probabilistic average (for compositional a r ). To see this, note that the ACC estimate (15) for 2 classes and categorical a r 's, relies on the principle that This is equivalent to E(a r1 ) = p 1 M tr 11 +p 2 M tr 21 or E(a r ) = M tr p, that is, assuming Equation (5) with M = M tr . Thus, using (M) ≈ δ(M = M tr ) in GBQL is a better implementation of ACC or APA, ensuring that the posterior mean of p is guaranteed to be a vector of probabilities lying in [0, 1]. This is not assured in their current implementation based on the direct correction (15).
Hence, in absence of local labeled set, a prior for M concentrated around I or M tr , makes estimates from GBQL nearly coincide with these existing methods (Figure 1). GBQL in fact provides a probabilistic framework around these existing quantification approaches.
When labeled data are present, instead of using M = I (i.e., no adjustment as in CC,PA) or M = M tr (i.e., transportability of the conditional distributions between the training and test data as used in ACC,APA), GBQL estimates an unstructured M only assuming transportability of the conditional distributions from the limited labeled test data L to all test data. However, quantification projects like burden of disease estimation using nationwide surveys are often multiyear endeavors. At the initial stages of such projects, L, consisting of hospital deaths with clinically diagnosed causes, can be very small. With very limited labeled data, estimating both M and p precisely with vague priors is ill-advised as M involves C(C − 1) parameters. In such settings, the above-established link between GBQL and the existing quantification methods can be exploited to choose shrinkage priors for stabilizing estimation of M. For example, one can use the priors M i * ∼ Dirichlet(γ ui (M pr i * + u 1)) for small u or large γ ui . This prior concentrates around δ(M = M pr ) if either u → 0 or γ ui → ∞, hence for no or small labeled dataset, the estimate for GBQL will shrink to those from CC or PA (if M pr = I) or to ACC or APA (if M pr = M tr ). With limited labeled data, these shrinkage priors make a bias-variance trade-off yielding estimates with higher precision. The benefits of such shrinkage priors over noninformative priors have been demonstrated in Datta et al. (2018). Finally as more and more labeled data are collected, in the next section, we show that any reasonable choice of prior (including all these shrinkage priors) leads to desirable asymptotic and finite-sample properties of the GBQL estimate.

Parametric Modeling of Misclassification Rates
An alternative to using the shrinkage priors would be to incorporate domain-knowledge about the misclassification rates via informative priors in a parametric model for M. One example of such prior knowledge is impossibility of occurrence of some true-label-predicted-label class pairs. This can especially be true for a domain-knowledge driven classifier. A clinically driven cause-of-death classifier like the Expert Algorithm (Kalter et al. 2015) is unlikely to produce entirely improbable predicted causes given a true cause-of-death. Hence, a misclassification rates for a large set of class-label pairs can likely be set to 0 a priori. Such sparse models for M can drastically reduce the number of parameters.
If S i = {j| M ij = 0} denote the known support set for true class i. Then we can use an uninformative prior for the nonzero entries of M via the uniform sparse-Dirichlet prior as follows: where 1 d denotes the d-dimensional vector of ones and δ > 0. We can easily modify the Gibbs sampler presented in Section 4 for such a sparse model to ensure conjugate updates. Only change would involve replacing the Dirichlet updates for M i * with Dirichlet update for M i,S [i] as M ij for j / ∈ S i are set to 0.
Strategies other than sparsity can also be adopted to reduce the number of parameters. Consider an example where the classifier is nearly perfect for the training population, and the test population is a mixture population where some cases are similar to the training data for which the classifier will be accurate, and some are from a different population for which the classifier reduces to a random guess. In such cases, M can be modeled as an equicorrelated (only one parameter) or row-equicorrelated (C parameters) matrix.
Finally, assuming homogeneous misclassification rates for the entire population may be inappropriate for some applications. We can then model M ij as function of a small set of covariates U. Datta et al. (2018) considered this extension for categorical labels, using a logit regression model of M on U and proposed a Gibbs sampler using the Polya-Gamma sampler of Polson, Scott, and Windle (2013). We can adopt such a model for GBQL, and formulate a Gibbs sampler with conjugate updates using the data augmentation scheme we present in Section 4 combined with the Polya-Gamma data augmentation.

Theory
In this section we establish asymptotic and finite sample guarantees for GBQL for the general case from Section 2.3 where the true labels in L are observed with uncertainty b r . This subsumes the case of Section 2.1 with exact labels y r . The Gibbs posterior for GBQL is given by Our quantification approach is grounded only in correct specification of the conditional first moment assumption (6) Chernozhukov and Hong (2003), and the misclassificationloss based approach for variable selection of classifiers of Jiang and Tanner (2008).
A central component driving the theory of Gibbs posteriors is some assumption about identifiability of the parameters. Parameters are identifiable if they maximize the likelihood for the Mclosed case, minimize the KL divergence to the true distribution for the M-open case, and minimize the loss function for M-free case. As we see in Equation (16), we have two loss functions for GBQL. The loss function L for the labeled data (a r , b r ) in L is different from the loss-function U for the unlabeled data a r in U . Each loss on their own is incapable of identifying the parameter of interest p, as L doesn't even depend on p, and U (M, p) = U (I, M p). To our knowledge, there hasn't been any application of the general theory of Gibbs posterior to a setting similar to quantification learning requiring more than one type of loss-function to identify the estimand.
More generally, there is no theory on model-free Gibbs posteriors for compositional data using cross-entropy (KLD) loss. Related Bayesian methodological work are the Gibbs samplers for compositional regression (Kessler and Munkin 2015), and for multiple toxicity grades in the context of early-phase clinical trials (Yuan, Chappell, and Bailey 2007). However, these methods only consider a compositional outcome, and not a compositional predictor as we do in Section 2.3. Also, their approach was motivated from fractional multinomial regression and not from loss-function-based generalized Bayes, and did not come with any theoretical guarantees.
Our main result which leads to all the subsequent asymptotic and finite-sample guarantees is identifiability of p from the loss L + U . We introduce the following notations for the theory. We will use M and p to denote the free parameters in M and p respectively, that is, M excludes the last column of M, p excludes the last element of p. M and p are bijective functions of M and p respectively, so we will use them interchangeably. Let θ = ( M, p), then θ is supported on the compact set = Switching to M and p ensures that the parameter space has a nonempty interior.
Let p 0 and M 0 denote the true values and θ 0 = ( M 0 , p 0 ), an interior point in . We first state and discuss our assumptions, for the theory: the C-dimensional probability simplex with corners e i -the C × 1 vector with 1 at the ith position and 0's elsewhere, then for any arbitrary small neighborhood Assumption 1 states that the true data-generation distribution of the compositional labels b r for L, has positive mass at each of the C corners of the simplex S C . Each corner of the simplex represents a cause-category. Mass at the i th corner is needed to estimate the ith row of M from L. So the assumption ensures that there is data to estimate each row of M. To interpret Assumption 1, consider the special case where we observe the true labels y, and the predicted labels a are categorical. Then Assumption 1, along with M 0 being an interior point, ensure that for large enough n, for every (i, j) pair, there are cases in L for whom the true class is i and the predicted class is j. This is of course necessary to estimate the misclassification rate M ij . Thus, Assumption 1 can be interpreted as a positivity assumption ensuring that the limited labeled test set can estimate the sensitivities and specificities of the classifier for all classpairs.
Assumption 2 is a separability assumption necessary for quantification. If there exists two probability vectors p 0 and p 1 such that M 0 p 0 = M 0 p 1 then U (M 0 , p 0 ) = U (M 0 , p 1 ). So it will be impossible to identify p based on predicted labels. A trivial example of this is a 2-class setting with M 11 = M 21 = 1 and M 12 = M 22 = 0. Then all labels will be predicted as class 1 and it would not be possible to distinguish between the true positives from class 1 and the false positives from class 2, that is, classes 1 and 2 will not be separable. This separability assumption has long been discussed in the finite mixture model literature (Teicher 1963;Yakowitz and Spragins 1968), but has not been explicitly discussed in the context of quantification.
Under these two assumptions, we have the following result asserting that, with enough data, the loss function is minimized close to the true parameter value.

Theorem 1 (Identifiability for quantification learning). Let
where ξ = lim n/N. Under Assumptions 1 and 2, for any > 0, there exists κ > 0 such that, the following holds for f N .
The formal proof is provided in the supplementary materials, we briefly outline the ideas used. We can write ν N (θ ) ∝ exp(−α L,n ( M) − α U ,N (θ )) ( p, M) where the subscripts n and N are added to indicate dependence of L , U and ν on the sample size. When the loss-functions f N 's are convex, and converge pointwise to some f , one can use the rich theory of convex functions to establish identifiability by showing that f is minimized at the true value θ 0 . In our case, f N = ( L,n + U ,N )/N converges point-wise to f = ξ E L (D KL (a||M b)) + E U (D KL (a||M p)) where ξ = lim n/N. However, neither f N 's nor f is convex because of the M p term, ruling out direct application of this result.
We first show in Lemma S2 of the supplement that L is convex in M and use the convexity results to show that L can identify M 0 , that is, outside of any neighborhood around the true value M 0 , the empirical loss-function L,n /n has higher value than the limiting loss-function E L (D KL (a||M b)). A complementary result to this is that of weak identifiability of ( M 0 , p 0 ) from the nonconvex U . Lemmas S3 and S4 of the supplementary materials state that θ 0 = ( M 0 , p 0 ) is one of the minimizers of the loss function U /N and its limit E U (D KL (a||M p)). Combining, these results for the two losses, we have that for any θ = ( M, p), f N (θ ) is greater than f (θ 0 ) unless M lies in an infinitesimally small neighborhood around M 0 , and θ is also a minimizer of U . Thus, use of the local labeled set L via the loss function L,n helps to identify M, as the posterior is guaranteed to concentrate around M 0 . As M concentrates around M 0 , the loss U ,N (M, p) becomes capable of identifying p 0 as ( M 0 , p 0 ) is a minimizer of U and M 0 is nonsingular by the separability assumption.
Subsequent to establishing identifiability, one can leverage general results of Gibbs posteriors (Chernozhukov and Hong 2003;Miller 2019) to show posterior asymptotic consistency of the Gibbs posterior for GBQL (Theorem 2), and asymptotic normality of the Gibbs posterior mean (Theorem 3).
Theorem 3 (Asymptotic normality). The meanθ of the Gibbs posterior distribution (16) The proof of the results are provided in the supplementary material. Proof of existence and positiveness of the matrices (θ 0 ) and J(θ 0 ) of Theorem 3 are provided in Lemma S5 of the supplementary material.

Coverage of Interval estimates
Gibbs posteriors or even full posteriors under model misspecification generally do not offer well calibrated credible intervals (Kleijn et al. 2012). Calibration of credible intervals can be guaranteed under generalized information equality, that is, when (θ 0 ) = J(θ 0 ) in Theorem 3 (Chernozhukov and Hong 2003). This equality is satisfied for correctly specified likelihoods and other estimators like generalized method of moments, but typically is not satisfied for fractional posteriors or M-estimation type approaches like ours. Consequently, there is a large body of work on choice of the scaling parameter α (also known as the learning rate or inverse temperature) in (16) to ensure desirable properties of interval estimates from Gibbs posteriors. Bissiri, Holmes, and Walker (2016) proposed choosing α by matching expected unit losses from the data and the prior. This strategy does not work with flat priors. Other strategies considered include endowing α with a hyperprior which requires choosing the hyper-parameters, or using a prior that has conjugate structure to the loss which requires establishing the equivalence of prior loss with m units of data loss. Holmes and Walker (2017) considered the special case of power likelihoods and developed related ideas of choosing the power parameter based on matching expected information from Gibbs power posterior with that from a full Bayesian update. This relies on knowledge of a full parametric model for a Bayesian update and does not generalize from power posteriors to loss-function based Gibbs posteriors. More importantly, choosing α based on information matching offer no guarantee about calibration of credible intervals.
Choice of α is also discussed in the PAC-Bayes literature (Guedj 2019). While theoretical bounds often use some oracle value of α, practical strategies include cross-validation which can be computationally demanding, or integration over α. SafeBayes (Grünwald 2011(Grünwald , 2012Grünwald et al. 2017) recommend optimizing over α using expected posterior loss or posterior predictive loss from power likelihoods that may not generalize to arbitrary loss functions. Again, these strategies provide no guarantees about coverage probabilities of the credible intervals.
Syring and Martin (2019) calibrated α directly using bootstrapped data distributions to get the desired coverage of credible intervals. However, the algorithm requires posterior sampling for each choice of α, each confidence level, and each bootstrap sample. Also their theory proves existence of an α * which ensures calibrated intervals, but does not guarantee reaching α * with the proposed iterative algorithm.
As an alternative to choosing α to get well-calibrated credible intervals, Chernozhukov and Hong (2003) proposed fixing α and obtaining delta-method style confidence intervals around the Gibbs posterior mean. These guarantee well-calibrated coverage probability of all functionals of parameters for any confidence level.
For GBQL, while we can use any of the aforementioned strategies for choosing α, we adopt the delta-method approach of Chernozhukov and Hong (2003) due to its minimal computational overhead and established asymptotic coverage guarantee. We prove the following result on guaranteed coverage of all parameter functionals for GBQL using fully data-driven confidence intervals. The technical statement of the result is provided in Section S2 of the supplementary material, and proved therein.
Theorem 4. For any differentiable function g(θ ), one can compute t(g, L, U )-a deterministic function of g and the data (L, U ), such that with θ denoting the Gibbs posterior mean, and C g,s = z 1−s/2 t(g, L, U ) for any 0 < s < 1, where z s is the sth quantile of a standard normal variable, we have

Finite-Sample Concentration Rate
In addition to the asymptotic results above, we also prove a finite sample result on posterior concentration rate. Let P 0 N denote the probability measure for generating a r ∈ U , and (a r , b r ) ∈ L. We define For the M-closed case, that is, when the class {p N (θ )| θ ∈ } contains the true data-generation model, D N,α (θ , θ 0 ) is (upto a constant) the widely used Rényi-divergence. For the M-open case, that is, when {p N (θ )| θ ∈ } is a family of misspecified likelihoods, Bhattacharya et al. (2019) used (17) as a valid divergence measure of θ from θ 0 to derive finite-sample posterior concentration rates. Two conditions needed for the validity of (17) as a divergence were θ 0 being an interior point of , and that θ 0 minimizes-log(p N (θ ))dP 0 N . For GBQL, we are using estimating equations (M-free case) for the compositional true and predicted labels. We also consider θ to be an interior point of the parameter space, and under Assumptions 1 and 2, we show in Lemma S5 part (iv) that is minimized at θ 0 . This ensures validity of D N,α (θ , θ 0 ) as a divergence. We now have the following finite sample result: Theorem 5 (Posterior concentration rate). Let := N > 0 be such that B (θ 0 ) -an 1 ball of radius around θ 0 lies in the interior of , and N (p, M) be any (possibly N-dependent) prior that gives positive mass of atleast exp(−NR ) to B (θ 0 ) for some universal constant R. Then, under Assumptions 1 and 2, we have for any α ∈ (0, 1), D > 1, and t > 0, Theorem 5 establishes the (outer)-probability of 1 − O(1/N) of the Gibbs posterior for GBQL concentrating in α-divergence neighborhood of the truth at an exponential (1 − exp(−O(N))) rate. The rate is the same as that for fractional posteriors established in Bhattacharya et al. (2019). We show that the prior mass condition of assigning atleast exp(−NR ) prior probability for the ball B (θ 0 ) is satisfied by the Dirichlet priors for M and p for . This leads to the following nearly parametric rate (upto logarithmic terms) for the concentration of the posterior around α-divergence neighborhoods.
Corollary 1. Using independent Dirichlet priors for p and for (the rows of) M, we have for some universal constant M,

Ensemble GBQL
Finally, all theory extends to the ensemble quantification of Section 2.4. One thing to note for ensemble GBQL is that the same dataset is used with different classifiers to get predictions. Hence, these K sets of predictions will not be independent. Also, the labeled data b r for r ∈ L is going to be the same one used in the loss function for each classifier. The theory accommodates these dependencies. We state the result informally here, and provide the technical version in Section S2 and the proof in Section S3.
Corollary 2 (Ensemble GBQL). If there are K predictions available for each instance from K classifiers, and Assumptions 1 and 2 are satisfied for each classifier, then with θ = ( M (1) , . . . , M (K) , p) we can establish posterior consistency, asymptotic normality, coverage of confidence intervals, and posterior concentration rate results for ensemble GBQL analogous to Theorems 2 -5.

Gibbs Sampler Using Rounding and Coarsening
We first outline the Gibbs sampler steps when only using one classifier. The sampler for ensemble GBQL is detailed in the Supplement. The Gibbs posterior ν is given by (18) When all a r are categorical, the polynomial expansion of ( i p i M ij ) r a rj enabled an efficient latent variable Gibbs sampler in Datta et al. (2018). When a rj are fractions, this advantage is lost as fractional polynomials do not have such convenient expansions. Additionally, since we now allow uncertainty in the true labels, we also need to consider the extra fractional expansion terms ( i b ri M ij ) a rj .
We can use any of-the-shelf sampler to generate samples from Equation (18). However, to enable fast and efficient sampling, we propose a data-augmented Gibbs sampler. We first switch from ν to ν round where the probabilistic output a rj is replaced by Ta rj where T is an integer, and · denotes the ceiling of any real number. Consider now the following generative model: The rounded generalized posterior ν round is then the proper Bayesian posterior using the likelihood p(d U , d L | b L , M, p) for any realization of d rt 's satisfying t I(d rt = j) = Ta rj . To obtain samples of p and M from ν round , instead of using this marginalized likelihood, we can equivalently introduce z L , and z U as latent variables and use the joint likelihood M, p). This joint likelihood decomposes nicely and will be conducive to a Gibbs sampler with standard Dirichlet priors on M and p.
However, since we artificially inflate sample size by an order of T by switching from a r to Ta r , instead of sampling from v round we sample from the coarsened likelihood As ν round = p(d U , d L |b L , M, p) is a proper likelihood arising from a mixture of categorical distributions, ν coarse can be expressed as a fractional (coarsened) posterior (Bhattacharya et al. 2019;Ibrahim et al. 2015). The conditional coarsening algorithm (Miller and Dunson 2019) introduces latent variables to device Gibbs samplers for such coarsened posteriors for mixture likelihoods, just as one would do for proper posteriors. While the coarsened posterior has not been established to be exactly equal to the stationary distribution of the conditional coarsening Gibbs sampler, Miller and Dunson (2019) has provided heuristic justification for the algorithm by drawing connection with geometric mean of posteriors based on subsampled data. Empirical evidence was also provided by comparing conditional coarsening with direct importance sampling. As our ν round is also a proper mixture likelihood, we outline below a conditional coarsening-based Gibbs sampler algorithm for sampling from ν coarse . Our own simulations, detailed in Section S4.2 reinforces the claim of Miller and Dunson (2019) about the accuracy of conditional coarsening to sample from coarsened posteriors. The GBQL conditional coarsening Gibbs sampler generates posteriors nearly indistinguishable from direct samples from the coarsened posterior, while being substantially faster.
We use generic Dirichlet priors M ∼ Dirichlet(V), that where V and v respectively are a matrix and a vector of positive hyperparameters. Specific choices with desirable shrinkage properties are discussed in Section 2.5. This gives the following Gibbs updates: If there are hyperparameters γ in V and v that need to be assigned a prior, they can be sampled using a Metropolis-Hastings step. We note that the full conditional distributions for the z rt for r ∈ U , d rt = j are identical, which enables them to be jointly sampled. Furthermore, the z rt for r ∈ L do not need to be updated if there is a i such that b ri = 1.

Choice of Coarsening Factor
Our Gibbs sampler relies on rounding and coarsening ν using an integer factor T. In this section we discuss theory guiding the choice of this factor T. Kessler and Munkin (2015) have used a similar data-augmented Gibbs sampling approach for compositional regression. However, their approach only incorporates the rounded likelihood for the pseudo-data d r and does not coarsen. Rounding inflates the sample size by a factor of T resulting in underestimation of the posterior variance and the coarsening is needed to adjust for this. We first show that the coarsening adjustment by a factor T growing with the sample-size ensures asymptotic equivalence of the rounded and coarsened posterior ν coarse with the original posterior ν.
We denote the coarsened and rounded version of the loss function f N using a factor T asf N . As for Hence, for any M, p on the interior of the parameter space, we havef N goes to the same point-wise limit f as T = T N → ∞. This immediately leads to the analogues of the asymptotic results of Theorems 2-4 for ν coarse .
Corollary 3. Let ν coarse,N denote the rounded and coarsened generalized posterior using a factor T N with T N → ∞. Then, under Assumption 1 and 2, we have (i) P ν coarse,N (θ) (B (θ 0 )) → 1 for any 1 ball B (θ 0 ) of radius around θ 0 for any > 0 such that the prior (p, M) be gives positive support to B (θ 0 ). (ii) The Gibbs posterior meanθ coarse using ν coarse is asymptotic normal that is, is a positive definite matrix created by replacing the population distribution of a r with that of T N a r /T N in (θ 0 ) of Theorem 3. (iii) For any differentiable function g(θ ), one can compute interval estimates of g(θ ) with valid asymptotic coverage in the same way as Theorem 4 using the same J and an N that by replaces the samples a r with T N a r /T N in the estimate .
The asymptotic results only need T N → ∞. For practical guidance on the choice of N, we now look at how the finite sample posterior concentration rate for ν coarse compare with that of ν in Theorem 5.
Theorem 6 (Coarsened posterior concentration rate). Let ν N,T denote the coarsened posterior of (20) with rounding and coarsening factor T = O(N β ) for any β ≥ 1. Then under Assumptions 1 and 2 and the conditions of Theorem 5 we have with universal constants R and R 0 ≥ 1, It is clear that when T = O(N β ) for β ≥ 1, the coarsened posterior concentrates at the same rate (up to a constant) around the true value with the same probability as the uncoarsened posterior, while when T = ∞, then the probability in Theorem 6 is same as that is Theorem 5 which is not surprising as the coarsened posterior with T = ∞ is the original uncoarsened posterior. Larger T would involve an creating and updating a larger dimensional pseudo-data in the Gibbs sampler. Hence, we recommend using T = O(N) the smallest scaling which ensures same concentration rate as the original posterior.

Simulations
We conduct multiple simulation studies to assess (a) accuracy of GBQL in estimating p in the presence of moderate amounts of labeled data, (b) compare robustness of estimation-equationbased GBQL to Dirichlet model-based approach using different data generating mechanisms, (c) compare computation efficiency compared to Dirichlet models, and (d) assess estimation accuracy when there is uncertainty in true labels in L.
To mimic the real data application we present in Section 6, we used N = 1000, n = 300, C = 5, p L = E L (y r ) = ( 1 C , . . . , 1 C )' , and the following four different values of p representing each of the four countries in the PHMRC dataset (Section 6). The values of p and M are presented below. p1 = (0.20, 0.19, 0.27, 0.27, 0.07) p2 = (0.11, 0.11, 0.40, 0.29, 0.09) p3 = (0.09, 0.18, 0.52, 0.19, 0.02) p4 = (0.13, 0.30, 0.35, 0.19, 0.03 We generated true labels y r |p ∼ Multinomial(1, p), r ∈ U and y r |p L ∼ Multinomial(1, p L ), r ∈ L. For the first analyses, we allow for full knowledge of these labels for r ∈ L, which means that b r |y r = i equals e i for r ∈ L. We then simulated outputs a r |y r directly from a model, so that we know the true data-generating mechanism of the dataset shift. We use two data generating mechanisms for a r |y r . The first mechanism corresponds to a zero-inflated Dirichlet mixture model: The second data-generating mechanism introduced subjectlevel over-dispersion, replacing the Gamma shape parameter 5M ij above with τ r M ij , where τ r is subject-specific overdispersion generated from the mixture distribution τ r ∼ 0.5 · Unif(.1, 1) + 0.5 · Unif(10, 20). Instances with τ r ≤ 1 will have responses a rj close to 0 and 1, while instances with large τ r will have a rj clustered closer to the non-zero entries of M.
We compare GBQL estimates of p with estimates from following Bayesian Dirichlet mixture model which assumes the first data generating mechanism as truth: For both the Dirichlet model and GBQL, we used Dirichlet priors for M shrinking toward I, and uninformative Dirichlet prior for p. Since the Dirichlet distribution does not support zeros, for running the Dirichlet model, 0 values were replaced with = .001 and each a r was re-normalized. Posterior sampling for this model was performed using RStan Version 2.19.2 (Stan Development Team 2019). Note that this model becomes misspecified for the second true data-generating mechanism.
For both models, we ran three chains each with a total of 6000 draws and a burn-in of 1000 draws. We used the posterior mean of p asp.
To compare estimates of p, we use a chance corrected version of the normalized absolute accuracy (NAA) (Gao and Sebastiani 2016) for estimating a compositional vector. NAA is defined as To represent random guessing of p with a score of 0, and perfect estimation of p with a score of 1, we follow Flaxman et al. (2015) and use the chance corrected NAA (CCNAA) = (NAA − 0.632)/(1 − 0.632). We repeat our simulations 500 times for each choice of p and show the average CCNAA across this simulations in Figure 2. For case 1 (left panel) when the likelihood is correctly specified for the Dirichlet model, both methods produce accurate estimates of p and have approximately the same CCNAA. When we introduce over-dispersion to the distribution of the a r |y r = i (right panel), we see that the performance the GBQL model is hardly affected, and substantially outperforms the now misspecified Dirichlet model in all cases.
When we investigated the Stan output for the Dirichlet models, many of the chains failed to converge when the likelihood was misspecified as indicated in the Gelman-Rubin diagnostic (R) values in Table 1. Furthermore, on average the Stan Dirichlet model took nearly 200 times longer to run than the GBQL method (Table 1). For GBQL, allR values indicated convergence. Thus, GBQL accurately estimates p, removes the need to correctly specify the likelihood, is fast, and does not require finetuning for the posterior samples to converge.
We now examine the behavior of the GBQL model in the case of uncertain labels. To induce this uncertainty, we generate the compositional b r from the following over-dispersed Dirichlet distribution b r ∼ Dirichlet(τ r p), r ∈ U and b r ∼ Dirichlet(τ r p L ), r ∈ L, where τ r ∼ 0.5 · Unif(0.1, 1) + 0.5 · Unif(10, 20), and generate y r |b r ∼ Multinomial(1, b r ). The data-generating process for the a r is the same as in the   simulations with known labels. The compositional b r are used as the uncertain labels for r ∈ L. Figure 3 plots the average CCNAA from GBQL with known labels y against CCNAA of GBQL with unknown labels b for each value of p and data-generating mechanism. It can be seen that introducing uncertainty in the labels results in slightly lower (up to 10%) CCNAA values indicating the small price we pay for the added uncertainty.

Additional Simulation Studies
We conducted additional simulation studies on comparison of the estimates, MCMC convergence, and computation time using the Gibbs sampler and direct implementation of GBQL in Stan, comparison of performance of our Gibbs sampler for different choices of the coarsening factor, evaluation of coverage probabilities of our interval estimates, comparison of the sparse model of Section 2.6 with the full model.
These are provided in Section S4 of the supplementary material.

PHMRC Dataset Analysis
High-quality cause-of-death information is lacking for 65% of the world's population due to scarcity of diagnostic autopsies in low-and middle-income countries (LMICs) (Nichols et al. 2018). So, estimating subnational and national cause-specificmortality fractions (CSMF) and burden of disease numbers rely to a large extent on simple aggregation (classify-and-count) of verbal autopsy predicted cause-of-deaths. We apply GBQL to improve estimation of CSMF using predicted cause-of-death data from verbal autopsy classifiers. An example of dataset shift is in the Population Health Metrics Research Consortium (PHMRC) gold standard dataset (Murray et al. 2011), which contains 168 reported symptoms and gold-standard underlying causes of death for adults in 4 countries. There are 21 total causes of death, that are then aggregated to 5 broader cause of death categories. Figure 4 shows the percentage of subjects within each country and cause of death that report each symptom. The xaxis is an enumeration of the entire list of symptoms x and the y-axis plots p(x| y) for each symptom x. With no dataset shift, we would expect the conditional response rates for each question within each cause of death to be similar for every country. However, as the country-specific lines are quite distinct in each sub-figure, it is clear that this assumption is violated. This leads to poor performance of verbal autopsy classifiers trained on symptoms and cause of death labels from 3 countries to predict the cause of death distribution for the remaining country (McCormick et al. 2016). We now apply GBQL using limited local data from each country to improve CSMF estimation from verbal autopsy classifier trained on data from the other 3 countries. The number of observations within India, Mexico, Philippines, and Tanzania are 2973, 1586, 1259, and 2023, respectively. To address countryspecific dataset shift, for each country, we used the three remaining countries as training data for four methods commonly used for cause of death predictions: InterVA (Byass et al. 2012), InSilicoVA (McCormick et al. 2016, NBC (Miasnikof et al. 2015), and Tariff . The first three methods are probabilistic, while Tariff produces a score for each cause that needed to be normalized to be in [0, 1]. Model training was done using the openVA package version 1.0.8 (Li, McCormick, and Clark 2019). We considered both compositional predictions and classifications (single-class categorical predictions based on the plurality rule). For GBQL in the test country, we then sampled labeled data L of varying sizes (n=25, 100, 200, 400) to investigate the effect of increasing the number of known labels. Sampling was performed such that p L = ( 1 5 , . . . , 1 5 ), as in Section 5. For comparisons, we obtained estimates using the probabilistic average (PA, Bella et al. 2010) method for compositional predictions, which should align with the GBQL estimate for n = 0 (Section 2.5) for our choice of priors, as well as estimates using the adjusted PA method. We repeated this 500 times for each size of n. Results for the average CCNAA when using compositional predictions are shown in Figure 5(a). When no labeled instances are available, we see that the APA method performs worse than the PA method across almost all countries and algorithms, demonstrating why, in presence of dataset shift, it is not appropriate to estimate M using the training data. We see that obtaining n = 25 labeled instances (an average of only 5 labeled deaths per class) does not effectuate any improvement in the performance over not having any labeled test data (n = 0). However, increasing this to 100 labels (an average of 20 labeled deaths per class) leads to large increase in CCNAA indicating substantial improvement in estimation of p across all countries and algorithms. As there are 168 covariates used for building these classifiers, using just 100 observations to build a reliable classifier would be difficult, if not impossible. Quantification accuracy continues to increase with a larger number of labeled observations across all countries and algorithms, although the extent of this improvement is quite variable. Figure 5(b) compares the CCNAA for GBQL using compositional predictions versus GBQL using single-class categorical predictions. We see that using the original compositional scores offers improvement over categorization for all algorithms except Tariff for Philippines and Tanzania. Figure 5(a) also shows that classifier performance varies widely across settings. For example among the PA estimates, InSilicoVA is best for Philippines, whereas NBC is most accurate for Tanzania. We now look at the performance of our ensemble method which uses predictions from all four algorithms. Figure 6 shows the CCNAA for the ensemble GBQL and GBQL using individual classifiers, for different numbers of labeled observations and each country. With only 25 labeled observations, the ensemble CCNAA is approximately an average of the CCNAA for each of the other algorithms, which is what we would expect, as for n = 0 it is exactly the average as discussed in Section 2.5. With more labeled observations, the ensemble begins to either outperform all of the methods, or has CCNAA very close to that of the top performing method. Importantly, the ensemble method significantly outperforms the worst method for all combinations of country, output format and numbers of labeled observations, showing that including multiple algorithms and using the ensemble quantification protects against inadvertently selecting the worst algorithm.
Finally, to illustrate the efficacy of GBQL even when true labels are observed with uncertainty, we create a toy dataset by randomly pairing individuals within a country in the PHMRC data. To introduce label uncertainty into the analysis, for a pair of individuals, r1 and r2, we let b r1i = b r2i = 1 2 (I(y r1i = 1) + I(y r2i = 1)), By using two individuals each with a single (but possibly different) true label, we create two individuals each with uncertain observed labels in such a way that the total number of individuals with a given cause remains same in this new dataset as that in the actual PHMRC dataset. The data generation satisfies the assumption that p(y r = i|b r ) = b ri . We then used these beliefs instead of the true labels as input for our method. Figure 7 compares the CCNAA for the individual methods across each value of n for compositional predictions when using the known labels versus representing uncertainty in the labels through beliefs. The performance of GBQL is similar for both types of inputs. The CCNAA were slightly worse when labels are observed with uncertainty, as in Section 5.

Discussion
Quantification is an important and challenging problem that has only recently gained the attention it deserves. There are important limitations of the commonly used methods; CC (Forman 2005), ACC (Forman 2005), PA (Bella et al. 2010), and APA (Bella et al. 2010) as they do not account for dataset shift. In absence of local labeled data, GBQL with specific choices of priors yields model-based analogs for each of these methods and provides a probabilistic framework around these approaches to conduct inference beyond point-estimation. In the presence of local test data GBQL leverages it and substantially improves quantification over these previous approaches. In such settings, GBQL extends BTL (Datta et al. 2018) which does not allow uncertainty in either the predicted or the true labels. In summary, GBQL generalizes all these methods, allowing for unified treatment of both categorical and compositional classifier output, incorporation of training data (through priors) and labeled test data, and uncertain knowledge of labeled data classes.  Appealing to the generalized Bayes framework, our Bayesian estimating equations and the KLD loss functions rely only on a simple first-moment assumption for compositional data that circumvents the need for full model specification even in a Bayesian setting. The loss function approach easily extends to harmonize output from multiple classifiers, leading to a unified ensemble method which is a pragmatic solution guarding against inadvertent inclusion of a poorly performing classifier in the pool of algorithms. The Bayesian paradigm enables use of shrinkage priors to inform the estimation of M and p when limited labeled data from the test set is available. The GBQL Gibbs posterior can be approximated using our customized rounded and coarsened Gibbs sampler leading to fast posterior sampling compared to off-the-shelf samplers.
There was no theory justifying the BTL method and more generally, to our knowledge, there is no theory for quantification learning methods under dataset shift. We offer a comprehensive theory for GBQL including posterior consistency, asymptotic normality, valid coverage of interval estimates, and finitesample concentration rate. All results only assume a first moment assumption and are robust to not having knowledge of full data distribution. Finally, extensive simulations and PHMRC data analysis show that the GBQL model is robust to model misspecification, and uncertainty in true labels, and significantly improves quantification in the presence of dataset shift.
The estimating equation (13) is of independent importance beyond quantification learning. It offers a novel and direct method to generalized Bayes regression of compositional response a r on compositional covariate b r , allowing 0's and 1's in both variables and without requiring full distributional specification (like the Dirichlet distribution) or data transformations (like the hard-to-interpret log-ratio transformations) customary in analysis of compositional data. We do not pursue this further here as it is beyond the scope of the article but would like to point out that efficient posterior sampling algorithm for such a Bayesian composition-on-composition regression follows directly from a part of the coarsened sampler. Similarly, all the general theoretical results for quantification learning presented here, that is, asymptotic consistency, normality, validity of coverage, and finite sample concentration rates can immediately be applied to this setting to ensure analogous guarantees for the Bayesian composition-on-composition regression.
A future direction is to extend the methodology for a continuous version of the problem, that is, instead of predicting probabilities for C-categories, for each datapoint the classifiers now predict a = (a (1) , . . . , a (S) ) -a sample of S predicted (real-valued) labels. A simple solution for the continuous case would be to discretize the domain into C bins B 1 , . . . , B C and compute the empirical proportion of predicted labels in each bin, thereby transforming the sample-valued data to compositional data and subsequently using GBQL. A more general solution that prevents unnecessary discretization would be to write the continuous version of Equation (1) as f (a) = y f (a| y)p(y) where f (a) and f (a| y) respectively denotes the marginal and conditional densities. This leads to the moment equations E(a j ) = y E(a j | y)p(y) for all j ≥ 1 for which E(a j ) exists. We can calculate q j = E(a j ) from the unlabeled set U as 1 r ) j . Since the moment generating function uniquely defines the distribution, letting q = (q 1 , q 2 , . . .), M = (M ij ), we can solve for the quantity of interest p(y) = p as the minimizer of some norm q − M p subject to p lying on the simplex. We can also extend this to continuous true labels y by using the equation f (a) = y f (a| y)f (y). Expressing f (y) generally as a mixture density, f (y) = H h=1 w h f h (y; μ h ) for some known densities f h , we can solve for the unknown parameters μ h and the unknown weights w h using the moment equation q k = h w h t m tk f h (t; μ h )dt.

Supplementary materials
A Supplementary Materials file containing proofs of all the theoretical results, additional details on the Gibbs sampler, and more numerical studies is available online.