Skip to Main Content

Abstract

Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods assume that we observe all confounders, variables that affect both the causal variables and the outcome variables. This assumption is standard but it is also untestable. In this article, we develop the deconfounder, a way to do causal inference with weaker assumptions than the traditional methods require. The deconfounder is designed for problems of multiple causal inference: scientific studies that involve multiple causes whose effects are simultaneously of interest. Specifically, the deconfounder combines unsupervised machine learning and predictive model checking to use the dependencies among multiple causes as indirect evidence for some of the unobserved confounders. We develop the deconfounder algorithm, prove that it is unbiased, and show that it requires weaker assumptions than traditional causal inference. We analyze its performance in three types of studies: semi-simulated data around smoking and lung cancer, semi-simulated data around genome-wide association studies, and a real dataset about actors and movie revenue. The deconfounder is an effective approach to estimating causal effects in problems of multiple causal inference. Supplementary materials for this article are available online.

1 Introduction

Here is a frivolous, but perhaps lucrative, causal inference problem. Table 1 contains data about movies. For each movie, the table shows its cast of actors and how much money the movie made. Consider a movie producer interested in the causal effect of each actor; for example, how much does revenue increase (or decrease) if Oprah Winfrey is in the movie?

Table 1 The TMDB dataset of movie earnings.

To solve this problem, the producer wants to use the potential outcomes approach to causal inference (Rubin 1974, 2005; Imbens and Rubin 2015). Following the methodology, she associates each movie to a potential outcome function, yi(a). This function maps each possible cast a to its revenue if the movie i had that cast. (The cast a is a binary vector with one element per actor; each element encodes whether the actor is in the movie.) The potential outcome function encodes, for example, how much money Star Wars would have made if Robert Redford replaced Harrison Ford as Han Solo. When doing causal inference, the producer’s goal is to estimate something about the population distribution of Yi(a). For example, she might consider a particular cast a and estimate the expected revenue of a movie with that cast, E[Yi(a)].

Traditionally, causal inference from observational data is a difficult enterprise and requires strong assumptions. The challenge is that the dataset is limited; it contains the revenue of each movie, but only at its assigned cast. However, the producer’s problem is not a traditional causal inference. While causal inference usually considers a single possible cause, such as whether a subject receives a drug or a placebo, our producer is considering a multiple causal inference, where each actor might causally contribute to the revenue. This article shows how multiple causal inference can be easier than traditional causal inference. Thanks to the multiplicity of causes, the producer can make causal inferences under weaker assumptions than the traditional approaches require.

Let’s discuss the producer’s inference in more detail: how can she calculate E[Yi(a)]? Naively, she subsets the data in Table 1 to those with cast equal to a, and then computes a Monte Carlo estimate of the revenue. This procedure is unbiased when E[Yi(a)]=E[Yi(a) | Ai=a].

But there is a problem. The data in Table 1 hide confounders, variables that affect both the causes and the effect. For example, every movie has a genre, such as comedy, action, or romance. This genre has an effect on both who is in the cast and the revenue. (E.g., action movies cast a certain set of actors and tend to make more money than comedies.) When left unobserved, the genre of the movie produces a statistical dependence between whether an actor is cast and the revenue; this dependence biases the causal estimates, E[Yi(a) | Ai=a]E[Yi(a)].

Thus, the main activities of traditional causal inference are to identify, measure, and control for confounders. Suppose the producer measures confounders for each movie wi. Then inference is simple: use the data (now with confounders) to take Monte Carlo estimates of E[E[Yi(a) | Wi,Ai=a]]; this iterated expectation “controls” for the confounders. But the problem is that whether the estimate is equal to E[Yi(a)] rests on an uncheckable assumption: there are no other confounders. For many applied causal inference problems, this assumption is a leap of faith.

We develop the deconfounder, an alternative method for the producer who worries about missing a confounder. First the producer finds and fits a good latent-variable model to capture the dependence among actors. It should be a factor model, one that contains a per-movie latent variable that renders the assigned cast conditionally independent. (Probabilistic principal component analysis (Tipping and Bishop 1999) is a simple example, but there are many others.) Given the model, she then estimates the per-movie variable for each cast in the dataset; this estimated variable is a substitute for unobserved confounders. Finally, she controls for the substitute confounder and obtains valid causal inferences.

All methods for causal inference rely on assumptions. The deconfounder makes two. First, it assumes that the fitted latent-variable model is a good model of the assigned causes. This assumption is testable, and we will use predictive checks to assess how well the fitted model captures the data. Second, it assumes that there are no unobserved single-cause confounders, variables that affect one cause (e.g., actor) and the potential outcome function (e.g., revenue). While this assumption is not testable, it is weaker than the usual assumption of unconfoundedness, which requires no unobserved confounders.

Subject to the assumptions, the deconfounder provides valid causal inferences because it capitalizes on the dependency structure of the observed casts. It uses patterns of how actors tend to appear together in movies as indirect evidence for confounders in the data.

Beyond making movies, many causal inference problems, especially from observational data, also classify as multiple causal inference. Such problems arise in many fields.

  • Genome-wide association studies (GWAS). In GWAS, biologists want to know how genes causally connect to traits (Stephens and Balding 2009; Visscher et al. 2017). The assigned causes are alleles on the genome, often encoded as either being common (“major”) or uncommon (“minor”), and the effect is the trait under study. Confounders, such as shared ancestry among the population, bias naive estimates of the effect of genes. We study GWAS problems in Section 6.2.

  • Computational neuroscience. Neuroscientists want to know how the electrical activity of neurons produces observed behavior, such as limb movement (Churchland et al. 2012). The possible causes are multiple measurements about the brain’s activity, for example, one per neuron, and the effect is a measured behavior. Confounders, particularly through dependencies among neural activity, bias the estimated connections between brain activity and behavior.

  • Social science. Sociologists and policy-makers want to know how social programs affect social outcomes, such as poverty levels and upward mobility (Morgan and Winship 2015). However, individuals may enroll in several such programs, blurring information about their possible effects. In social science, controlled experiments are difficult to engineer; using observational data for causal inference is typically the only option.

  • Medicine. Doctors want to know how medical treatments affect the progression of disease. The multiple causes are medications and procedures; the outcome is a measurement of a disease (e.g., a lab test). There are many confounders—such as when and where a patient is treated or the treatment preferences of the attending doctor—and these variables bias the estimates of effects. While gold-standard data from clinical trials are expensive to obtain, the abundance of electronic health records could inform medical practices.

  • Recommender systems. Technology companies want to know whether recommending different items to a user will increase revenue. The multiple causes are the recommendation of each item; the outcome is the total revenue of the company. However, the past purchase history of the users affect both which items are recommended and which items they buy, that is, the revenue. Users’ past purchase history thus confounds the observed effect of recommendation.

All of these problems of causal inference can use the deconfounder. Fit a good factor model of the assigned causes, infer substitute confounders, and use the substitutes in causal inference.

1.1 Related Work

The deconfounder relates to several threads of research in causal inference.

1.1.1 Probabilistic Modeling for Causal Inference

Several lines of work use probabilistic modeling to aid causal inference. Mooij et al. (2010) use Gaussian processes to depict causal mechanisms; Zhang and Hyvärinen (2009) study post-nonlinear causal models and their identifiability; Mckeigue et al. (2010) builds on sparse methods to infer causal structures; Moghaddass, Rudin, and Madigan (2016) use factor models to generalize the self-controlled case series method to multiple causes and multiple outcomes. Louizos et al. (2017) use variational autoencoders to infer unobserved confounders from proxy variables, Shah and Meinshausen (2018) develop projection-based techniques for high-dimensional covariance estimation under latent confounding, Frot, Nandy, and Maathuis (2019) use linear factor models for robust causal structure learning with hidden variables, and Kaltenpoth and Vreeken (2019) leverages information theory principles to differentiate causal and confounded connections.

With a related goal, Tran and Blei (2017) build implicit causal models. Like the GWAS example in Section 6.2, they take an explicit causal view of genome-wide association studies (gwas), treating the single-nucleotide polymorphisms (snps) as the multiple causes. They connect implicit probabilistic models and nonparametric structural equation models for causal inference (Pearl 2009), and develop inference algorithms for capturing shared confounding. Heckerman (2018) studies the same scenario with linear regression, where observing many causes makes it possible to account for shared confounders. Multiple causal inference and latent confounding was also formalized by Ranganath and Perotte (2018), who take an information-theoretic approach.

Most of these articles use Pearl’s framework (Pearl 2009); they hypothesize a causal graph with confounders, causes, and outcomes. This article complements these works. We develop the deconfounder in the potential outcomes framework (Rubin 1974, 2005; Imbens and Rubin 2015).

1.1.2 Analyzing GWAS

In gwas, latent population structure is an important unobserved confounder. Pritchard et al. (2000) propose a probabilistic admixture model for unsupervised ancestry inference. Price et al. (2006) and Astle and Balding (2009) estimate the unobserved population structure using the principal components of the genotype matrix. Yu et al. (2006) and Kang et al. (2010) estimate the population structure via the “kinship matrix” on the genotypes. Song, Hao, and Storey (2015) and Hao, Song, and Storey (2015) rely on factor analysis and admixture models to estimate the population structure. GTEx Consortium et al. (2017) adopt a similar idea to study the effect of genetic variations on gene expression levels. These methods can be seen as variants of the deconfounder (see Appendix A in the supplementary materials). The deconfounder gives them a rigorous causal justification, provides principled ways to compare them, and suggests an array of new approaches. We study gwas data in Section 6.2.

1.1.3 Assessing the Unconfoundedness Assumption

Rosenbaum and Rubin (1983) demonstrate that unconfoundedness and a good propensity score model are sufficient to perform causal inference with observational data. Many subsequent efforts assess the plausibility of unconfoundedness. For example, Robins, Rotnitzky, and Scharfstein (2000), Gilbert, Bosch, and Hudgens (2003), and Imai and Van Dyk (2004) develop sensitivity analysis in various contexts, though focusing on data with a single cause. In contrast, this work uses predictive model checks to assess unconfoundedness with multiple causes. More recently, Sharma, Hofman, and Watts (2016) leveraged auxiliary outcome data to test for confounding; Janzing and Schölkopf (2018a; 2018b), and Liu and Chan (2018) developed tests for non-confounding in multivariate linear regression; Cinelli et al. (2019) developed sensitivity analysis for linear causal models; Franks, D’Amour, and Feller (2019) designed flexible sensitivity analysis for causal inference with one binary treatment. Here, we work without auxiliary data, focus on causal estimation, as opposed to testing, and move beyond linear models and one treatment.

1.1.4 The (Generalized) Propensity Score

Schneeweiss et al. (2009), McCaffrey, Ridgeway, and Morral (2004), Lee, Lessler, and Stuart (2010), and many others develop and evaluate different models for assigned causes. In particular, Chernozhukov et al. (2017) introduce a semiparametric assignment model; they propose a principled way of correcting for the bias that arises when regularizing or overfitting the assignment model. The work in this article introduces latent variables into the assignment model. The multiplicity of causes enables us to infer these latent variables and then use them as substitutes for unobserved confounders.

1.1.5 Traditional Causal Inference With Multiple Treatments

Lopez and Gutman (2017), McCaffrey et al. (2013), Zanutto, Lu, and Hornik (2005), Rassen et al. (2011), Lechner (2001), and Feng et al. (2012) extend matching, subclassification, and weighting to multiple treatments, always assuming no unobserved confounders. This work relaxes that assumption to no unobserved single-cause confounders.

1.2 This Article

Section 2 reviews traditional causal inference, sets up multiple causal inference, presents the deconfounder. Section 3 describes the identification strategy of the deconfounder and its main assumptions. Section 4 discusses the practical details of the deconfounder and presents the full algorithm. Section 5 answers some questions a reader might have. Section 6 presents three empirical studies, two semi-synthetic and one real. Section 7 further develops the theory around the deconfounder and establishes causal identification. Section 8 concludes the article.

2 Multiple Causal Inference With the Deconfounder

In this section, we discuss the problem of multiple causal inference and develop the deconfounder.

2.1 Multiple Causal Inference

We first describe multiple causal inference. In the data, there are m possible causes, encoded in a vector a=(a1,,am). We can consider a variety of types: real-valued causes, binary causes, integer causes, and so on. In the example of movie revenue, the causes are binary: aj encodes whether actor j is in the movie.

For each individual i (movie) there is a potential outcome function that maps configurations of causes to the outcome (revenue). We focus on real-valued outcomes. For the ith movie, the potential outcome function maps each possible cast to the log of the movie’s revenue had it had that cast, yi(a):{0,1}mR.

The goal of causal inference is to characterize the sampling distribution of the potential outcomes Yi(a) for each configuration of the causes a. This distribution provides causal inferences, such as the expected outcome for a particular array of causes (a particular cast of actors) μ(a)=E[Yi(a)] or the average effect of individual causes (how much a particular actor contributes to revenue).

To help make causal inferences, we draw data from the sampling distribution of assigned causes ai (the cast of movie i) and realized outcomes yi(ai) (its revenue).1 The data is D={(ai,yi(ai)}i=1n. Note we only observe the outcome for the assigned causes yi(ai), which is just one of the values of the potential outcome function. But we want to use such data to characterize the full distribution of Yi(a) for any a; this is the “fundamental problem of causal inference” (Holland 1986).

To estimate μ(a), consider using the data to calculate conditional Monte Carlo approximations of E[Yi(a) | Ai=a]. These estimates are simply averages of the outcomes for each configuration of the causes. But this approach may not be accurate. There might be unobserved confounders—hidden variables that affect both the assigned causes Ai and the potential outcome function Yi(a). When there are unobserved confounders, the assigned causes are correlated with the observed outcome. Consequently, Monte Carlo estimates of μ(a) are biased,(1) E[Yi(a) | Ai=a]E[Yi(a)].(1)

We can estimate E[Yi(a) | Ai=a] with the dataset; but the goal is to estimate E[Yi(a)].2

Suppose we measure covariates xi and append to each data point, D={(ai,xi,yi(ai)}i=1n. If these covariates contain all confounders then(2) E[E[Yi(a) | Xi,Ai=a]]=E[Yi(a)].(2)

With augmented data, estimate the left side with Monte Carlo; thus, estimate E[Yi(a)].

Equation (2) is true when X capture all confounders. More precisely, it is true under the assumption of unconfoundedness3 (Rosenbaum and Rubin 1983; Imai and Van Dyk 2004): conditional on observed X, the assigned causes are independent of the potential outcomes,(3) AiYi(a) | Xia.(3)

The nuance is that Equation (3) needs to hold for all possible a’s, not only for the value of Yi(a) at the assigned causes. Unconfoundedness implies no unobserved confounders.4

Equation (2) underlies the practice of causal inference: find and measure the confounders, estimate conditional expectations, and average. In the introduction, for example, we pointed out that the genre of the movie is a confounder to causal inference of movie revenues. The genre affects both which cast is selected and the potential earnings of the film. But the assumption that there are no unobserved confounders is significant. One of the central challenges around causal inference from observational data is that unconfoundedness is untestable—it fundamentally depends on the entire potential outcome function, of which we only observe one value (Holland 1986).

2.2 The Deconfounder

We develop the deconfounder, an algorithm that uses the multiplicity of causes to infer unobserved confounders. There are three steps. First, find a good latent variable model of the assignment mechanism. (A good model is one that accurately captures the joint distribution of the causes.) Second, use the model to infer the latent variable for each individual. Finally, use the inferred variable as a substitute for unobserved confounders and form causal inferences.

We explain the method and discuss why and when it provides unbiased causal inferences.

In the first step of the deconfounder, define and fit a probabilistic factor model to capture the joint distribution of causes p(a1,,am). A factor model posits per-individual latent variables Zi, which we call local factors, and uses them to model the assigned causes. The model is(4) Zip(· | α)i=1,,n,Aij | Zip(· | zi,θj)j=1,,m,(4) where α parameterizes the distribution of Zi and θj parameterizes the per-cause distribution of Aij. Notice that Zi can be multi-dimensional. Factor models encompass many methods from Bayesian statistics and probabilistic machine learning. Examples include probabilistic PCA (Tipping and Bishop 1999), mixture models (McLachlan and Basford 1988), mixed-membership models (Pritchard et al. 2000; Blei, Ng, and Jordan 2003; Erosheva 2003; Airoldi et al. 2008), and deep generative models (Neal 1990; Kingma and Welling 2013; Rezende and Mohamed 2015; Mohamed and Lakshminarayanan 2016; Ranganath et al. 2015; Ranganath, Tran, and Blei 2016; Tran et al. 2017).

We can fit using any appropriate method, such as maximum likelihood estimation or Bayesian inference. And exact fitting is not required; we can use approximate methods like the EM algorithm, Markov chain Monte Carlo, or variational inference. What the deconfounder requires is that the fitted factor model provides an accurate approximation of the population distribution of p(a).

In the next step, use the fitted factor model to calculate the conditional expectation of each individual’s local factor weights ẑi=EM[Zi | Ai=ai]. We emphasize that this expectation is from the fitted model M (not the population distribution). We can use approximate expectations.

In the final step, condition on ẑi as a substitute confounder and proceed with causal inference. For example, estimate E[Yi(a)]=E[E[Yi(a) | Ẑi,Ai=a]].

Why is this strategy sensible? Assume the fitted factor model captures the (unconditional) distribution of assigned causes p(ai1,,aim). This means that all causes are conditionally independent given the local latent factors,(5) p(ai1,,aim | zi)=j=1mp(aij | zi).(5)

Now make an additional assumption: there are no single-cause confounders, variables that affect just one of the assigned causes and the potential outcome function. (More precisely, we need to have observed all the single-cause confounders.) With this assumption, the independence statement of Equation (5) implies unconfoundedness, AiYi(a) | Zi, and unconfoundedness justifies causal inference. In summary, if the factor model captures the distribution of assigned causes—a testable proposition—then we can use ẑi as a variable that contains the (multi-cause) confounders.

The graphical model in Figure 1 justifies the deconfounder and reveals its assumptions.5 Suppose we observe a Zi such that the conditional independence in Equation (5) holds. Further suppose there exists an unobserved multi-cause confounder Ui (illustrated in red), which connects to multiple assigned causes and the outcome. If such a Ui exists then the causes would be dependent, even conditional on Zi. (This fact comes from d-separation.) But such dependence leads to a contradiction, specifically that Equation (5) does not hold. Thus, Ui cannot exist.

Fig. 1 A graphical model argument for the deconfounder. The punchline is that if Zi renders the Aij’s conditionally independent then there cannot be a multi-cause confounder. The proof is by contradiction. Assume conditional independence holds, p(ai1,,aim | zi)=jp(aij | zi); if there exists a multi-cause confounder Ui (red) then, by d-separation, conditional independence cannot hold (Pearl 1988). Note we cannot rule out the single-cause confounder Si (blue).

There is a nuance. The conditional independence in Equation (5) cannot rule out the existence of an unobserved single-cause confounder, denoted Si in Figure 1. Even if such a confounder exists, the conditional independence still holds.

Here is the punchline. If we find a factor model that accurately represents the distribution of causes then that model can provide a variable that captures the unobserved multiple-cause confounders. The reason is that the multiple-cause confounders induce dependence among the causes; a good factor model provides a variable that renders the causes conditionally independent; thus, that variable captures the confounders. This is the blessing of multiple causes.

3 The Identification Strategy of the Deconfounder

How does the deconfounder identify potential outcomes? The classical strategy for causal identification is that unconfoundedness, together with stable unit treatment value assumption (sutva) and overlap, identifies the potential outcomes (Imbens 2000; Hirano and Imbens 2004; Imai and Van Dyk 2004). The deconfounder continues to assume sutva and overlap, but it weakens the unconfoundedness assumption.

Roughly, unconfoundedness requires that there are no unobserved confounders. To weaken this assumption, the deconfounder constructs a substitute confounder that captures multiple-cause confounders. (The proof is in Section 7.) Uncovering multi-cause confounders from data weakens the unconfoundedness assumption to one of no unobserved single-cause confounders.

Thus, the deconfounder relies on three main assumptions: (1) sutva (Rubin 1980, 1990); (2) no unobserved single-cause confounders; (3) overlap (Imai and Van Dyk 2004).

3.1 Stable Unit Treatment Value Assumption (SUTVA)

The stable unit treatment value assumption (sutva) requires that the potential outcomes of one individual are independent of the assigned causes of another individual. It assumes that there is no interference between individuals and there is only a single version of each assigned cause. See Rubin (1980, 1990) and Imbens and Rubin (2015) for discussion of this assumption.

3.2 No Unobserved Single-Cause Confounders

“No unobserved single-cause confounders” requires that we observe any confounders that affect only one of the causes; see Figure 1. (The precise technical definition is in Definition 4 of Section 7.)

This assumption is weaker than classical assumption of unconfoundedness, which requires “no unobserved confounders.” That said, whether the assumption is plausible depends on the particulars of the problem. Note that “no unobserved single-cause confounders” reduces to the “no unobserved confounders” when there is only one cause; all confounders are single-cause in this case.

When might “no unobserved single-cause confounders” be plausible? Consider the movie-actor example. One possible confounder is the reputation of the director. Famous directors have access to a circle of capable actors; they also tend to make good movies with large revenues. If the dataset contains many actors, it is likely that several are in the circle of capable actors; the director’s reputation is a multi-cause confounder. (If only one actor in the dataset is capable then the director’s reputation is a single-cause confounder.)

Or consider the gwas problem. If a confounder affects SNPs—and we observe 100,000 SNPs per individual—then the confounder may be unlikely to have an effect on only one. The same reasoning can apply to other settings—medications in medical informatics data, neurons in neuroscience recordings, and vocabulary terms in text data.

By the same token, “no unobserved single-cause confounders” may not be satisfied when there are very few assigned causes. Consider the neuroscience problem of inferring the relationship between brain activity and animal behavior, but where the scientist only records the activity of a small number of neurons. While unlikely that a confounder affects only one neuron in the brain, it may be more possible that a confounder affects only one of the observed neurons. This would violate “no unobserved single-cause confounders.”

In domains where “no unobserved single-cause confounders” is likely not satisfied, we suggest performing sensitivity analysis (Robins, Rotnitzky, and Scharfstein 2000; Gilbert, Bosch, and Hudgens 2003; Imai and Van Dyk 2004; Cinelli et al. 2019; Franks, D’Amour, and Feller 2019) on the deconfounder estimates. It assesses the robustness of the estimate against unobserved single-cause confounding. In the context of gwas, Section 6.2 will illustrate the effect of violating “no unobserved single-cause confounders.”

3.3 Overlap

The final main assumption of the deconfounder is that the substitute confounder Zi satisfies the overlap condition,(6) p(AijA | Zi)>0 for all sets A with positive measure,i.e.,p(A)>0.(6)

Overlap asserts that, given the substitute confounder, the conditional probability of any vector of assigned causes is positive. This assumption is sometimes stated as the second half of ignorability (Imai and Van Dyk 2004).6

The potential outcome Yi(a) is not identifiable if the substitute confounder does not satisfy overlap. When the overlap is limited, that is, p(AijA | Zi) is small for all values of Zi, then the deconfounder estimates of the potential outcome Yi(a) will have high variance.

For probabilistic factor models, the overlap condition is usually satisfied. For example, probabilistic PCA assumes Aij | ZiN(Ziθj,σ2). The normal distribution has support over the real line, which ensures P(AijA | Zi)>0 for all A with positive measure. That said, as the dimensionality of Zi increases, overlap often becomes increasingly limited (D’Amour et al. 2017). For example, probabilistic PCA returns increasingly small σ2, which signals P(AijA | Zi) is small.

We can enforce overlap by constraining the allowable family of factor models. With continuous causes, we restrict to models with continuous densities on R. (We assume the causes are full-rank, that is, that no two causes are measurable with each other; if such a pair exists, merge them into a single cause.) With discrete causes, we restrict to factor models with support on the whole A and a Zi lower-dimensional than the causes.

Alternatively, we can merge highly correlated causes as a preprocessing step. For example, consider two causes that are always assigned the same value, for example, two actors who either both appear in a movie or both not. We can merge them into one cause. This merging step prevents the deconfounder from extrapolating for the assigned causes which the data carries little evidence. We can also resort to classical strategies of causal inference under limited overlap, for example subsampling the population (Crump et al. 2009).

How can we assess the overlap with respect to the substitute confounder? With a fitted factor model, we can analyze the conditional distribution of the assigned causes given the substitute confounder P(Aij | Zi) for all individual i’s. A conditional with low variance or low entropy signals limited overlap and the possibility of high-variance causal estimates.

3.4 The Deconfounder Is Unbiased

We have described the main assumptions of the deconfounder. With sutva, overlap, and no unobserved single-cause confounders, we use the deconfounder to estimate causal quantities. Note that point identification of causal quantities requires further assumptions; see Section 7 for a discussion of these additional assumptions.

The deconfounder (informal version of Theorem 6). Assume sutva and no unobserved single-cause confounders. Then the deconfounder provides an unbiased estimate of the average causal effect:(7) EY[Yi(a1, , am)]EY[Yi(a1, , am)]=EX,Z[EY[Yi | Ai1=a1,,Aim=am,Xi,Zi]]EX,Z[Ey[Yi | Ai1=a1,,Aim=am,Xi,Zi]],(7) where Zi denotes the substitute confounder constructed from the factor model.

The theorem relies on two properties of the substitute confounder: (1) it captures all multi-cause confounders; (2) it does not capture mediators. By its construction from probabilistic factor models, the substitute confounder captures all multi-cause confounders; again, see the graphical model argument in Figure 1. Moreover, the substitute confounder is constructed with only the observed causes; no outcome information is used, so it may not pick up any mediators. (We prove this fact in Lemma 4.) Thus, along with no unobserved single-cause confounders, the substitute confounder provides unconfoundedness. With unconfoundedness in hand, we treat the substitute confounder as if it were observed covariates. While this theorem does not require overlap, identifying other causal quantities with the deconfounder requires overlap. We discuss identification of different causal quantities in Section 7 and lay out the assumptions required for each.

4 Practical Details of the Deconfounder

We next attend to some of the practical details of the deconfounder. The ingredients of the deconfounder are (1) a factor model of assigned causes, (2) a way to check that the factor model captures their population distribution, and (3) a way to estimate the conditional expectation E[Yi(a) | Ẑi,Ai=a] for performing causal inference. We discuss each ingredient below (Sections 4.1 and 4.2) and then describe the full deconfounder algorithm (Section 4.3).

4.1 Using the Assignment Model to Infer a Substitute Confounder

The first ingredient is a factor model of the assigned causes, as defined in Equation (4), which we call the assignment model. Many models fall into this category, such as mixture models, mixed-membership models, and deep generative models. Each of these models can be written as Equation (4); they each involve a per-datapoint latent variable Zi and a per-cause parameter θj. Fitting the factor model gives an estimate of the parameters θj,j=1,,m. When the fitted factor model captures the population distribution of the assigned causes then inferences about Zi can be used as substitute confounders in a downstream causal inference.

4.1.1 Example Factor Models

The deconfounder requires that the investigator find an adequate factor model of the assigned causes and then use the factor model to estimate the posterior p(zi | ai). In the simulations and studies of Section 6, we will explore several classes of factor models; we describe some of them here.

One of the most common factor models is principal component analysis (pca). pca is appropriate when the assigned causes are real-valued. In its probabilistic form (Tipping and Bishop 1999), both zi and the per-cause parameters θj are real-valued K-vectors. The model is(8) ZikN(0,λ2),k=1,,K,(8) (9) Aij | ZiN(ziθj,σ2),j=1,,m.(9)

We can fit probabilistic pca with maximum likelihood (or Bayesian methods) and use standard conditional probability to calculate p(zi | ai). Exponential family extensions of pca are also factor models (Collins, Dasgupta, and Schapire 2002; Mohamed, Ghahramani, and Heller 2009) as are some deep generative models (Tran et al. 2017), which can be interpreted as a nonlinear probabilistic PCA.

When the causes are counts, Poisson factorization (pf) is an appropriate factor model (Cemgil 2009; Schmidt, Winther, and Hansen 2009; Gopalan, Hofman, and Blei 2015). pf is a probabilistic form of nonnegative matrix factorization (Lee and Seung 1999, 2001), where zi and θj are positive K-vectors,(10) ZikGamma(α0,α1),k=1,,K,(10) (11) Aij | ZiPoisson(ziθj),j=1,,m.(11) pf can be fit to large datasets with variational methods (Gopalan, Hofman, and Blei 2015).

A final example of a factor model is the deep exponential family (def) (Ranganath et al. 2015). A def is a probabilistic deep neural network. It uses exponential families to generalize classical models like the sigmoid belief network (Neal 1990) and deep Gaussian models (Rezende, Mohamed, and Wierstra 2014). For example, a two-layer def models each observation as(12) Z2,ilExpFam2(α),l=1,,L,(12) (13) Z1,ik | Z2,iExpFam1(g1(z2,iθ1,k)),k=1,,K,(13) (14) Aij | Z1,iExpFam0(g0(z1,iθ0,j)),j=1,,m.(14)

ExpFam is an exponential family distribution, θ* are parameters, and g*(·) are link functions. Each layer of the def has the same functional form as a generalized linear model (McCullagh and Nelder 1989). The def inherits the flexibility of deep neural networks, but uses exponential families to capture different types of layered representations and data. For example, if the assigned causes are counts then Expfam0 can be Poisson; if they are reals then it can be Gaussian. Approximate inference in def can be performed with variational methods (Ranganath, Gerrish, and Blei 2014).

4.1.2 Predictive Checks for the Assignment Model

The deconfounder requires that its factor model captures the population distribution of the assigned causes. To assess the fidelity of the chosen model, we use predictive checks. A predictive check compares the observed assignments with assignments drawn from the model’s predictive distribution. If the model is good, then there is little difference.

First hold out a subset of assigned causes for each individual ai, where indexes some held-out causes. The heldout assignments are written ai,held and note we hold out randomly selected causes for each individual. The observed assignments are written ai,obs.

Next fit the factor model to the remaining assignment data D={ai,obs}i=1n. This results in a fitted assignment model p(z,θ | a). For each individual i, calculate the local posterior distribution of p(zi | ai,obs).

Here is the predictive check. First sample held-out causes from their predictive distribution,(15) p(ai,heldrep | ai,obs)=p(ai,held | zi)p(zi | ai,obs)dzi.(15)

This distribution integrates out the local posterior p(zi | ai,obs). (An approximate posterior also suffices; we discuss why in Section 5.)

Then compare replicated data to held-out data. We compare with expected log probability(16) t(ai,held)=EZ[logp(ai,held | Z) | ai,obs],(16) which relates to the marginal log-likelihood. In the nomenclature of posterior predictive checks, this is the “discrepancy function” that we use; one can use others.

Finally calculate the predictive score,(17) predictive score=p(t(ai,heldrep)<t(ai,held)).(17)

Here the randomness stems from ai,heldrep coming from the predictive distribution in Equation (15), and we approximate the predictive score with Monte Carlo.

How to interpret the predictive score? A good model will produce values of the held-out causes that give similar log-likelihoods to their real values—the predictive score will not be extreme. A mismatched model will produce an extremely small predictive score, often where the replicated data has much higher log-likelihood than the real data. An ideal predictive score is around 0.5. We consider predictive scores with predictive scores larger than 0.1 to be satisfactory; we do not have enough evidence to conclude significant mismatch of the assignment model. Note that the threshold of 0.1 is a subjective design choice. We find such assignment models that pass this threshold often yield satisfactory causal estimates in practice. Figure 2 illustrates a predictive check of a good assignment model. Section 6 shows predictive checks in action.

Fig. 2 An example of a predictive check for an assignment model. The vertical dashed line shows t(ai,held). The blue curve shows the kde of t(ai,heldrep). The predictive score is the area under the blue curve to the left of the vertical dashed line. The predictive score of this assignment model is larger than 0.1; we consider it satisfactory.

These predictive checks blend ideas around posterior predictive checks (ppcs) (Rubin 1984), ppcs with realized discrepancies (Gelman, Meng, and Stern 1996), ppcs with held-out data (Gelfand, Dey, and Chang 1992; Ranganath and Blei 2019), and stage-wise checking of hierarchical models (Dey et al. 1998; Bayarri and Castellanos 2007). They also relate to Bayesian causal model criticism (Tran et al. 2016b). More broadly, the process of iterative model building—cycling between finding, fitting, and checking a model of the assignments—relates to the applied practice of Bayesian data analysis (Gelman et al. 2013; Blei 2014).

4.2 The Outcome Model

We described how to fit and check a factor model of multiple assigned causes. We now fold in the observed outcomes and use the fitted factor model to correct for unobserved confounders.

Suppose p(zi | ai,D) concentrates around a point ẑi. Then we can use ẑi as a confounder. Follow Section 2.1 to calculate the iterated expectation on the left side of Equation (2). However, replace the observed confounders with the substitute confounder; the goal is to calculate E[EY[Yi(a) | Ai=a,Zi]]. First, approximate the outside expectation with Monte Carlo,(18) E[E[Yi(a) | Ai=a,Zi]]1ni=1nEY[Yi(Ai) | Ai=a,Zi=ẑi].(18)

This approximation uses the substitute confounder ẑi, integrating over its population distribution. It uses the model to infer the substitute confounder from each data point and then integrates the distribution of that inferred variable induced by the population distribution of data.

Turn now to the inner expectation of Equation (18). We fit a function to estimate this quantity,(19) E[Yi(Ai) | Ai=a,Zi=z]=f(a,z).(19)

The function f(a,z) is called the outcome model and can be fit from the augmented observed data {ai,ẑi,yi(ai)}. For example, we can minimize their discrepancy via some loss function :f̂=arg minfi=1n(yi(ai)f(ai,ẑi)).

Like the factor model, we can check the outcome model—it is fit to observed data and should be predictive of held-out observed data (Tran et al. 2016b).

One outcome model we consider is a simple linear function,(20) f(a,z)=βa+γz+β0.(20)

Another outcome model we consider is where f(·) is linear in the assigned causes a and the “reconstructed assigned causes” â(z)=EM[A | z], an expectation from the fitted factor model. This class of functions is(21) f(a,z)=β(aâ(z))+β0.(21)

This model relates to the generalized propensity score (Imbens 2000; Hirano and Imbens 2004), where it uses â(z) as a proxy for the propensity score. Note this substitution is used in Bayesian statistics (Laird and Louis 1982; Tierney and Kadane 1986; Geisser et al. 1990), and is justified when higher moments of the assignment are similar across individuals. In both Equations (20) and (21), the coefficient β represents the average causal effect of raising each cause by one unit.

Note we are not restricted to linear models. Other outcome models like random forests (Wager and Athey 2018) and Bayesian additive regression trees (Hill 2011) all apply here. Moreover, devising an outcome model is just one approach to approximating the inner expectation of Equation (18). Another approach is again to use Monte Carlo. There are several possibilities. In one, group the confounder ẑi into bins and approximate the expectation within each bin. In another, bin by the propensity score p(ai | ẑi) and approximate the inner expectation within each propensity-score bin (Rosenbaum and Rubin 1983; Lunceford and Davidian 2004). A third possibility—if the assigned causes are discrete and the number of causes is small—is to use the propensity score with inverse propensity weighting (Horvitz and Thompson 1952; Rosenbaum and Rubin 1983; Heckman et al. 1998; Dehejia and Wahba 2002).

4.3 The Full Algorithm and an Example

We described each component of the deconfounder. Algorithm 1 gives the full algorithm, a procedure for estimating Equation (18). The steps are: (1) find, fit, and check a factor model to the dataset of assigned causes; (2) estimate ẑi for each datapoint; (3) find and fit a outcome model; (4) use the outcome model and estimated ẑi to do causal inference.

Algorithm 1: The deconfounder

Input: a dataset of assigned causes and outcomes {(ai,yi)},  i=1,,n

Output: the average potential outcome E[Y(a)] for any causes a

repeat

choose an assignment model from the class in Equation (4)

fit the model to the assigned causes {ai},  i=1,,n check the fitted model M̂

until the assignment check is satisfactory

foreach datapoint i do

calculate ẑi=EM̂[Zi | ai].

end

repeat

choose an outcome model from Equation (19)

fit the outcome model to the augmented dataset {(ai,yi,ẑi)},  i=1,,n

check the fitted outcome model

until the outcome check is satisfactory

estimate the average causal effect E[Y(a)]E[Y(a)] by Equation (18)

4.3.1 Example

Consider a causal inference problem in genome-wide association studies (gwas) (Stephens and Balding 2009; Visscher et al. 2017): how do human genes causally affect height? Here we give a brief account of how to use the deconfounder, omitting many of the details. We analyze gwas problems extensively in Section 6.2. We discuss the connections of the deconfounder to existing GWAS methods in Appendix A in the supplementary materials.

Consider a dataset of n = 5000 individuals; for each individual, we measure height and genotype, specifically the alleles at m=100,000 locations, called the single-nucleotide polymorphisms (snps). Each snp is represented by a count of 0, 1, or 2; it encodes how many of the individual’s two nucleotides differ from the most common pair of nucleotides at the location. Table 2 illustrates a snippet of the data (10 individuals).

Table 2 How do SNPs causally affect height?

We simulate such a dataset of genotypes and height. We generate each individual’s genotypes by simulating heterogeneous mixing of populations (Pritchard et al. 2000). We then generate the height from a linear model of the snps (i.e., the assigned causes) and some simulated confounders. (The confounders are only used to simulate data; when running the deconfounder, the confounders are unobserved.) In this simulated data, the coefficients of the SNPs are the true causal effects; we denote them β*=(β1*,,βm*). See Section 6.2 for more details of the simulation.

The goal is to infer how the snps causally affect human height, even in the presence of unobserved confounders. The m-dimensional snp vector ai=(ai1,ai2,,aim) is the vector of assigned causes for individual i; the height yi is the outcome. We want to estimate the potential outcome: what would the (average) height be if we set a person’s snp to be a=(a1,a2,,am)? Mathematically, this is the average potential outcome function: E[Yi(a)], where the vector of assigned causes a takes values in {0,1,2}m.

We apply the deconfounder: model the assigned causes, infer a substitute confounder, and perform causal inference. To infer a substitute confounder, we fit a factor model of the assigned causes. Here we fit a 50-factor pf model, as in Equation (10). This fit results in estimates of nonnegative factors θ̂j for each assigned cause and nonnegative weights ẑi for each individual (both K-vectors).

If the predictive check greenlights this fit, then we take the posterior predictive mean of the assigned causes as the reconstructed assignments, âj(zi)=ẑiθ̂j. For brevity, we do not report the predictive check here. (The model passes.) We demonstrate predictive checks for gwas in the empirical studies of Section 6.2.

Using the reconstructed assigned causes, we estimate the average potential outcome function. Here we fit a linear outcome model to the height yi against both of the assigned causes ai and reconstructed assignment â(zi),(22) yiN(β0+β(aiâ(zi)),σ2).(22)

This regression is high dimensional (m > n); for regularization, we use an L2-penalty on β (equivalently, normal priors). Fitting the outcome model gives an estimate of regression coefficients {β̂0,β̂}. Because we use a linear outcome model, the regression coefficients β̂ estimate the true causal effect β*.

We evaluate the causal estimates obtained with and without the deconfounder. We focus on the root mean squared error (rmse) of β̂ to β*. (“Causal estimation without the deconfounder” means fitting a linear model of the height yi against the assigned causes ai.) The rmse is 49.6×102 without the deconfounder and 41.2×102 with the deconfounder. The deconfounder produces closer-to-truth causal estimates.

5 A Conversation With the Reader

In this section, we answer some questions a reader might have.

Why do I need multiple causes? The deconfounder uses latent variables to capture dependence among the assigned causes. The theory in Section 7 says that a latent variable which captures this dependence will contain all valid multi-cause confounders. But estimating this latent variable requires evidence for the dependence, and evidence for dependence cannot exist with just one assigned cause. Thus, the deconfounder requires multiple causes.

Is the deconfounder a free lunch? The deconfounder is not a free lunch—it trades confounding bias for estimation variance. To see this, take an information point of view: the deconfounder uses a portion of information in the data to estimate a substitute confounder; then it uses the rest to estimate causal effects. By contrast, classical causal inference uses all the information to estimate causal effects, but it must assume unconfoundedness.

Suppose unconfoundedness is satisfied, that is, no unobserved confounders. Then both classical causal inference and the deconfounder provide unbiased causal estimates, though the deconfounder will be less confident; it has higher variance. Now suppose only “no unobserved single-cause confounders” is satisfied. The deconfounder still provides unbiased causal estimates, but classical causal inference is biased.

Why does the deconfounder have two stages? Algorithm 1 first fits a factor model to the assigned causes and then fits the potential outcome function. This is a two stage procedure. Why? Can we fit these two models jointly?

One reason is convenience. Good models of assigned causes may be known in the research literature, such as for genetic studies. Moreover, separately fitting the assignment model allows the investigator to fit models to any available data of assigned causes, including datasets where the outcome is not measured.

Another reason for two stages is to ensure that Zi does not contain mediators, variables along the causal path between the assigned causes and the outcome. Intuitively, excluding the outcome ensures that the substitute confounders are “pretreatment” variables; we cannot identify a mediator by looking only at the assigned causes. More formally, excluding the outcome ensures that the model satisfies p(zi | ai,yi(ai))=p(zi | ai); this equality cannot hold if Zi contains a mediator.

How does the deconfounder relate to the generalized propensity score? What about instrumental variables? The deconfounder relates to both. The deconfounder can be interpreted as a generalized propensity score approach, except where the propensity score model involves latent variables. If we treat the substitute confounder Zi as observed covariates, then the factor model P(Ai | Zi) is precisely the propensity score of the causes Ai. With this view, the innovation of the deconfounder is in Zi being latent. Moreover, it is the multiplicity of the causes Ai1,,Aim that makes a latent Zi feasible; we can construct Zi by finding a random variable that renders all the causes conditionally independent.

The deconfounder can also be interpreted as a way of constructing instruments using latent factor models. Think of a factor model of the causes with linearly separable noises: Aij=a.s.f(Zi)+ϵij. Given the substitute confounder, consider the residual of the causes ϵij. For example, with probabilistic pca the residual is ϵij=AijZiθjN(0,σ2).

Assuming no unobserved single-cause confounders, the variable ϵij is an instrumental variable for the jth cause Aij: (1) The residual ϵij correlates with the cause Aij. (2) The residual ϵij affects the outcome only through the cause Aij; this fact is true because the substitute confounder Zi is constructed without using any outcome information. (3) The residual ϵij cannot be correlated with a confounder; this is true because Ziϵij by construction from the factor model, where P(Zi) and P(Aij | Zi) are specified separately.

However, the deconfounder differs from classical instrumental variables approaches because it uses latent variable models to construct instruments, rather than requiring that instruments be observed. The latent variable construction is feasible because the multiplicity of the causes allows us to construct Zi and ϵij from the conditional independence requirement.

Does the factor model of the assigned causes need to be the true assignment model? Which factor model should I choose if multiple factor models return good predictive scores? Finding a good factor model is not the same as finding the “true” model of the assigned causes. We do not assume the inferred variable Zi reflects a real-world unobserved variable.

Rather, the deconfounder requires the factor model to capture the population distribution of the assigned causes and, more particularly, their dependence structure. This requirement is why predictive checking is important. If the deconfounder captures the population distribution—if the predictive check returns high predictive scores—then we can use the inferred local variables Zi as substitute confounders.

Moreover, the deconfounder can rely on approximate inference methods to infer the substitute confounder. The predictive check evaluates whether Zi provides a good predictive distribution, regardless of how it was inferred. Given the assumptions of the deconfounder, as long as the model and (approximate) inference method together give a good predictive distribution—one close to the population distribution of the assigned causes—then the downstream causal inference is valid. We use approximate inference for most of the factor models we study in Section 6.

Suppose multiple factor models give similarly good predictive scores in the predictive check. In this case, we recommend choosing the factor model with the lowest capacity. Factor models with similar predictive scores often result in causal estimates with similarly little bias. But the variance of these estimates can differ. Factor models with high capacity can compromise overlap and lead to high-variance estimates; factor models with low capacities tend to produce lower variance causal estimates. The empirical study in Section 6.1 demonstrates this phenomenon.

Should I condition on known confounders and covariates? Suppose we also observe known confounders and other covariates Xi. The deconfounder maintains its theoretical properties when we condition on observed covariates Xi as well as infer a substitute confounder Zi. In particular, if Xi is “pretreatment”—it does not include any mediators—then the causal estimate will be unbiased (Imai and Van Dyk 2004) (also see Theorem 6). Moreover, to satisfy no unobserved single-cause confounders (Section 3.2), we must condition on single-cause confounders.

That said, we do not need to condition on observed confounders that affect more than one of the causes; it suffices to condition only on the substitute confounder Zi. And there is a tradeoff. Conditioning on covariates Xi maintains unbiasedness but it hurts efficiency. If the true causal effect size is small then large confidence or credible intervals will conclude these small effects as insignificant—inefficient causal estimates can bury the real causal effects. The empirical study in Section 6.1 explores this phenomenon.

How can I assess the uncertainty of the deconfounder? The uncertainty in the deconfounder comes from two sources, the factor model and the outcome model. The deconfounder first fits (and checks) the factor model; it gives a substitute confounder Zip(zi | ai). It then uses the mean of the substitute confounder ẑi=EM̂[Zi | ai] to fit an outcome model p(yi | ai,ẑi) and compute the potential outcome estimate E[Yi(a)].

To assess the uncertainty of the deconfounder, we consider the uncertainty from both sources. We first draw s samples {zi(1),,zi(s)} of the substitute confounder: zi()iidp(zi | ai),=1,,s. For each sample zi(), we fit an outcome model and compute a point estimate of the potential outcome. (If the outcome model is probabilistic, we compute the posterior distribution of its parameters; this leads to a posterior of the potential outcome.) We aggregate the estimates of the potential outcome (or its distributions) from the s samples {zi(1),,zi(s)}; the aggregated estimate is a collection of point estimates of the potential outcome (or a mixture of its posterior distributions). The variance of this aggregated estimate describes the uncertainty of the deconfounder; it reflects how the finite data informs the estimation of the potential outcome. In a two-cause smoking study, Section 6.1 illustrates this strategy for calculating the uncertainty of the deconfounder.

6 Empirical Studies

We study the deconfounder in three empirical studies. Two studies involve simulations of realistic scenarios; these help assess how well the deconfounder performs relative to ground truth. The third study is a real-world analysis. All three studies demonstrate the benefits of the deconfounder. They show how predictive checks reveal potential issues with downstream causal inference and how the deconfounder can provide closer-to-truth causal estimates.

In Section 6.1, we study semi-synthetic data about smoking; the causes are a real dataset about smoking and the effect (medical expenses) is simulated. In Section 6.2, we study semi-synthetic data about genetics. Finally, in Section 6.3, we study real data about actors and movie revenue; there is no simulation.

Each stage of the deconfounder requires computation: to fit the factor model, to check the factor model, to calculate the substitute deconfounder, and to fit the outcome model. In all these stages, we use black box variational inference (bbvi) (Ranganath, Gerrish, and Blei 2014; Kucukelbir et al. 2017). We use its RStan implementation (Carpenter et al. 2017) in Section 6.1 and its Edward implementation (Tran et al. 2016a, 2017) in Sections 6.2 and 6.3. (This was a choice; other methods of computation can also be used.)

6.1 Two Causes: How Smoking Affects Medical Expenses

We first study the deconfounder with semi-synthetic data about smoking. The 1987 National Medical Expenditures Survey (NMES) collected data about smoking habits and medical expenses in a representative sample of the U.S. population (US Department of Health and Human Services Public Health Service 1987; Imai and Van Dyk 2004). The dataset contains 9708 people and 8 variables about each. For each person, we focus on the current marital status (amar), the cumulative exposure to smoking (aexp), and the last age of smoking (aage). We standardize all variables.

6.1.1 A True Outcome Model and Causal Inference Problem

We use the assigned causes from the survey to simulate a dataset of medical expenses, which we will consider as the outcome variable. In this simulation, the true model is linear,(23) yi=βmar amar,i+βexp aexp,i+βage aage,i+εi,(23) where εiN(0,1). We generate the true causal coefficients from(24) βmarN(0,1)βexpN(0,1)βageN(0,1).(24) and from these coefficients we generate the outcome for each individual. The result is a dataset of 9708 tuples (amar,i,aexp,i,aage,i,yi). It is semi-synthetic: the assigned causes are from the real world, but we know the true outcome model. Note that the last smoking age is a multi-cause confounder—it affects both marital status and exposure and is one of the causes of the expenses.

We are interested in estimating the causal effects of marital status and smoking exposure on medical expenses. But suppose we do not observe age; it is an unobserved confounder. We can use the deconfounder to solve the problem.

6.1.2 Modeling the Assigned Causes

We begin by finding a good factor model of the assigned causes (amar,i,aexp,i). Because there are two observed assigned causes, we consider models with a single scalar latent variable for overlap considerations. (See Section 3.) We consider two factor models.

The first is a linear factor model,(25) zline,iN(0,σ2)(25) (26) amar,i=ηmar(1) zline,i+ηmar(0)+ϵi,mar(26) (27) aexp,i=ηexp(1) zline,i+ηexp(0)+εi,exp,(27) where all errors are standard normal. We use variational inference to approximate posterior estimates of the substitute confounders zline,i. Then we use the predictive check to evaluate it: following Section 4.1, we hold out a subset of the assigned causes and using the expected log probability as the test statistic. The resulting predictive score is 0.03, which signals a model mismatch. See Figure 3(a).

Fig. 3 Predictive checks for the substitute confounder z obtained from a linear factor model (a) and a quadratic factor model (b). The blue line is the kde of the test-statistic based on the predictive distribution. The dashed vertical line shows the value of the test-statistic on the observed dataset. The figure shows that the linear model mismatches the data—the observed statistic falls in a low probability region of the kde. The quadratic factor model is a better fit to the data.

We next consider a quadratic factor model,(28) zquad,iN(0,σ2)(28) (29) amar,i=ηmar(1) zquad,i+ηmar(2) zquad,i2+ηmar(0)+εi,mar(29) (30) aexp,i=ηexp(1) zquad,i+ηexp(2) zquad,i2+ηexp(0)+εi,exp,(30) where all errors are standard normal. (In Appendix C in the supplementary materials, we prove that the average causal effect is identifiable with this quadratic factor model and a linear outcome model.) We again use variational inference and a predictive check. The resulting predictive score is 0.12, Figure 3(b). This value gives the green light. We use the model’s posterior estimates ẑipquad(z | ai) to form a substitute confounder in a causal inference.

6.1.3 Deconfounded Causal Inference

Using a factor model to estimate substitute confounders, we proceed with causal inference. We set the outcome model of E[Y(Amar,Aexp) | A,Z] to be linear in amar and aexp. In one form, the linear model conditions on ẑ directly. In another it conditions on the reconstructed causes, for example, for the quadratic model and for age,(31) amar,i(ẑi)=Equad[Amar | Z=ẑi].(31)

See Equation (21).

We use predictive checks to evaluate the outcome models. Conditioning on ẑ gives a predictive score of 0.05; conditioning on a(ẑ) gives a predictive score of 0.18. The model with reconstructed causes is better.

If the outcome model is good and if the substitute confounder captures the true confounders then the estimated coefficients for age and exposure will be close to the true βmar and βexp of Equation (23). We emphasize that Equation (23) is the true mechanism of the simulated world, which the deconfounder does not have access to. The linear model we posit for E[Y(Amar,Aexp) | A,Z] is a functional form for the expectation we are trying to estimate.

6.1.4 Performance

We compare all combinations of factor model (linear, quadratic) and outcome-expectation model (conditional on ẑi or a(ẑi)). Table 3 gives the results, reporting the total bias and variance of the estimated causal coefficients βmar and βexp. We compute the variance by drawing posterior samples of the substitute confounder and the resulting posterior samples of the causal coefficients.

Table 3 Total bias and variance of the estimated causal coefficients βexp and βmar.

Table 3 also reports the estimates if we had observed the age confounder (oracle), and the estimates if we neglect causal inference altogether and fit a regression to the confounded data. Neglecting causal inference gives biased causal estimates; observing the confounder corrects the problem.

How does the deconfounder fare? Using the deconfounder with a linear factor model yields biased causal estimates, but we predicted this peril with a predictive check. Using the deconfounder with the quadratic assignment model, which passed its predictive check, produces less biased causal estimates. (The estimate with one-dimensional zquad was still biased, but the outcome check revealed this issue.)

We also use this simulation study to illustrate a few questions discussed in Section 5:

  • What if multiple factor models pass the check?We fit to the causes one-dimensional, two-dimensional, and three-dimensional quadratic factor models. All three models pass the check. Table 3 shows that they yield estimates with similar bias. However, factor models with higher capacity in general lead to higher variance. The one-dimensional factor model is the smallest factor model that passes the check, and it achieves the best mean squared error.

  • Should we additionally condition on the observed covariates? Table 3 shows that using the deconfounder, along with covariates, preserves the unbiasedness of the causal estimates but inflates the variance. (The covariates include gender, race, seat belt usage, education level, and the age of starting to smoke.) This demonstrates how including covariates trades variance for the risk of missing a confounder.

This study provides two takeaway messages: (1) it is crucial to check both the assignment model and the outcome model; (2) unless a single-cause confounder believably exists, we do not need to accompany the deconfounder with other observed covariates; (3) use the deconfounder.

6.2 Many Causes: Genome-Wide Association Studies

Analyzing gene-wide association studies (GWAS) is an important problem in modern genetics (Stephens and Balding 2009; Visscher et al. 2017). The GWAS problem involves large datasets of human genotypes and a trait of interest; the goal is to determine how genetic variation is causally connected to the trait. GWAS is a problem of multiple causal inference: for each individual, the data contains a trait and hundreds of thousands of single-nucleotide polymorphisms (snps), measurements on various locations on the genome.

One benefit of GWAS is that biology guarantees that genes are (typically) cast in advance; they are potential causes of the trait, and not the other way around. However, there are many confounders. In particular, any correlation between the SNPs could induce confounding. Suppose the value of SNP i is correlated with the value of SNP j, and SNP j is causal for the outcome. Then a naive analysis will find a connection between gene i and the outcome.

There can be many sources of correlation; common sources include population structure, that is, how the genetic codes of an individuals exhibits their ancestral populations, and lifestyle variables. We study how to use the deconfounder to analyze GWAS data. (Many existing methods to analyze GWAS data can be seen as versions of the deconfounder; see Appendix A in the supplementary materials.)

6.2.1 Simulated GWAS Data and the Causal Inference Problem

We put the GWAS problem into our notation. The data are tuples (ai,yi), where yi is a real-valued trait and aij{0,1,2} is the value of SNP j in individual i. (The coding denotes “unphased data,” where aij codes the number of minor alleles—deviations from the norm—at location j of the genome.) As usual, our goal is to estimate aspects of the distribution of yi(a), the trait of interest as a function of a specific genotype.

We generate synthetic GWAS data. Following Song, Hao, and Storey (2015), we simulate genotypes a1:n from an array of realistic models. These include models generated from real-world fits, models that simulate heterogeneous mixing of populations, and models that simulate a smooth spatial mixing of populations. For each model, we produce multiple datasets of genotypes.

With the individuals in hand, we next generate their traits. Still following Song, Hao, and Storey (2015), we generate the outcome (i.e., the trait) from a linear model,(32) yi=jβjaij+λci+εi.(32)

To introduce further confounding effects, we group the individuals by their SNPs; the ith individual is in group ci. (Appendix N in the supplementary materials describes how individuals are grouped.) Each group is associated with a per-group intercept term λc and a per-group error variance σc, where the noise εiN(0,σc2). In this study, the group indicator of each individual is an unobserved confounder.

In Equation (32), SNP j is associated with a true causal coefficient βj. We draw this coefficient from N(0,0.52) and truncate so that majority of the coefficients are set to zero (i.e., no causal effect). Such truncation mimics the sparse causal effects that are found in the real world. Further, we study both low and high SNR settings. In low SNR settings, the SNPs contribute only a small portion (e.g., 10%) of the variance, and vice versa. Appendix N in the supplementary materials details the full configurations of the simulation.

In a separate set of studies, we generate binary outcomes. They come from a generalized linear model,(32) yiBernoulli(11+exp(jβjaij+λci+εi)).(32)

We will study the deconfounder for both binary or real-valued outcomes.

For each true assignment model of ai, we simulate 100 datasets of genotypes ai, causal coefficients βj, and outcomes yi (real and binary). For each, the causal inference problem is to infer the causal coefficients βj from tuples (ai,yi). The unobserved confounding lies in the correlation structure of the SNPs and the unobserved groups. We correct it with the deconfounder.

6.2.2 Deconfounding GWAS

We apply the deconfounder with five assignment models discussed in Section 2.2: probabilistic principal component analysis (ppca), Poisson factorization (pf), Gaussian mixture models (gmms), the three-layer deep exponential family (def), and logistic factor analysis (lfa); none of these models is the true assignment model. (We use 50 latent dimensions so that most pass the predictive check; for the def we use the structure [100,30,15].) We fit each model to the observed SNPs and check them with the per-individual predictive checks from Section 4.1.

With the fitted assignment model, we estimate the causal effects of the SNPs. For real-valued traits, we use a linear model conditional on the snps and the reconstructed causes a(ẑ); see Equation (21). Each assignment model gives a different form of a(ẑ). For the binary traits, we use a logistic regression, again conditional on the SNPs and reconstructed causes.

6.2.3 Performance

We study the deconfounder for GWAS. Tables 6–15 in the supplementary materials present the full results across the 11 different configurations and both high and low signal-to-noise ratio (snr) settings. Each table is attached to a true assignment model and reports results across different factor models of the SNPs. For each factor model, the tables report the results of the predictive check and the root mean squared error (rmse) of the estimated causal coefficients (for real-valued and binary-valued outcomes). Tables 6–15 in the supplementary materials also report the error if we had observed the confounder and if we neglect causal inference by fitting a regression to the confounded data.

On both real and binary outcomes, the deconfounder gives good causal estimates with ppca, pf, lfa, linear mixed models (lmms), and defs: they produce lower rmses than blindly fitting regressions to the confounded data. (The linear mixed model does not explicitly posit an assignment model so we omit the predictive check. It can be interpreted as the deconfounder though; see Appendix A in the supplementary materials.) Notably, the deconfounder often outperforms the regression where we include the (unobserved) confounder as a covariate under the low snr setting; see Tables 11–14 in the supplementary materials.

In general, predictive checks of the factor models reveal downstream issues with causal inference: better factor models of the assigned causes, as checked with the predictive checks, give closer-to-truth causal estimates. For example, the gmm does not perform well as a factor model of the assignments; it struggles with fitting high-dimensional data and can amplify the causal effects (see, e.g., Table 15 in the supplementary materials). But checking the gmm signals this issue beforehand; the gmm constantly yields close-to-zero predictive scores in predictive checks.

Among the assignment models, the three-layer def almost always produces the best causal estimates. Inspired by deep neural networks, the def has layered latent variables; see Section 4.1. The def model of SNPs uses Gamma distributions on the latent variables (to induce sparsity) and a bank of Poisson distributions to model the observations.

The deconfounder is most challenged when the assigned SNPs are generated from a spatial model; see Tables 10 and 15 in the supplementary materials. The spatial model produces spatially correlated individuals; its parameter τ controls the spatial dispersion. (Consider each individual to sit in a unit square; as τ0, the individuals are placed closer to the corners of the unit square while when τ = 1 they are distributed uniformly.) The five factor models—ppca, pf, lfa, gmm, lmm, and def—all produce closer-to-truth causal estimates than when ignoring confounding effects. But they are farther from the truth than the estimates that use the (unobserved) confounder. Again, the predictive check hints at this issue. When the true distribution of SNPs is a spatial model, the predictive scores are generally more extreme (i.e., closer to zero).

6.2.4 Partially Observed Causes

Finally, we study the situation where some assigned causes are unobserved, that is, where some of the SNPs are not measured. Recall that the deconfounder assumes that all single-cause confounders are observed. This assumption may be plausible when we measure all assigned causes but it may well be compromised when we only observe a subset—if a confounder affects multiple causes but only one of those causes is observed then the confounder becomes a single-cause confounder.

Using the simulated GWAS data, we randomly mask a percentage of the causes. We then use the deconfounder to estimate the causal effects of the remaining causes. To simplify the presentation, we focus on the DEF factor model. Figure 4 shows the ratio of the rmse between the deconfounder and “no control”; a ratio closer to one indicates a more biased causal estimate. Across simulations, the rmse ratio increases toward one as the percentage of observed causes decreases. With fewer observed causes, it becomes more likely for “no unobserved single-cause confounders” to be compromised.

Fig. 4 The rmse ratio between the deconfounder with def and “No control” across simulations when only a subset of causes are unobserved. (Lower ratios means more correction.) As the percentage of observed causes decreases, the “no unobserved single-cause confounders” assumption is compromised; the deconfounder can no longer correct for all latent confounders.

6.2.5 Summary

These studies provide three take-away messages: (1) the deconfounder can produce closer-to-truth causal estimates, especially when we observe many assigned causes; (2) predictive checks reveal downstream issues with causal inference, and better factor models give better causal estimates; (3) defs can be a handy class of factor models in the deconfounder.

6.3 Case Study: How Do Actors Boost Movie Earnings?

We now return to the example from Section 1: How much does an actor boost (or hurt) a movie’s revenue? We study the deconfounder with the TMDB 5000 Movie Dataset.7 It contains 901 actors (who appeared in at least five movies) and the revenue for the 2828 movies they appeared in. The movies span 18 genres and 58 languages. (More than 60% of the movies are in English.) We focus on the cast and the log of the revenue. Note that this is a real-world observational dataset. We no longer have ground truth of causal estimates.

The idea here is that actors are potential causes of movie earnings: some actors result in greater revenue. But confounders abound. Consider the genre of a movie; it will affect both who is in the cast and its revenue. For example, an action movie tends to cast action actors, and action movies tend to earn more than family movies. And genre is just one possible confounder: movies in a series, directors, writers, language, and release season are all possible confounders.

We are interested in estimating the causal effects of individual actors on the revenue. The data are tuples of (ai,yi), where aij{0,1} is an indicator of whether actor j in movie i, and yi is the revenue. Table 1 shows a snippet of the highest-earning movies in this dataset. The goal is to estimate the distribution of Yi(a), the (potential) revenue as a function of a movie cast.

6.3.1 Deconfounded Causal Inference

We apply the deconfounder. We explore four assignment models: probabilistic principal component analysis (ppca), Poisson factorization (pf), Gaussian mixture models (gmms), and deep exponential familys (defs). (Each has 50 latent dimensions; the def has structure [50,20,5].) We fit each model to the observed movie casts and check the models with a predictive check on held-out data; see Section 4.1.

The gmm fails its check, yielding a predictive score <0.01. The other models adequately capture patterns of actors: the checks return predictive scores of 0.12 (ppca), 0.14 (pf), and 0.15 (def). These numbers give a green light to estimate how each actor affects movie earnings.

With a fitted and checked assignment model, we estimate the causal effects of individual actors with a log-normal regression, conditional on the observed casts and “reconstructed casts,” Equation (21).

6.3.2 Results: Predicting the Revenue of Uncommon Movies

We consider test sets of uncommon movies, where we simulate an “intervention” on the types of movies that are made. This changes the distribution of casts to be different from those in the training set.

For such data, a good causal model will provide better predictions than a purely predictive model. The reason is that predictions from a causal model will work equally well under interventions as for observational data. In contrast, a noncausal model can produce incorrect predictions if we intervene on the causes (Peters, Bühlmann, and Meinshausen 2016). This idea of invariance has also been discussed in Haavelmo (1944), Aldrich (1989), Lanes (1988), Pearl (2009), Schölkopf et al. (2012), and Dawid and Didelez (2010) under the terms “autonomy,” “modularity,” and “stability.”

In one test set, we hold out 10% of non-English-language movies. (Most of the movies are in English.) Table 17 in the supplementary materials compares different models in terms of the average predictive log likelihood. The deconfounder predicts better than both the purely predictive approach (no control) and a classical approach, where we condition on the observed (pretreatment) covariates.

In another test set, we hold out 10% of movies from uncommon genres, that is, those that are not comedies, action, or dramas. Table 18 in the supplementary materials shows similar patterns of performance. The deconfounder predicts better than purely predictive models and than those that control for available confounders.

For comparison, we finally analyze a typical test set, one drawn randomly from the data. Here we expect a purely predictive method to perform well; this is the type of prediction it is designed for. Table 16 in the supplementary materials shows the average predictive log-likelihood of the deconfounder and the purely predictive method. The deconfounder predicts slightly worse than the purely predictive method.

6.3.3 Exploratory Analysis of Actors and Movies

We show how to use the deconfounder to explore the data, understanding the causal value of actors and movies.8

First we examine how the coefficients of individual actors differ between a noncausal model and a deconfounded model. (In this section, we study the deconfounder with pf as the assignment model.) We explore actors with njβj, their estimated coefficients scaled by the number of movies they appeared in. This quantity represents how much of the total log revenue is “explained” by actor j.

Consider the top 25 actors in both the corrected and uncorrected models. In the uncorrected model, the top actors are movie stars such as Tom Cruise, Tom Hanks, and Will Smith. Some actors, like Arnold Schwartzenegger, Robert De Niro, and Brad Pitt, appear in the top-25 uncorrected coefficients but not in the top-25 corrected coefficients. In their place, the top 25 causal actors include actors that do not appear in as many blockbusters, such as Owen Wilson, Nick Cage, Cate Blanchett, and Antonio Banderas.

Also consider the actors whose estimated contribution improves the most from the noncausal to the causal model. The top five “most improved” actors are Stanley Tucci, Willem Dafoe, Susan Sarandon, Ben Affleck, and Christopher Walken. These (excellent) actors often appear in smaller movies.

Next we look at how the deconfounder changes the causal estimates of movie casts. We can calculate the movie casts whose causal estimates are decreased most by the deconfounder. The “causal estimate of a cast” is the predicted revenue without including the term that involves the confounder; this is the portion of the predicted log revenue that is attributed to the cast.

At the top of this list are blockbuster series. Among the top 25 include all of the X-Men movies, all of the Avengers movies, and all of the Ocean’s movies. Though unmeasured in the data, being part of a series is a confounder. It affects both the casting and the revenue of the movie: sequels must contain recurring characters and they are only made when the producers expect to profit. In capturing the correlations among casts, the deconfounder corrects for this phenomenon.

7 Theory

We develop theoretical results around the deconfounder. (All proofs are in the Appendix.)

We first justify the use of factor models by connecting them to the unconfoundedness assumption. We show that factor models, together with “no unobserved single-cause confounders,” imply unconfoundedness. We next establish theoretical properties of the substitute confounder: it captures all multiple-cause confounders and it does not capture any mediators. These results imply that if the factor model captures the distribution of the assigned causes then the substitute confounder renders the assignment ignorable. Moreover, such a factor model always exists.

We then discuss identification results around the deconfounder. Under stable unit treatment value assumption (sutva) and “no unobserved single-cause confounders,” we prove that the deconfounder identifies the average causal effects and the conditional potential outcomes under different conditions.

7.1 Factor Models and the Substitute Confounder

To study the deconfounder, we first connect unconfoundedness to factor models. Recall the definitions of unconfoundedness and factor model.

Unconfoundedness assumes that the assigned causes are conditionally independent of the potential outcomes (Rosenbaum and Rubin 1983; Imbens 2000):

Definition 1

(Weak unconfoundedness (Imbens 2000)). The assigned causes are weakly unconfounded given Zi if(34) (Ai1,,Aim)Yi(a) | Zi(34) for all (a1,,am)A1Am,, and i=1,,n.

Roughly, the assigned causes are weakly unconfounded given Zi if all confounders are captured by Zi. More technically, the assigned causes are weakly unconfounded if all confounders are measurable with respect to the σ-algebra generated by Zi.

A factor model of assigned causes describes each assigned cause of a individual with a latent variable specific to this individual and another specific to this cause:

Definition 2

(Factor model of assigned causes). Consider the assigned causes A1:n, a set of latent variables Z1:n and a set of parameters θ1:m. A factor model of the assigned causes is a latent-variable model,(35) p(z1:n,a1:n ; θ1:m)=p(z1:n)i=1nj=1mp(aij | zi,θj).(35)

The distribution of assigned causes is the corresponding marginal,(36) p(a1:n)=p(z1:n,a1:n ; θ1:m)dz1:n.(36)

In a factor model, each latent variable Zi of individual i renders its assigned causes Aij,j=1,,m, conditionally independent. Each cause is accompanied with an unknown parameter θj. As we mentioned in Section 4.1, many common models from Bayesian statistics and machine learning can be written as factor models. In the deconfounder, we fit factor models to construct substitute confounders, where we infer Z1:n as a function of a1:n and check its fidelity against the distribution of the causes p(a1:n) using a predictive check. When a fitted factor model passes the check, it captures p(a1:n) well. In other words, factor models in the deconfounder satisfy Equations (35) and (36) with p(z1:n)=δfθ(a1:n) for some function fθ(·).

To connect unconfoundedness to factor models, consider an intermediate construct, the “Kallenberg construction.” The Kallenberg construction is inspired by the idea of randomization variables, Uniform[0,1] variables from which we can construct a random variable with an arbitrary distribution (Kallenberg 1997). The Kallenberg construction of assigned causes will bridge the conditional independence statement in Equation (34) with the factor models of the deconfounder.

Definition 3

(Kallenberg construction of assigned causes). Consider a random variable Zi taking values in Z. The distribution of assigned causes (Ai1,,Aim) admits a Kallenberg construction if there exists (deterministic) measurable functions, fj:Z×[0,1]Aj and random variables Uij[0,1] (j=1,,m) such that(37) Aij=a.s.fj(Zi,Uij);(37) the variables Uij must marginally follow Uniform[0,1] and jointly satisfy(38) (Ui1,,Uim)(Zi,Yi(a1,,am))(38) for all (a1,,am)A1Am.

Using these definitions, the first lemma relates unconfoundedness to the Kallenberg construction.

Lemma 1

(Kallenberg construction weak unconfoundedness). The assigned causes are weakly unconfounded given a random variable Zi if and only if the distribution of the assigned causes (Ai1,,Aim) admits a Kallenberg construction from Zi.

What Lemma 1 says is that if the distribution of the assigned causes has a Kallenberg construction from a random variable Zi then Zi is a valid substitute confounder: it renders the causes unconfounded. Moreover, a valid substitute confounder must always come from a Kallenberg construction.

We next relate the Kallenberg construction to factor models. We show that factor models admit a Kallenberg construction. This fact suggests the deconfounder: if we fit a factor model to capture the distribution of assigned causes then we can use the fitted factor model to construct a substitute confounder. This step relies on a key assumption of the deconfounder, “no unobserved single-cause confounders.”

Definition 4

(No unobserved single-cause confounders). Denote Xi as the observed covariates. There are no unobserved single-cause confounders for the assigned causes Ai1,,Aim if, for j=1,,m,

  1. There exist some random variable Vij such that(39) AijYi(a) | Xi,Vij,(39) (40) AijAi,j | Vij,(40)

where Ai,j={Ai1,,Aim}\Aij is the complete set of m causes excluding the jth cause;
  1. There exists no proper subset of the sigma algebra σ(Vij) satisfies Equation (40).

At a higher level, Vij refers to the multiple-cause confounders that affect the jth cause Aij. Equation (39) then ensures that the observed covariates Xi and the multiple-cause confounders Vij satisfy unconfoundedness. In other words, Xi must contain all single-cause confounders. Equation (40) ensures that Vij indeed induces a dependence between Aij and Ai,j. It guarantees that Vij can be recovered by constructing a random variable Zi that renders all the causes conditionally independent.

This “no unobserved single-cause confounders” assumption differs from the classical weak unconfoundedness assumption (Definition 1) by only requiring marginal independence between individual causes Aij and the potential outcome Yi(a). In contrast, weak unconfoundedness requires (Ai1,,Aim)Yi(a) | Xi, that is, the joint independence between the causes (Ai1,,Aim) and the potential outcome function Yi(a). Moreover, it involves multiple-cause confounders Vij. We remark that “no unobserved single-cause confounders” reduces to weak unconfoundedness when there is only one cause; both require AiYi(a) | Xi, where Ai and a are one-dimensional.

Now we state the connection between the Kallenberg construction and factor models.

Lemma 2

(Factor models Kallenberg construction). Under weak regularity conditions and “no unobserved single-cause confounders,” every factor model of the assigned causes p(z1:n,a1:n ; θ1:m) admits a Kallenberg construction from Zi.

Lemmas 1 and 2 connect unconfoundedness to Kallenberg constructions and then Kallenberg constructions to factor models. The two lemmas together connect factor models to unconfoundedness. These connections enable the deconfounder: they explain how the distribution of assigned causes relates to the substitute confounder Z in a Kallenberg construction. They justify why we can take a set of assigned causes and do inference on Z via factor models.

Next we establish two properties of the substitute confounder. We assume the substitute confounder comes from a factor model that captures the population distribution of the causes.

The first property is that the substitute confounder must capture all multiple-cause confounders. It implies that the inferred substitute confounder, together with all single-cause confounders (if there is any), deconfounds causal inference.

Lemma 3. An

y multiple-cause confounder Ci must be measurable with respect to the σ-algebra generated by the substitute confounder Zi.

A multiple-cause confounder is a confounder that confounds two or more causes. (Its technical definition stems from Definition 4 of VanderWeele and Shpitser (2013); see Appendix H in the supplementary materials.) Figure 1 gives the intuition with a graphical model and Appendix H in the supplementary materials gives a detailed proof.

Lemma 3 shows that the deconfounder captures unobserved confounders. But might the inferred substitute confounder pick up a mediator? If the substitute confounder also picks up a mediator then conditioning on it will yield conservative causal estimates (Baron and Kenny 1986; Imai, Keele, and Yamamoto 2010). The next proposition alleviates this concern.

Lemma 4. An

y mediator is almost surely not measurable with respect to the σ-algebra generated by the substitute confounder Zi and the pretreatment observed covariates Xi.

Lemma 4 implies that the substitute confounder does not pick up mediators, variables along the path between causes and effects. This property greenlights us for treating the inferred substitute confounder as a pretreatment covariate.

Lemma

s 3 and 4 qualify the substitute confounder for mimicking confounders. We condition on the substitute confounder and proceed with causal inference.

These lemmas lead to justifications of the deconfounder algorithm. We first describe their implications on the substitute confounders and factor models.

Proposition 5

(Substitute confounders and factor models). Under weak regularity conditions,

  1. Under “no unobserved single-cause confounders,” the assigned causes are weakly unconfounded given the substitute confounder Zi and the pretreatment covariates Xi if the true distribution p(a1:n) can be written as a factor model that uses the substitute confounder, p(z1:n,a1:n ; θ1:m).

  2. There always exists a factor model that captures the distribution of assigned causes.

Proof

sketch. The first part follows from Lemmas 1 and 2. The second part follows from the Reichenbach’s common cause principle (Reichenbach 1956; Sober 1976; Peters, Janzing, and Schölkopf 2017) and Sklar’s theorem (Sklar 1959): any multivariate joint distribution can be factorized into the product of univariate marginal distributions and a copula which describes the dependence structure between the variables. The full proof is in Appendix G in the supplementary materials. □

Proposition 5

justifies the use of factor models in the deconfounder. The first part of Proposition 5 suggests how to find a valid substitute confounder, one that renders the causes weakly unconfounded. Two conditions suffice: (1) the substitute confounder comes from a factor model; (2) the factor model captures the population distribution of the assigned causes. The assignment model in the deconfounder stems from this result: fit a factor model to the assigned causes, check that it captures their population distribution, and finally use the fitted factor model to infer a substitute confounder. The first part of the theorem says that the deconfounder does deconfound. The second part ensures that there is hope to find a deconfounding factor model. There always exists a factor model that captures the population distribution of the assigned causes.

7.2 Causal Identification of the Deconfounder

Building on the characterizations of the substitute confounder (Lemmas 1–4), we discuss a collection of causal identification results around the deconfounder. We prove that the deconfounder can identify three causal quantities under suitable conditions.9 These causal quantities include the average causal effect of all the causes, the average causal effect of subsets of the causes, and the conditional potential outcome.

Before stating the identification results, we first describe the notion of a consistent substitute confounder; we will rely on this notion for identification.

Definition 5

(Consistency of substitute confounders). The factor model p(θ,z,a) admits consistent estimates of the substitute confounder Zi if, for some function fθ,(41) p(zi | ai,θ)=δfθ(ai).(41)

Consistency of substitute confounders requires that we can estimate the substitute confounder Zi from the causes Ai with certainty; it is a deterministic function of the causes.10 Nevertheless, the substitute confounder need not coincide with the true data-generating Zi; nor does it need to coincide with the true unobserved confounder. We only need to estimate the substitute confounder Zi up to some deterministic bijective transformations (e.g., scaling and linear transformations).

Many factor models admit consistent substitute confounder estimates when the number of causes is large. For example, probabilistic PCA and Poisson factorization lead to consistent Zi as (n+m)·log(nm)/(nm)0, where n is the number of individuals and m is the number of causes (Chen, Li, and Zhang 2017). Many studies also involve many causes, for example, the genome-wide association studies (gwas) study in Section 6.2 and the movie-actor study in Section 6.3.

We now describe three identification results under the sutva, “no unobserved single-cause confounders,” and consistency of substitute confounders. We first study the average causal effect of all the causes.

Theorem 6

(Identification of the average causal effect of all the causes). Assume sutva, “no unobserved single-cause confounders,” and consistency of substitute confounders. Then, under conditions described below, the deconfounder nonparametrically identifies the average causal effect of all the causes. The average causal effect of changing the causes from a=(a1,,am) to a=(a1,,am) is(42) EY[Yi(a)]EY[Yi(a)]=EZ,X[EY[Yi | Ai=a,Zi,Xi]]EZ,X[EY[Yi | Ai=a,Zi,Xi]].(42)

This holds with the following two conditions11: (1) the substitute confounder is a piece-wise constant function of the (continuous) causes: afθ(a)=0 up to a set of Lebesgue measure zero; (2) the outcome is separable,E[Yi(a) | Zi=z,Xi=x]=f1(a,x)+f2(z),E[Yi | Ai=a,Zi=z,Xi=x]=f3(a,x)+f4(z),for all (a,x,z)A×X×Z and some continuously differentiable12 functions f1, f2, f3, and f4.13

Proof

sketch. Theorem 6 relies on two results: (1) “No unobserved single-cause confounders” and Lemma 3 ensure (Zi, Xi) capture all confounders; (2) The pretreatment nature of Xi and Lemma 4 ensure (Zi, Xi) capture no mediators. These results assert unconfoundedness given the substitute confounder Zi and the observed covariates Xi. They greenlight us for causal inference given consistency of substitute confounder estimates. Theorem 6 then leverages two additional conditions to identify average causal effects without assuming overlap. The full proof is in Appendix K in the supplementary materials. □

Theorem 6 shows that the deconfounder can unbiasedly estimate the average causal effect of all the causes. It requires two conditions beyond “no unobserved single-cause confounders,” sutva, and consistency of substitute confounders. The first condition requires that the substitute confounder be a piece-wise constant function of the causes; it is satisfied when the substitute confounder is discrete and the causes are continuous. We remark that this piece-wise constant condition does not assume away all confounding. For example, it is satisfied when the substitute confounder (and hence the unobserved confounder) is a discretization of the causes. In this case, the substitute confounder still correlates with the causes while satisfying the piece-wise constant condition.

The second condition of Theorem 6 requires that the potential outcome be separable in the substitute confounder and the causes; the observed data also respects this separability. This condition is satisfied when the substitute confounder does not interact with the causes. For example, this condition is often satisfied in gwas studies: the effect of snps on an individual’s height does not depend on his/her ancestry (Veturi et al. 2019). A reader might ask: how can the outcome be separable in the substitute confounder Zi and the causes Ai when Zi=fθ(Ai), which is required by the consistency of substitute confounders? The reason is that fθ is a non-differentiable piece-wise constant function by condition (1), while f1,f2,f3,f4 are differentiable required by condition (2). In this way, the conditional expectation E[Yi(a) | Zi=z,Xi=x] is can be separated into two components, one differentiable f1(a,x) and one non-differentiable f2(z). A similar argument also holds for E[Yi | Ai=a,Zi=z,Xi=x]. It is this incongruence between Xi and Zi in differentiability that leads to identification.

When the separability condition of Theorem 6 does not hold, we can still use the deconfounder to handle the unobserved multiple-cause confounders that do not interact with the causes. As long as the observed covariates include those that do interact with the causes, the deconfounder produces unbiased estimates of the average causal effect.

We next discuss the identification of the average causal effect for subsets of the causes.

Theorem 7

(Identification of the average causal effect of subsets of the causes). Assume sutva, “no unobserved single-cause confounders,” and consistency of substitute confounders. Then, under the condition described below, the deconfounder nonparametrically identifies the average causal effect of subsets of causes. The average causal effect of changing the first k (k<m) causes from a1:k=(a1,,ak) to a1:k=(a1,,ak) is(43) EA(k+1):m[EY[Yi(a1:k,Ai,(k+1):m)]]EA(k+1):m[EY[Yi(a1:k,Ai,(k+1):m)]]=EZ,X[EY[Yi | Zi,Xi,Ai,1:k=a1:k]]EZ,X[EY[Yi | Zi,Xi,Ai,1:k=a1:k]].(43)

This holds with the following condition: The first k causes Ai1,,Aik satisfy overlap, P((Ai1,,Aik)A | Zi,Xi)>0 for any set A such that P(A)>0.14

Proof

sketch. Similar to Theorem 6, Theorem 7 uses Lemmas 3 and 4 to greenlight the use of a substitute confounder. It then relies on overlap to identify the average causal effect; we follow the classical argument that identifies the average treatment effect (Imbens and Rubin 2015). The full proof is in Appendix L in the supplementary materials. □

Theorem 7 shows that the deconfounder can unbiasedly estimate the average causal effect of subsets of the causes. It lets us answer “how would the movie revenue change, on average, if we place Meryl Streep and Sean Connery into a movie?” Beyond “no unobserved single-cause confounders,” sutva, and consistency of substitute confounders, Theorem 7 requires overlap. Overlap ensures that EY[Yi | Zi,Xi,Ai,1:k=a1:k] is estimable from the observed data for all possible values of (Zi,Xi,Ai,1:k). The overlap assumption about the causes in Theorem 7 replaces the separability assumption about the outcome model required by Theorem 6.

We note that the overlap condition and the consistency of substitute confounders are compatible. Though consistency requires P(Zi | Ai)=δfθ(Ai), it is still possible for subsets of the causes to satisfy overlap; the consistency condition only prevents the complete set of m causes from satisfying overlap. For example, consider a consistent estimate of the substitute confounder that is one-dimensional, Zi=j=1mαjAij. Any km1 causes satisfy overlap, but the complete set of m causes do not.

Finally, we discuss the identification of the conditional mean potential outcome.

Theorem 8

(Identification of the conditional mean potential outcome). Assume sutva, “no unobserved single-cause confounders,” and consistency of substitute confounders. Then, under the condition described below, the deconfounder nonparametrically identifies the mean potential outcome of an individual given its current assigned causes. If an individual is assigned with a=(a1,,am), then its potential outcome under a different assignment a=(a1,,am) isEY[Yi(a)|Ai=a]=EZ,X[EY[Yi|Zi,Xi,Ai=a]].

This holds with the following condition: The cause assignment of interest a leads to the same substitute confounder estimate as the observed assigned causes: P(Zi | Ai=a)=P(Zi | Ai=a).

Proof

sketch. As with Theorems 6 and 7, Theorem 8 relies on the unconfoundedness given the substitute confounders Zi and the observed covariates Xi due to Lemmas 3 and 4. It then identifies the potential outcome by focusing on the data points with the same substitute confounder estimate. We note that this identification result does not require overlap. The full proof is in Appendix M in the supplementary materials. □

Given consistency of substitute confounders, Theorem 8 nonparametrically identifies the mean potential outcome of an individual Yi(a) given its current assigned causes Ai=a. The only requirement is about the configurations of cause assignments we can query, a; these configurations should lead to the same substitute confounder estimate as the current assigned causes.

We illustrate this condition with actors causing movie revenue. For simplicity, assume the substitute confounder captures the genre of each movie. Start with one of the James Bond movie; it is a spy film. We can ask what its revenue would be if we make its cast to be that of “The Bourne Trilogy” (also a spy film). Alternatively, we can query what if we make its cast to include some actors from “The Bourne Trilogy” and other actors from “North By Northwest”; both are spy films. However, we cannot query what if we make its cast to be that of “The Shawshank Redemption” (which is not a spy film).

Theorems 6–8 confirm the validity of the deconfounder by providing three sets of nonparametric identification results. When the assumptions in Theorems 6–8 may not hold, we recommend evaluating the uncertainty of the deconfounder estimate. Section 5 discusses how; Section 6.1 gives an example. The posterior distribution of the deconfounder estimate reflects how the (finite) observed data informs causal quantities of interest. When the causal quantity is non-identifiable, the posterior distribution of the deconfounder estimate will reflect this non-identifiability. For example, if the causal quantity is non-identifiable over R, the posterior distribution of the deconfounder estimate will be uniform over R (with noninformative priors).

We finally remark that the identification results in Theorems 6–8 do not contradict the negative results of D’Amour (2019). D’Amour (2019) explore nonparametric non-identification of a particular multi-causal quantity, the mean potential outcome E[Yi(a)]. In this article, Theorems 6–8 establish the nonparametric identification of different causal quantities. D’Amour (2019) do not make the same assumptions as in Theorems 6–8. More specifically, under consistency of substitute confounders and other suitable conditions, Theorem 6 shows that the average causal effect of all the causes E[Yi(a)]E[Yi(a)] is nonparametrically identifiable; Theorem 7 shows that the average causal effect of subsets of the causes EA(k+1):m[EY[Yi(a1:k,Ai,(k+1):m)]]EA(k+1):m[EY[Yi(a1:k,Ai,(k+1):m)]] is nonparametrically identifiable; Theorem 8 shows that the conditional mean potential outcome E[Yi(a) | Ai=a] is nonparametrically identifiable.

8 Discussion

Classical causal inference studies how a univariate cause affects an outcome. Here we studied multiple causal inference, where there are multiple causes that contribute to the effect. Multiple causes might at first appear to be a curse, but we showed that it can be a blessing.

We developed the deconfounder: first fit a good factor model of assigned causes; then use the factor model to infer a substitute confounder; finally perform causal inference. We showed how a substitute confounder from a good factor model must capture all multi-cause confounders, and we demonstrated that whether a factor model is satisfactory is a checkable proposition.

There are many directions for future work.

  • We estimated the potential outcomes under configurations of the causes. Which potential outcomes can be reliably estimated? Can we trade off confounding bias and estimation variance?

  • We checked factor models for downstream causal unbiasedness. But model checking is an imprecise science. Can we develop rigorous model checking algorithms for causal inference?

  • We focused on estimation. Can we develop a testing counterpart? How can we identify significant causes while still preserving family-wise error rate or false discovery rate?

  • We analyzed univariate outcomes. Can we work with both multiple causes and multiple outcomes. Can dependence among outcomes further help causal inference?

Supplementary Materials

The supplementary materials contain further discussions of the deconfounder algorithm, detailed results of the empirical studies, and proofs of the theoretical results.

Supplemental material

Supplemental Material

Download PDF (379 KB)

Related Research Data

The Blessings of Multiple Causes
Source: Taylor & Francis

The Blessings of Multiple Causes
Source: Taylor & Francis


Acknowledgments

We have had many useful discussions about the previous versions of this article. We thank Edo Airoldi, Elias Barenboim, Léon Bottou, Alexander D’Amour, Barbara Engelhart, Andrew Gelman, David Heckerman, Jennifer Hill, Ferenc Huszár, George Hripcsak, Daniel Hsu, Guido Imbens, Thorsten Joachims, Fan Li, Lydia Liu, Jackson Loper, David Madigan, Joris Mooij, Suresh Naidu, Xinkun Nie, Elizabeth Ogburn, Georgia Papadogeorgou, Judea Pearl, Alex Peysakhovich, Rajesh Ranganath, Jason Roy, Cosma Shalizi, Dylan Small, Hal Stern, Amos Storkey, Wesley Tansey, Eric Tchetgen Tchetgen, Dustin Tran, Victor Veitch, Linbo Wang, Stefan Wager, Kilian Weinberger, Jeannette Wing, Linying Zhang, Qingyuan Zhao, and José Zubizarreta.

Additional information

Funding

This work is supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, IBM, 2Sigma, Amazon, NVIDIA, and Simons Foundation.

Notes

1 We use the term assigned causes for the vector of what some might call the “assigned treatments.” Because some variables may not exhibit a causal effect, a more precise term would be “assigned potential causes” (but it is too cumbersome).

2 Here is the notation. Capital letters denote a random variable. For example, the random variable Ai is a randomly chosen vector of assigned causes from the population. The random variable Yi(Ai) is a randomly chosen potential outcome from the population, evaluated at its assigned causes. A lowercase letter is a realization. For example, ai is in the dataset—it is the vector of assigned causes of individual i. The left side of Equation (1) is an expectation with respect to the random variables; it conditions on the random vector of assigned causes to be equal to a certain realization Ai=a. The right side is an expectation over the same population of the potential outcome functions, but always evaluated at the realization a.

3 Here we describe the weak version of the unconfoundedness assumption, which requires individual potential outcomes Yi(a) be marginally independent of the causes Ai, that is, AiYi(a) | Xi for all a. Imbens (2000) and Hirano and Imbens (2004) call this assumption weak unconfoundedness. In contrast, the strong version of unconfoundedness says Ai(Yi(a))aA | Xi, which requires all possible potential outcomes (Yi(a))aA be jointly independent of the causes Ai.

4 We also assume stable unit treatment value assumption (SUTVA) (Rubin 1980, 1990) and overlap (Imai and Van Dyk 2004), roughly that any vector of assigned causes has positive probability. These three assumptions together identify the potential outcome function (Imbens 2000; Hirano and Imbens 2004; Imai and Van Dyk 2004).

5 Figure 1 uses a graphical model to represent and reason about conditional dependencies in the population distribution. It is not a causal graphical model or a structural equation model.

6 We also require the observed covariates Xi satisfy the overlap condition if they are single-cause confounders, that is, p(AijA | Xi)>0 for all setsAwithpositivemeasure,i.e.p(A)>0.

7 https://www.kaggle.com/tmdb.

8 This section illustrates how to use the deconfounder to explore data. It is about these methods and the particular dataset that we studied, not a comment about the ground-truth quality of the actors involved. The authors of this article are statisticians, not film critics.

9 Here “identify” means the causal quantity can be written as a function of the observed data. Moreover, the deconfounder can unbiasedly estimate it.

10 Together with Lemma 3, consistency of substitute confounders implies that the true unobserved multiple-cause confounders are also deterministic functions of the causes.

11 We assume the two conditions—”piece-wise constant” and “separable”—for the substitute confounder. However, it suffices to assume the same two conditions for the unobserved multiple-cause confounders. The former is easier to check; it also implies the latter because of Lemma 3.

12 For binary causes, we can analogously assume that there exists anew and anew such that anewanew=aa and they lead to the same substitute confounder estimate f(anew)=f(anew). Further, the outcome model is separable: E[Yi(a)Yi(a) | Zi=z,Xi=x]=f1(aa,x)+f2(z).

13 The expectation over Zi and Xi is taken over P(Zi,Xi) in Equation (42): EZi,Xi[EY[Yi | Ai=a,Zi,Xi]]=EY[Yi | Ai=a,Zi,Xi]P(Zi,Xi)dZidXi.

14 In full notation, EA(k+1):m[EY[Yi(a1:k,Ai,(k+1):m)]]=EA(k+1):m[EY[Yi(a1,,ak,Aik+1,,Aim)]].

    References

  • Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008), “Mixed Membership Stochastic Blockmodels,” Journal of Machine Learning Research, 9, 19812014. [PubMed], [Web of Science ®][Google Scholar]
  • Aldrich, J. (1989), “Autonomy,” Oxford Economic Papers, 41, 1534. DOI: 10.1093/oxfordjournals.oep.a041889. [Crossref], [Web of Science ®][Google Scholar]
  • Astle, W., and Balding, D. J. (2009), “Population Structure and Cryptic Relatedness in Genetic Association Studies,” Statistical Science, 24, 451471. DOI: 10.1214/09-STS307. [Crossref], [Web of Science ®][Google Scholar]
  • Baron, R. M., and Kenny, D. A. (1986), “The Moderator–Mediator Variable Distinction in Social Psychological Research: Conceptual, Strategic, and Statistical Considerations,” Journal of Personality and Social Psychology, 51, 1173. DOI: 10.1037/0022-3514.51.6.1173. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Bayarri, M., and Castellanos, M. (2007), “Bayesian Checking of the Second Levels of Hierarchical Models,” Statistical Science, 22, 322343. DOI: 10.1214/07-STS235. [Crossref], [Web of Science ®][Google Scholar]
  • Blei, D. M. (2014), “Build, Compute, Critique, Repeat: Data Analysis With Latent Variable Models,” Annual Review of Statistics and Its Application, 1, 203232. DOI: 10.1146/annurev-statistics-022513-115657. [Crossref], [Web of Science ®][Google Scholar]
  • Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003), “Latent (D)irichlet Allocation,” Journal of Machine Learning Research, 3, 9931022. [Crossref], [Web of Science ®][Google Scholar]
  • Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017), “Stan: A Probabilistic Programming Language,” Journal of Statistical Software, 76, 1–32. DOI: 10.18637/jss.v076.i01. [Crossref], [Web of Science ®][Google Scholar]
  • Cemgil, A. T. (2009), “Bayesian Inference for Nonnegative Matrix Factorization Models,” Computational Intelligence and Neuroscience, 2009, 785152. DOI: 10.1155/2009/785152. [Crossref], [PubMed][Google Scholar]
  • Chen, Y., Li, X., and Zhang, S. (2017), “Structured Latent Factor Analysis for Large-Scale Data: Identifiability, Estimability, and Their Implications,” arXiv no. 1712.08966. DOI: 10.1080/01621459.2019.1635485. [Taylor & Francis Online][Google Scholar]
  • Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2017), “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics Journal, 21, C1C68. DOI: 10.1111/ectj.12097. [Crossref], [Web of Science ®][Google Scholar]
  • Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Nuyujukian, P., Ryu, S. I., and Shenoy, K. V. (2012), “Neural Population Dynamics During Reaching,” Nature, 487, 51. DOI: 10.1038/nature11129. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Cinelli, C., Kumor, D., Chen, B., Pearl, J., and Bareinboim, E. (2019), “Sensitivity Analysis of Linear Structural Causal Models,” in International Conference on Machine Learning, pp. 12521261. [Google Scholar]
  • Collins, M., Dasgupta, S., and Schapire, R. E. (2002), “A Generalization of Principal Components Analysis to the Exponential Family,” in Advances in Neural Information Processing Systems, pp. 617624. [Google Scholar]
  • Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2009), “Dealing With Limited Overlap in Estimation of Average Treatment Effects,” Biometrika, 96, 187199. DOI: 10.1093/biomet/asn055. [Crossref], [Web of Science ®][Google Scholar]
  • D’Amour, A. (2019), “On Multi-Cause Causal Inference With Unobserved Confounding: Counterexamples, Impossibility, and Alternatives,” arXiv no. 1902.10286. [Google Scholar]
  • D’Amour, A., Ding, P., Feller, A., Lei, L., and Sekhon, J. (2017), “Overlap in Observational Studies With High-Dimensional Covariates,” arXiv no. 1711.02582. [Google Scholar]
  • Dawid, A. P., and Didelez, V. (2010), “Identifying the Consequences of Dynamic Treatment Strategies: A Decision-Theoretic Overview,” Statistics Surveys, 4, 184231. DOI: 10.1214/10-SS081. [Crossref][Google Scholar]
  • Dehejia, R. H., and Wahba, S. (2002), “Propensity Score-Matching Methods for Nonexperimental Causal Studies,” Review of Economics and Statistics, 84, 151161. DOI: 10.1162/003465302317331982. [Crossref], [Web of Science ®][Google Scholar]
  • Dey, D., Gelfand, A., Swartz, T., and Vlachos, P. (1998), “Simulation Based Model Checking for Hierarchical Models,” Test, 7, 325346. DOI: 10.1007/BF02565116. [Crossref], [Web of Science ®][Google Scholar]
  • Erosheva, E. A. (2003), “Bayesian Estimation of the Grade of Membership Model,” Bayesian Statistics, 7, 501510. [Google Scholar]
  • Feng, P., Zhou, X.-H., Zou, Q.-M., Fan, M.-Y., and Li, X.-S. (2012), “Generalized Propensity Score for Estimating the Average Treatment Effect of Multiple Treatments,” Statistics in Medicine, 31, 681697. DOI: 10.1002/sim.4168. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Franks, A., D’Amour, A., and Feller, A. (2019), “Flexible Sensitivity Analysis for Observational Studies Without Observable Implications,” Journal of the American Statistical Association, 133. DOI: 10.1080/01621459.2019.1604369. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Frot, B., Nandy, P., and Maathuis, M. H. (2019), “Robust Causal Structure Learning With Some Hidden Variables,” Journal of the Royal Statistical Society, Series B, 81, 459487. DOI: 10.1111/rssb.12315. [Crossref][Google Scholar]
  • Geisser, S., Hodges, J., Press, S., and ZeUner, A. (1990), “The Validity of Posterior Expansions Based on Laplace Method,” Bayesian and Likelihood Methods in Statistics and Econometrics, 7, 473. [Google Scholar]
  • Gelfand, A. E., Dey, D. K., and Chang, H. (1992), “Model Determination Using Predictive Distributions With Implementation via Sampling-Based Methods,” Technical Report, DTIC Document. [Google Scholar]
  • Gelman, A., Meng, X., and Stern, H. (1996), “Posterior Predictive Assessment of Model Fitness via Realized Discrepancies,” Statistica Sinica, 6, 733807. [Web of Science ®][Google Scholar]
  • Gelman, A., Stern, H. S., Carlin, J. B., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013), Bayesian Data Analysis, Boca Raton, FL: Chapman and Hall/CRC. [Crossref][Google Scholar]
  • Gilbert, P. B., Bosch, R. J., and Hudgens, M. G. (2003), “Sensitivity Analysis for the Assessment of Causal Vaccine Effects on Viral Load in HIV Vaccine Trials,” Biometrics, 59, 531541. DOI: 10.1111/1541-0420.00063. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Gopalan, P., Hofman, J. M., and Blei, D. M. (2015), “Scalable Recommendation With Hierarchical Poisson Factorization,” in Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp. 326335. [Google Scholar]
  • GTEx Consortium, Battle, A., Brown, C. D., Engelhardt, B. E., and Montgomery, S. M. (2017), “Genetic Effects on Gene Expression Across Human Tissues,” Nature, 550, 204213. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Haavelmo, T. (1944), “The Probability Approach in Econometrics,” Econometrica: Journal of the Econometric Society, 12, iii115. DOI: 10.2307/1906935. [Crossref][Google Scholar]
  • Hao, W., Song, M., and Storey, J. D. (2015), “Probabilistic Models of Genetic Variation in Structured Populations Applied to Global Human Studies,” Bioinformatics, 32, 713721. DOI: 10.1093/bioinformatics/btv641. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Heckerman, D. (2018), “Accounting for Hidden Common Causes When Inferring Cause and Effect From Observational Data,” arXiv no. 1801.00727. [Google Scholar]
  • Heckman, J., Ichimura, H., Smith, J., and Todd, P. (1998), “Characterizing Selection Bias Using Experimental Data,” Technical Report, National Bureau of Economic Research. [Google Scholar]
  • Hill, J. L. (2011), “Bayesian Nonparametric Modeling for Causal Inference,” Journal of Computational and Graphical Statistics, 20, 217240. DOI: 10.1198/jcgs.2010.08162. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Hirano, K., and Imbens, G. W. (2004), “The Propensity Score With Continuous Treatments,” in Applied Bayesian Modeling and Causal Inference From Incomplete-Data Perspectives, eds. A. Gelman and X.-L. Meng, New York: Wiley, pp. 7384. [Crossref][Google Scholar]
  • Holland, P. (1986), “Statistics and Causal Inference,” Journal of the American Statistical Association, 81, 945960. DOI: 10.1080/01621459.1986.10478354. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Horvitz, D. G., and Thompson, D. J. (1952), “A Generalization of Sampling Without Replacement From a Finite Universe,” Journal of the American Statistical Association, 47, 663685. DOI: 10.1080/01621459.1952.10483446. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Imai, K., Keele, L., and Yamamoto, T. (2010), “Identification, Inference and Sensitivity Analysis for Causal Mediation Effects,” Statistical Science, 25, 5171. DOI: 10.1214/10-STS321. [Crossref], [Web of Science ®][Google Scholar]
  • Imai, K., and Van Dyk, D. A. (2004), “Causal Inference With General Treatment Regimes: Generalizing the Propensity Score,” Journal of the American Statistical Association, 99, 854866. DOI: 10.1198/016214504000001187. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Imbens, G. W. (2000), “The Role of the Propensity Score in Estimating Dose-Response Functions,” Biometrika, 87, 706710. DOI: 10.1093/biomet/87.3.706. [Crossref], [Web of Science ®][Google Scholar]
  • Imbens, G. W., and Rubin, D. B. (2015), Causal Inference in Statistics, Social and Biomedical Sciences: An Introduction, New York: Cambridge University Press. [Crossref][Google Scholar]
  • Janzing, D., and Schölkopf, B. (2018a), “Detecting Confounding in Multivariate Linear Models via Spectral Analysis,” Journal of Causal Inference, 6, 1–27. DOI: 10.1515/jci-2017-0013. [Crossref], [Web of Science ®][Google Scholar]
  • Janzing, D., and Schölkopf, B. (2018b), “Detecting Non-Causal Artifacts in Multivariate Linear Regression Models,” arXiv no. 1803.00810. [Google Scholar]
  • Kallenberg, O. (1997), Foundations of Modern Probability, Collection: Probability and Its Applications, New York: Springer. [Google Scholar]
  • Kaltenpoth, D., and Vreeken, J. (2019), “We Are Not Your Real Parents: Telling Causal From Confounded Using MDL,” arXiv no. 1901.06950. [Google Scholar]
  • Kang, H. M., Sul, J. H., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C., and Eskin, E. (2010), “Variance Component Model to Account for Sample Structure in Genome-Wide Association Studies,” Nature Genetics, 42, 348. DOI: 10.1038/ng.548. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Kingma, D. P., and Welling, M. (2013), “Auto-Encoding Variational Bayes,” arXiv no. 1312.6114. [Google Scholar]
  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017), “Automatic Differentiation Variational Inference,” The Journal of Machine Learning Research, 18, 430474. [Web of Science ®][Google Scholar]
  • Laird, N. M., and Louis, T. A. (1982), “Approximate Posterior Distributions for Incomplete Data Problems,” Journal of the Royal Statistical Society, Series B, 44, 190200. DOI: 10.1111/j.2517-6161.1982.tb01198.x. [Crossref][Google Scholar]
  • Lanes, S. F. (1988), “The Logic of Causal Inference,” in Causal Inference, ed. K. J. Rothman, Boston: ERI, pp. 5975. [Google Scholar]
  • Lechner, M. (2001), “Identification and Estimation of Causal Effects of Multiple Treatments Under the Conditional Independence Assumption,” in Econometric Evaluation of Labor Market Policies, eds. M. Lechner and F. Pfeiffer, Heidelberg: Springer, pp. 4358. [Crossref][Google Scholar]
  • Lee, B. K., Lessler, J., and Stuart, E. A. (2010), “Improving Propensity Score Weighting Using Machine Learning,” Statistics in Medicine, 29, 337346. DOI: 10.1002/sim.3782. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Lee, D. D., and Seung, H. S. (1999), “Learning the Parts of Objects by Non-Negative Matrix Factorization,” Nature, 401, 788. DOI: 10.1038/44565. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Lee, D. D., and Seung, H. S. (2001), “Algorithms for Non-Negative Matrix Factorization,” in Advances in Neural Information Processing Systems, pp. 556562. [Google Scholar]
  • Liu, F., and Chan, L. (2018), “Confounder Detection in High Dimensional Linear Models Using First Moments of Spectral Measures,” arXiv no. 1803.06852. [Google Scholar]
  • Lopez, M. J., and Gutman, R. (2017), “Estimation of Causal Effects With Multiple Treatments: A Review and New Ideas,” Statistical Science, 32, 432454. DOI: 10.1214/17-STS612. [Crossref], [Web of Science ®][Google Scholar]
  • Louizos, C., Shalit, U., Mooij, J. M., Sontag, D., Zemel, R., and Welling, M. (2017), “Causal Effect Inference With Deep Latent-Variable Models,” in Advances in Neural Information Processing Systems, pp. 64496459. [Google Scholar]
  • Lunceford, J. K., and Davidian, M. (2004), “Stratification and Weighting via the Propensity Score in Estimation of Causal Treatment Effects: A Comparative Study,” Statistics in Medicine, 23, 29372960. DOI: 10.1002/sim.1903. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R., and Burgette, L. F. (2013), “A Tutorial on Propensity Score Estimation for Multiple Treatments Using Generalized Boosted Models,” Statistics in Medicine, 32, 33883414. DOI: 10.1002/sim.5753. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • McCaffrey, D. F., Ridgeway, G., and Morral, A. R. (2004), “Propensity Score Estimation With Boosted Regression for Evaluating Causal Effects in Observational Studies,” Psychological Methods, 9, 403. DOI: 10.1037/1082-989X.9.4.403. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models (2nd ed.), Monographs on Statistics and Applied Probability, Chapman and Hall/CRC, 37. [Google Scholar]
  • Mckeigue, P., Krohn, J., Storkey, A. J., and Agakov, F. V. (2010), “Sparse Instrumental Variables (SPIV) for Genome-Wide Studies,” in Advances in Neural Information Processing Systems, pp. 2836. [Google Scholar]
  • McLachlan, G. J., and Basford, K. E. (1988), Mixture Models: Inference and Applications to Clustering (Vol. 84), New York: Marcel Dekker. [Google Scholar]
  • Moghaddass, R., Rudin, C., and Madigan, D. (2016), “The Factorized Self-Controlled Case Series Method: An Approach for Estimating the Effects of Many Drugs on Many Outcomes,” Journal of Machine Learning Research, 17, 124. [Web of Science ®][Google Scholar]
  • Mohamed, S., Ghahramani, Z., and Heller, K. A. (2009), “Bayesian Exponential Family PCA,” in Advances in Neural Information Processing Systems, pp. 1089–1096. [Google Scholar]
  • Mohamed, S., and Lakshminarayanan, B. (2016), “Learning in Implicit Generative Models,” arXiv no. 1610.03483. [Google Scholar]
  • Mooij, J. M., Stegle, O., Janzing, D., Zhang, K., and Schölkopf, B. (2010), “Probabilistic Latent Variable Models for Distinguishing Between Cause and Effect,” in Advances in Neural Information Processing Systems, pp. 16871695. [Google Scholar]
  • Morgan, S., and Winship, C. (2015), Counterfactuals and Causal Inference (2nd ed.), New York: Cambridge University Press. [Google Scholar]
  • Neal, R. M. (1990), “Learning Stochastic Feedforward Networks,” Department of Computer Science, University of Toronto, 64, 1283. [Google Scholar]
  • Pearl, J. (1988), Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Francisco, CA: Morgan Kaufmann Publishers Inc. [Google Scholar]
  • Pearl, J. (2009), Causality (2nd ed.), New York: Cambridge University Press. [Crossref][Google Scholar]
  • Peters, J., Bühlmann, P., and Meinshausen, N. (2016), “Causal Inference by Using Invariant Prediction: Identification and Confidence Intervals,” Journal of the Royal Statistical Society, Series B, 78, 9471012. DOI: 10.1111/rssb.12167. [Crossref][Google Scholar]
  • Peters, J., Janzing, D., and Schölkopf, B. (2017), “Elements of Causal Inference: Foundations and Learning Algorithms, Cambridge, MA: MIT Press. [Google Scholar]
  • Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006), “Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies,” Nature Genetics, 38, 904. DOI: 10.1038/ng1847. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Pritchard, J. K., Stephens, M., Rosenberg, N. A., and Donnelly, P. (2000), “Association Mapping in Structured Populations,” The American Journal of Human Genetics, 67, 170181. DOI: 10.1086/302959. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Ranganath, R., and Blei, D. M. (2019), “Population Predictive Checks,” arXiv no. 1908.00882. [Google Scholar]
  • Ranganath, R., Gerrish, S., and Blei, D. (2014), “Black Box Variational Inference,” in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, eds. Samuel Kaski and Jukka Corander, Reykjavik, Iceland: PMLR, 33, pp. 814822. [Google Scholar]
  • Ranganath, R., and Perotte, A. (2018), “Multiple Causal Inference With Latent Confounding,” arXiv no. 1805.08273. [Google Scholar]
  • Ranganath, R., Tang, L., Charlin, L. and Blei, D. (2015), “Deep Exponential Families,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, eds. Lebanon, G. and S. V. N. Vishwanathan, San Diego, California, USA: PMLR, 38, pp. 762771. [Google Scholar]
  • Ranganath, R., Tran, D., and Blei, D. (2016), “Hierarchical Variational Models,” in International Conference on Machine Learning, pp. 324333. [Google Scholar]
  • Rassen, J. A., Solomon, D. H., Glynn, R. J., and Schneeweiss, S. (2011), “Simultaneously Assessing Intended and Unintended Treatment Effects of Multiple Treatment Options: A Pragmatic ‘Matrix Design’,” Pharmacoepidemiology and Drug Safety, 20, 675683. DOI: 10.1002/pds.2121. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Reichenbach, H. (1956), The Direction of Time, ed. M. Reichenbach, Berkeley, CA: University of California Press. [Crossref][Google Scholar]
  • Rezende, D. J., and Mohamed, S. (2015), “Variational Inference With Normalizing Flows,” arXiv no. 1505.05770. [Google Scholar]
  • Rezende, D. J., Mohamed, S., and Wierstra, D. (2014), “Stochastic Backpropagation and Variational Inference in Deep Latent Gaussian Models,” in International Conference on Machine Learning (Vol. 2). [Google Scholar]
  • Robins, J. M., Rotnitzky, A., and Scharfstein, D. O. (2000), “Sensitivity Analysis for Selection Bias and Unmeasured Confounding in Missing Data and Causal Inference Models,” in Statistical Models in Epidemiology, the Environment, and Clinical Trials, eds. M. E. Halloran and D. Berry, New York: Springer, pp. 194. [Crossref][Google Scholar]
  • Rosenbaum, P. R., and Rubin, D. B. (1983), “The Central Role of the Propensity Score in Observational Studies for Causal Effects,” Biometrika, 70, 4155. DOI: 10.1093/biomet/70.1.41. [Crossref], [Web of Science ®][Google Scholar]
  • Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies,” Journal of Educational Psychology, 66, 688. DOI: 10.1037/h0037350. [Crossref], [Web of Science ®][Google Scholar]
  • Rubin, D. B. (1980), “Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment,” Journal of the American Statistical Association, 75, 591593. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Rubin, D. B. (1984), “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician,” The Annals of Statistics, 12, 11511172. [Crossref], [Web of Science ®][Google Scholar]
  • Rubin, D. B. (1990), “Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies,” Statistical Science, 5, 472480. [Crossref][Google Scholar]
  • Rubin, D. B. (2005), “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions,” Journal of the American Statistical Association, 100, 322331. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Schmidt, M. N., Winther, O., and Hansen, L. K. (2009), “Bayesian Non-Negative Matrix Factorization,” in International Conference on Independent Component Analysis and Signal Separation, Springer, pp. 540547. [Google Scholar]
  • Schneeweiss, S., Rassen, J. A., Glynn, R. J., Avorn, J., Mogun, H., and Brookhart, M. A. (2009), “High-Dimensional Propensity Score Adjustment in Studies of Treatment Effects Using Health Care Claims Data,” Epidemiology, 20, 512. DOI: 10.1097/EDE.0b013e3181a663cc. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012), “On Causal and Anticausal Learning,” arXiv no. 1206.6471. [Google Scholar]
  • Shah, R. D., and Meinshausen, N. (2018), “Rsvp-Graphs: Fast High-Dimensional Covariance Matrix Estimation Under Latent Confounding,” arXiv no. 1811.01076. [Google Scholar]
  • Sharma, A., Hofman, J. M., and Watts, D. J. (2016), “Split-Door Criterion for Causal Identification: Automatic Search for Natural Experiments,” arXiv no. 1611.09414. [Google Scholar]
  • Sklar, M. (1959), “Fonctions de Repartition an Dimensions et Leurs Marges,” Publications de l’Institut de Statistique de l’Université de Paris, 8, 229231. [Google Scholar]
  • Sober, E. (1976), Simplicity. [Google Scholar]
  • Song, M., Hao, W., and Storey, J. D. (2015), “Testing for Genetic Associations in Arbitrarily Structured Populations,” Nature Genetics, 47, 550554. DOI: 10.1038/ng.3244. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Stephens, M., and Balding, D. J. (2009), “Bayesian Statistical Methods for Genetic Association Studies,” Nature Reviews Genetics, 10, 681. DOI: 10.1038/nrg2615. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Tierney, L., and Kadane, J. B. (1986), “Accurate Approximations for Posterior Moments and Marginal Densities,” Journal of the American Statistical Association, 81, 8286. DOI: 10.1080/01621459.1986.10478240. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Tipping, M. E., and Bishop, C. M. (1999), “Probabilistic Principal Component Analysis,” Journal of the Royal Statistical Society, Series B, 61, 611622. DOI: 10.1111/1467-9868.00196. [Crossref][Google Scholar]
  • Tran, D., and Blei, D. M. (2017), “Implicit Causal Models for Genome-Wide Association Studies,” arXiv no. 1710.10742. [Google Scholar]
  • Tran, D., Hoffman, M. D., Saurous, R. A., Brevdo, E., Murphy, K., and Blei, D. M. (2017), “Deep Probabilistic Programming,” in International Conference on Learning Representations. [Google Scholar]
  • Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016a), “Edward: A Library for Probabilistic Modeling, Inference, and Criticism,” arXiv no. 1610.09787. [Google Scholar]
  • Tran, D., Ruiz, F. J., Athey, S., and Blei, D. M. (2016b), “Model Criticism for Bayesian Causal Inference,” arXiv no. 1610.09037. [Google Scholar]
  • US Department of Health and Human Services Public Health Service (1987). National Medical Expenditure Survey Series (NMES). Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 200603-30. 10.3886/ICPSR06371.v1 [Crossref][Google Scholar]
  • VanderWeele, T. J., and Shpitser, I. (2013), “On the Definition of a Confounder,” The Annals of Statistics, 41, 196220. DOI: 10.1214/12-AOS1058. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Veturi, Y., de los Campos, G., Yi, N., Huang, W., Vazquez, A. I., and Kühnel, B. (2019), “Modeling Heterogeneity in the Genetic Architecture of Ethnically Diverse Groups Using Random Effect Interaction Models,” Genetics, 211, 13951407. DOI: 10.1534/genetics.119.301909. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., and Yang, J. (2017), “10 Years of GWAS Discovery: Biology, Function, and Translation,” The American Journal of Human Genetics, 101, 522. DOI: 10.1016/j.ajhg.2017.06.005. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Wager, S., and Athey, S. (2018), “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests,” Journal of the American Statistical Association, 113, 12281242. DOI: 10.1080/01621459.2017.1319839. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Yu, J., Pressoir, G., Briggs, W. H., Bi, I. V., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., and Kresovich, S. (2006), “A Unified Mixed-Model Method for Association Mapping That Accounts for Multiple Levels of Relatedness,” Nature Genetics, 38, 203. DOI: 10.1038/ng1702. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Zanutto, E., Lu, B., and Hornik, R. (2005), “Using Propensity Score Subclassification for Multiple Treatment Doses to Evaluate a National Antidrug Media Campaign,” Journal of Educational and Behavioral Statistics, 30, 5973. DOI: 10.3102/10769986030001059. [Crossref], [Web of Science ®][Google Scholar]
  • Zhang, K., and Hyvärinen, A. (2009), “On the Identifiability of the Post-Nonlinear Causal Model,” in Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, pp. 647655. [Google Scholar]