Causal bounds for outcome-dependent sampling in observational studies

Outcome-dependent sampling designs are common in many different scientific fields including ecology, economics, and medicine. As with all observational studies, such designs often suffer from unmeasured confounding, which generally precludes the nonparametric identification of causal effects. Nonparametric bounds can provide a way to narrow the range of possible values for a nonidentifiable causal effect without making additional assumptions. The nonparametric bounds literature has almost exclusively focused on settings with random sampling and applications of the linear programming approach. We derive novel bounds for the causal risk difference in six settings with outcome-dependent sampling and unmeasured confounding. Our derivations of the bounds illustrate two general approaches that can be applied in other settings where the bounding problem cannot be directly stated as a system of linear constraints. We illustrate our derived bounds in a real data example involving the effect of vitamin D concentration on mortality.


Introduction
The aim of empirical research is often to estimate the causal effect of a particular exposure on a particular outcome. In observational (i.e. not randomized) studies, this is typically complicated by the fact that there are confounders (i.e. common causes) of the exposure and the outcome. Often, these confounders are at least partly unmeasured, in which case the causal effect of interest is generally nonidentifiable as any observable association could be due to the uncontrolled common causes. One way to narrow the possible range of a nonidentifiable causal effect is to derive bounds, i.e. a range of values that are guaranteed to contain the true causal effect given the true distribution of the observed data. [e.g. Robins, 1989, Balke and Pearl, 1997, Zhang and Rubin, 2003, Cai et al., 2008, Sjölander, 2009. Generally, the width of these bounds depends on the amount of information provided by the study design; the more information the tighter the bounds. In particular, Balke and Pearl [1997] showed that the bounds for the causal risk difference may be substantially improved upon if an instrumental variable (IV) is available.
In their work, Balke and Pearl [1994] defined an algorithm for deriving valid and tight bounds when the causal effect of interest and the constraints implied by the causal model (in the sense of Pearl [2009]) can be stated as a linear optimization problem. Much of the subsequent work on bounds has used this algorithm.
The literature on bounds for causal effects has focused on simple random sampling.
However, in many studies the probability of selection into the study depends on the outcome of interest. Such outcome-dependent sampling often has advantages over simple random sampling, such as increased statistical power when the outcome is rare, but it also complicates the analysis and interpretation of results [Didelez et al., 2010]. The case-control design, and its variations, is perhaps the most common outcome-dependent design. Ideally, in this design, the sampling only depends on the outcome, and not on other variables in the study; we refer to this and all similar studies as uncounfounded outcome-dependent sampling. However, this ideal is difficult to achieve in reality.
More often, selection will be confounded with the outcome and exposure. A test-negative design is an example of a confounded outcome-dependent sampling design. In a test-negative study, subjects are selected from the group of patients seeking health care for a particular set of symptoms that are indicative of the disease of interest. Subjects are included in the study if they are tested for disease status based on their symptoms, and the exposure of interest is then retrospectively ascertained. This design, and variations of it, have been discussed extensively in the literature [Sullivan et al., 2014, Sullivan and Cowling, 2015, Sullivan et al., 2016]. An important difference between the case-control design and the testnegative design is that, in the latter, potentially unobserved characteristics, such as the subject's lifestyle or health-seeking behavior, may influence whether the subject is tested and therefore selected into the study. These characteristics may also be confounders for the exposure (e.g. vaccination status) and the outcome of the study; we refer to this and all similar studies as confounded outcome-dependent sampling.
In addition to being confounded with the outcome and exposure, sampling may directly depend on the exposure. We refer to this as confounded exposure-and outcome-dependent sampling. Although this is a setting that one hopes to avoid when sampling is outcomedependent, it may arise in test-negative studies when receiving the exposure reduces one's threshold for seeking medical treatment. This may also be the case in studies where the exposure is itself a medical condition and having such a condition leads to increased medical monitoring and therefore an increased probability of being included in the study independent of the outcome.
For each of these three sampling scenarios, we consider the bounds when there is and is not an available IV. In the test-negative design, which is a confounded sampling, outcome-, and possibly exposure-dependent setting, a possible IV is randomization of particular health care centers to provide the exposure, for example a vaccine. Individuals at those health care centers would then be free to decide if they wished to use the vaccine, regardless of their randomized encouragement. For the unconfounded outcome-dependent sampling designs, a possible IV may be a genetic allele, as in so-called Mendelian randomization studies [Bowden and Vansteelandt, 2011].
In this paper we derive bounds for the causal risk difference under six versions of outcomedependent sampling, which cover if not all, most of the plausible outcome-dependent sampling settings. As conditioning on sampling selection, as well as the lack of confounding of the selection variable, do not result in a clearly linear problem, we cannot use the method of Balke and Pearl [1994] without modification; see Balke [1995] for more detail. Therefore, in addition to providing novel bounds in six relevant scenarios, we also provide an approach to derive valid and informative bounds in some nonlinear scenarios and for modifying some seemingly nonlinear settings so that they are linear programming problems, which to our knowledge have not previously been described in the bounds literature. In addition, we numerically compare the bounds derived in each setting to each other and random sampling via simulations and in a real data example. This describes, in terms possibly easier to convey to practitioners, the information loss associated with outcome-dependent sample selection.
The paper is structured as follows. In Section 2 we define the causal target parameter, provide basic definitions, assumptions and pertinent previously derived bounds. In Section 3 we present the bounds for the target parameter, and in Section 4 we carry out a simulation study to assess the performance of the bounds. In Section 5 we provide a real data example, before providing a summary of our results and a discussion of future work in Section 6.

Notation
Let X and Y be the exposure and outcome of interest, and Y (x) be the potential (or counterfactual) outcome for a given subject, if the exposure were set to level x [Rubin, 1974, Pearl, 2009. Let Z be a valid and available IV for X and Y ; we refer readers to Glymour et al. [2012] for details on testing for validity of a candidate IV. We assume X, Y and Z are all fully observed and binary. Let S be an indicator of being selected into the study; S = 1 for 'selected' and S = 0 for 'not selected'. Let U represent the set of unobserved confounders for X and Y , and sometimes S; there are no restrictions on the distribution of U . Thus, the observed data distribution is given by p{Z, X, Y |S = 1}, when an IV is available, and by p{X, Y |S = 1} when an IV is not available; p{·} denotes the probability mass function.
Because U is unmeasured, no counterfactual probabilities of the type p{Y (x) = y} are identifiable in any of our settings of interest. Our target parameter is the most common estimand used in nonparametric causal bounds, the causal risk difference θ = p{Y (x = 1) = 1} − p{Y (x = 0) = 1}.
For convenience of notation, we define several probability abbreviations. Let

Settings
The causal diagrams (as defined by Pearl [2009]

Available information
To provide tight and valid bounds on θ when the sampling is confounded, we find it is necessary to have knowledge of quantities that are not estimable from the observed data distributions, p{Z, X, Y |S = 1} or p{X, Y |S = 1}, alone. In particular, we assume that the sampling prevalence p{S = 1} is known. In finite populations, p{S = 1} is simply the sample size divided with the population size. For example, in test-negative designs, p{S = 1} could be considered the proportion of subjects who were tested for the disease status, among the total number of eligible subjects that were initially included into the study. When one considers a super-population, p{S = 1} could be thought of as the true data generating mechanism of sampling, which remains constant regardless of the infinite population, e.g.
flipping a coin for each new case or control that is encountered.
When there is an available IV (Figures 1d, 1f and 1h) we additionally assume that the IV prevalence p{Z = 1} is known. When the IV is the randomization of health care centers to provide the vaccine, as may be the case in a test-negative study, p{Z = 1}, is determined by design. When the IV is a genetic allele, p{Z = 1} is often known from previous research.
We use the observed data distribution together with p{S = 1} and p{Z = 1} to calculate When the sampling is outcome-dependent but unconfounded, X is independent of S given Y (Figures 1c and 1d). Therefore, if p{Y } were known, one could recover the unconditional distributions p{X, Y } and p{Z, X, Y } from the observed conditional data distributions. Thus, previously developed bounds for θ under random sampling could be applied.
However, knowing the outcome prevalence is not sufficient to recover the unconditional distributions p{X, Y } and p{Z, X, Y } when the outcome-dependent sampling is confounded . Therefore, to allow for equitable comparison of the derived bounds, we consider the same external information in all scenarios, i.e. p{S = 1}, without an IV and p{S = 1} and p{Z = 1} with an IV. Robins [1989] derived bounds for θ under random sampling. Without an available IV, the bounds by Robins [1989] are given by

Previous bounds
(1) Robins [1989] also derived bounds in the setting when an IV is available. However, Balke and Pearl [1997] showed that those bounds are valid but not tight, using the linear programming method of bounds derivation developed in Balke [1995] and presented in Balke and Pearl [1994]. Briefly, this method uses the fact that if all of the observed data probabilities can be written as linear combinations of underlying counterfactual probabilities, then maximizing and minimizing the causal risk difference, θ, under the linear constraints defined by these relations is a linear programming problem. XXXXXX fix this Using fundamental concepts in the field of linear programming, one can prove that the global extrema obtained from the linear programming formulation yield bounds that are tight [Dantzig, 1963]. "Tight" here means that all values inside the bounds are logically compatible with the observed data distribution, under the assumptions (e.g. causal diagram) that were used to derive the bounds. The bounds derived by [Balke and Pearl, 1997] with an available IV are well known and have been cited numerous times in the causal inference literature [e.g. Greenland, 2000, Frangakis and Rubin, 2002, Hernán and Robins, 2006; we reproduce these bounds in Equations (1) and (2)
It can be shown (see the supplementary material) that the absent arrow from U to S in Figure 1c implies the constraint, A(y, 0) = A(y, 1) for all y, and that the absent arrow from U to S in Figure 1d implies the constraint B(y, z, 0) = B(y, z, 1) for all (y, z), provided that sampling was not deterministic, meaning that among the cases or controls we samples randomly with probability between 0 and 1, i.e. 0 < p{S = 1|Y = y} < 1, ∀y. Although non-deterministic sampling is required for constraints (6) and (7) to hold, the bounds derived from these constraints are valid provided that the ratios A(y, 1) and B(y, z, 1) are not undefined, i.e. provided that the probabilities p 0y.1 in the denominators of A(y, 1) and p 1y1.z in the denominators of B(y, z, 1) are not zero; detail provided below and in the supplementary material.
To derive bounds for θ in the setting of Figure 1c that take the constraint in (6) into account we proceed as follows. We partition the term p 01 + p 10 in (1) to p 01 + p 10 = (p 01.1 + p 10.1 )r + (p 01.0 + p 10.0 )(1 − r).
We can then combine (9) and (10) with (1) to produce the bounds for θ. However, these bounds are only valid and informative if A(y, 1) is defined; the ratio A(y, 1) is undefined if p 0y.1 = 0. A solution to this problem is to reverse the coding for the exposure, i.e. to define a new exposure as x * = 1 − x for which A(y, 1) are inverted. Replacing x everywhere in (2) and (3) with x * we can obtain the lower bound, l * , and upper bound, u * , for θ * = p{Y (x * = 1) = 1} − p{Y (x * = 0) = 1}. Since θ * = −θ, these translate into bounds for θ as This simple solution works for the bounds if the inverse ratios, A −1 (y, 1), are not undefined. However, this may not be the case. It is possible that p 1y.1 = p 0y.1 = 0 for some y, so that both A(y, 1) and A −1 (y, 1), are undefined. For such scenarios, we suggest using the bounds in (11), for confounded sampling.
We now turn to Figure 1d. Arguing as we did for Figure 1c (see the supplementary material for details) from the random sampling IV bounds of Balke and Pearl [1997] and combining with (7) we arrive at valid bounds for θ in the setting of Figure 1d. These bounds are given in (4) and (5). Just as above, when B(y, z, 1) is undefined for any y, z pair, we can instead bound −θ by defining exposure as x * = 1 − x, provided that B −1 (y, z, 1) is not undefined. When B −1 (y, z, 1) and B(y, z, 1) are both undefined for a given y and all z, then we suggest using the bounds given in (12) and (13), for confounded sampling in addition to any remaining defined terms from (4) and (5). Result 3: The bounds for θ given in (11) are valid and tight in the setting of Figure 1e, and the bounds given in (12) and (13) are valid and tight in the settings of Figure 1f, provided that p xy1 ∀x, y (for Figure 1e) or that p xy1.z ∀x, y, z (for Figure 1f) are known.

Confounded outcome-dependent sampling
For Figure 1f the resulting bounds are, and θ ≤ min To prove Result 3, we show (see the supplementary material), that given the probabilities p xy1 (for Figure 1e) or p xy1.z (for Figure 1f), the problem of bounding θ can be expressed as a linear programming problem. Solving the linear programming problems yields the bounds in equations (11), and (12) and (13).

Confounded exposure-and outcome-dependent sampling
Confounded exposure-and outcome-dependent sampling is illustrated in Figures 1g and 1h, without and with an available IV, respectively.

Result 4:
The bounds for θ given in (11), and in (14) and (15) are valid and tight in the settings of Figure 1g and Figure 1h, respectively, provided that p xy1 ∀x, y (for Figure 1g) or p xy1.z ∀x, y, z (for Figure 1h) are known.
and θ ≤ min Just as above, it can be shown that, given the probabilities p xy1 (for Figure 1g) or p xy1.z (for Figure 1h) are known, the problem of bounding θ can be expressed as a linear programming problem. Solving the linear programming problem yields the bounds in (14) and (15). For the setting of Figure 1g the resulting bounds are the same as for Figure 1h given in (11).

Comparison and refinement of the bounds
One would expect that the bounds in (2) and (3) (corresponding to Figure 1c) are at least as wide as the bounds in (1), and that the bounds in (11) (corresponding to Figure 1e) are at least as wide as the bounds in (2) and (3). Using the relations in (8), (9) and (10) An important implication of the existence of such a scenario is that the bounds in (4) and (5) are not tight. This is because, if they were, they would never be wider than any other bounds for θ in the setting of Figure 1d based on the same information. In particular, they would never be wider than the bounds in (12) and (13), since we assume the same available information and the causal diagram in Figure 1d is a special case of the causal diagram in Figure 1f, which implies that the bounds in (12) and (13) are also valid for Figure 1d.
This argument, however, suggests a simple way to improve the bounds in (4) and (5), namely to replace them with the bounds in (12) and (13) whenever these are tighter. Formally, let l d and u d be the lower and upper bounds in (4) and (5), and let l f and u f be the lower and upper bounds in (12) and (13). We thus define new bounds for θ under the causal diagram in Figure 1d as We use these bounds for Figure 1f in the remainder of the paper.

Simulation
To compare the derived bounds we carried out a simulation study. We generated probability distributions p{U, Z, X, Y, S} under the causal diagram in Figure 1h from the model where expit(x) = e x /(1 + e x ) and where e is Euler's number. In this model, σ U and σ X determine the degree of exposure dependence and confounding of the sampling; e.g. if σ U = 0, then the sampling is unconfounded, and if σ X = 0, then the sampling does not depend on the exposure. We first generated 100,000 distributions p{U, Z, X, Y, S} from the model in (17), with σ U = σ X = 0. For each distribution we computed our proposed bounds corresponding to the assumed causal diagrams in Figures 1c-1h. For comparison we also computed the previously derived bounds for random sampling, without [Robins, 1989] and with an IV [Balke and Pearl, 1997]. We emphasize that these bounds use the distributions p{X, Y } and p{Z, X, Y }, respectively, which do not condition on S = 1. Thus, these bounds are not applicable when there is non-random (e.g. outcome-dependent) sampling. We next repeated the simulation for σ U = 0, 1, 2..., 10, holding σ X fixed at 0 (top row of the true causal risk difference is outside the bounds; right column of Figure 3). We observe that the median width is virtually constant across σ U and σ X for all eight bounds. When σ U > 0, the bounds under Figures 1c and 1d are not generally valid, since these bounds assume unconfounded sampling. Thus, when σ U > 0 we would expect that these bounds are sometimes violated. Similarly, when σ X > 0, the bounds under Figures 1c, 1d and 1f are not generally valid, since these bounds assume sampling that is not directly dependent on the exposure. Thus, when σ X > 0 we would expect that these bounds are sometimes violated. We observe that, although violations do indeed occur when σ U > 0, they are very rare; at σ U = 10 the bounds under Figures 1c and 1d are violated for less than 1% of all simulated distributions. When σ X > 0, violations appear to be relatively common for the bounds under Figures 1d and 1f; at σ X = 10 these bounds are violated for almost 10% of all simulated distributions. However, the bounds under Figure 1c appear to be more robust, with a the risk of violation less than 1% at σ X = 10.

Real Data Example
To illustrate the performance of the bounds we use data from a real cohort study on Vitamin D and mortality, described by Martinussen et al. [2019]. To allow for public availability, the data were slightly mutilated before inclusion in the R package ivtools, whence we obtained the data. The exposure (X) is vitamin D level at baseline, measured as serum 25-OH-D (nmol/L). As vitamin D levels below 30 nmol/L indicate vitamin D deficiency [Martinussen et al., 2019], we used this level as a cutoff for defining a binary exposure. The outcome (Y ) is death during follow-up. The data also contain an IV, a binary indicator of whether the subject has mutations in the filaggrin gene. Martinussen et al. [2019] used this IV to estimate the causal effect of vitamin D on mortality, assuming a parametric structural Cox model. In contrast, the bounds that we have derived do not make any parametric model assumptions.
The data constitute a random sample, which makes it possible to compute the bounds corresponding to Figures 1a and 1b; these are given by (-0.74, 0.26) and (-0.71, 0.15), respectively. Thus, for these data, the presence of an IV reduces somewhat the range of possible values for θ.
To illustrate the impact of outcome-dependent sampling, we generated a selection variable (S) randomly for each subject from a Bernoulli distribution with probability p{S = 1|Y = y} = expit(α + βy). Sampling is therefore dependent on the outcome, but is unconfounded and does not depend on the exposure. We set β to 0.5, to simulate sampling that is moderately outcome-dependent. We used three values of α, corresponding to the marginal selection probabilities p{S = 1} = (0.1, 0.5, 0.9).
For each level of sampling, we computed the bounds under the assumed DAGs in Figures 1c-1h; Table 1 shows the results. As in the simulation, we observe that an available IV generally makes the bounds narrower, and that stronger deviation from the ideal randomized trial with random sampling generally makes the bounds wider. As expected, we also observe that all bounds become narrower as the probability of selection increases.

Discussion
We provide valid and informative nonparametric bounds for the causal risk difference under six DAGs that cover a large number of settings for which, at least to our knowledge, the bounds had not yet been derived. This provides a nearly complete set of bounds for outcomedependent sampling settings.
Although the Balke-Pearl method is the basis for our bounds, modifications and/or extensions to the method described in Balke [1995] were required in all settings, as none of the settings clearly define linear programming problems. In addition, our procedure requires modification of the algorithm to allow for the sampling variable S = 1 to define the observed data distribution, which was not considered in Balke [1995], or, to our knowledge, in subsequent literature.
As the unconfounded outcome-dependent sampling setting is not a linear optimization problem under our assumed external information, the derivation for the bounds in this setting is novel. Our approach also provides a possible route for the derivation of bounds in other settings where the constraints are nonlinear due the lack of unmeasured confounding or lack of complete observation, or both, for one or more variables.
In outcome-dependent sampling settings, the most common estimand is the odds ratio. This is because when sampling is uncounfounded (Figure 1c and 1d), the observed odds ratio is collapsible over S and is therefore equal to the population (i.e. marginal over S) odds ratio [Didelez et al., 2010]. However, in settings with unmeasured confounding of X and Y , the observed odds ratio does not have a causal interpretation. In addition, when there is confounding of the sampling, the observed odds ratio is no longer collapsible over S and thus, not equal to the population odds ratio. Therefore, the odds ratio is not a better target than any other causal estimand for our purposes, but is, in our opinion, less interpretable than other estimands.
A setting we did not consider is uncounfounded exposure-and outcome-dependent sampling. Although this is a possible scenario, we don't see this as a likely scenario because a practitioner would have to intentionally randomly sample conditional on both the outcome and exposure. We also do not discuss scenarios where the exposure is randomized, but the sampling is outcome-dependent and/or confounded with outcome. Although this is an interesting set of problems, this is beyond the scope of this paper and an area of future research for the authors.