Poorly Measured Confounders are More Useful on the Left than on the Right

Researchers frequently test identifying assumptions in regression based research designs (which include instrumental variables or difference-in-differences models) by adding additional control variables on the right hand side of the regression. If such additions do not affect the coefficient of interest (much) a study is presumed to be reliable. We caution that such invariance may result from the fact that the observed variables used in such robustness checks are often poor measures of the potential underlying confounders. In this case, a more powerful test of the identifying assumption is to put the variable on the left hand side of the candidate regression. We provide derivations for the estimators and test statistics involved, as well as power calculations, which can help applied researchers interpret their findings. We illustrate these results in the context of various strategies which have been suggested to identify the returns to schooling.


Introduction
The identification of causal effects depends on explicit or implicit assumptions which typically form the core of a debate about the quality and credibility of a particular research design. In regression based strategies, this is the claim that variation in the regressor of interest is as good as random after conditioning on a sufficient set of control variables. In instrumental variables models it involves the assumption that the instrument is as good as randomly assigned. In panel or differences-in-differences designs it is the parallel trends assumption, possibly after suitable conditioning. The credibility of a design can be enhanced when researchers can show explicitly that potentially remaining sources of selection bias have been eliminated. This is often done through some form of balancing tests or robustness checks.
The research designs mentioned above can all be thought of as variants of regression strategies. If the researcher has access to a variable for a potentially remaining confounder, tests for the identifying assumption take two canonical forms. The variable can be added as a control on the right hand side of the regression. The identifying assumption is confirmed if the estimated causal effect of interest is insensitive to this variable addition-we call this the coefficient comparison test. Alternatively, the variable can be placed on the left hand side of the regression instead of the outcome variable. A zero coefficient on the causal variable of interest then confirms the identifying assumption. This is the balancing test which is typically carried out using baseline characteristics or pre-treatment outcomes in a randomized trial or in a regression discontinuity design.
Researchers often rely on one or the other of these tests. The main point of our paper is to show that the balancing test, using the proxy for the candidate confounder on the left hand side of the regression, is generally more powerful. This is particularly the case when the available variable is a noisy measure of the true underlying confounder. The attenuation due to measurement error often implies that adding the candidate variable on the right hand side as a regressor does little to eliminate any omitted variables bias. The same mea-1 surement error does comparatively less damage when putting this variable on the left hand side. Regression strategies work well in finding small but relevant amounts of variation in noisy dependent variables.
These two testing strategies are intimately related through the omitted variables bias formula. The omitted variables bias formula shows that the coefficient comparison test involves two regression parameters, the coefficient from the balancing test and the coefficient from the added regressor in the outcome equation. If the researcher has a strong prior that the added regressor ought to matter for the outcome under study, then the balancing test will provide the remaining information necessary to assess the research design.
This maintained assumption is the ultimate source of the superior power of the balancing test. However, we show that quantitatively meaningful differences emerge particularly when there is some substantial amount of measurement error in the added regressor. We derive the relevant parameters in the presence of measurement error in Section 3.
Of course, sometimes researchers may be more agnostic about whether the added regressor matters for the outcome. In case it does not matter, rejecting balance for this variable is of no consequence for this particular research design.
In this view, only the coefficient comparison test is really relevant while the balancing test provides no additional information. However, this strikes us as a narrow view and not one shared by many in the experimental community, where balancing tests are commonly used. Lack of balance is seen as an indictment of the randomization in an experiment irrespective of whether the variable in question affects the outcome. Lack of balance with respect to one or more observed covariates raises the possibility that there may also be lack of balance for other unobservables, and would lead a prudent researcher to reassess the credibility of their research design. The same should be true for quasi-experimental research based on observational data.
A second point we are making is that the two strategies, coefficient comparison and balancing, both lead to explicit statistical tests. The balancing test is a simple t-test used routinely by researchers. When adding a covariate on the right hand side, comparing the coefficient of interest across the two re-gressions can be done using a generalized Hausman test. In practice, we have not seen this test carried out in applied papers, where researchers typically just eye-ball the results. 2 We provide the relevant test statistics and discuss how they behave under measurement error in Section 4. We also show how the coefficient comparison test is simple to implement for varying identification strategies. We demonstrate the superior power of the balancing test under a variety of scenarios in Section 5.
The principles underlying the points we are making are not new but the consequences do not seem to be fully appreciated in much applied work. Griliches (1977) is a classic reference for the issues arising when regression controls are measured with error. A subsequent literature, for example Rosenbaum and Rubin (1983) and Imbens (2003), has considered omitted variables bias in non-linear models without measurement error. More closely related is Battistin and Chesher (2014), as it discusses identification in the presence of a mismeasured covariate in non-linear models. Like in the literature following Rosenbaum and Rubin (1983), they discuss identification given assumptions about a missing parameter, namely the degree of measurement error in the covariate. We follow Griliches (1977) in framing our discussion around the omitted variables bias arising in linear regressions, the general framework used most widely in empirical studies. Unlike this literature, we are less interested in point identification in the presence of missing information. We go beyond the analysis in all of these papers in our explicit discussion of testing, which forms the core of our study. Altonji, Elder and Taber (2005) discuss an alternative but closely related approach to the problem. As we noted above, applied researchers often argue that relative stability of regression coefficients when adding additional controls provides evidence for credible identification. Implicit in this argument is the idea that other confounders not controlled for are similar to the controls just added to the regression. The paper by Altonji, Elder and Taber (2005) formalizes this argument. In practice, adding controls will typically move the coefficient of interest somewhat even if it is not by much. Altonji et al. (2013) and Oster (forthcoming) extend the original Altonji, Elder and Taber work by providing more precise conditions for bounds and point identification in this case. The approach in these papers relies on an assumption about how the omitted variables bias due to the observed regressor is related to any remaining omitted variables bias due to unobserved confounders.
The remaining unobserved confounders in this previous work can be thought of as the source of measurement error in the covariate which is added to the regression in our analysis. For example, in our empirical example below, we use mother's education as a measure for family background but this variable may only capture a small part of all the relevant family background information, a lot of which may be orthogonal to mother's education. In fact, we show that our formulation and Oster's (forthcoming) are isomorphic. This means that our framework is a useful starting point for researchers who are willing to make the type of assumptions in Altonji, Elder and Taber (2005) and follow-up papers as well.
Another related strand of work is by Belloni, Chernozhukov and Hansen (2014a, b), who tackle the opposite problem from Altonji, Elder and Taber (2005), namely choosing the best controls when the researcher has a potentially bigger set of candidate controls available than is necessary. This large dimensional set may come from nonlinearities and interactions among regressors. Belloni, Chernozhukov and Hansen (2014b) use Lasso to select regressors which are highly correlated with either the treatment or the outcome conditional on other covariates. They then estimate an outcome equation including as controls all the regressors selected in this preliminary step. In a sense, this is more closely related to our setup than the Altonji, Elder and Taber approach as Belloni, Chernozhukov and Hansen (2014b) also postulate that identification can be achieved when using a subset of the available covariates as controls.
Their variable selection problem is related to the two testing strategies we discuss in this paper. However, like Altonji et al. (2013) and Oster (forthcoming), their ultimate interest is in point identification and inference for the treatment effects parameter, not in testing whether a particular specification is subject to remaining confounders. Their setup is also not specifically geared towards 4 dealing with control variables which are subject to error, which is our focus.
An older literature by Hausman (1978), Hausman andTaylor (1980), andHolly (1982) (see also the summary in MacKinnon, 1992, section II.9) considers the relative power of the Hausman test compared to alternatives, in particular an F -test for the added covariates in the outcome equation when potentially multiple covariates are added. This comparison effectively maintains that there is a lack of balance, and instead tests whether the added regressors matter for explaining the outcome. While this is a different exercise from ours, this literature highlights the potential power of the Hausman test when it succinctly transforms a test with multiple restrictions (like the F -test for the added covariates) into a test with a single restriction (the coefficient comparison test). We discuss how to extend our framework to multiple added controls in Section 5.3. Our basic findings largely carry over to this setting but we also reach the conclusion that the Hausman test has a role to play when the goal is to summarize a large number of restrictions. Griliches (1977) uses estimates of the returns to schooling as example for the methodological points he makes. Such estimates have formed a staple of labor economics ever since. We use Griliches' data from the National Longitudinal Survey of Young Men to illustrate our power results in Section 6. In addition to Griliches (1977), this data set has been used in a well known study by Card (1995). It is well suited for our purposes because the data contain various test score measures which can be used as controls in a regression strategy (as investigated by Griliches, 1977), a candidate instrument for college attendance (investigated by Card, 1995), as well as a myriad of other useful variables on individual and family background. The empirical results support and illustrate our theoretical claims.

A Simple Framework
Consider the following simple framework starting with a population regression equation where y i is an outcome like log wages, s i is the causal variable of interest, like years of schooling, and e s i is the regression residual. The researcher proposes this short regression model to be causal. This might be the case because the data come from a randomized experiment, so the simple bivariate regression is all we need. More likely, the researcher has a particular research design applied to observational data. For example, in the case of a regression strategy controlling for confounders, y i and s i would be residuals from regressions of the original outcome and treatment variables on the chosen controls. In the case of panel data or differences-in-differences designs the controls are sets of fixed effects. In the case of instrumental variables, s i would be the predicted value from a first stage regression. In practice, (1) encompasses a wide variety of empirical approaches, and should be thought of as a short-hand for these. 3 Now consider the possibility that the population regression parameter β s from (1) may not actually capture a causal effect. There may be a candidate confounder x i , so that the causal effect of s i on y i would only be obtained conditional on x i , as in the long regression and the researcher would like to probe whether this is a concern. For example, in the returns to schooling context, x i might be some remaining part of an individual's earnings capacity which is also related to schooling, like ability or family background.
Researchers who find themselves in a situation where they start with a proposed causal model (1) and a measure for a candidate confounder x i typically do one of two things: They either regress x i on s i and check whether s i is significant, or they include x i on the right hand side of the original regression as in (2), and check whether the estimate of β changes materially when x i is added to the regression of interest. The first strategy constitutes a test for "balance," a standard check for successful randomization in an experiment. In principle, the second strategy has the advantage that it goes beyond testing whether (1) qualifies as a causal regression. An appreciable change in β suggests that the original estimate β s is biased. The results obtained with x i as an additional control should be closer to the causal effect we seek to uncover. In particular, if x i were the only relevant confounder and if we measure it without error, the β parameter from the controlled regression is the causal effect of interest. In practice, there is usually little reason to believe that these two conditions are met, and hence a difference between β and β s again only indicates a flawed research design.
The relationship between these two strategies is easy to see. Write the regression of x i on s i , which we will call the balancing regression, as The change in the coefficient β from adding x i to the regression (1) is given by the omitted variables bias formula The change in the coefficient of interest β from adding x i consists of two components, the coefficient γ on x i in the outcome equation (2) and the coefficient δ from the balancing regression.
Here we consider the relationship between these two approaches: the balancing test, consisting of an investigation of the null hypothesis compared to the inspection of the coefficient movement β s − β. The latter strategy of comparing β s and β is often done informally, but it can be formalized as a statistical test of the null hypothesis which we will call the coefficient comparison (CC) test. From (4) it is clear that (6) amounts to This highlights that the two approaches formally test the same hypothesis under the maintained assumption γ = 0. We may often have a strong sense that γ = 0; i.e. we are dealing with a variable x i which we believe affects the outcome, but we are unsure whether it is related to the regressor of interest s i . In this case, both tests would seem equally suitable. 4 Nevertheless, in other cases γ may be zero, or we may be unsure. In this case, the coefficient comparison test seems to dominate because it directly addresses the question we are after, namely whether the coefficient of interest β is affected by the inclusion of x i in the regression. 5 Here we make the point that the balancing test adds valuable information particularly when the true confounder is measured with error. In general, x i may not be easy to measure. If the available measure for x i contains classical measurement error, the estimator of γ in (2) will be attenuated, and the comparison β s − β will be too small (in absolute value) as a result. The estimator of δ from the balancing regression is still consistent in the presence of measurement error; this regression simply loses precision because the mismeasured variable is on the left hand side. Under the maintained assumption that 0 < γ < ∞, the balancing test is more powerful than the coefficient comparison test. In order to make these statements precise, we collect results for the relevant population parameters for the case of classical measurement error in the following section, before moving on to the test statistics. 4 One might argue that researchers should only carry out the long regression and not the short regression if they know that γ = 0: if δ = 0, not including x in the regression will lead to omitted variable bias; if δ = 0, bothβ s andβ are consistent butβ s is less efficient thanβ. As we emphasized in the Introduction, however, the focus of this paper is on testing whether the treatment is plausibly randomly assigned in an (quasi-)experimental design. In the analysis of a randomized controlled trial, for example, researchers may include covariates when estimating the treatment effect but that does not come before a formal test of covariate balance. 5 Equations (4) and (7) highlight that a regressor ought to be included in the long regression when both γ = 0 and δ = 0. This differs from the selection rule chosen by Belloni, Chernozhukov and Hansen (2014b), who include a regressor when either γ = 0 or δ = 0 is true.

3 Population Parameters in the Presence of Measurement Error
The candidate variable x i is not observed. Instead, the researcher works with the mismeasured variable Here we assume the measurement error m i is classical, i.e. E (m i ) = 0, Cov (x i , m i ) = 0. In Section 5 below we also investigate the impact of nonclassical errors. As a result of the measurement error, the researcher compares the regressions Notice that the short regression does not involve the mismeasured x i , so that β s = β + γδ as before. However, the population regression coefficients β m and γ m are now different from β and γ from equation (2), and they are related in the following way: where R 2 is the population R 2 of the regression of s i on x m i and i . 6 λ measures the amount of measurement error present as the fraction of the variance in the observed x m i , which is due to the signal in the true x i . It is also the attenuation factor in a simple bivariate regression on x m i . In the multivariate model (9), an alternative way to parameterize the amount of measurement error is 6 Note R 2 is also the population R 2 of the regression of x m i on s i .

9
where σ 2 · denotes the variance of the random variable in the subscript. 1 − θ is the multivariate attenuation factor. Recall that u i is the residual from the balancing regression (3).
With the mismeasured x m i , the balancing regression becomes which implies that As a result θ is an alternative way to parameterize the degree of measurement error in x i compared to λ and R 2 . The θ parameterization uses only the variation in x m i which is orthogonal to s i . This is the part of the variation in x m i relevant to the estimate of γ m in regression (9), which also has s i as a regressor. θ turns out to be a useful parameter in many of the derivations that follow.
The population coefficient β m differs from β but less so than β s . In fact, β m lies between β s and β, as can be seen from (10). The parameter γ m is attenuated compared to γ; the attenuation is bigger than in the case of a bivariate regression of y i on x m i without the regressor s i if x m i and s i are correlated (R 2 > 0).
These results highlight a number of issues. The gap β s − β m is too small compared to the desired β s − β, directly affecting the coefficient comparison test. This is a consequence of the fact that γ m is biased towards zero. Ceteris paribus, this is making the assessment of the hypothesis γ = 0 more difficult. Finally, the balancing regression (11) with the mismeasured x m i involves measurement error in the dependent variable, which has no effect on the population parameter δ m = δ, but the estimator δ m is less efficient than δ.
The results here are also useful for thinking about the identification of β and γ in the presence of measurement error. Rearranging (10) yields Since R 2 can be estimated from the data, these expressions only involve the unknown parameter λ. If we are willing to make an assumption about the measurement error, we are able to point identify β. Even if λ is not known precisely, (12) can be used to bound β for a range of plausible reliabilities.
Alternatively, (10) can be used to derive the value of λ for which β = 0. These calculations are similar in spirit to the ones suggested by Oster (forthcoming) in a setting that is closely related.

Inference
In this section, we consider how conventional standard errors and test statistics for the quantities of interest are affected in the homoskedastic case. 7 We present the theoretical power functions for the two alternative test statistics; derivations are in Appendix A, which also shows that our results carry over to robust standard errors. We extend the power results to the heteroskedastic case and non-classical measurement error in simulations. Our basic conclusions are the same in all these different scenarios.
Start with the standard error of estimator δ m from the balancing regression: where we use se( · ) to denote the estimated standard error of a given estimator. Let se( · ) denote the asymptotic standard error of an estimator, i.e., se( · ) ≡ 1 √ n plim{ √ n se( · )}. In the case of δ m , Comparing the asymptotic standard error of δ m to its counterpart in the case with no measurement error, Since 0 < θ < 1, the standard error is inflated compared to the case with no measurement error.
A test based on the t-statistic t δ m = δ m se δ m remains consistent because m i is correctly accounted for in the residual of the balancing regression (11), but the t-statistic is asymptotically smaller than in the error free case: As n → ∞, the comparison of the scaled t-statistics when δ > 0 is (without loss of generality, we are assuming that δ is either zero or This means the null hypothesis (5) is rejected less often. The test is less powerful than in the error free case; the power loss is captured by the term √ 1 − θ.
We next turn to γ m , the estimator for the coefficient on the mismeasured x m i in (9). The parameter γ is of interest since it determines the coefficient movement β s − β = γδ in conjunction with the result from the balancing regression. Letx m i be the residual from the population regression of x m i on s i . For ease of exposition, we impose conditional homoskedasticity of e m i given s i and x m i here and leave the more general case to Appendix A.2.3. The standard error for γ m in the limit is se( γ m ) involves two terms: the first term is an attenuated version of se( γ) from the corresponding regression with the correctly measured x i , while the second term depends on the value of γ. The parameters in the two terms are not directly related, so se ( γ m ) ≷ se ( γ). Measurement error does not necessarily inflate the standard error here.
The two terms have a simple, intuitive interpretation. Measurement error attenuates the parameter γ m towards zero, the attenuation factor is 1 − θ.
The standard error is attenuated in the same direction; this is reflected in the √ 1 − θ factor, which multiplies the remainder of the standard error calculation. The second influence from measurement error comes from the term θγ 2 , which results from the fact that the residual variance V ar (e m i ) is larger when there is measurement error. The increase in the variance is related to the true γ, which enters the residual.
The t-statistic for testing whether γ m = 0 is and it follows that when γ > 0 As in the case of δ m from the balancing regression, the t-statistic for γ m is smaller than t γ for the error free case. But in contrast to the balancing test statistic t δ m , measurement error reduces t γ m relatively more, namely due to the term θγ 2 in the denominator, in addition to the attenuation factor √ 1 − θ.
This is due to the fact that measurement error in a regressor both attenuates the relevant coefficient towards zero and introduces additional variance into the residual. Though interestingly, θγ 2 captures the additional residual variance while the factor √ 1 − θ now captures the attenuation of γ m . In the balancing test statistic, √ 1 − θ accounted for the residual variance. The upshot from this discussion is that classical measurement error makes the assessment of whether γ = 0 more difficult compared to the assessment of whether δ = 0.
As we will see, this is the source of the greater power of the balancing test statistic.
Finally, consider the quantity β s − β m , which enters the coefficient comparison test. To form a test statistic for this quantity we need the expression for the asymptotic variance of β s − β m , which we derive through an application of the delta method to the omitted variables bias formula Specifically, we can relate V ar( β s − β m ) to the asymptotic variances of δ m and γ m and their asymptotic covariance: Using V ar δ m and V ar ( γ m ), which we derived above, and the fact that It is easy to see that, like V ar ( γ m ), V ar β s − β m has both an attenuation factor as well as an additional positive term compared to the case where θ = 14 0, i.e. V ar β s − β . Measurement error may therefore raise or lower the sampling variance for the coefficient comparison test.
Before we proceed to discuss the power of the coefficient comparison test, we note that the covariance term in reduces the sampling variance of β s − β m . In fact, this covariance term is positive, and it is generally sizable compared to V ar β s and V ar β m since the regression residuals e s i and e m i are highly correlated. Because 2Cov β s , β m gets subtracted, looking at the standard errors of β s and β m alone can potentially mislead the researcher into concluding that the two coefficients are not significantly different from each other when in fact they are.
The coefficient comparison test itself can be formulated as a t-test as well, since we are interested in the movement in a single parameter. Define where se( β s − β m ) is a consistent standard error estimator. Since Under the alternative hypothesis (δ = 0) and the maintained assumption γ = 0, the limits for the other two test statistics can be written as Hence, using (14), it is apparent that under these conditions the three tests are asymptotically related in the following way: These results highlight a number of things. First of all, under the maintained hypothesis γ = 0, the balancing test alone is more powerful. This is not surprising at all, since the balancing test only involves estimating the parameter δ while the coefficient comparison test involves estimating both δ and γ. Imposing γ = 0 in the coefficient comparison test is akin to t γ m → ∞, and this would restore the equivalence of the balancing and coefficient comparison tests. Note that the power advantage from imposing γ = 0 exists regardless of the presence of measurement error.
The second insight is that measurement error affects the coefficient comparison test in two ways. The test statistic is subject to both the attenuation factor √ 1 − θ and the term θδ 2 γ 2 in the variance, which is inherited from the t-statistic for γ m . Importantly, however, all these terms interact in the coefficient comparison test. In our numerical exercises below, it turns out that the way in which measurement error attenuates γ m compared to γ is a major source of the power disadvantage of the coefficient comparison test. Our simulations demonstrate that the differences in power between the coefficient comparison and balancing tests can be substantial when there is considerable measurement error in x m i . Before we turn to these results, we briefly note how the coefficient comparison test can be implemented in practice.

Implementing the Coefficient Comparison Test
The balancing test is a straightforward t-test, which regression software calculates routinely. We noted that the coefficient comparison test is a general- As far as we can tell, the Stata suest or 3reg commands don't work for the type of IV regressions we might be interested in here. An alternative, which also works for IV, is to take the regressions (1) and (2) and stack them: Testing β s − β = 0 is akin to a Chow test across the two specifications (1) and (2). Of course, the data here are not two subsamples but rather duplicates of the original data set. To take account of this and allow for the correlation in the residuals across duplicates, it is crucial to cluster standard errors on the observation identifier i.

Asymptotic and Monte Carlo Results with Classical Measurement Error
The ability of a test to reject when the null hypothesis is false is described by the power function of the test. The power functions here are functions of d, the values the parameter δ might take on under the alternative hypothesis. Because the joint distribution between the coefficient and standard error estimators is difficult to characterize, especially in the case of the coefficient comparison test, we abstract away from the sampling variation in estimating the standard errors in the theoretical derivations of this section. The resulting t-statistic for the null hypothesis that the coefficient δ is zero in the balancing test is Similarly, we use in the derivation of the power function for the coefficient comparison test. As shown in Appendix A, the power function for a 5% critical value of the balancing test is where Φ ( · ) is the standard normal cumulative distribution function. The power function for the coefficient comparison test is Note that the power function for the balancing test does not involve the parameter γ. Using our results above, for 0 < γ < ∞ it can be written as where In practice, this result may or may not be important. In addition, when the standard error is estimated, the powers of the two tests may differ from the theoretical results above. Therefore, we carry out a number of Monte Carlo simulations to assess the performance of the two tests. Table 1  Before going on to simulations of more complicated cases, we contrast the theoretical power functions in Figure 1, based on asymptotic approximations, to simulated rejection rates of the same tests in Monte Carlo samples. Figure   9 The power function for the balancing test in equation (16) is written using the normal distribution, but we actually calculate it using the t-distribution with n − 2 degrees of freedom. This is consistent with how Stata version 14 performs the balancing test following the command reg x s or reg x s, r, even though this distribution choice makes little difference given our sample size (n = 100).
10 While we highlight the consequences of measurement error throughout the paper, we should note that formally any particular value of θ can be mimicked by an appropriate combination of values for γ and σ 2 u . This is an immediate consequence of the fact that the classical measurement error model is underidentified by one parameter. In that sense "measurement error" is simply a label for a certain set of parameter values. It is always difficult to choose empirically relevant values for simulations and we take comfort from the fact that the results emerging from this section are also reflected in the empirical example in Section 6. 20 2 shows the power functions for the two tests without measurement error (θ = 0) and with (θ = 0.85), as well as their simulated counterparts. 11 Without measurement error, the theoretical power functions are closely aligned with the empirical rejection rates (black lines). Adding measurement error, this is also true for the balancing test (solid red and blue/thick lines) but not for the coefficient comparison test (broken red and blue/thick lines). native, and the test has too little power. This highlights another advantage of the balancing test-a standard t-test where no such problem arises. We note that this is a small sample problem, which goes away when we increase the sample size (in unreported simulations). We suspect that this problem is related to the way in which the coefficient comparison test effectively combines the simple t δ m and t γ m test statistics in a non-linear fashion, as can be seen in equation (15), and the fact that t γ m sometimes is close to 0 in small samples despite the fact that we fix γ substantially above 0.

Monte Carlo Results beyond the Benchmark Model
The homoskedastic case with classical measurement error might be highly stylized and not correspond well to the situations typically encountered in empirical practice. We therefore explore some other scenarios using simulations in this section. Figure 3 shows the original theoretical power functions for the case with no measurement error from Figure 1. It adds empirical rejection rates from simulations with heteroskedastic errors u i and e i of the form We set the baseline variances σ 2 0u and σ 2 0e so that the unconditional variances σ 2 u = 3 and σ 2 e = 30 match the variances in Figure 1. The test statistics used in the simulations employ robust standard errors. We plot the rejection rates for data with no measurement error and for the more severe measurement error scenario given by θ = 0.85. As can be seen in Figure 3, both the balancing and the coefficient comparison tests lose some power with heteroskedastic residuals and a robust covariance matrix compared to the conventional, homoskedastic baseline (black/thin lines). Otherwise, the main findings look very similar to those in Figure 1. Heteroskedasticity does not seem to alter the basic conclusions appreciatively.
We generate measurement error as where κ is a parameter and Cov (x i , µ i ) = 0, so that κx i captures the error related to x i and µ i the unrelated part. When −1 < κ < 0, the error is mean reverting, i.e. the κx i -part of the error reduces the variance in x m i compared to x i .
The case of mean reverting measurement error captures a variety of ideas, including the one that we may observe only part of a particular confounder made up of multiple components. Imagine we would like to include in our regression a variable x i = w 1i + w 2i , where w 1i and w 2i are two orthogonal variables. We observe x m i = w 1i . For example, x i may be family background, w 1i is mother's education and other parts of family background correlated 22 with it, and w 2i are all relevant parts of family background which are uncorrelated with mother's education. As long as selection bias due to w 1i and w 2i is the same, this amounts to the mean reverting measurement error formulation above. Note that λ = V ar (x i ) /V ar (x m i ) > 1 in this case, so the mismeasured x m i has a lower variance than the true x i . This scenario is also isomorphic to the model studied by Oster (forthcoming). See Appendix B for details.
Notice that x m i can now be written as so this parameterization directly affects the coefficient in the balancing regression, which will be smaller than δ for a negative κ. As a result, the balancing test will reject less often. At the same time, a negative κ offsets and possibly reverses the attenuation bias on γ. This should bring the power functions of the balancing and coefficient comparison tests closer together.
For the simulations we set κ = −0.5, so the error is mean reverting. We also fix σ 2 µ in the simulations. However, it is important to note that the nature of the measurement error will change as we change the value of d under the alternative hypotheses. x i depends on δ and the correlated part of the measurement error depends in turn on x i . We show results for two cases with σ 2 µ = 0.75 and σ 2 µ = 2.25. Under the null, these two parameter values correspond to λ = 2 and λ = 1, respectively. The case λ = 2 corresponds to the Oster (forthcoming) model just described with V ar (w 1i ) = V ar (w 2i ).
These models exhibit relatively large amounts of mean reversion. Figure 4 demonstrates that the balancing test again dominates. The gap is small for the σ 2 µ = 0.75 case but grows with σ 2 µ , the classical portion of the measurement error. This finding is not surprising as mean-reverting measurement error does less damage in terms of biasing the estimate of γ.
A particular case of mean reverting measurement error is the one where x i is a dummy variable, so we provide some simulation results for this case. In this case, the balancing equation is a binary choice model, and hence inherently non-linear. While we assume that the researcher continues to estimate (3) as a linear probability model, we generate x i as follows: where Φ ( · ) is the normal distribution function as before. Measurement error takes the form of misclassification, and we assume the misclassification rate to be symmetric: Compared to the baseline parameters in Table 1, we set σ 2 s = 0.25, and τ = 0.1 in our simulations. The model remains the same in all other respects. We use robust standard errors in estimating (9) and (11).
Various issues arise from the nonlinear nature of (20). One is the fact that plim δ from estimating (11) linearly is not going to equal the δ we generated in the probit equation (20) to generate x. The relationship between plim δ and δ is concave. In Figure 5, we plot rejection rates against values of δ, although the quantity plim δ is probably more comparable to what we put on the x-axis in the previous figures that summarize the simulation results from linear models. We note that results look qualitatively very similar when we plot rejection rates against the empirical averages ofδ from estimating (11) as a linear probability model.
Another issue is that measurement error in x i will now lead to a biased estimate of δ in estimating (11). This is true even if we were to use a probit and estimated a model like (20). The bias takes the form of attenuation, just as in the case of a binary regressor with measurement error (see Hausman, Abrevaya and Scott-Morton, 1998). This is the corollary of our result that mean reverting measurement error also reduces the power of the balancing test. Of course, we know from the relationship (15) between the test statistics that the coefficient comparison test will also suffer from the same power loss.
The black/thin lines in Figure 5 reveal a sizable power advantage for the balancing test even without any misclassification. This result is in stark contrast to the linear models we have analyzed, where a large power loss for the coefficient comparison test only resulted once we introduced measurement error. In fact, it is possible to think of the binary nature of x i itself as a form of mismeasurement. Equation (20) defines Pr (x i = 1) as a latent index, but the outcome regression (2) uses a coarse version of this variable in the form of the binary x i .
In our parameterization, the coefficient comparison test never reaches a rejection rate of 1, and the power function levels off at a far lower level. As d increases, the power of the balancing test goes to 1. In the linear model, the rejection rate of t γ is independent of d. Because of the nonlinear nature of (20) this is no longer true here, and the average value of t γ across repeated samples actually falls for higher values of d. Drawing on (15), the power of the coefficient comparison test will equal the power of t γ when t δ → ∞. This is not a specific feature of the binary case but is generally true for the relationship between the three test statistics. However, in the binary case this implies that the power of the coefficient comparison test may decline with d. 12 Adding measurement error to the binary regressor x i makes things worse as is visible from the red/thick lines in Figure 5. The power loss of the balancing test is comparatively minor for the relatively low misclassification rate of τ = 0.1 we are using. Much of the loss for the balancing test results from the binary nature of the x i variable in the first place. The coefficient comparison test is affected by misclassification error to a much higher degree because t γ is affected, the Hausman, Abrevaya and Scott-Morton (1998) result notwithstanding.

Multiple Controls
So far we have concentrated on the case of a single added regressor x i . Often in empirical practice we may want to add a set of additional covariates at once. 12 The reason for the decline of t γ with d in our parameterization is as follows: the standard error ofγ depends on the residual variance of the long regression, which is independent of d, and on the variance of the residual from regressing x i on s i (because s i is partialled out in the long regression). When d = 0, this latter residual is just equal to x i itself, which is binary. But s i is continuous, so as d increases, partialling out s i transforms the binary x i into a continuous variable, which has less variance than in the d = 0 case. As the effective variance in this regressor falls, the standard error ofγ goes up and t γ goes down.
It is straightforward to extend our framework to that setting. In this section, we describe this multivariate extension, and provide some simulation results.
Some interesting new issues arise in this analysis.
Suppose there are k added regressors, i.e. x i is a k × 1 vector, and where γ, δ 0 , δ and u i are k × 1 vector analogs of their scalar counterparts in Section 2. Lee and Lemieux (2010) suggest a balancing test for multiple covariates in the context of evaluating regression discontinuity designs. Let x (j) denote the n × 1 vector of all the observations on the j-th x-variable. We can stack all the x-variables on the left-hand-side of the regression to obtain where ι is an n × 1 vector of ones, s = [s 1 , s 2 , ..., s n ] , and u (j) the vector of residuals corresponding to covariate x (j) . We can then perform an F -test for the joint significance of the δ coefficients. This left-hand-side (LHS) balancing test is similar to the way we implemented the coefficient comparison test above in Section 4.1.
The drawback of the LHS test is that stacking equations is non-standard, and requires some extra programming to carry it out. It therefore seems appealing to consider the alternative of regressing s on the covariates x and test whether the coefficient vector π is significantly different from zero. While putting the balancing variables on the RHS might at first glance seem unusual, it turns out that the LHS and RHS tests deliver very similar results. In the case of a single covariate x i (i.e. k = 1) the LHS and the RHS tests using a conventional covariance matrix for homoskedastic residuals are numerically identical. 13 This is no longer true with multiple covariates (k > 1). However, the scaled F -statistics of the two tests have the same probability limit in the special case where the LHS regression has a spherical error structure var(u i ) = σ 2 I k and the RHS regression is homoskedastic, as we show in Appendix C.
How do the balancing tests with multiple covariates perform in practice? Figure 6 shows simulations using a similar design as described in Table 1   The F -test in this case amounts to the overall F -test for the significance of the regression. This, in turn, is a function of the R 2 of the regression. Since only two variables x i and s i are involved, this is the square of the correlation coefficient between the two. But the correlation coefficient is not directional, so the forward and reverse regression have to deliver the same F -statistic (in the case when there other covariates present in the regression, replace the R 2 and correlation coefficient with their partial equivalents in this argument).
in combining the k separate hypotheses into one single test-statistic, which is generated from the estimates of only two parameters, the long and short β's.
The balancing tests, on the other hand, have to rely on the estimation of k parameters. 14 In this case, the rejection rates for the coefficient comparison test (black/thin broken lines) therefore lie above the ones for both the balancing tests (black/thin solid and dash-dot lines), as can be seen in the left-hand panels. In the presence of measurement error, however, the balancing tests are again more powerful than the coefficient comparison test as can be seen from the juxtaposition of the thicker red lines.
This power advantage of the balancing tests is greater when only one covariate is unbalanced. Both tests are less powerful in this case, but the power loss for the coefficient comparison test is now much more pronounced. This is particularly noticeable in the case with measurement error in the covariates (red/thick lines) but the balancing tests outperform the coefficient comparison test even without measurement error in this case. Empirically relevant cases may often lie in between these extremes. Researchers may be faced with a set of potential controls to investigate, some of which may be unbalanced with the treatment while others are not. Figure 6 demonstrates that the balancing tests will frequently be the most powerful tools in such a situation, but the coefficient comparison test also has a role to play in the multivariate case.
The simulations reveal a number of further insights. With measurement error, the small sample issue of the coefficient comparison test, which we highlighted in Figure 2, arises again. On top of this, we found in unreported simulations that both the LHS and RHS balancing tests with robust standard errors (clustered standard errors across equations for the LHS test and heteroskedasticity-robust standard errors for the RHS test) have a size distortion under the null hypothesis and reject too often. This is the standard small sample distortion of these covariance matrices discussed in the literature (MacKinnon and White, 1985;Chesher and Jewitt, 1987;Angrist and Pis-chke, 2009, chapter 8). We find that the bias tends to get worse when more covariates are added. Applied researcher may be most interested in the testing strategies discussed here when k is large (so that a series of single variable balancing tests is unattractive), and will want to rely on a robust covariance matrix. An upward size distortion may be less of an issue for a conservative researcher in a balancing test (where it means the researcher will falsely decide not to go ahead with a research design where the covariates are actually balanced) than in a test for the presence of non-zero treatment effects (where the same bias leads to false discoveries). Nevertheless, we suspect that most applied researchers would prefer a test with a correct size under the null and a steep power function. As a result, research on remedies for the bias problem in multivariate tests is therefore particularly important. 15 While we find few differences between the power of the LHS and RHS tests in our simulations, we know from the theoretical analysis in Appendix C that the test statistics will differ asymptotically when the third and fourth moments of the underlying data deviate from the normally distributed case.
It is therefore interesting to probe how the two tests perform in an example with real data.
We therefore pooled data from the 2010 -2014 American Community Surveys (ACS). Our data set consists of white and African American individuals 15 We find in unreported simulations that the classic small sample corrections HC2 and HC3 by MacKinnon and White (1985) still have size distortions under the null. There is currently an active literature on how to better deal with this small sample bias of the robust or clustered covariance estimator. For example, Young (2016) suggests an adjustment of the degrees of freedom of hypothesis tests but this adjustment is only implemented for one coefficient at a time, so does not work for testing multiple linear restrictions at once. Cattaneo, Jansson and Newey (2017) present an adjustment of the entire covariance matrix but only consider the case of heteroskedasticity and do not allow for clustering. As a result, neither of these can currently be applied to our LHS balancing test.
Another alternative is to rely on a series of single coefficient tests and adjust the resulting test statistics for multiple testing. Akin to the size distortion of robust test statistics, without adjustment such multiple testing will reject too often under the null as first noted by Bonferroni (1935). There is a sizable literature in statistics and theoretical econometrics on this topic with modern approaches based either on the influential work by Westfall and Young (1993) or by Benjamini and Hochberg (1995). Examples of empirical applications in economics are Kling, Liebman and Katz (2007), Anderson (2008), and Duflo, Dupas and Kremer (2017. But these examples remain rare, and no clear choices among the multitude of theoretical alternatives have yet emerged among applied researchers. aged 21 to 64 with non-missing annual earnings. This data set has 5,644,865 observations. We generated a binary treatment s i according to  Figure 7 shows the results for the two balancing tests. The rejection rates are virtually indistinguishable. We find no evidence that the performance of the two tests differs in this setting. 16 This does not mean that the LHS and RHS test statistics are identical in any given sample. Particularly under the null we sometimes find sizable disparities in p-values.
The upshot is that it is in principle straightforward to extend the balancing test to multiple covariates. An interesting finding is that a RHS test offers a computationally simple alternative that closely mimics the performance of the more standard LHS balancing test. Yet, at this point implementation issues related to the small sample bias of robust covariance estimators also hamper our ability to confidently carry out balancing tests for multiple covariates.
Moreover, sometimes we are interested in the robustness of the original results when the number of added regressors is very large. An example would be a differences-in-differences analysis in a state-year panel, where the researcher is interested in checking whether the results are robust to the inclusion of state specific trends. The balancing test does not seem to be the right framework to deal with this situation. The coefficient comparison test has an important role to play in this scenario.

Empirical Analysis
We illustrate the theoretical results in the context of estimating the returns to schooling using data from the National Longitudinal Survey of Young Men (NLS). This is a panel study of about 5,000 male respondents interviewed from 1966 to 1981. The data set has featured in many prominent analyses of the returns to education, including Griliches (1977) and Card (1995). We use We start in Table 2 by presenting simple OLS regressions controlling for experience, race, and past and present residence. The estimated return to schooling is 0.075. This estimate may not reflect the causal effect of education on income because important confounders, such as ability or family background, are not controlled for.
In columns (2) to (5) we include variables which might proxy for the respondent's family background. In column (2) we include mother's education, in column (3) whether the household had a library card when the respondent was 14, and in column (4) we add body height measured in inches. Each of these variables is correlated with earnings, and the coefficient on education moves moderately when these controls are included. Mother's education captures an important component of a respondent's family background. The library card measure has been used by researchers to proxy for important parental attitudes (e.g. Farber and Gibbons, 1996). Body height is a variable determined by parents' genes and by nutrition and disease environment during childhood.
It is unlikely a particularly powerful control variable but it is predetermined and correlated with family background, self-esteem, and ability (e.g. Persico, Postlewaite and Silverman, 2004;Case and Paxson, 2008). The return to education falls by 0.1 to 0.2 log points when these controls are added. In column (5) we enter all three variables simultaneously. The coefficients on the controls are somewhat attenuated, and the return to education falls slightly further to 0.071.
It might be tempting to conclude from the relatively small change in the estimated returns to schooling that this estimate should be given a causal interpretation. We provide a variety of evidence that this conclusion is unlikely to be a sound one. Below the estimates in columns (2) to (5), we display the p-values from the coefficient comparison test, comparing each of the estimated returns to education to the one from column (1). Although the coefficient movements are small, the tests all reject at the 5% level, and in columns (4) and (5)  The results in columns (6) to (8), where we regress maternal education, the library card, and body height on education, further magnify the concern.
The education coefficient is positive and strongly significant in all three regressions, with t-values ranging from 4.4 to 13.1, and both the LHS and RHS joint balancing tests reject the hypothesis that all three controls are balanced with a p-value of virtually zero. The magnitudes of the coefficients are substantively important. It is difficult to think of these results as causal effects: the respondent's education should not affect predetermined proxies of family background. Instead, these estimates reflect selection bias. Individuals with more education have significantly better educated mothers, were more likely to grow up in a household with a library card, and experienced more body growth when young. Measurement error leads to attenuation bias when these variables are used on the right-hand side which renders them fairly useless as controls. The measurement error matters less for the estimates in columns (6) to (8), and these are informative about the role of selection. Comparing the p-values at the bottom of the table to the corresponding ones for the coefficient comparison test in columns (2) to (4) demonstrates the superior power of the balancing test.
The following tables have the same general layout. In Table 3 we change the baseline specification by including the respondent's score on the Knowledge of the World of Work test (KWW), a variable used by Griliches (1977) as a proxy for ability. The sample size is reduced due to the exclusion of missing IQ values in the test score for consistency with a subsequent table. Estimated returns without the KWW score in this restricted sample (unreported) are very similar to those in Table 2. Adding the KWW score reduces the coefficient on education by almost 20%, from 0.075 to 0.061. Adding our additional controls maternal education, the library card, and body height to this new specification does very little to the estimated returns to education. The coefficient comparison test indicates that none of the small changes in the returns to education are statistically significant. Controlling for the KWW scores has largely knocked out the library card effect but done little to the coefficients on maternal education and body height. The relatively small and insignificant coefficient movements in columns (2) to (5) suggest that the specification controlling for the KWW score might solve the ability bias problem.
Columns (6) to (8), however, show that the three covariates are still mostly unbalanced with respect to education even when the KWW score is in the regression. This raises the possibility that the estimated returns in columns (1) to (5) might remain biased by selection. The estimated coefficients on education for the three controls are on the order of half their value from Table 2, and the body height measure is now only significant at the 10% level. But the relationship between mother's and own education is still sizable, so that this measure continues to indicate the possibility of important selection. Balance 33 in library card ownership is rejected despite the fact that a comparison of the coefficients in columns (1) and (3) indicates no role for this variable at all. A joint balancing test with all three controls strongly rejects the hypothesis that they are balanced. The results in this table illustrate the message of our paper in a powerful fashion.
While the KWW score might be a potent control, it is likely also measured with substantial error, for example, due to testing noise. Griliches (1977) proposes to instrument this measure with an independent IQ test score variable, which is also contained in the NLS data, to eliminate at least some of the consequences of this measurement error. In Table 4, we take the specification one step further by instrumenting the KWW score with IQ. The coefficient on the KWW score almost triples, in line with the idea that an individual test score is a very noisy measure of ability. The education coefficient now falls to only about half its previous value from 0.061 to 0.034. This might be due to positive omitted variable bias present in the previous regressions which is eliminated by IQ-instrumented KWW (although there may be other possible explanations for the change as well, like measurement error in schooling).
Both the coefficient comparison tests and the balancing tests (individual and joint) indicate no evidence of selection any more. This is due to a combination of lower point estimates and larger standard errors. We note that the joint LHS and RHS balancing tests produce somewhat different test statistics in this case, although both p-values are well above conventional rejection levels. Tables 3 and 4 highlights the usefulness of the balancing   test: it warns about the Table 3 results, while the coefficient comparison test delivers insignificant differences in either case.

The contrast between
Finding an instrumental variable for education is an alternative to control strategies, such as using test scores. In Table 5 we follow Card's (1995) analysis and instrument education using distance to the nearest college, while dropping the KWW score. We use the same sample as in Table 2, which differs from Card's sample. 17 Our IV estimates of the return to education are slightly higher than in Table 2 but lower than in Card (1995) at around 8%. The IV returns estimates are relatively noisy, with a t-statistic of about 2. Columns 1-5 of Table 5 show that the IV estimate on education, while bouncing around a bit, does not change significantly when maternal education, the library card, or body height is included. In particular, if these three controls are included at the same time in column (5), the point estimate is indistinguishable from the unconditional estimate in column (1) both substantively and statistically.
IV regressions with pre-determined variables on the left hand side can be thought of as a test for random assignment of the instruments. In this case the selection regressions in columns (6)-(8) are imprecise, just like the IV returns estimates, and as a result less informative. The coefficients in the regressions for mother's education and body height have the wrong sign but confidence intervals cover anything ranging from zero selection to large positive amounts.
Only the library card measure is large, positive, with a p-value of around 0.06, possibly indicative of some remaining potential for selection even in the IV regressions. However, with p-values of around 0.29, both the LHS and RHS joint balancing tests fail to reject the null hypothesis that all three controls are balanced. In other words, the college distance instrument passes the balancing test, but the data do not speak clearly in this particular case.

Conclusion
Using predetermined characteristics as dependent variables offers a useful specification check for a variety of identification strategies popular in empirical economics. We argue that this is the case even for variables which might be poorly measured and are of little value as control variables. Such variables should be available in many data sets, and we encourage researchers to perform such balancing tests more frequently. We show that this is generally a more powerful strategy than adding the same variables on the right hand side of the regression as controls and looking for movement in the coefficient of instrument, and we instrument experience and experience squared by age and age squared. We restrict Card's sample to non-missing values in maternal education, the library card, and body height.

interest.
We have illustrated our theoretical results with an application to the returns to education. Taking our assessment from this exercise at face value, a reader might conclude that the results in Table 4, returns around 3.5%, can safely be regarded as causal estimates. Of course, this is not the conclusion reached in the literature, where much higher IV estimates like those in Table   5 are generally preferred (see e.g. Card, 2001 or Angrist andPischke, 2015, chapter 6). This serves as a reminder that the discussion here is focused on sharpening one particular tool in the kit of applied economists. Successfully passing the balancing test should be a necessary condition for a successful research design but it is not sufficient.
The balancing test and other statistics we discuss here are useful for gauging selection bias due to observed confounders, even when they are potentially measured poorly. It does not address any other issues which may also haunt a successful empirical investigation of causal effects. One possible issue is measurement error in the variable of interest, which is also exacerbated as more potent controls are added. Griliches (1977) shows that a modest amount of measurement error in schooling may be responsible for the patterns of returns we have displayed in Tables 2 to 4. Another issue, also discussed by Griliches, is that controls like test scores might themselves be influenced by schooling, which would make them bad controls. For all these reasons, other approaches like IV estimates of the returns may be preferable.   Note: Comparison of baseline rejection rates (from Figure 1) with simulated rejection rates based on mean reverting measurement error and robust standard errors.     The desired balancing regression is Effectively, we run the balancing regression As mentioned in Section 5.1, in the theoretical derivation of the power functions we abstract away from the sampling variation in estimating the standard errors by treating σ u , σ m and σ s as known constants. In this case, the asymptotic variance of δ m can be directly calculated, and the resulting test statistic for the null hypothesis that the balancing coefficient δ is zero is

53
Appendix (For Online Publication Only) The rejection probability when δ = d and when using critical value C is when n is large. This is the power function of the balancing test

A.2 The Coefficient Comparison Test
The short and long regressions are Adding measurement error in x i : we have

54
Appendix (For Online Publication Only) Treat s i , u i , e i , and m i as the underlying random variables which determine x i , y i , e s i and e m i . We normalize s i to a mean zero variable. For the derivations in the remainder of this section, we make the following assumptions: Assumption A1: s i , u i , e i and m i are mutually independent; Assumption A2: E[u 3 i ] = 0. Note that Assumptions A1 and A2 are satisfied in the DGP's we adopt for the Monte Carlo simulations underlying Figure 2, that is, when s i , u i , e i , m i follow a joint normal distribution with the first two moments specified according to (A1)

A.2.1 Population Parameters
In this subsection, we derive the expressions of population regression coefficients β m and γ m in terms of the model parameters, as discussed in Section 3.
Performing an anatomy to the multiple regression (9), we have where u i + m i is the residual from the population regression of x m i on s i . Using θ as defined above, equation (A2) becomes By the omitted variable bias formula, we have and therefore 55 Appendix (For Online Publication Only) As mentioned in the main text, an alternative representation of θ is where λ = V ar (x i ) V ar (x m i ) is the reliability of x m i , and R 2 is the population R 2 of the regression of x m i on s i . To see why (A5) holds, notice that from which equation (A5) mechanically follows.

A.2.2 Asymptotic Variance in the Coefficient Comparison Test under Homoskedasticity
For the coefficient comparison test β s − β m = 0, the test statistic is , which is asymptotically standard normal. As mentioned in Section 4, we rely on the delta method equation (13) to derive V ar( β s − β m ). We have already shown in the previous subsection that and we derive V ar ( γ m ) and Cov δ m , γ m in the remainder of this subsection.
For simplicity of exposition, we make an additional assumption: Assumption A3: V ar(e m i |s i , x m i ) is constant. Like Assumptions A1 and A2, Assumption A3 is also satisfied in the DGP's underlying Figure 2. In the subsection below, we also derive the general expression of V ar( β s − β m ) when Assumption A3 is relaxed.

56
Appendix (For Online Publication Only) In order to derive V ar( γ m ), first note that where, as mentioned above, u i + m i is the residual from the population regression of x m i on s i . Since V ar (u i + m i ) = σ 2 u + σ 2 m , the missing piece in equation (A7) is V ar (e m i ). Plugging (A3) and (A4) into (9), we get matching residuals yields As for Cov( δ m , γ m ), first note that 57 Appendix (For Online Publication Only) wheres and¯ x m are the sample averages of s i and x m i with x m i = x m i − δ 0 − δ m s i being the residual from regressing x m i on s i . By Assumption A1 along with the fact thatδ 0 p → δ 0 andδ m p → δ, the asymptotic joint distribution of the numerators in equations (A9) and (A10) is .
By Assumptions A1 and A2, Since the denominators of equations (A9) Recall that β s − β m = δγ m = δγ (1 − θ) , so the power function of the coefficient comparison test is

A.2.3 Relaxing Assumption A3
In this subsection, we provide the expression for V ar( β s − β m ) while relaxing the conditional homoskedasticity of e m i , i.e. Assumption A3. Our derivation 58 Appendix (For Online Publication Only) of this asymptotic variance expression still relies on equation (13). Since equations (A6) and (A11) are not affected by Assumption A3, we will only need the general expression for V ar ( γ m ).
Compared to its expression under homoskedasticity (A8), we have an extra term (a) that accounts for the excess kurtosis of the u and m distributions. It follows that Note that when u i and m i are normal, κ u − 3σ 4 u = 0 and κ m − 3σ 4 m = 0, and the variance expression above simplifies to that of equation (A12). Since V ar β s − β m increases in κ u and κ m and that the balancing test is unaffected by the heteroskedasticity of e m , the power advantage of the balancing test is larger when u i and m i have thicker tails than a normal distribution.

B Comparison with Oster (forthcoming)
The Oster (forthcoming) formulation of the causal regression takes the form where w 1i is an observed covariate and w 2i is an unobserved covariate, uncorrelated with w 1i . To map this into our setup, think of the true x i as capturing both w 1i and w 2i , i.e. x i = ρw 1i + w 2i . Furthermore, there is equal selection, i.e.