On the Use of the Lasso for Instrumental Variables Estimation with Some Invalid Instruments

ABSTRACT We investigate the behavior of the Lasso for selecting invalid instruments in linear instrumental variables models for estimating causal effects of exposures on outcomes, as proposed recently by Kang et al. Invalid instruments are such that they fail the exclusion restriction and enter the model as explanatory variables. We show that for this setup, the Lasso may not consistently select the invalid instruments if these are relatively strong. We propose a median estimator that is consistent when less than 50% of the instruments are invalid, and its consistency does not depend on the relative strength of the instruments, or their correlation structure. We show that this estimator can be used for adaptive Lasso estimation, with the resulting estimator having oracle properties. The methods are applied to a Mendelian randomization study to estimate the causal effect of body mass index (BMI) on diastolic blood pressure, using data on individuals from the UK Biobank, with 96 single nucleotide polymorphisms as potential instruments for BMI. Supplementary materials for this article are available online.


Introduction
Instrumental variables estimation is a procedure for the identification and estimation of causal effects of exposures on outcomes where the observed relationships are confounded by nonrandom selection of exposure. This problem is likely to occur in observational studies, but also in randomized clinical trials if there is selective participant noncompliance. An instrumental variable (IV) can be used to solve the problem of nonignorable selection. To do this, an IV needs to be associated with the exposure, but only associated with the outcome indirectly through its association with the exposure. The former condition is referred to as the "relevance" and the latter as the "exclusion" condition. Examples of instrumental variables are quarter-of-birth for educational achievement to determine its effect on wages, see Angrist and Krueger (1991), randomization of patients to treatment as an instrument for actual treatment when there is noncompliance, see, for example, Greenland (2000), and Mendelian randomization studies use IVs based on genetic information, see, for example, Lawlor et al. (2008). For recent reviews and further examples see, for example, Clarke and Windmeijer (2012), Imbens (2014), Burgess, Small, and Thompson (2017), and Kang et al. (2016).
Whether instruments are relevant can be tested from the observed association between exposure and instruments. The effects on the standard linear IV estimator of "weak instruments, " that is, the case where instruments are only weakly associated with the exposure of interest, have been derived for the linear model using weak instrument asymptotics by Staiger and Stock (1997). This has led to the derivation of critical values for the simple F-test statistic for testing the null of weak instruments by Stock and Yogo (2005). Another strand of the literature focuses on instrument selection in potentially highdimensional settings, see, for example, Belloni et al. (2012), Belloni et al. (2014), Chernozhukov et al. (2015), and Lin et al. (2015), where the focus is on identifying important covariate effects and selecting optimal instruments from a (large) set of a priori valid instruments, where optimality is with respect to the variance of the IV estimator.
In this article, we consider violations of the exclusion condition of the instruments, following closely the setup by Kang et al. (2016) for the linear IV model where some of the available instruments can be invalid in the sense that they can have a direct effect on the outcomes or are associated with unobserved confounders. Kang et al. (2016) proposed a Lasso-type procedure to identify and select the set of invalid instruments. Liao (2013) and Cheng and Liao (2015) also considered shrinkage estimation for identification of invalid instruments, but in their setup there is a subset of instruments that is known to be valid and that contains sufficient information for identification and estimation of the causal effects. In contrast, Kang et al. (2016) did not assume any prior knowledge about which instruments are potentially valid or invalid. This is a similar setup as in Andrews (1999) who proposed a selection procedure using information criteria based on the so-called J-test of overidentifying restrictions, as developed by Sargan (1958) and Hansen (1982). The Andrews (1999) setup is more general than that of Kang et al. (2016) and requires a large number of model evaluations, which has a negative impact on the performance of the selection procedure.
This article assesses the performance of the Kang et al. (2016) Lasso-type selection and estimation procedure in their setting of a fixed number of potential instruments. If the set of invalid instruments were known, the oracle two-stage least squares (2SLS) estimator would be the estimator of choice in their setting. As the focus is estimation of and inference on the causal effect parameter, denoted by β, and as the standard Lasso approach does not have oracle properties, see, for example, Zou (2006), we show how the adaptive Lasso procedure by Zou (2006) can be used to obtain an estimator with oracle properties. To do so, we propose an initial consistent estimator of the parameters that is consistent also when the irrepresentable condition for consistent Lasso selection of Zhao and Yu (2006) and Zou (2006) fails. The oracle property in this setup is when an estimator for β has the same limiting distribution as the oracle 2SLS estimator.
Applying the irrepresentable condition to this IV setup, we derive conditions under which the Lasso method does not consistently select the invalid instruments. As is well known from Zhao and Yu (2006), Zou (2006), Meinshausen and Bühlmann (2006), and Wainwright (2009), certain correlation structures of the variables prevent consistent selection. New in our results are the conditions on the strength of the invalid instruments relative to that of the valid ones that result in violations of the irrepresentable condition, where the strength of an instrument is its standardized effect on the exposure. From this we can show that consistent selection of the invalid instruments may not be possible if these are relatively strong, even when less than 50% of the instruments are invalid, which is a sufficient condition for the identification of the parameters.
We show that under the condition that less than 50% of the instruments are invalid, a simple median-type estimator is a consistent estimator for the parameters in the model, independent of the strength of the invalid instruments relative to that of the valid instruments, or their correlation structure. It can therefore be considered for use in the adaptive Lasso procedure as proposed by Zou (2006). With n the sample size, we show that the median estimator converges at the √ n rate, but with an asymptotic bias, as the limiting distribution is that of an order statistic. It does, however, satisfy the conditions for the adaptive Lasso procedure to enjoy oracle properties.
Because of this oracle property, and as in practice instrument strength is very likely to vary by instruments and invalid instruments could be relatively strong, it will be important to consider our adaptive Lasso approach for assessing instrument validity and estimating causal effects. In Mendelian randomization studies it is clear that genetic markers have differential impacts on exposures from examining the results from genomewide association studies and one cannot rule out ex ante that invalid instruments with a direct effect are also stronger predictors for the exposure. (Bowden et al. (2015) and Kolesar et al. (2015) allowed for all instruments to be invalid and showed that the causal effect can be consistently estimated if the number of instruments increases with the sample size under the assumption of uncorrelatedness of the instrument strength and their direct effects on the outcome variable.) The next section, Section 2, introduces the model and the Lasso estimator as proposed by Kang et al. (2016). In Section 3, we derive the irrepresentable condition for this particular Lasso selection problem and present the result on the relationship between the relative strengths of the instruments and consistent selection. Section 4 presents the median estimator, establishes its consistency, and shows that its asymptotic properties are such that the adaptive Lasso estimator enjoys oracle properties. Section 5 presents some Monte Carlo simulation results. In Section 5.2, we link the Andrews (1999) method to the Lasso selection problem and show how the test of overidentifying restrictions can be used as a stopping rule. Section 5.3 investigates how close the behavior of the adaptive Lasso estimator is to that of the oracle 2SLS estimator in the Monte Carlo simulations, by comparing the performances of the Wald tests on the causal parameter under the null for different sample sizes. Further analyses and simulation results investigating the effects of varying the information content by varying the strength of the instruments and the size of the direct effects of the invalid instruments on the outcome are presented in Section B in the supplementary materials. In Section 6, the methods are applied to a Mendelian randomization study to estimate the causal effect of body mass index (BMI) on diastolic blood pressure using data on individuals from the UK Biobank, with 96 single nucleotide polymorphisms as potential instruments for BMI. Section 7 concludes.
The following notation is used in the remainder of the article. For a full column rank matrix X with n rows, M X = I n − P X , where P X = X(X X) −1 X is the projection onto the column space of X, and I n is the n -dimensional identity matrix. A kvector of ones is denoted as ι k . The l p -norm is denoted by . p , and the l 0 -norm, . 0 , denotes the number of nonzero components of a vector. We use . ∞ to denote the maximal element of a vector.

Model and Lasso Estimator
We follow Kang et al. (2016; KZCS from now on), who considered the following potential outcomes model. For i = 1,. . . ,n,let Y (d,z) i , be the potential outcome if the individual i were to have exposure d and instrument values z. The observed outcome for an individual i is denoted by the scalar Y i , the treatment by the scalar D i , and the vector of L potential instruments by Z i. . The instruments may not all be valid and can have a direct or indirect effect. For two possible values of the exposure d * , d and instruments z * , z , assume the following potential outcomes model where φ measures the direct effect of z on Y , and ψ represents the presence of unmeasured confounders that affect both the instruments and the outcome.
We have a random sample Combining (1) and (2), the observed data model for the random sample is given by where α = φ + ψ; and hence E[ε i |Z i. ] = 0. For ease of exposition, we further assume that E[ε 2 i |Z i. ] = σ 2 ε . The KZCS definition of a valid instrument is then linked to the exclusion restriction and given as follows: Instrument j, j ∈ {1, . . . , L}, is valid if α j = 0 and it is invalid if α j = 0. As in the KZCS setting, we are interested in the identification and estimation of the scalar treatment effect β in large samples with a fixed number L of potential instruments.
Let y and d be the n-vectors of n observations on {Y i } and {D i }, respectively, and let Z be the n × L matrix of potential instruments. As an intercept is implicitly present in the model, y, d, and the columns of Z have all been taken in deviation from their sample means. Following the notation of Zou (2006), let Z A be the set of invalid instruments, A = { j : α j = 0} and α A the associated coefficient vector. The oracle instrumental variables or two-stage least square (2SLS) estimator is obtained when the set Z A is known. Let R A = [ d Z A ], the oracle 2SLS estimator is then given by Let d = P Z d, with individual elements D i , then θ or is the OLS estimator in the model where ξ i is defined implicitly, and hence The oracle 2SLS estimator for β is given by Under standard assumptions, as defined below, where The vector d is the linear projection of d on Z. If we define for j = 1, . . . , L. Theorem 1 in KZCS states the conditions under which, given knowledge of γ and , a unique solution exists for values of β and α j . A necessary and sufficient condition to identify β and the α j is that the valid instruments form the largest group, where instruments form a group if they have the same value of π . Corollary 1 in KZCS then states a sufficient condition for identification. Let s = ||α|| 0 be the number of invalid instruments. A sufficient condition is that s < L/2, as then clearly the largest group is formed by the valid instruments. In model (3), some elements of α are assumed to be zero, but it is not known ex ante which ones they are and the selection problem therefore consists of correctly identifying those instruments with nonzero α. KZCS proposed to estimate the parameters α and β by using l 1 penalization on α and to minimize where α 1 = j |α j |. This method is closely related to the Lasso, and the regularization parameter λ n determines the sparsity of the vector α (n) . From (5), a fast two-step algorithm is proposed as follows. For a given λ n solve and obtain β (n) by To find α (n) in (11), the Lasso modification of the LARS algorithm of Efron et al. (2004) can be used and KZCS had developed an R-routine for this purpose, called sisVIVE (some invalid and some valid IV estimator), where the regularization parameter λ n is obtained by cross-validation.
For the random variables and iid sample {Y i , D i , Z i. } n i=1 , and model (3) and (8), we assume throughout that the following conditions hold: with Q a finite and full-rank matrix.
The elements of are finite.
The setting is thus a relatively straightforward one with fixed parameters β, α, and γ, and fixed number L n of potential instruments. This is the setting under which the oracle 2SLS estimator has the limiting distribution (6), and is a setting of interest in many applications. To identify in this simple setting an ex ante unknown subset of invalid instruments using the Lasso is challenging, as highlighted in the next section where we investigate the irrepresentable condition for this setting.
For the case of many weak instruments, even the oracle 2SLS estimator would not be the estimator of choice, due to its poor asymptotic performance, and the median estimator may not be consistent. Oracle estimators with better asymptotic properties in this setting are the limited information maximum likelihood (LIML) estimator, see Bekker (1994) and Hansen, Hausman and Newey (2008), or the continuous updating estimator (CUE), see Newey and Windmeijer (2009). Selection of invalid instruments in this setting is outside the scope of this article.

Irrepresentable Condition
As y − Zα 2 2 = y y − 2y Zα + α Z Zα, it follows that the Lasso estimator α (n) as defined in (11) can equivalently be obtained as This minimization problem looks very much like a standard Lasso approach with Z as explanatory variables. However, an important difference is that Z does not have full rank, but its rank is equal to L − 1. This is related to the standard Lasso case where we have an overcomplete dictionary implying that the OLS solution is not feasible. Intuitively, we cannot set λ n = 0 in (13) as we have to shrink at least one element of α to zero to identify the parameter β. All just-identified models with L − 1 instruments included as invalid result in a residual correlation of 0, and hence setting λ n = 0 does not lead to a unique 2SLS estimator.
We follow Zhao and Yu (2006) and Zou (2006), who developed the irrepresentable conditions for consistent Lasso variable selection. As before, let A = { j : α j = 0} and assume wlog that A = {1, 2, . . . , s}, s < L. (We will use subscripts A and 1 interchangeably from here onward, and subscript 2 for associations with the set A c = { j : α j = 0}.) Let where C 11 is an s × s matrix. Further, define is an (almost) necessary and sufficient condition for consistent Lasso variable selection. While (15) refers to the formulation of the weak irrepresentable condition of Zhao and Yu (2006), they showed that in this setting of a random design with fixed L and constant parameters α, their strong and weak irrepresentable conditions are equivalent to (15) almost surely (Zhao andYu 2006, p. 2544).
If (15) is satisfied, and if λ n satisfies λ n /n → 0 and λ n / √ n → ∞, then lim n→∞ P( A n = A) = 1, see Theorem 1 in Zhao and Yu (2006). Necessity means that consistent model selection implies the irrepresentable condition. As Zou (2006) showed, if lim n→∞ P( A n = A) = 1 and under the same conditions λ n /n → 0 and λ n / √ n → ∞, then the following condition must hold While in the standard linear model setup λ n /n → 0 guarantees estimation consistency, see Lemma 1 in Zou (2006), this is not the case in the IV setup here because of the rank deficiency of Z.
Choosing λ n = 0 in the standard setup would simply result in consistent OLS estimation of a model that includes all variables, which is not possible here as discussed above. Therefore, if the necessary irrepresentable condition (16) does not hold, consistent Lasso selection is not possible and even λ n /n → 0 does not guarantee estimation consistency in this rank deficient IV case. We now analyze under what conditions the irrepresentable condition does or does not hold in the IV setup, focusing particularly on the relative strengths γ 1 and γ 2 of the invalid and valid instruments.
Partition Q = plim(n −1 Z Z) and γ commensurate with the partitioning of C as where the instruments have been standardized such the diagonal elements of Q are equal to 1. In contrast to C, Q is not rank deficient. Then for the Lasso specification (13), we have the following result.
Proposition 1. Consider the observational models (3) and (8) under Assumptions 1, 3, and 4. Let C = plim(n −1 Z Z); Q = plim(n −1 Z Z); and C 11 , C 21 , Q 11 , Q 21 , Q 22 , γ 1 , and γ 2 as specified in (14) and (17 ). Then C 21 C −1 11 is given by where Proposition 1 shows that consistent selection of the instruments is not only affected by the correlation structure of the instruments, but also by the values of γ 1 and γ 2 . The next Proposition derives conditions on γ 1 and γ 2 under which the necessary condition for consistent variable selection (16) does not hold.
From Proposition 1, we can investigate consistent selection for various cases of interest. Related to the Monte Carlo simulations in KZCS and in Section 5, Corollary 1 considers the case with γ 1 = γ 1 ι s and γ 2 = γ 2 ι L−s .
For equal strength instruments, γ 1 = γ 2 , the result of Corollary 1 shows that the necessary condition (16)

A Consistent Estimator when s < L/2 and Adaptive Lasso
As the results above highlight, the Lasso path may not include the correct model, leading to an inconsistent estimator of β. This is the case even if less than 50% of the instruments are invalid because of differential instrument strength and/or correlation patterns of the instruments. Indeed, we find in the simulation exercise of Section 5.1 that the Lasso selects the valid instruments as invalid if these are relatively weak, γ 2 1 < γ 1 1 , for a design with s(α 1 ) = s(γ 1 ). In this section, we present an estimation method that consistently selects the invalid instruments when less than 50% of the potential instruments are invalid. This is the same condition as that for the Lasso selection problem to satisfy the irrepresentable condition for equal strength uncorrelated instruments, but the proposed estimator below is consistent when the instruments have differential strength and/or have a general correlation structure. We consider the adaptive Lasso approach of Zou (2006) using an initial consistent estimator of the parameters. In the standard linear case, the OLS estimator in the model with all explanatory variables included is consistent. As explained in Section 3, in the instrumental variables model this option is not available. We build on the result of Han (2008), who shows that the median of the L IV estimates of β using one instrument at the time is a consistent estimator of β in a model with invalid instruments, but where the instruments cannot have direct effects on the outcome, unless the instruments are uncorrelated. Let = (Z Z) −1 Z y; γ = (Z Z) −1 Z d and let π be the L-vector with jth element Under the standard assumptions, Theorem 1 shows that the median of the π j , denoted β m , is a consistent estimator for β when s < L/2, without any further restrictions on the relative strengths or correlations of the instruments. Theorem 1 also shows that √ n( β m − β) converges in distribution to that of an order statistic. From these results it follows that the consistent estimator α m = − γ β m can be used for the adaptive Lasso approach of Zou (2006), resulting in oracle properties of the resulting estimator of β.
Theorem 1. Under model specifications (3) and (8) with Assumptions 1-4, let π be the L-vector with elements as defined in (19). If s < L/2, then the estimator β m defined as is a consistent estimator for β, Let π 2 be the L − s vector with elements π j , j = s + 1, . . . , L. The limiting distribution of β m is given by Proof. See Section A.2 in the supplementary materials.
Given the consistent estimator β m , we obtain a consistent estimator for α as which can then be used for the adaptive Lasso specification of (13) as proposed by Zou (2006). The adaptive Lasso estimator for α is defined as and, for given values of υ can be estimated straightforwardly using the LARS algorithm, see Zou (2006). The resulting adaptive Lasso estimator for β is obtained as As the result for the limiting distribution of the median estimator shows, β m , although converging at the √ n rate, has an asymptotic bias. This clearly also results in an asymptotic bias of α m . As √ n( α m − α) = O p (1), Theorem 2 together with Remark 1 in Zou (2006) states the following properties of the adaptive Lasso estimator α (n) ad , where A ad,n = { j : α (n) ad, j = 0}.
From the results of Proposition 3, it follows that the limiting distribution of β (n) ad is that of the oracle 2SLS estimator of β, as stated in the next Corollary.
Corollary 2. Under the conditions of Proposition 3, the limiting distribution of the adaptive Lasso estimator β (n) ad is given by with σ 2 β or as defined in (7).

Relative Strength of Instruments
We start with presenting some estimation results from a Monte Carlo exercise which is similar to that in KZCS. The data are generated from and we set β = 0; L = 10; ρ = 0.25; s = 3, and the first s elements of α are equal to a = 0.2. Further, γ 1 = γ 1 ι s and γ 2 = γ 2 ι L−s . Note that none of the estimation results presented here and below depend on the value of β. Table 1 presents estimation results for estimators of β in terms of bias, standard deviation, root mean squared error (rmse), and median absolute deviation (mad) for 1000 replications for sample sizes of n = 500, n = 2000, and n = 10,000 for an equal strength design, with γ 1 = γ 2 = 0.2. The information content for IV estimation can be summarized by the concentration parameter, see Rothenberg (1984).
For the oracle estimation of β by 2SLS, the concentration parameter is given by μ 2 n = γ 2 Z 2 M Z 1 Z 2 γ 2 /σ 2 v . For this datagenerating process with independent instruments, the concentration parameter is therefore approximately n(L − s)(0.2 2 ) and hence equal to 140 , 560, and 2800 for the three sample sizes. μ 2 n can be seen as a population Wald statistic for testing H 0 : γ 2 = 0. The corresponding population F-statistics are equal to n(0.2 2 ), or 20, 80, and 400 for the sample sizes 500, 2000, and 10,000, respectively.
A summary measure of the information content for Lasso selection is the (squared) signal-to-noise ratio (SNR), denoted by η 2 . It is defined as see, for example, Bühlmann and van der Geer (2011, p. 25).
Analogously to the concentration parameter, nη 2 can be interpreted as a population Wald statistic for testing H 0 : α 1 = 0. We analyze the effects of varying μ 2 n and η 2 more extensively in Section B.2 in the supplementary materials, where we derive that, for this design, resulting in η 2 = 0.084 for the parameter values considered in Table 1. The "2SLS" results are for the naive 2SLS estimator of β that treats all instruments as valid. The probability limit of this estimator is given by plim β naive = β + γ Qα γ Qγ = β + γ 1 Q 11 α 1 + γ 2 Q 21 α 1 γ 1 Q 11 γ 1 + 2γ 2 Q 21 γ 1 + γ 2 Q 22 γ 2 . (23) Therefore, in the design specified here, we have plim( β naive ) = s/L = 0.3. The "2SLS or" is the oracle 2SLS estimator that correctly includes the three invalid instruments in the model as explanatory variables. For the Lasso estimates, the value for λ n has been obtained by 10-fold cross-validation, using the one-standard error rule, as in KZCS. This estimator is denoted "Lasso cvse " and is the one produced by the sisVIVE routine. We also present results for the cross-validated estimator that does not use the one-standard error rule, denoted "Lasso cv. " For the Lasso estimation procedure, we standardize throughout such that the diagonal elements of Z Z/n are equal to 1.
We further present results for the so-called post-Lasso estimator, see, for example, Belloni et al. (2012), which is called the LARS-OLS hybrid by Efron et al. (2004). This is here simply the 2SLS estimator in the model that includes Z A n , the set of instruments with nonzero estimated Lasso coefficients. Clearly, when A n = A, the post-Lasso 2SLS estimator is equal to the oracle 2SLS estimator. The post-Lasso 2SLS estimator is expected to have a smaller bias as it avoids the bias in the Lasso estimate of β due to the shrinkage of the Lasso estimate of α toward 0, see also Hastie, Tibshirani, and Friedman (2009, p. 91). This shrinkage bias effect on β (n) for models where A ⊆ A n is in the direction of the bias of β naive , where α is assumed to be 0. (In an OLS setting, Belloni and Chernozhukov (2013) showed that the post-Lasso estimator can perform at least as well as Lasso in terms of rate of convergence, but is less biased even if the Lasso-based model selection misses some components of the true model.) Further entries in Table 1 are the average number of instruments selected as invalid, that is, the average number of instruments in A n = { j : α (n) j = 0}, together with the minimum and maximum number of selected instruments, and the proportion of times the instruments selected as invalid include all three invalid instruments.
The results in Table 1 reveal some interesting patterns. First of all, the Lasso cv estimator outperforms the Lasso cvse estimator in terms of bias, rmse, and mad for all sample sizes, but this is reversed for the post-Lasso estimators, that is, the post-Lasso cvse outperforms the post-Lasso cv . The Lasso cv estimator selects on average around 6.5 instruments as invalid, which is virtually independent of the sample size. The Lasso cvse estimator selects on average around 3.8 instruments as invalid for n = 2000 and n = 10,000, but fewer, 3.16 for n = 500. Although the three invalid instruments are always jointly selected as invalid for the larger sample sizes, the Lasso cvse is substantially biased, the biases being larger than twice the standard deviations. The post-Lasso cvse estimator performs best, but is still outperformed by the oracle 2SLS estimator at n = 10,000. Although the post-Lasso cvse estimator has a larger standard deviation than the Lasso cvse estimator, it has a smaller bias, rmse, and mad for all sample sizes.
We focus below on the performance of the median and adaptive Lasso estimators for a design with invalid instruments that are stronger than the valid ones, but for comparison we present results for these estimators for this equal strength instruments design in Section B.1 in the supplementary materials, which also includes a more detailed analysis of the differences in performances of the Lasso and post-Lasso estimators in this design. Table 2 presents estimation results for the same Monte Carlo design as in Table 1, but now with stronger invalid than valid instruments, with γ 2 = 0.2 and γ 1 = 3 γ 2 . At these relative values, the necessary condition (16) is not satisfied and the Lasso selection will here select the valid instruments as invalid. Note that the behavior of the oracle 2SLS estimator is the same as in Table 1. In this case, β + a/ γ 2 = 0 + 0.2/0.6 = 0.33 , which is the parameter value estimated by the invalid instruments. From (22), it follows that the SNR is smaller here, with η 2 = 0.0247. The estimation results for the adaptive Lasso are based on setting υ = 1. The resulting estimators are denoted as "ALasso. " As L is even here, the median is defined as β m = ( π [5] + π [6] )/2, where π [ j] is the jth-order statistic. The results in Table 2 confirm that, for large sample sizes, the Lasso selects the valid instruments as invalid because of the relative strength of the invalid instruments. The post-ALasso cvse estimator does not perform well for n = 500, but does for the sample sizes of n = 2000, and n = 10,000, with results for the latter very similar to the oracle 2SLS results. The Post-ALasso cv estimator performs better at n = 500, as it selects more instruments as invalid with a larger proportion correctly selecting all invalid instruments, although it is outperformed there by the simple median estimator β m .

Alternative Stopping Rule
The results for the Lasso estimator in Table 1 show that the 10-fold cross-validation method tends to select too many valid instruments as invalid over and above the invalid ones, and that the ad hoc one-standard error rule does improve the selection. The fact that the cross-validation method selects too many variables is well known, see, for example, Bühlmann and van der Geer (2011), who argued that use of the cross-validation method is appropriate for prediction purposes, but that the penalty parameter needs to be larger for variable selection, as achieved by the one-standard error rule. Selecting valid instruments as invalid in addition to correctly selecting the invalid instruments clearly does not lead to an asymptotic bias, but results in a less efficient estimator as compared to the oracle estimator.
We propose a stopping rule for the LARS/Lasso algorithm based on the approach of Andrews (1999) for moment selection, which is particularly well-suited for the IV selection problem. We can use this approach because the number of instruments L n. This stopping rule is computationally less expensive than cross-validation.
Consider again the oracle model Let g n (θ A ) = n −1 Z (y − R A θ A ), and W n a k z × k z weight matrix, then the oracle generalized method of moments (GMM) estimator is defined as see Hansen (1982). 2SLS is a one-step GMM estimator, setting W n = n −1 Z Z. Given the moment conditions E[Z i. ε i ] = 0, 2SLS is efficient under conditional homoscedasticity, E(ε 2 i |Z i. ) = σ 2 ε . Under general forms of conditional heteroscedasticity, an efficient two-step oracle GMM estimator is obtained by setting where θ A,1 is an initial consistent estimator, with a natural choice the 2SLS estimator. Then, under the null that the moment conditions are correct, E[Z i. ε i ] = 0, the Hansen (1982) J-test statistic and its limiting distribution are given by For any set A + , such that A ⊂ A + , we have that

whereas for any set
Note that the J-test is a robust score, or Lagrange multiplier, test for testing H 0 : α C = 0 in the just identified specification where Z B is a k B set of instruments included in the model and Z C is any selection of L − k B − 1 instruments from the L − k B set of instruments not in Z B , see, for example, Davidson and MacKinnon (1993, p. 235). This makes clear the link between the J-test and testing for additional invalid instruments of the form as specified in model (3).
We can now combine the LARS/Lasso algorithm with the Hansen J-test, which is a directed downward testing procedure in the terminology of Andrews (1999). Compute J n ( θ A [ j] n ) at every LARS/Lasso step j = 0, 1, 2, . . ., where A [0] n = ∅ and A [1] n 0 = 1, compare it to a corresponding critical value ζ n,L−k of the χ 2 We then select the model with the largest degrees of freedom L − k, for which J n ( θ A [ j] n ) is smaller than the critical value. If two models of the same dimension pass the test, which can happen with a Lasso step, the model with the smallest value of the J-test gets selected. (If there is no empirical evidence at all for any invalid instruments, that is, if J n ( θ A [0] n ) is smaller than its corresponding critical value, then the model with all instruments as valid gets selected.) Clearly, this approach is a post-Lasso approach, where the LARS/Lasso algorithm is used purely for selection of the invalid instruments. For consistent model selection, the critical values ζ n,L−k need to satisfy ζ n,L−k → ∞ for n → ∞, and ζ n, see Andrews (1999). As the oracle model is on the adaptive LARS/Lasso path in large samples, this approach leads to consistent selection, lim n→∞ P( A ad n,ah = A) = 1, the subscript ah standing for Andrews/Hansen. As Guo et al. (2018, Theorem 2) showed, consistent selection implies that the limiting distribution of the 2SLS estimator β A ad n,ah is the same as that of the oracle 2SLS estimator, that is, the post-ALasso ah estimator. This approach also leads to consistent selection along the Lasso path when the irrepresentable condition (15) holds, resulting in oracle properties of the resulting post-Lasso ah estimator.
Let ζ n,L−k = χ 2 L−k (p n ) be the 1 − p n quantile of the χ 2 L−k distribution. Here, p n is the p-value of the test. This combination of the Andrews/Hansen method with the LARS/Lasso steps therefore results in having to choose a p-value p n instead of a penalty parameter λ n . Keeping n fixed, choosing a large value for p n leads to selecting a larger set as invalid instruments as compared to choosing a smaller value for p n . Finite sample inference will not be straightforward, as this method is essentially a sequential approach where the model at step j is only considered when the model at step j − 1 is rejected. Using the consistent selection properties, we will investigate the behavior of the Wald test in the next section and find in our simulation designs that this method performs quite well and similar to the ALasso cvse method in the unequal instrument strength design, and also performs well using the post-Lasso ah estimator for the equal strength design. Table 3 presents the estimation results using this stopping rule as a selection device for the Lasso estimator for the design with equal strength instruments and the adaptive Lasso estimator for the unequal instrument strength design, as in Tables 1 and 2. We denote the resulting 2SLS estimators as "post-(A)Lasso ah . " The p-values here are chosen as p n = 0.1/ ln(n), following Belloni et al. (2012), and are equal to 0.0161, 0.0132, and 0.0109 for n equal to 500, 2000, and 10,000, respectively. For the equal strength design, the ah approach selects too few invalid instruments for n = 500, resulting in an upward bias, with bias, std dev, rmse, and mad very similar to those of the post-Lasso cvse estimator in Table 1. For n = 2000 and n = 10,000, this post-Lasso procedure performs well with properties very similar to that of the oracle 2SLS estimator, and with smaller bias, rmse, and mad than the post-Lasso cvse method. For the unequal strength design, for n = 10,000 the results are virtually identical to those of the oracle and post-ALasso cvse estimators, whereas the post-ALasso ah estimator performs better in terms of bias, std dev, rmse, and mad than the post-ALasso cvse estimator when n = 2000. Again, when n = 500, the method does not select the invalid instruments.

Inference
From the limiting distribution result (21), a simple approach to estimating the asymptotic variance of the post-ALasso 2SLS estimator for β is by calculating the standard 2SLS variance estimator. The post-ALasso 2SLS estimator is given by where σ 2 ε = ε ε/n, ε = y − d β (n) ad,post − Z A ad,n α (n) A ad,n ,post . Under the conditions of Proposition 3, the standard assumptions and conditional homoscedasticity, nv ar( β (n) ad,post ) p → σ 2 β or . A standard robust version, robust to general forms of heteroscedasticity, is given by v ar r β (n) where H is an n × n diagonal matrix with diagonal elements H ii = ε 2 i , for i = 1, . . . , n. The robust Wald test for the null H 0 : β = β 0 is then given by From the results for the post-ALasso cvse and post-ALasso ah estimators for the unequal strength instruments design as presented in Tables 2 and 3, respectively, one would expect this approach to work well for the large sample case, n = 10,000, as there the estimation results are very close to those of the oracle 2SLS estimator. The robust Wald test for the null H 0 : β = 0, the true value of β, at the 10% level for n = 10,000 has a rejection frequency of 9.3% and 9.2% for the post-ALasso cvse and post-ALasso ah estimators, respectively, very close to that of the robust Wald test based on the oracle 2SLS estimator, which has a rejection frequency of 9.0%.
For the equal strength instruments design, we perform the same analysis for the post-Lasso estimators. Figures 1(a)-1(c) shows the performance of the robust Wald test W β,r , its rejection frequency at the 10% level, as a function of the sample size in steps of 500, n = 500, 1000, . . . , 5000. Figures 1(a) and 1(b) shows the results for the post-Lasso and post-ALasso estimators for the equal strength instruments design. Figure 1(c) shows the results for the post-ALasso estimators for the unequal strength instruments design. Figure 1(a) clearly shows that the Lasso cv and Lasso cvse procedures do not result in consistent selection and the resulting post-Lasso estimators do not have oracle properties. The Wald test rejection frequencies remain constant for increasing sample size and larger than those of the oracle estimator. In contrast, the post-Lasso ah estimator behaves very similar to the oracle estimator in this design from n = 1500 onward. Figure 1(b) shows that both the post-ALasso cvse and post-ALasso ah behave like the oracle estimator, again from n = 1500 onward in this design. The results in Figure 1(c) show that for the unequal instruments strength design considered here, the performances of the postadaptive Lasso estimators are far from that of the oracle estimator in small samples, as expected from the results in Tables 2 and  3. The post-ALasso ah behaves like the oracle estimator here from n = 4000 onward, with the post-ALasso cvse estimator behaving similarly, but having a larger rejection frequency for all sample sizes considered here that are less than n = 5000.
The results in Tables 1-3 and Figures 1(a)-1(c) show clearly that the information content in the data, given the parameter values chosen here, is insufficient at n = 500 for the (adaptive) Lasso procedures to correctly select the invalid instruments and hence the resulting estimators have poor properties, far removed from those of the oracle estimator. At these levels of information, the ALasso cv estimator is actually the preferred estimator as it counteracts the selection of too few invalid instruments of the ALasso cvse and ALasso ah estimators. We further explore how the performances of the estimators depend on the information content of the data-generating process in Section B.2 in the supplementary materials.

The Effect of BMI on Diastolic Blood Pressure Using Genetic Markers as Instruments
We use data on 105,276 individuals from the UK Biobank and investigate the effect of BMI on diastolic blood pressure (DBP). See Sudlow et al. (2015) for further information on the UK Biobank. We use 96 single nucleotide polymorphisms (SNPs) as instruments for BMI as identified in independent GWAS studies, see Locke et al. (2015). With Mendelian randomization studies, the SNPs used as potential instruments can be invalid for various reasons, such as linkage disequilibrium, population stratification, and horizontal pleiotropy, see, for example, von Hinke et al. (2016) or Davey Smith and Hemani (2014). For example, an SNP has pleiotropic effects if it not only affects the exposure but also has a direct effect on the outcome. While we guard against population stratification by considering only white European origin individuals in our data, the use of the Lasso methods can be extremely useful here to identify the SNPs with direct effects on the outcome and to estimate the causal effect of BMI on diastolic blood pressure taking account of this.
Because of skewness, we log-transformed both BMI and DBP. The linear model specification includes age, age 2 , and sex, together with 15 principal components of the genetic relatedness matrix as additional explanatory variables. Table 4 presents the estimation results for the causal effect parameter, which is here the percentage change in DBP due to a 1% change in BMI. As p-value for the Hansen test-based procedures we take again 0.1/ ln(n) = 0.0086.
The OLS estimate of the causal parameter is equal to 0.206 (s.e. 0.003), whereas the 2SLS estimate treating all 96 instruments as valid is much smaller at 0.087 (s.e. 0.016), with a 95% confidence interval of [0.056, 0.118]. The J-test, however, rejects the null that all the instruments are valid. The Lasso cv estimator identifies a large number of 56 instruments as invalid and the Lasso cv estimate is equal to 0.126, the post-Lasso cv estimate is equal to 0.145. The Lasso cvse procedure identifies 20 instruments as invalid and the Lasso cvse estimate is equal to 0.111. The post-Lasso cvse estimate is larger and equal to 0.142, which is in line with our findings above that the Lasso estimator is biased toward the 2SLS estimator that treats all instruments as valid due to shrinkage. The post-Lasso ah procedure selects a subset of 12 instruments as invalid, and the post-Lasso ah parameter estimate is equal to 0.122. The median estimate β m is equal to 0.148. Using this estimate for the adaptive Lasso results in the cv method selecting 54 instruments as invalid and the cvse method selecting 17 instruments as invalid. The adaptive Lasso ah method selects a subset of 11 instruments as invalid. The post-ALasso cv , post-ALasso cvse , and post-ALasso ah estimates are equal to 0.161, 0.151, and 0.163, respectively, with the 95% confidence intervals of the post-ALasso cvse and post-ALasso ah estimators given by [0.113,0.189] and [0.127,0.198 ], respectively. These results indicate that the OLS estimator is less confounded than suggested by the 2SLS estimation results using all 96 instruments as valid instruments.
The strongest potential instrument is the FTO SNP. For all Lasso estimators in Table 4, it is selected as an invalid instrument. The value for π FTO = −0.009, that is, negative, which is contrary to the direction of the found causal effect.
The F-test statistic for H 0 : γ 2 = 0 for the model resulting from the ALasso ah procedure is equal to 18.21 with the associated estimate of the concentration parameter equal to 1547.81. The F-test result indicates that the 2SLS estimator may have some many weak instruments bias, see Stock and Yogo (2005). However, the LIML (limited information maximum likelihood) estimator in this model is very similar to the 2SLS estimator and is equal to 0.159 (s.e. 0.019), indicating that there is not a many weak instruments problem here, see Davies et al. (2015).

Conclusions
Instrumental variables estimation is a well-established procedure for the identification and estimation of causal effects of exposures on outcomes where the observed relationships are confounded by nonrandom selection of exposure. The main identifying assumption is that the instruments satisfy the exclusion restriction, that is, they only affect the outcomes through their relationship with the exposure. In an important contribution, Kang et al. (2016) showed that the Lasso method for variable selection can be used to select invalid instruments in linear IV models, even though there is no prior knowledge about which instruments are valid.
We have shown here that, even under the sufficient condition for identification that less than 50% of the instruments are invalid, the Lasso selection may select the valid instruments as invalid if the invalid instruments are relatively strong, that is, the case where an invalid instrument explains more of the exposure variance than a valid instrument. Consistent selection of invalid instruments also depends on the correlation structure of the instruments.
We show that a median estimator is consistent when less than 50% of the instruments are invalid, and its consistency does not depend on the relative strength of the instruments or their correlation structure. This initial consistent estimator can be used for the adaptive Lasso estimator of Zou (2006) and we show that it performs well for larger sample sizes/information settings in our simulations. This adaptive Lasso estimator has the same limiting distribution as the oracle 2SLS estimator, and solves the inconsistency problem of the Lasso method when the relative strength of the invalid instruments is such that the Lasso method selects the valid instruments as invalid.

Supplementary Materials
The document contains the proofs of Proposition 1 and Theorem 1 in Section A, and further simulation results and discussions in Section B.
The Stata module "SIVREG" implements the post-ALasso ah method. Further details and documentation are provided in Farbmacher (2017).