Permutation testing in high-dimensional linear models: an empirical investigation

Permutation testing in linear models, where the number of nuisance coefficients is smaller than the sample size, is a well-studied topic. The common approach of such tests is to permute residuals after regressing on the nuisance covariates. Permutation-based tests are valuable in particular because they can be highly robust to violations of the standard linear model, such as non-normality and heteroscedasticity. Moreover, in some cases they can be combined with existing, powerful permutation-based multiple testing methods. Here, we propose permutation tests for models where the number of nuisance coefficients exceeds the sample size. The performance of the novel tests is investigated with simulations. In a wide range of simulation scenarios our proposed permutation methods provided appropriate type I error rate control, unlike some competing tests, while having good power.


Introduction
We consider the problem of testing hypotheses about coefficients in linear models, where the outcome may be non-Gaussian and heteroscedastic, and the number of nuisance coefficients exceeds the sample size. By the nuisance coefficients we mean the coefficients which are not tested by the particular test at hand, but still need to be dealt with since they lead to confounding effects. In recent decades, the literature on permutation methods has strongly expanded (Tusher et al., 2001;Meinshausen et al., 2011;Hemerik and Goeman, 2018a;Ganong and Jäger, 2018;Berrett et al., 2018;He et al., 2019;Albajes-Eizagirre et al., 2019;Hemerik et al., 2019;Rao et al., 2019). While the permutation test dates far back (Fisher, 1936), most of the permutation tests in the presence of nuisance were published in the last four decades. To our knowledge, the existing methods are limited to low-dimensional nuisance. For the high-dimensional case, an approach similar to a permutation test is proposed in Dezeure et al. (2017).
Permutation tests for low-dimensional linear models are valuable for two main reasons. First, they are robust to violations of certain standard assumptions, such as normality and homoscedasticity. Under non-normality, they often have higher power than a classical parametric test (Anderson and Legendre, 1999;Kowalski, 1972). Under heteroscedasticity, they often provide better type I error control than parametric tests. Second, when the outcome is multidimensional, a permutation-based test can be combined with existing permutationbased multiple testing methods, which tend to be relatively powerful, since they take into account the dependence structure of the outcomes (Meinshausen, 2006;Meinshausen et al., 2011;Hemerik and Goeman, 2018a;Hemerik et al., 2019). For example, under strong positive dependence among p-values, the Bonferroni-Holm multiple testing method (Holm, 1979) is greatly improved by a permutation method (Westfall and Young, 1993). Permutation methods are particularly popular in omics research and neuroscience (Winkler et al., 2014). This is partly due to the fact that permutation-based multiple testing methods take into account the dependence among, for example, genes or voxels Hemerik and Goeman, 2018a).
Exact permutation tests, i.e.. tests with level exactly equal to the nominal value α, are only available if the outcome is Gaussian Langsrud, 2005), if all nuisance parameters are known (Anderson and Robinson, 2001) or if the nuisance variables are discrete (so that we can permute within blocks). Otherwise, permutation tests are only asymptotically exact.
For the low-dimensional general linear model, with identity link but not necessarily Gaussian or homoscedastic residuals, several different permutation tests have been proposed. The main approach that these methods have in common, is to permute residuals after regressing on the nuisance covariates. There are different ways to permute the residuals. Instead of permutation in the strict sense, the residuals may also be randomly sign-flipped, under the assumption that the residuals are symmetric. For overviews of the available methods, see Anderson and Legendre (1999), Anderson and Robinson (2001), Winkler et al. (2016) and in particular Winkler et al. (2014).
Among the existing permutation methods, the ones that often perform best with respect to type I error control and power, are the procedures commonly referred to as the Freedman-Lane method (Freedman and Lane, 1983) and the Smith method. The latter procedure is named after a reviewer (O'Gorman, 2005;Winkler et al., 2014). Compared to all other methods, the Freedman-Lane approach is most commonly used. It tends to provide excellent type I error control, even if the number of nuisance covariates approaches the sample size.
Because the existing permutation tests require estimating the nuisance coefficients using maximum likelihood, these methods cannot be used when the number of covariates exceeds the sample size. It is of interest to somehow extend these methods to high-dimensional linear models. A seemingly natural way to do this, is to replace the least squares estimation by some type of regularized estimation, e.g. ridge estimation.
However, extending the existing methodology to the high-dimensional setting is not straightforward. For example, the Freedman-Lane method requires re-estimating the nuisance each time the data have been permuted. This can quickly become computationally challenging in the high-dimensional setting. Moreover, it is not obvious what test statistic should be used within the permutation test. For example, the usual F -and Wald statistics are not available in the high-dimensional setting. As in the low-dimensional case, a suitable choice of the test statistic is essential for the validity of the test.
In recent years, important theoretical advances have been made in the field of inference in high-dimensional linear models. Although there is a vast literature on regularization methods such as ridge regression, most existing results on testing in such models are quite recent (for a partial overview, see Dezeure et al., 2015). Several of these tests have proven asymptotic properties. In particular, the method in Zhang and Zhang (2014) has been shown to be asymptotically optimal under certain assumptions (Van de Geer et al., 2014). The method in Bühlmann et al. (2013) tends to be more conservative, but requires less assumptions to provide asymptotic type I error control. Dezeure et al. (2017) propose a bootstrap approach, which is related to the method in Zhang and Zhang (2014) .
The known theoretical properties of these methods are asymptotic and rely on complex assumptions and sparsity. The test by Zhang and Zhang (2014) can be rather anti-conservative in settings where a substantial fraction of the coefficients are non-zero. Moreover, these methods are not based on permutations. Hence they do not generally have the above-mentioned advantages, such as robustness against certain violations of the standard linear model. An exception is the bootstrap method in Dezeure et al. (2017), which tends to be more robust to such violations.
We propose two novel tests, which, to our knowledge, are the first permutation tests in the presence of high-dimensional nuisance. One is an extension of the low-dimensional method in Freedman and Lane (1983) and the other is somewhat related to a method by Kennedy (Kennedy, 1995;Kennedy and Cade, 1996). The method based on Freedman and Lane (1983) requires performing ridge regression in every permutation step, but we do this in a computationally efficient way.
Using simulations we show that our methods provide appropriate type I error rate control in a wide range of situations. In particular, we illustrate empirically that our tests have the above-mentioned robustness properties. The methods in this paper have been implemented in the R package phd, available on CRAN.
This paper is built up as follows. In Section 2 we discuss exact permutation tests, which are available when the nuisance parameters are known. We discuss permutation testing in settings with low-dimensional, unknown nuisance in Section 3. This section contains some novel remarks which will be used in Section 4. There, we propose permutation tests for highdimensional settings. We assess the performance of our methods with simulations in Section 5. An analysis of real data is in Section 6.
2 Known nuisance 2.1 Notation throughout the paper We consider the general linear model where X is a n × d matrix of covariates of interest, Z an n × q matrix of nuisance covariates and ǫ an n-vector of i.i.d. errors with mean 0 and non-zero variance, which are independent of the covariates. Here the rows of X, Z and Y are i.i.d.. The matrix Z is assumed to have full rank with probability 1. We will often focus on the case that d = 1. Let the variables Y ∈ R, X ∈ R d and Z ∈ R q be such that their joint distribution coincides with that of the rows of (Y , X, Z).
The parameter β ∈ R d is of interest and γ ∈ R q is a nuisance parameter. We want to test the null hypothesis H 0 : β = 0 ∈ R d . Here 0 might be replaced by another constant: the extension is straightforward.
Let w be a positive integer, which will denote the number of random permutations or other transformations. In this paper, all permutation p-values are of the form or, in case of a two-sided test where both small and large values of T 1 are evidence against where T 2 , ..., T w ∈ R are statistics computed after random permutation and T 1 is a statistic based on the original, unpermuted data. The methods in this paper only differ with respect to how T 1 , ..., T w are computed.

Exact permutation tests
When the nuisance parameter γ is known, then E Z (Y ), the expected value of Y given Z under H 0 , is known. Hence, under H 0 , the errors ǫ coincide with the observed residuals e = Y − E Z Y . However, the distributional shape of the errors is not generally known, so that there exists no exact test of H 0 in general. When the distribution of the errors is known to be invariant under a group of permutations or other transformations, then we do obtain an exact test. More precisely, suppose that under for all g ∈ G, where G is a group of transformations g : R n → R n . See e.g. Hemerik and Goeman (2018b) for the definition of a group in the algebraic sense. Examples of such G are given below. The exact test is then obtained as follows. Let T be a function from the sample space of (X, Z, Y ) to R, large (absolute) values of which are evidence against H 0 . For example, T could be an F -statistic. For every 1 ≤ j ≤ w, define T j = T (X, Z, Y * j ), where with g j ∈ G a random transformation of the vector of residuals e. Here, we draw g 2 , ..., g w uniformly from G, with replacement. We take g 1 to be the identity, so that T 1 corresponds to the original data. We draw with replacement for convenience, although drawing without replacement is also allowed (Hemerik and Goeman, 2018b). The p-value is then (1) or (2), as required by the context. See Algorithm 1. Writing α ∈ (0, 1) for the desired confidence level, we reject H 0 when the p-value is at most α. Under H 0 , the resulting rejection probability is at most α. Under mild assumptions such as continuous residuals, if the transformations are drawn without replacement and α is a multiple of 2w −1 , then the rejection probability is exactly α (Hemerik and Goeman, 2018b).
In practice, G is usually taken to be a group of permutation maps or a group of signflipping transformations (Winkler et al., 2014). In case we use permutation, G consists of all maps g : R n → R n of the form (e 1 , ..., e n ) → (e π(1) , ..., e π(n) ),
To obtain an exact test, we require (3) to hold. In case G contains permutations, (3) is satisfied if e 1 , ..., e n are independent and identically distributed. Note that we then need no assumption on the shape of the errors ǫ. This means that the test is still exact if the errors ǫ are not normal, i.e., the test is robust to non-normality. If sign-flipping is used, it suffices to assume that the errors ǫ are independent and symmetric around 0. The errors can have different distributions, however. In particular they are allowed to have very different variances, i.e., the test is robust to heteroscedasticity.

Unknown, low-dimensional nuisance
Here we discuss existing permutation methods that can be used when the nuisance parameter γ is unknown and has dimension smaller than n. For this setting an appreciable number of permutation methods have been proposed (Winkler et al., 2014), most of which are asymptotically exact under mild assumptions (Anderson and Robinson, 2001). We will focus on two methods, which inspire the methods in Section 4. Section 3.1 also contains some novel remarks, which will be important in Section 4.1.
The existing permutation methods all provide p-values according to formulas (1) and (2), but differ with respect to how permutation is used to obtain T 1 , ..., T w . Although we will often write 'permutation', sign-flipping can also be used, as explained in Section 2.2. All methods in this paper consist of the following steps.
1. Compute a test statistic T 1 based on the original data.
2. Compute a test statistic T 2 in a similar way, but after randomly permuting certain residuals. Repeat to obtain T 3 , ..., T w .
Most of the existing permutation methods use residualization of Y or X with respect to the nuisance Z. The residual forming matrix is All tests in Section 3 involve the residuals RY ∈ R n . When d = 1 we will sometimes consider RX ∈ R n , which is assumed to be nonzero with probability 1. In Section 3 we assume Z contains a column of 1's. This implies that the entries of RX and RY sum up to 0.
Note that if we use permutation, we can write the transformed residuals g(RY ) as P RY , where P is an n × n matrix with exactly one 1 in every row and column and elsewhere 0's. In case of sign-flipping, P is instead an n × n diagonal matrix with diagonal elements in {1, −1} (Winkler et al., 2014). We write P 1 , ..., P w to distinguish the w random permutation matrices. Here P 1 is the identity matrix and P 2 , ..., P w are random.

The Freedman-Lane method
The Freedman-Lane permutation method (Freedman and Lane, 1983) is known to provide excellent type I error control, with both its level and power staying very close to the parametric F -test, under the Gaussian model. The test statistic T 1 is based on the unpermuted model Y = Xβ+Zγ+ǫ. The other statistics are obtained after randomly transforming the residuals. That is, for 2 ≤ j ≤ w the statistic T j is based on the model (P where the same test statistic, say T , is used as for computing T 1 . Thus where T is a suitable test statistic, the choice of which we now discuss. In Section 2 we saw that when the nuisance parameters are exactly known, the permutation test controls the type I error rate regardless of the choice of T . Here, however, it is usually important to use an asymptotically pivotal statistic, i.e., a statistic whose asymptotic null distribution does not depend on any unknowns under H 0 (Kennedy and Cade, 1996, p.926-927, Winkler et al., 2014, p.382, Hall and Titterington, 1989, Hall and Wilson, 1991. A pivotal statistic T will always involve estimation of the nuisance parameters. Thus, after every permutation, the nuisance parameters need to be estimated anew. Note that T should for example not be taken to be an estimate of β, since that is not a pivotal statistic. Instead one can use the F -statistic or Wald statistic. These are equivalent: the resulting p-value (1) will be the same.
In case X is one-dimensional, the F -statistic is also equivalent to the square of the partial correlation (Fisher, 1924;Agresti, 2015), which is used in Anderson and Robinson (2001). The partial correlation is the sample Pearson correlation of RY and RX, Here we used that the sample means of RY and RX are 0. If we use the partial correlation in the Freedman-Lane permutation test, this means that we take T (X, Z, Y ) = ρ RY , RX , so that (4) and (5) become where R(P j R + H) could be simplified to RP j R, since RH = 0. The numerator in (6) is The Freedman-Lane test with T defined by (9) remains unchanged if in (9) we replace i (RX) 2 i by 1 or by the constant i X 2 i . Indeed, T 1 , ..., T w will just be multiplied by the same constant. Thus, with respect to the permutation test, the statistic (6) If X has been centered around 0, then this equals where µ x denotes the n-vector with entries equal to the sample mean of X. This is the sample correlation of RY and X and is called the semi-partial correlation. If we take T to be the semi-partial correlation, then (4) and (5) become T 1 = ρ RY , X and where R(P j R+H) could be simplified to RP j R. Note that we could simply leave the constant i (X i − µ x ) 2 out without changing the result of the permutation test. Although for centered X the statistics (6) and (11) are equivalent, their counterparts in the high-dimensional setting are not, as will be discussed in Section 4.1.

The Kennedy method
The Kennedy method (Kennedy, 1995;Kennedy and Cade, 1996;Winkler et al., 2014) residualizes both Y and X with respect to Z. The outcome residuals are then permuted. Apart from the initial residualization step, the Kennedy method performs no nuisance estimation. The test statistics are where 2 ≤ j ≤ w. In the usual formulation of the Kennedy method, X is assumed to be one-dimensional, but generalizations are conceivable. The Kennedy method is similar to the Still-White procedure (Gail et al., 1988;Levin and Robbins, 1983;Still and White, 1981), which only residualizes Y .
In the simulations of Winkler et al. (2014), the Freedman-Lane method provided better type I error control than the Kennedy method. This is explained theoretically in Anderson and Robinson (2001). Heuristically, the reason is that the Freedman-Lane method repeats the residualization after every permutation, thus mimicking the computation of the original statistic T 1 . In Anderson and Robinson (2001) it is shown that the permutation tests by Freedman andLane (1983), Ter Braak (1992), Kennedy (1995) and Manly (1997) are asymptotically equivalent to each other and are asymptotically exact.

High-dimensional nuisance
When the nuisance parameter γ has dimension q ≥ n, the discussed permutation methods cannot be used. Here, these approaches are adapted to obtain tests which can account for high-dimensional nuisance. We consider the case that X is one-dimensional, i.e., d = 1, although generalizations where X is high-dimensional are conceivable. We assume that the entries of Y , X and Z have expected value 0. Consequently, the intercept is 0.
All existing tests rely on residualization steps, where Y or X is regressed on Z. A natural way to adapt this step to the high-dimensional setting, is to instead estimate the residuals using some type of elastic net regularization. We will consider ridge regression. For minimizing prediction error, ridge regression is often preferrable to Lasso, principal components regression, variable subset selection and partial least squares (Hastie et al., 2009;Frank and Friedman, 1993).
Compared to the methods in Section 3, using ridge regression comes down to replacing the projectionsŶ = HY andX = HX by ridge estimatesH λ Y andH λ X X, with λ, λ X > 0. Here, for λ ′ > 0,H which satisfiesH and similarly for X. The values λ, λ X are the regularization parameters, whose selection will be discussed. Using ridge regression, the residuals becomeR λ Y andR λ X X, wherẽ R λ = (I −H λ ) andR λ X = (I −H λ X ). The last two rows of Table 1 outline the permutation schemes that we will consider in Sections 4.1 and 4.2. The first two rows summarize the existing methods that have been discussed in Section 3. This table is analogous to Table 2 in Winkler et al. (2014) and allows easy comparison of the new methods with the existing methods discussed in Winkler et al. (2014).
Although Table 1 outlines the permutation schemes that we will use, several crucial specifics remain to be filled in. For example, several choices of the regularization parameters λ and λ X can be considered. Moreover, the computational challenge of performing nuisance estimation in every step will need to be addressed. Finally and importantly, we must determine what test statistics are suitable to use within our permutation tests.

Freedman-Lane HD
As discussed in Section 3.1, the low-dimensional Freedman-Lane method is known to provide excellent type I error control and power. Here we will provide an extension to the case of

Method
Model after permutation Freedman-Lane high-dimensional nuisance. We will refer to this test as Freedman-Lane HD. The permutation scheme that we use is analogous to that of Freedman-Lane and is shown in the third row of Table 1. As in the Freedman-Lane method, after every permutation, we will require nuisance estimation to compute T j . We will choose ridge regression to do this. Note however that when many permutations are used, performing a ridge regression after every permutation can be a large computational burden. We will therefore compute λ only once, for the unpermuted model. We take λ to be the value that gives the minimal mean cross-validated error; see Section 5.1 for more details. After each permutation, we then use the same parameter λ in the ridge regression. Thus, after the j-th permutation, to compute the new ridge residuals, we will only need to pre-multiply the transformed outcome (P jRλ +H λ )Y byR λ . We only need to computeR λ once. Owing to this approach, essentially we need to perform ridge regression only once.
An important consideration is the test statistic T used within the permutation test. The usual F -statistic and Wald statistic are only defined when the nuisance is low-dimensional. Extending these definitions to the high-dimensional setting with q ≥ n is problematic. For example, a Wald-type statistic would require an unbiased estimate of β and a variance estimate. The partial correlation (6), however, is more naturally generalized to the q ≥ n setting: we can replace the residuals RY and RX by the ridge residualsR λ Y andR λ X X. Similarly we can generalize the semi-partial correlation (11), by replacing RY byR λ Y . This gives the following test statistics, which generalize the partial correlation (6) and the semi-partial correlation (11) respectively: Here, µ 1 , µ 2 and µ x are n-vectors whose entries are the sample means ofR λ Y ,R λ X X and X respectively. In Section 3.1 we reasoned that if X has been centered, (6) and (11) are equivalent with respect to the permutation test. This does not apply to (16) and (17). In simulations, using the statistic (17) tended to result in somewhat higher power than using the statistic (16). In Section 5 we consider both methods.
In case the generalization of the partial correlation is used, the test statistics T 1 , ..., T w on which Freedman-Lane HD is based are Here µ j is an n-vector whose entries are the sample mean ofR λ (P jRλ + H λ )Y . For the version based on the generalization of the semi-partial correlation, the statistics are As usual, T 1 is just T j with P j = I n . The pseudo-code for the version based on semi-partial correlations is in Algorithm 2.
If q < n, as λ ↓ 0, the test converges to the test for λ = 0, which is the classical Freedman-Lane method. In the wide range of simulation settings considered in Section 5, the Freedman-Lane HD method stayed on the conservative side, in the sense that the size was less than α. This may due to the fact that if λ > 0 and 2 ≤ j < k ≤ w, the correlation between Y and Y * j is strictly larger than the correlation between Y * j and Y * k , where Y * j := (P jRλ +H λ )Y . This inequality is proved in the Supplementary Material.
As discussed, to perform the test, λ and henceR λ need to be computed only once. Thus, like the low-dimensional Freedman-Lane procedure, the test requires nuisance estimation after every permutation, but this is not a large computational burden. The method is often computationally feasible even when many millions of permutations are used; see Section 5. It is also worth mentioning that there exist approximate methods for reducing the number of permutations while still allowing for very small, accurate p-values (Knijnenburg et al., 2009;Winkler et al., 2016).

Algorithm 2 Freedman-Lane HD (version based on semi-partial correlations)
1: ComputeH λ = Z(Z ′ Z + λI q ) −1 Z ′ and the residual forming matrixR λ = I −H λ . Here λ is taken to give the minimal mean cross-validated error (see main text). 2: Let T 1 = ρ R λ Y , X , the sample Pearson correlation of the Y -residuals with X. 3: for 2 ≤ j ≤ w do

Double residualization
Here we propose a test that we refer to as the Double Residualization method. The method is somewhat related to the Kennedy procedure discussed in Section 3.2, but not analogous. The Kennedy method residualizes both Y and X and proceeds to permute the Y -residuals. Here we replace the least squares regression by ridge regression. Moreover, unlike Kennedy's permutation scheme, we keepH λ Y in the model; see Table 1. The test statistic that we use within the permutation test is the sample correlation. Thus, the test is based on the statistics where 2 ≤ j ≤ w. The difference between (22) and (19) is that (19) contains an additional R λ . The pseudo-code for the Double Residualization method is in Algorithm 3. We take λ and λ X to be the values that give the minimal mean cross-validated error; see Section 5.1 for more details. For fixed q, as n → ∞, the Double Residualization method becomes equivalent to the Kennedy method and the Freedman-Lane method if the penalty is o P (n 1/2 ), as shown in the Supplementary Material. The case that q > n is investigated in Section 5.
Let T j = ρ (P jRλ +H λ )Y ,R λ X X , where the random matrix P j encodes random permutation or sign-flipping. 5: end for 6: The two-sided p-value p equals (2). 7: return p

Simulations
We used simulations to gain additional insight into the performance of the new tests, as well as existing tests. The simulations were performed with R version 3.6.0 on a server with 40 cores and 1TB RAM. In Section 5.2 we consider scenarios where the outcome Y follows a standard Gaussian high-dimensional linear model. In Section 5.3 we consider non-standard settings with non-normality and heteroscedasticity. We consider simulated datasets where the covariates have equal variances. It is well-known that when the data are not standardized, this can affect the accuracy of the model obtained with ridge regression (Bühlmann et al., 2014, p.257).

Simulation settings and tests
We considered the model in Section 2.1, where the variable of interest was one-dimensional, i.e., β ∈ R. In every simulation, the covariates had mean 0 and variance 1. They were sampled from a multivariate normal distribution with homogenous correlation ρ ′ , unless stated otherwise. The errors ǫ had variance 1. The intercept was γ 1 = 0, i.e., Y had mean 0. The tested hypothesis was H 0 : β = 0. The sample size in the reported simulations was n = 30. We obtained comparable results for other sample sizes. The estimated probabilities in the tables are based on 10 4 repeated simulations, unless stated otherwise.
As the set G of transformations we took the group of n! permutations, unless stated otherwise. The penalty λ was chosen to give the minimal mean error, based on 10-fold cross validation. The penalty λ X was chosen analogously. To compute the penalties, we used the cv.glmnet() function in the R package glmnet. We used [10 −5 , 10 5 ] as the range of candidate values for the penalty. The penalty obtained with cv.glmnet() is scaled by a factor n, so we multiplied this penalty by n to obtain λ. We included an intercept in the ridge regressions, but excluding the intercept gave very similar results.
All tests used were two-sided. The tests corresponding to the columns of the tables in this section are the following.
"FLH1" is the Freedman-Lane HD test defined in Section 4.1, with test statistics T 1 , ..., T w based on the generalized partial correlation as in (19). "FLH2" is the same, except that T 1 , .., T w are based on the generalized semi -partial correlation as in (21). "DR" is the Double Residualization method of Section 4.2. Each of these tests used w = 2 · 10 4 permutations.
"BM" is a high-dimensional test based on ridge projections, proposed in Bühlmann et al. (2013). This test is based on a bias-corrected estimate |β corr | of |β| ∈ R and an asymptotic upper bound of its distribution. We used the implementation in the R package hdi (Dezeure et al., 2015).
"ZZ" is a high-dimensional test based on Lasso projections, proposed in Zhang and Zhang (2014). This method constructs a different bias-corrected estimateb of β, which has an asymptotically known normal distribution under certain asumptions, such as sparsity. For this test we also used the hdi package. We could not include this test, in the simulations with a very high number of nuisance parameters, since it is computationally very time-consuming when q is large, as also noted in Dezeure et al. (2015). We expect the test to have good power in these settings.
"BO" is the bootstrap approach in Dezeure et al. (2017), which is also implemented in the hdi package. We set the number of bootstrap samples per test to 1000 and considered the robust version of the method. We used the shortcut, which avoids repeated tuning of the penalty. Still, the method was very slow, so that we used 10 3 instead of 10 4 repeated simulations of this method per setting. Also, we did not include the test in the simulations with very large q.

Gaussian, homoscedastic outcome
We first consider some settings with a moderately large number of nuisance coefficients, q = 60. We first simulated an anti-sparse setting with γ 2 = ... = γ 60 = 0.05. We took ρ ′ = 0.5. The estimated level and power of the tests described above, for different p-value cut-offs α, are shown in Table 2. The tests rejected H 0 if the p-value was smaller than the α. The level of a test should be at most α. Table 2 shows that the test by Zhang and Zhang (2014) was rather anti-conservative. Especially for small α, its level was many times larger than α. This is partly due to the antisparsity. Indeed, the test by Zhang and Zhang (2014) only has proven asymptotic properties under a sparsity assumption. The bootstrap approach of Dezeure et al. (2017) was much less liberal, but still seemed to be somewhat anti-conservative for small α. Of the other tests, Freedman-Lane HD 2 often had the most power. The Double Residualization method had relatively low power when α was small, e.g. 0.001. Table 2: Anti-sparse setting with ρ ′ = 0.5, n = 30, q = 60. Power is shown for β = 1.5. We also considered a setting with very high correlation ρ ′ = 0.9, see Table 3. We took γ 2 = γ 3 = 1 and γ 4 = .... = γ 60 = 0. The first 4 methods provided appropriate type I error control. For small cut-offs α, the method by Zhang and Zhang (2014) was relatively powerful, but also seemed to be somewhat anti-conservative. This method seems more suitable for settings where q is many times larger than n. Among our permutation methods, Freedman-Lane HD 2 had the best power, while incurring few type I errors. The method by Bühlmann et al. (2013) was relatively conservative. We also performed simulations with a very large number of nuisance variables (q = 1000). We first took γ 2 = γ 3 = 1, γ 4 = ... = γ 10 = 0.2, γ 11 = ... = γ 1000 = 0. See Table 4 for simulations with ρ ′ = 0.5 and Table 5 for simulations with ρ ′ = 0.9. All permutation methods provided appropriate type I error control. Double Residualization had relatively high power for large cut-offs α, but not for small cut-offs. The method by Bühlmann et al. (2013) had relatively good power for ρ ′ = 0.5 but low power for ρ ′ = 0.9.

Method
We also performed simulations where γ was very anti-sparse, e.g. with γ 2 = 1, γ 3 = ... = γ 800 = 0.002 and ρ ′ = 0.9. We also considered negative coefficients and we varied the orders of magnitude of the coefficients and the errors ǫ and the sample size. We also considered settings where there were multiple independent clusters of correlated covariates. Also in these settings, the type I error rate was controlled.

Violations of the Gaussian model
As discussed in for example Section 2.2, permutation tests can be robust to violations of the standard linear model, such as non-normality and heteroscedasticity. The power of parametric methods is often substantially decreased when the residuals have heavy tails. The power of the permutation tests is more robust to such deviations from normality. This is illustrated in Table 6. Here, the data distribution was the same as in the setting corresponding to Table 3, except that the errors ǫ were not standard normally distributed, but had very heavy (cubed exponential) tails, scaled such that the errors had standard deviation 1. Note in Table 6 that the permutation and bootstrap methods still had roughly the same power as at Table 3, while the power of the tests by Bühlmann et al. (2013) and Zhang and Zhang (2014) was strongly reduced compared to Table 3. As a second type of violation of the standard linear model, we considered heteroscedasticity. We simulated errors ǫ i which were normally distributed, but with standard deviation proportional to the absolute value covariate of interest, |X i |. We again took γ 2 = γ 3 = 1, γ 4 = ... = γ 60 = 0. We took ρ ′ = 0 for illustration, since in that case the test by  Bühlmann et al. (2013) had higher power than the permutation methods in this specific setting, but was anti-conservative for small α.
In the simulations underlying Table 7, we did not use sign-flipping, which is known to be robust to heteroscedasticity (see Section 2.2). Surprisingly, our tests nevertheless provided appropriate type I control. We also performed these simulations with sign-flipping instead of permutation (results not shown), which further reduced the level of our tests, but also somewhat reduced the power.
We conclude from the simulations of Section 5 that our tests are rather robust to several types of model misspecification. The method from Zhang and Zhang (2014) was often relatively powerful, but was quite anti-conservative in several scenarios. The bootstrap approach of Dezeure et al. (2017) was also anti-conservative in some scenarios, but much less so. The method from Bühlmann et al. (2013) tended to be relatively conservative.

Data analysis
We analyze a dataset about riboflavin (vitamin B2) production with B. subtilis. This dataset is called riboflavin and is publicly available . It contains normalized measurements of expression rates of 4088 genes from n = 71 samples. We use these as input variables. Further, for each sample the dataset contains the logarithm of the riboflavin production rate, which is our one-dimensional outcome of interest. We (further) standardized the expression levels by subtracting the means and dividing by the standard deviations. We also shifted the outcome values to have mean zero. For every 1 ≤ i ≤ 4088, we tested the hypothesis H i that the outcome was independent of the expression level of gene i, conditional on the other expression levels. We used the same tests as considered in the simulations. This time we used w = 2 · 10 5 permutations per test.
The results of the analysis are summarized in Table 8. The columns correspond to the same methods as considered in Section 5. For every method, the fraction of rejections is shown for different p-value cut-offs α. The fraction of rejections is the number of rejected hypotheses divided by 4088, the total number of hypotheses. The hypotheses that were rejected, were those with p-values smaller than or equal to the cut-off α.
With most methods we obtain many p-values smaller than 0.05. This is not the case for the test by Bühlmann et al. (2013), which is known to be relatively conservative. After Bonferroni's multiple testing correction, we reject no hypotheses with any method, suggesting there is no strong signal in the data. Van de Geer et al. (2014) also obtained such a result with this dataset.

Discussion
We have proposed novel permutation methods for testing in linear models, where the number of nuisance variables may be much larger than the sample size. Advantages of permutation approaches include robustness to certain violations of the standard linear model and compatibility with powerful permutation-based multiple testing methods. We have proposed two novel permutation approaches, Freedman-Lane HD and Double Residualization. Within these approaches some variations are possible, with respect to how the regularization parameters are chosen and which test statistics are used. Our methods provided excellent type I error rate control in a wide range of simulation settings. In particular we considered settings with anti-sparsity, high correlations among the covariates, clustered covariates, fat-tailedness of the outcome variable and heteroscedasticity. The simulation study was limited to settings with multivariate normal covariates. Future research may address more scenarios.
We compared our methods to the parametric tests in Bühlmann et al. (2013) and Zhang and Zhang (2014) and to the bootstrap approach in Dezeure et al. (2017). Our tests tended to have higher power than the method by Bühlmann et al. (2013). The test by Zhang and Zhang (2014) had relatively good power, but was rather anti-conservative in several scenarios, for example under anti-sparsity and heteroscedasticity. The bootstrap approach of Dezeure et al. (2017) was also anti-conservative in some scenarios, but less so. Our permutation tests tended to be less powerful than that method, but provided appropriate type I error control in all scenarios. Moreover, our permutation tests were computationally much faster.
sinceH λ −H 2 λ is positive definite. We then also have Note that since P jRλ Y is a random permutation ofR λ Y . Similarly we have and cov(Y * k , Y * j ) = var(H λ Y ).