Variable screening in multivariate linear regression with high-dimensional covariates

We propose two variable selection methods in multivariate linear regression with high-dimensional covariates. The first method uses a multiple correlation coefficient to fast reduce the dimension of the relevant predictors to a moderate or low level. The second method extends the univariate forward regression of Wang [(2009). Forward regression for ultra-high dimensional variable screening. Journal of the American Statistical Association, 104(488), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516] in a unified way such that the variable selection and model estimation can be obtained simultaneously. We establish the sure screening property for both methods. Simulation and real data applications are presented to show the finite sample performance of the proposed methods in comparison with some naive method.


Introduction
High-dimensional multivariate regression has been widely applied in bioinformatics, chemometrics, and medical image analysis where many of the response variables are highly correlated (Cai et al., 2013;Ferte et al., 2013;Jia et al., 2017;Peng et al., 2010;Smith & Fahrmeir, 2007). For instance, in genetics study, we are interested in the association between correlated phenotypes (involved in biological pathways) and genotypes, as genetic effects and their possible interaction have been recognized as an important component for the genetic architecture of each complex phenotype (Yi, 2010). For this kind of problem, the number of covariates or explanatory variables is much larger than the number of observations or samples. Traditional methods of subset selection and stepwise procedure become infeasible when confronted with high dimensionality (Breiman, 1995).
For variable selection under multivariate regression models, one simple approach is to apply some variable selection method to univariate regression of each response separately. Such an approach may produce sub-optimal results since it does not utilize the joint information among the responses (Breiman & Friedman, 1997;Kim et al., 2009). To improve the estimation, various attempts have been made. One approach is to use dimension reduction techniques such as the reduced rank regression (Chen & Huang, 2012;He et al., 2018;Zhao et al., 2017) and the sliced inverse regression (Setdji & Cook, 2004;N. Zhang et al., 2019). Another approach is to use a block-structured regularization method to select a subset which can be used as predictors for all outcome variables (Obozinski et al., 2011;Peng et al., 2010;Turlach et al., 2005). The latter approach assumes that a covariate affects either all or none of the responses. However, this assumption may be too strong when each response variable is affected by different sets of predictors. Rothman et al. (2010) proposed a penalized framework to estimate multivariate regression coefficient and covariance matrix simultaneously under 1 penalty. Lee and Liu (2012) further improved Rothman et al.'s (2010) work by using a weighted 1 regularization. Cai et al. (2013) proposed a method to first estimate the regression coefficients in a column-wise fashion with Dantzig selector and then to estimate the precision matrix by solving a constrained 1 minimization problem.
In high-dimensional setting, most of the aforementioned multivariate regression methods use the technique of regularization to estimate the regression coefficient matrix (Obozinski et al., 2011;Peng et al., 2010;Turlach et al., 2005). However, a well-chosen penalty requires an efficient exploration of the correlation structure of the responses. It is reported that simultaneously estimating covariance and selecting variables via joint optimization can be numerically unstable in highdimensional cases (Deshpande et al., 2019;Pecanka et al., 2019;Ren et al., 2019).
In this study, we propose two methods in parallel for variable screening and variable selection, namely the multiple correlation coefficient (MCC) screening (Section 3) and the unified forward regression (UFR) (Section 4). The first method is for dimension reduction which filters out covariates that have weak correlation with the response variables. It significantly reduces the feature space to a moderate or low dimension that covers the set of relevant predictors almost certainly. The second method is for variable selection which uses an extended forward regression (FR) (H. Wang, 2009) to identify all relevant predictors consistently under mild conditions. By MCC all relevant predictors are identified or screened, whereas by UFR both variable selection and model estimation are obtained. We illustrate the finite sample performance of the proposed methods in comparison with a naive method by simulation (Section 5) and a real data application (Section 6). We conclude the paper in Section 7 and defer the technical proofs in Appendix.

Notation and assumptions
Let y = (y 1 , y 2 , . . . , y q ) denote the q-dimensional response vector of interest. Let x = (x 1 , x 2 , . . . , x p ) denote the p-dimensional covariates or predictors. Denote the covariance matrices of y and x by y and x = (σ ij ), respectively. Without loss of generality, assume that E(x k ) = 0 and var(x k ) = 1 for k = 1, . . . , p and that E(y j ) = 0 for j = 1, . . . , q. In practice, these can be achieved by standardization and centralization.
Consider the multivariate linear regression model where B is a p × q matrix of coefficients and ε is the random error vector which is independent with x. For j = 1, . . . , q and k = 1, . . . , p, denote β j as the jth column vector of B and β (k) as the kth row vector of B. If β (k) = 0, x k is referred to as a relevant predictor. Let F = {1, . . . , p} denote the full model of predictors. Let S = {k : β (k) = 0} denote the true model. Denote the compliment of S by S c . Denote the cardinalities of F and S as |F| = p and |S| = p 0 , respectively. Throughout, let · denote the Euclidean norm of a vector.
Assume that x is high dimensional with p being much larger than the sample size n (in the sense of Cai & Lv, 2007). Assume that the response vector is associated with only a small portion of predictors, i.e., p 0 /p is small and p 0 is O(n) (Fan & Lv, 2008). This sparsity principle is frequently adopted and deemed useful in analysis.

Multiple correlation coefficient
We first propose to use a multiple correlation coefficient (MCC) to identify S. It is known that the multiple correlation coefficient between y and x k is defined as ρ k = max α∈R q corr( α y, x k ) and its square can be further expressed as where γ k = −1 y E(yx k ) (Anderson, 2003, Section 12.2). Given the standardized samples, we estimate ρ 2 k by where Note that the computation of ρ 2 k is simple and fast through matrix algebra and does not involve any iteration. Then, we estimate S by S MCC = {k : ρ 2 k ≥ τ }, where τ is the threshold which determines the size of the estimated predictors. Here we adopt the threshold of Fan and Lv (2008) (p) are the order statistics and d n = n/ log(n) ( · is the ceiling function), so that d n predictors with the largest values of ρ 2 k are retained. The naive correlation coefficient (NCC) method of Fan and Lv (2008) estimates S by S NCC = ∪ p j=1 {k : ρ 2 k,j ≥ τ j }, where ρ k,j is the sample correlation coefficient between y j and x k and τ j is determined in the same way as in MCC with respect to the jth response.
We now show that the MCC-based screening procedure has the sure screening property (i.e., the probability of selecting all true relevant predictors tends to one) and reduces the dimensionality of predictors below the sample size.
We state some assumptions first.
Assumption 3.1: Let λ min (A) and λ max (A) denote the smallest and largest eigenvalue of a positive definite matrix A, respectively. Assume that there exist two positive constants τ min < τ max such that Assumption 3.2: Assume that (i) for j = 1, . . . , q, β j ≤ C B for some positive constant C B and that (ii) for k = 1, . . . , p, β min = min k∈S min j |β kj | ≥ ν B n −ξ min for some positive constants ξ min and ν B .
Assumption 3.1 requires the matrix X to be well behaved. Assumption 3.2 requires the smallest nonzero regression coefficient does not converge too fast. Otherwise, it cannot be consistently identified. (See Fan & Peng, 2004 for more discussions.) Assumption 3.3 ensures the exponential convergence rate of arbitrary order moments of x and ε (Cai et al., 2011) which is superior to the polynomial type counterpart (Ravikumar et al., 2010). Theorem 3.1 reveals that for a properly chosen threshold τ , the probability that MCC detects all relevant predictors tends to one.

Unified forward regression
In this section, we propose a unified forward regression (UFR) for variable selection. It extends Wang's (2009) forward regression method for the multivariate response case.
Let M = {k 1 , . . . , k t } denote a generic subset of F with |M| = t. Denote x (M) = (x k 1 , . . . , x k t ) and denote X (M) = (x 1(M) , . . . , x n(M) ) as the subset of X corresponding to M. We first describe a naive forward regression (NFR) method that combines the selected variables obtained by repeatedly applying Wang's (2009) forward regression method to univariate regressions with respect to every response. The procedure is summarized as follows. Initially, set S (0) (j) = ∅ for j = 1, . . . , q. Perform forward regression with respect to the jth response by iterating the following two steps for = 1, . . . , n.
Compute the sum square of residuals RSS ( −1) . Let

The solution path of NFR is obtained by {S
Next, we propose the unified forward regression to select predictors by applying a modified forward regression algorithm that makes use of all response variables simultaneously. The procedure is modified from the previous one as follows. Initially, set S (0) = ∅. Perform a modified forward regression by iterating the following two steps for = 1, . . . , n.
Compute the sum square of residuals RSS

The solution path of UFR is obtained by {S
Notice that both NFR and UFR terminate automatically after n iterations. It is seen that the UFR algorithm makes use of all response variables simultaneously by the trace operator. It has nearly one qth computation cost of NFR.
We show that the proposed UFR method also possesses the sure screening property. Also, we add a few more assumptions to facilitate the development of the theory.
Assumption 4.1: Assume that (i) x follows elliptically contoured distribution, whose density admits the form x (x − μ)} with μ = Ex and g(·) > 0, denoted by EC( μ, x , g), and that (ii) the distribution of is normal.
Usually, the normality assumption of x is imposed to facilitate theory development (Fan & Lv, 2008;H. Wang, 2009). Here in Assumption 4.1, we relax it to elliptically contoured distribution and show its sufficiency to obtain Lemma 1 of H. Wang (2009) in Appendix. Assumption 4.1, together with Assumption 3.1, ensures the sparse Riesz assumption (C. Zhang & Huang, 2008) to derive some key inequalities in proving Theorem 4.1. Assumption 4.2 has been popularly assumed in the literature of ultra-high dimensional inference (Fan & Lv, 2008;H. Wang, 2009). It implies that the dimension of the covariates diverges to infinity at an exponential rate (Fan & Lv, 2008). Assumption 4.3 implies that all responses are associated with the same covariates (Turlach et al., 2005). It warrants the row-wise selection of B by UFR in contrast to the element-wise selection by NFR, which enables UFR to reach the sure screening property in fewer steps than NFR.
Define ) → 1, i.e., the NFR selects all relevant predictors with high probability after qKνn 2ξ 0 +4ξ min steps for the multivariate regression setting. While the following theorem shows that the UFR can do the job in much fewer (one qth) steps. We adopt the BIC criteria to select the best subset of variables from a solution path (Liang et al., 2012;H. Wang, 2009). Let where . We then choose the subset of the variables from the solution path which minimizes BIC(M). The selection consistency was showed by H. Wang (2009) and Sofer et al. (2014). Note that the UFR method is consistent if p = O(n α ) for some α > 0, while the MCC method works with log(p) = O(n α ). In this sense, the MCC method can handle higher dimensional variable screening than the UFR method. Secondly, the UFR method is computationally more expensive than the MCC method as the former involves n−1 more times of matrix inversion operation than the latter. On the other hand, when p and n are of the same order, both MCC and UFR perform well in terms of coverage probability and UFR performs better in yielding a parsimonious model with high specificity in terms of model size and correct fit (defined later) as seen in simulation.

Simulation
We conduct numerical studies to investigate the finite sample performance of the proposed methods, i.e., MCC and UFR, in comparison with the naive correlation coefficient (NCC) method and the naive forward regression (NFR).

Models
Consider five models for generating the p-dimensional covariates x in Table 1, which are adopted from Examples 1 and 2 of Fan and Lv (2008), Example 1 of Tibshirani (1996), and Examples 4 and 5 of H. Wang (2009), respectively.
For models 1 to 3, x follows a multivariate normal distribution with zero mean vector and covariance matrix x of the structure of identity, autoregressive and compound symmetry, respectively. In model 4, x is generated by where z r and w r are independent standard normal variables (H. Wang, 2009). Note that model 4 is a challenging case as the correlation coefficient of the relevant predictors and the response variables are much smaller than the correlation coefficient of irrelevant predictors and the response variables. The details of x are provided in Table 1. In model 5, we generate independent components of both x and to be e−1 where e is an exponential random variable with parameter one. This model is used to examine the robustness of the proposed methods against the departure from the normality assumption. Consider the number of predictors p to be 1000, 5000, and 10,000, respectively, which are all much larger than the sample sizes considered in the five models. Recall p 0 is the number of relevant predictors. Denote the first p 0 rows of B by B 0 . We generate independent entries of B 0 from distributions given in the last column of Table 1, where N(4 log(n)/ √ n, 1) is a normal random variable with mean 4 log(n)/ √ n and variance 1, (2, 1) denotes a random variable of gamma distribution with shape parameter 2 and scale parameter 1, and exp(9) is an exponential random variable with parameter 9. They are all independent with x. Set the remaining entries (the last p − p 0 rows) of B to be zero.
For the multivariate response case, the signal-tonoise ratio is given by We chose the values of σ 2 such that the signal-to-noise ratios are 30%, 60%, and 90%, respectively. Throughout, set the number of replications N to be 1000.

Evaluation criteria
For MCC screening, we use Fan and Lv's (2008) hard threshold method to retain the relevant predictors. For both NFR and UFR, we use the BIC criterion (4) to determine the relevant predictors. We adopt five measures as described in Table 2 to evaluate the finite sample performance of the proposed methods, where the model size (MS) is the number of the selected relevant predictors, the coverage probability (CP) measures how likely all the relevant predictors are identified, the percentage of correctly fitted (CF) measures the capability in identifying the true model correctly, the correct zero (CZ) characterizes the capability in producing sparse solution, and the incorrect zero (IZ) characterizes the method's under-fitting effects. Ideally, we wish a method to have MS close to p 0 , CP, CF, CZ all close to 100% and IZ close to zero.

Results
Tables 3-7 report the finite sample performance of the four competing methods in terms of the measures Table 4. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p) and signal to noise ratio (R 2 ) for Model 2 in Table 1 with (n, q, p 0 ) = (75, 5, 3).  given in Table 2 under various numbers of covariates p and signal strength R 2 . We summarize the findings as follows. (i) The MCC method is uniformly superior to the NCC method with larger coverage probability (CP), better estimation of sparsity (with larger CZ and smaller IZ), as expected. (ii) As we adopted the fixed threshold procedure for MCC and NCC, these two methods produce conservatively large coverage of predictors at the cost of large model size. For the same reason, the percentage of incorrect zero is larger than the other two regression-based methods (UFR and NFR). So the resulting percentages of correctly fitted models for MCC and NCC are zero. (iii) When comparing UFR with NFR, the UFR demonstrates its superiority over NFR uniformly in all five measures across all five models (including Model 5 with the non-normal distribution). This corroborates the advantage of UFR in utilizing the correlation within responses over NFR. When comparing UFR with PWL, both methods perform comparably when the signal strength is as small as 30%. When the signal strength is as large as 60% or 90%, UFR outperforms PWL in all five measures in general.
(iv) The UFR method performs inferior to the MCC  Wang (2009) and Y. Li et al. (2017). However, as the signal strength increases (e.g., from 30% to 90%), the percentages of coverage probability (CP) and probability of correct fit (CF) increase significantly (e.g., 61.9% to 98.3% and 28.8% to 58.8%, respectively, with p = 5000) and the percentage of incorrect zeros (IZ) drops quickly (e.g., from 53.7% to 2.35% with p = 5000) by both NFR and UFR as seen in Table 3. (vi) To examine the impact of the sample size, Table 8 reports the performance of the proposed methods under Model 1 with a number of covariates p fixed at 5000 and varying sample size n to be 100, 200, and 400, respectively. It is seen that Table 7. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various numbers of covariates (p), and signal-to-noise ratio (R 2 ) for Model 5 in Table 1 with (n, q, p 0 ) = (200, 6, 10). the measures of model size (MS), coverage probability (CP), probability of correct fit (CF) and probability of incorrect zero (IZ) are sensitive to sample size. The improvement of performance is significant. For instance, when the sample size increases from n = 100 to n = 200 with signal strength R 2 = 60%, the CP increases from 52.2% to 80.4% on average and the percentage of incorrect zero drops from 47.8% to 19.7% on average.
In conclusion, the MCC method performs better when the dimension of covariates is ultra-high (log(p) = O(n α )) with respect to the sample size and the UFR method outperforms the MCC method when the dimension of covariates is of polynomial order (p = O(n α )).

Real data application
We apply the proposed methods to a real data set regarding bone mineral density (BMD) (Reppe et al., 2010). The data were collected from 84 postmenopausal Caucasian women aged from 50 to 86. For each subject, there are two responses, namely the body mass index and total hip z-score (a measure of how Table 8. Five measures of the performance of variable selection defined in Table 2 obtained by the four competing methods under various sample sizes (n) and signal-to-noise ratio (R 2 ) for Model 1 in Table 1 with (p, q, p 0 ) = (5000, 4, 8). strong the bone in the hip), and 8649 gene expression levels in trans-iliacal bone biopsies served as covariates. It is known that low bone mineral density is usually related to fragile bone and osteoporosis and progressive reduction of bone strength which leads to increasing susceptibility of bone fractures (Cooper, 1997;Reppe et al., 2010). The goal of the study is to identify the genes that are related to BMD. Table 9 reports the genes identified by the five competing methods. The MCC method identified 19 genes which include all 13 genes identified by NFR except gene TNK2. The PWL method identified 12 genes which all identified by NFR except PAIP1. And the UFR found 10 significant genes which are all contained in the set identified by NFR.

Conclusion
We propose two methods for variable screening in high-dimensional multivariate linear regression. The MCC method has the advantage of computational ease and can provide fast variable screening to obtain an accurate subset with a dimension below the ample size. The proposed UFR method has the feature of discovering all relevant predictors consistently at nearly the same computational cost as the univariate forward regression. The performance of UFR is sensitive to the dimensionality and signal strength. Our theory assumes Gaussian distribution for the response variables. The numerical study also shows the robustness of the proposed methods against non-normality. It is of interest to investigate the problem under more general non-homogeneously sparse assumption and nonlinear models.

Disclosure statement
No potential conflict of interest was reported by the author(s).

A.2 A lemma used to prove theorem 4.1
The proof is similar to that for Lemma 1 of H. Wang (2009). Here, we relax the normality assumption of x to the elliptically contoured distribution.
In what follows, we study the two terms in (A13) separately.