A sparse approach for high-dimensional data with heavy-tailed noise

Abstract High-dimensional data have commonly emerged in diverse fields, such as economics, finance, genetics, medicine, machine learning, and so on. In this paper, we consider the sparse quantile regression problem of high-dimensional data with heavy-tailed noise, especially when the number of regressors is much larger than the sample size. We bring the spirit of -norm support vector regression into quantile regression and propose a robust -norm support vector quantile regression for high-dimensional data with heavy-tailed noise. The proposed method achieves robustness against heavy-tailed noise due to its use of the pinball loss function. Furthermore, -norm support vector quantile regression ensures that the most representative variables are selected automatically by using a sparse parameter. We use a simulation study to test the variable selection performance of -norm support vector quantile regression, where the number of explanatory variables greatly exceeds the sample size. The simulation study confirms that -norm support vector quantile regression is not only robust against heavy-tailed noise but also selects representative variables. We further apply the proposed method to solve the variable selection problem of index construction, which also confirms the robustness and sparseness of -norm support vector quantile regression.


Introduction
The development of regression technology presents several challenges to modern data. The first challenge comes from the dimensionality of data. High-dimensional data, where the number of explanatory variables (K) greatly exceeds the sample size (N), vary greatly across different fields. For example, large panels of home-price data are high dimensionality (Huang et al., 2020). To consider the cross-sectional effects, the house price in one city depends on several other cities, most likely its geographic neighbors (Fan et al., 2011). Another example of high-dimensional data is in the finance field (Wang et al., 2020;Zhou et al., 2020). Portfolio allocation with a few thousand stocks involves over one million explanatory variables (Fan & Lv, 2010). High-dimensional data have commonly emerged in other fields, such as genetics (Algamal & Lee, 2019), medicine (Dondelinger et al., 2020), and machine learning (Ye et al., 2017a(Ye et al., , 2017b. In gene expression studies, for instance, one is able to collect far fewer observations than the total number of genes assayed (Clarke et al., 2008). A high-dimensional data set needs a sparse technique that can select the representative variables and discard the redundant variables. The second challenge comes from heavy-tailed noise, which exists in practice Fan et al., 2017;Hsu & Sabato, 2016;Zhou et al., 2018). Modern data with heavy-tailed noise require robust techniques. The main purpose of this paper is to propose a new sparse regression method for modern high-dimensional data with heavy-tailed noise.
Although ordinary least squares (OLS) regression (Dempster et al., 1977;Greene, 1981;Hutcheson, 2011;Rzhetsky & Nei, 1992) is one of the commonly used methods for estimating conditional mean functions because its estimators have the smallest variance among the class of linear unbiased estimators (Berk & Hwang, 1989), OLS estimation exhibits some drawbacks when used for modern high-dimensional data with heavy-tailed noise. High-dimensional data require a sparse technique for selecting the representative variables and discarding the redundant variables. OLS estimation may suffer from the presence of redundant variables since the process utilizes all variables without discrimination. Heavy-tailed noise requires a robust regression technique. The use of sum of squared residuals makes OLS regression sensitive to noise in heavy-tailed situations (Koenker & Bassett, 1978).
In the field of statistics, the least absolute shrinkage and selection operator (Lasso) (Bertsimas et al., 2016;Kim et al., 2019;Liang & Jacobucci, 2020;Tibshirani, 1996;Wang et al., 2007;Zou, 2006) is a very popular sparse method for high-dimensional data since it shrinks some coefficients of the regression estimators toward 0. According to the regression estimators, the contribution of each explanatory variable to the final decision function can be judged, and then the representative explanatory variables are selected, while the redundant variables are discarded. However, Lasso lacks robustness against heavy-tailed noise. Koenker and Bassett (1978) proposed quantile regression (QR) and effectively dealt with a regression problem involving heavy-tailed noise. QR is robust against noise in heavy-tailed situations since the quantile estimators as a class of empirical 'location' measures for the dependent variable, are based on pinball loss rather than least squares loss (Newey & Powell, 1987). Although QR is robust against heavy-tailed noise (He et al., 2020), it does not focus on the variable selection problem of high-dimensional data. Li and Zhu (2008) brought the spirit of the Lasso approach into quantile regression, and proposed L 1 -norm regularized quantile regression. Thereafter, L 1 -penalized quantile regression (Belloni & Chernozhukov, 2011;Peng & Wang, 2015;Yu et al., 2017) and generalized L 1 -penalized quantile regression  were proposed. In addition, Li et al. (2010) studied regularization in quantile regression from a Bayesian perspective. The L 1 -norm regularization term in the quantile regression methods has variable selection ability since some coefficients of the estimator are driven towards zero. Therefore, these L 1 -norm regularized quantile regression methods conduct estimation and variable selection simultaneously.
In the field of machine learning, support vector regression (SVR) (Drucker et al., 1996;Smola & Sch€ olkopf, 2004) in the framework of statistical learning theory, or Vapnik-Chervonenkis theory, is an effective method for addressing the K ) N regression problem. Sparse SVR, such as L 1 -norm support vector regression (L 1 ÀSVR) (Peng & Xu, 2013;Ye et al., 2017aYe et al., , 2017b and L p -norm support vector regression (L p ÀSVR) (0<p<1) (Ye et al., 2015(Ye et al., , 2017a(Ye et al., , 2017bZhang et al., 2013), has been proven to be an effective variable selection tool for high-dimensional data. L p ÀSVR is much sparser than L 1 ÀSVR since the L p -norm regularization term in support vector regression shrinks some coefficients of an estimator towards 0, and some coefficients are shrunk to exactly 0, leading to some redundant variables being discarded and some representative variables remaining. Moreover, L p -norm support vector regression can realize estimation and variable selection simultaneously. Takeuchi et al. (2006) introduced the spirit of QR to support vector regression and proposed nonparametric quantile regression, which minimizes the pinball loss and the regularization term. Thereafter, support vector censored quantile regression (Shim & Hwang, 2009), semiparametric support vector quantile regression (Shim et al., 2012), and support vector quantile regression (Anand et al., 2020) were proposed. By using the quantile parameter, support vector quantile regression is robust against heavy-tailed noise. In addition, the L 2 -norm regularization term in the support vector quantile regression effectively solves the K ) N estimation problem. However, support vector quantile regression may suffer from the presence of redundant variables in high-dimensional data since the estimator of the L 2 -norm regularization term lacks sparseness. Thus, support vector quantile regression methods maintain the advantages of quantile regression and support vector regression but they may use all variables without discrimination in the estimation process for high-dimensional data.
To solve the sparse regression problem of high-dimensional data with heavy-tailed noise, we propose L p -norm support vector quantile regression (L p ÀSVQR). Because we use the quantile parameter in the pinball loss function, L p ÀSVQR is robust to heavy-tailed noise. The L p -norm regularization term in L p ÀSVQR solves the K ) N estimation problem. Moreover, by using the sparse parameter p, L p ÀSVQR automatically conducts variable selection and effectively improves the regression results simultaneously. We adopt a convergent successive linear algorithm (SLA) to obtain an approximate local solution of L p ÀSVQR: Compared with L 1 -norm regularized quantile regression (L 1 ÀQR) (Li & Zhu, 2008) and e-support vector quantile regression (eÀSVQR) (Anand et al., 2020), the simulation results show that L p ÀSVQR selects sparser variables but with smaller estimation errors than those of eÀSVQR and L 1 ÀQR, and this means that L p ÀSVQR not only selects fewer representative variables but also has good regression effectiveness. To further test the sparseness of L p ÀSVQR, we discuss the variable selection problem with regard to index construction. The real-world variable selection analysis of an innovation and entrepreneurship index also shows the sparseness of L p ÀSVQR: The contributions of this paper are summarized as follows: 1. During the high-dimensional regression process of L p ÀSVQR, useful variables are retained, and irrelevant variables are discarded.
2. By using the quantile parameter in the pinball loss function, L p ÀSVQR is robust against heavy-tailed noise. 3. Simulation results indicate that L p ÀSVQR outperforms the other two methods with better sparseness and robustness.
The remainder of this paper is organized as follows. In Sec. 2, we propose L p ÀSVQR and present the properties of the solution path. The simulation study is shown in Sec. 3. Section 4 provides a real-world variable selection analysis. Section 5 concludes this paper.

A high-dimensional sparse regression model
In this section, we introduce the spirit of L p -norm support vector regression into quantile regression, propose L p -norm support vector quantile regression (L p ÀSVQR), and then derive an efficient algorithm that computes the exact solution path for the parameter b: are the unknown parameters that need to be estimated. Consider the following linear regression model: where e is the random error.

L p -norm support vector quantile regression
The estimator of the linear regression (1) can be defined as the solution to the proposed L p ÀSVQR optimization problem: (2) where sð0<s<1Þ is the quantile level, kðk ! 0Þ is a tuning parameter balancing the quantile loss, n and n Ã are slack variables, and e is a vector of ones with appropriate dimensions. We penalize the model's complexity by using the L p -norm regularization term, |b| p p : Generally, |b| p p results in a much sparser estimator than that obtained with the L 1 -norm regularization term. Furthermore, L p ÀSVQR has an adaptive property since the optimal value of p is automatically chosen by the data set. Therefore, by using the parameters k and p, L p ÀSVQR can achieve a sparser estimator for selecting the representative variables and discarding the redundant variables.
Regression problem (2) is not differentiable because of the L p -norm regularization term. To make it smooth, we introduce the upper bound variable t ¼ ð½t 1 , . . . , ½t K Þ T , and thus problem (2) can be written as: It should be noted that problem (3) is differentiable and can be solved using a successive linear algorithm (SLA) (Mangasarian, 2007). An SLA starts with a random initial pointb 0 0 ,b 0 ,t 0 ,n 0 ,n Ã 0 , and the kÀth iteration,k ¼ 1, 2, . . . , obtainŝ by solving the following problem: The proof of convergence of the SLA is omitted here since it can be easily obtained from adaptations of the proof in Ye et al. (2015).

Properties of solution path
Similar to Li and Zhu (2008), we shed light on the properties of the whole solution pathb: Problem (4) can be rewritten as an equivalent optimization problem: where s is the regularization parameter, which plays the same role as that of k: From the KKT conditions in (5), we can obtain 0 a i s and 0 a Ã i 1Às, where a i and a Ã i are Lagrangian multipliers. Furthermore, we are able to obtain the following relationships: This implies a i 2 ½0, s and a Ã i 2 ½0, 1Às: Like in the condition stated by Li and Zhu (2008), the samples in the training set can be classified into the following three subsets: We further discuss the solution pathb as a function of s: We define the following events as s increases: Either a data point hits the elbow, that is, a residual y i Àb 0 Àb T x i changes from nonzero to zero, or a coefficientb j changes from nonzero to zero. Either a data point hits left of the elbow or right of the elbow, that is, a residual y i Àb 0 Àb T x i changes from zero to nonzero, or a coefficientb j changes from zero to nonzero.
If an event occurs, we need to update e, L, R, and m accordingly, as these are the index sets associated withb j : Moreover, nonzerob j satisfies: since the number of observations in the elbow is equal to the number of variables in the active set, that is, jej ¼ jmj:

Degrees of freedom
Degrees of freedom (df), measuring the effective dimensionality of the fitted model, play an important role in model assessment and variable selection. Suppose that y follows a distribution y $ ðlðxÞ, r 2 Þ, where l is the true mean and r 2 is the variance. The 'degrees of freedom' is defined as: wheref ðxÞ is a fitted model. Stein (1981) showed that the number of degrees freedom for a fitted modelf ðxÞ can be calculated as: There exists a set of regularization parameters 0 ¼ s 0 <s 1 <s 2 Á Á Á <s L ¼ 1, such that in the interior of any interval ðs l , s lþ1 Þ, the sets e, L, R and m are constant with respect to s: These sets only change during each event. Similar to Li and Zhu (2008), we first list the useful lemmas below, and then we apply Stein's theory (Stein, 1981) to derive an expression for the number of degrees of freedom.
Lemma 1. For any fixed s>0, there exists a set y ¼ ðy 1 , . . . , y N Þ T such that s is a finite collection of hyperplanes in R N and this set is denote as N s : Lemma 2. For any fixed s>0,b 0 ðyÞ andbðyÞ are continuous function of y, wherê b 0 ðyÞ andbðyÞ are the fitted intercept and coefficient vectors, respectively, when the response vector is y: Lemma 3. For any fixed s>0 and any y 2 R N nN s , the sets e, L andR are locally constant with respect to y: Theorem 1. For any fixed s>0 and any y 2 R N nN s , we have the following divergence formula: The proofs of these results are omitted here since they can be easily obtained from the proofs by Li and Zhu (2008).
An important application of degrees of freedom is the selection of the regularization parameter s: Two commonly used criteria in the quantile regression literature are the Schwarz information criterion (SIC) (Schwarz 1978) and generalized approximate cross-validation criterion (GACV) (Yuan, 2006). Here, we chose to minimize the following normalized mean squared error (NMSE) and sum absolute estimation error (SAEE) to select the optimal s: They are defined as follows: SAEEðsÞ ¼ where y is the average value of y 1 , . . . , y N .

Simulation study
In this section, we conduct a simulation study to evaluate the variable selection performance of L p ÀSVQR by comparing the obtained results with those of L 1 -norm regularized quantile regression (L 1 ÀQR) (Li & Zhu, 2008) and e-support vector quantile regression (eÀSVQR) (Anand et al., 2020).

Simulation setup
In our simulation study, the parameter s is searched from the set 2 ðÀ10Þ , 2 ðÀ9Þ , . . . , 2 9 , 2 10 È É , and the parameter p is chosen from 0 to 1 with a fixed step size of 0.1. The optimal values of the parameters in the simulation study are obtained by utilizing the grid-search method.
Let n be the number of samples,ŷ i be the prediction value of y i , and y ¼ 1 n P n i¼1 y i be the average value of y 1 , . . . , y n : We use the following evaluation criteria to evaluate the variable selection ability of each model. Similar to the model setup in Peng and Wang (2015), we generate ðx 1 ,x 2 , . . . ,x K Þ T from Nð0, RÞ, where R is the covariance matrix with elements r ij ¼ 0:5 iÀj j j , 1 i, j K: Then, we set x 1 ¼ Fðx 1 Þ and x k ¼x k for k ¼ 2, 3, . . . , K, where FðÁÞ is the cumulate distribution function of the standard normal distribution. Next we setb ¼ ðb 1 ,b 2 , . . . ,b K Þ T as the estimate of b: We use the following criteria to evaluate the variable selection ability of linear L p ÀSVQR : P 1 : The proportion of simulation runs nonzero coefficients are selected. P 2 : The proportion of simulation runs is x 1 selected.

AEE:
The absolute estimation error defined asb i À b i , i ¼ 1, 2, . . . , K: NMSE: The normalized mean squared error (NMSE) defined as: R2: The coefficient of determination R 2 defined as: We consider two simulation cases with five different values of s : 0.1, 0.3, 0.5, 0.7, and 0.9. We set K ¼ 500 and N ¼ 100, and generate the first regression model as follows: where the random error e$Nð0, 1Þ is independent of the covariates.
Therefore, these two cases compose the K ) N estimation problem. The optimal parameter selection is based on 10-fold cross validation. For each case, a total of 100 simulation iterations are conducted to evaluate the parameter b: We chose minimized NMSE and SAEE to select the optimal parameters.

Sparseness and robustness analysis
Tables 1 and 2 list the variable selection results of L p ÀSVQR, eÀSVQR, and L 1 ÀQR: It is observed that L p ÀSVQR selects fewer features than eÀSVQR and L 1 ÀQR: The main reason is that the L p À norm can find sparser estimations than those of the L 1 À norm and L 2 À norm. L p ÀSVQR selects sparser variables than those of eÀSVQR and L 1 ÀQR but with a smaller estimation error. Moreover, L p ÀSVQR drives a larger R 2 and smaller NMSE than those of eÀSVQR and L 1 ÀQR: It is clear that L p ÀSVQR is more robust to heavy-tailed noise than eÀSVQR and L 1 ÀQR: The main reason is that L p ÀSVQR adopts quantile loss and the sparse parameter p: In terms of the running times, the training speed of eÀSVQR is significantly faster than those of L 1 ÀQR and L p ÀSVQR: Figures 1-3 show the absolute estimation error for each case when the values of s are 0.3, 0.5, and 0.7, respectively. We can see that the regression estimatesb obtained from L p ÀSVQR are close to the real values of the regression coefficients b: From Figures 1-3, we can see that the red lines, representing the absolute estimation error of L p ÀSVQR in different cases, fluctuate slightly around 0, which indicates that L p ÀSVQR selects very few useful variables and captures the statistical information in the test data sets. It can be seen that L p ÀSVQR realizes variable selection and regression simultaneously due to its inherent variable selection property. The blue lines, representing the absolute estimation error of L 1 ÀQR in different cases, fluctuate a lot, especially when b i ¼ 1 ði 2 mÞ, which shows that L 1 ÀQR lacks the ability to select useful variables. The absolute estimation  errors of eÀSVQR have a similar trend to those of L 1 ÀQR, thereby indicating that eÀSVQR lacks sparseness. Therefore, the variable selection results of L p ÀSVQR perform significantly better than those of eÀSVQR and L 1 ÀQR:

Solution path
The solution path of simulation treatsb as a function of s and characterizes howb changes when s changes. We first fix the parameters s and p as their optimal values from the experiments. Then, we investigate the influence of the regularization parameter s on the absolute estimation error. Figure 4 shows the sum of absolute estimation error (SAEE) as a function of s: In Figure 4 of the type A, we find that when s changes from 2 À10 to 2 À1 , SAEE remains unchanged. When s changes from 2 À1 to 2 2 , the SAEE sharply decreases. When the corresponding value of s is 2 2 , the SAEE reaches a minimum value of 0:01: As the regularization parameter s increases from 2 2 to 2 4 , the SAEE becomes larger. When s is larger than 2 4 , the SAEE reaches a maximum value of 10: In the type A case, we fix the parameters s, p, and s to 0.5, 0.3, and 2 2 , respectively, and the fitted coefficients are shown in Figure 5 and the corresponding values of b i Àb i are shown in Figure 6. From Figure 5 of the type A, we find that most fitted coefficients are zero and only 11 fitted coefficients are nonzero, and these  fluctuate slightly around the real value of b i : From Figure 6 of the type A, we see that most absolute estimation errors are zero and only 10 absolute estimation errors are nonzero. The absolute estimation error reaches a maximum value of 0.07.
The type B solution path treatsb as a function of s: In Figure 4 of the type B, we observe that the SAEE has a similar trend as that of the type A. The SAEE reaches a minimum value of 0.18, where the corresponding value of s is 2 3 : Figure 5 of the type B shows the estimation resultsb for the type B simulation with the parameters s, p, and s fixed to 0.5, 0.5, and 2 3 , respectively. We observe that most fitted coefficients are zeros and only the first 10 fitted coefficients are nonzero, and these fluctuate slightly around the real value of 1. In Figure 6 of the type B, we see that only b i Àb i ði 2 f2, 3, . . . , 11gÞ are nonzeros and the others are zero. Moreover, the absolute estimation error reaches a maximum value of 0.06.

Real data analysis
To further test the sparseness of L p ÀSVQR, we discuss the variable selection problem with regard to index construction. Selecting the representative variables from the vast number of potential candidates in a system is a crucial process for index construction since representative variables are usually capable of providing the most important information for management to make decisions. Lasso (Tibshirani, 1996) is a very popular method for the variable selection problem since it shrinks some coefficients of the estimators toward 0. However, Lasso has some drawbacks in terms of the variable selection problem. Fan and Li (2001) found that Lasso uses the same tuning parameters for the regression coefficients, resulting in Lasso suffering an appreciable bias. In addition, Lasso may suffer from the influence of heavy-tailed noise since its least squares loss function is sensitive to noise.
L p ÀSVQR can overcome these drawbacks of Lasso. In addition, by using the sparse parameter p, L p ÀSVQR is much sparser than the L 1 -norm regularization term in Lasso. Thus, L p ÀSVQR is an effective method for solving the variable selection problem of index construction. Here, we discuss the variable selection problem of an innovation and entrepreneurship index. The innovation and entrepreneurship index summarizes information about the state level of innovation and entrepreneurship.
We collected 25 variables with data ranging from the first quarter of 2014 to the second quarter of 2017, all of which were obtained from the Hangzhou Bureau of Statistics (http://tjj.zj.gov.cn/col/col1525563/index.html). The sample size is 14, which is smaller than the number of explanatory variables. Gross domestic product (GDP) is used as a monitor for supervising the selection process. The experimental results indicate that eÀSVQR and L 1 ÀQR select 25 variables, and L p ÀSVQR only selects 13 variables when s ¼ 0:1, 0:3, 0:5, 0:7 and 0.9. Table 3 shows the regression results in terms of the NMSE for L p ÀSVQR, eÀSVQR, and L 1 ÀQR: Although L p ÀSVQR selects fewer variables than eÀSVQR and L 1 ÀQR, L p ÀSVQR obtains regression results that are comparable to those of the other methods. These results prove that L p ÀSVQR is a much sparser method than eÀSVQR and L 1 ÀQR: The variable selection results of L p ÀSVQR are shown in Table 4. We find that the loan balance of nonfinancial institutions, loans of small enterprises, number of talents, and number of new employments, as important input factors of innovation and  entrepreneurship, are selected. The output rates of the number of new industrial products, profit ratio of business, number of new invention patents, and number of new invention patents, as output factors of innovation and entrepreneurship, are selected. The number of new enterprises, number of college student employments and entrepreneurships, and number of new trademark registrations, representing the vitality of innovation and entrepreneurship, are selected. The amount of innovative vocabulary searches, amount of entrepreneurship vocabulary searches, and satisfaction of innovation and entrepreneurship policies, representing the environment of innovation and entrepreneurship, are selected.

Conclusion
Our work focused on the sparse regression problem of high-dimensional data with heavy-tailed noise, especially when the dimensionality of the regressors is larger than the sample size. We proposed L p -norm support vector quantile regression, which can be considered an extension of the L 1 -norm regularized quantile regression method discussed by Li and Zhu (2008), to solve this problem. L p -norm support vector quantile regression was robust against heavy-tailed noise due to its use of the pinball loss function. The L p -norm regularization term and the supervised selection process of L p ÀSVQR ensured that the representative variables were selected and the redundant variables were discarded automatically. The variable selection performance of L p ÀSVQR in the simulation analysis and real data analysis in the K ) N setting confirmed its robustness and sparseness. One of the important remaining problems is the application of the proposed highdimensional sparse method in the real world where high-dimensional data sets are available, such as in economics (Queir os et al., 2019), finance (Bhat et al., 2020) and other fields (Medase & Abdul-Basit, 2020). By adjusting the sparse parameter p, the proposed high-dimensional sparse method ensures that the representative variables are selected. Moreover, using the quantile parameter, the proposed high-dimensional sparse method is robust against heavy-tailed noise. Therefore, how to apply the proposed high-dimensional sparse method to high-dimensional variable selection problem in the real world is our future work. For example, multi-factor quantitative stock selection is a typical high-dimensional research problem, and it is interesting to apply the proposed method to select the principal factors of stock return.

Funding
This work was supported by Philosophy and Social Sciences Leading Talent Training Project of Zhejiang Province (No. 21YJRC07-1YB).