Ridge Regression Under Dense Factor Augmented Models

Abstract This article establishes a comprehensive theory of the optimality, robustness, and cross-validation selection consistency for the ridge regression under factor-augmented models with possibly dense idiosyncratic information. Using spectral analysis for random matrices, we show that the ridge regression is asymptotically efficient in capturing both factor and idiosyncratic information by minimizing the limiting predictive loss among the entire class of spectral regularized estimators under large-dimensional factor models and mixed-effects hypothesis. We derive an asymptotically optimal ridge penalty in closed form and prove that a bias-corrected k-fold cross-validation procedure can adaptively select the best ridge penalty in large samples. We extend the theory to the autoregressive models with many exogenous variables and establish a consistent cross-validation procedure using the what-we-called double ridge regression method. Our results allow for nonparametric distributions for, possibly heavy-tailed, martingale difference errors and idiosyncratic random coefficients and adapt to the cross-sectional and temporal dependence structures of the large-dimensional predictors. We demonstrate the performance of our ridge estimators in simulated examples as well as an economic dataset. All the proofs are available in the supplementary materials, which also includes more technical discussions and remarks, extra simulation results, and useful lemmas that may be of independent interest.


Introduction
Improving the bias-variance tradeoff is crucial in many highdimensional forecasting applications.The classical least-squares regression can suffer from substantial predictive variance when the number of predictors is large compared with the sample size.One useful variance reduction strategy is summarizing massive raw features into a few leading principal components and then predicting the response with these low-dimensional predictors.When the large-dimensional covariates obey a factor structure, the principal component estimator can provide consistent forecasts of the regression function if the factors suffice to predict the target variable.The principal component regression (PCR) often shows better performance than the LASSO regression and ensemble algorithms such as bagging for macroeconomic forecasting; see, for example, Stock and Watson (2002), De Mol, Giannone, and Reichlin (2008), Stock and Watson (2012), Castle, Clements, and Hendry (2013), and Carrasco and Rossi (2016).The PCR is also widely used when studying DNA microarrays (Alter, Brown, and Botstein 2000), medical shapes (Cootes et al. 1995), climate (Preisendorfer 1998), robust synthetic control (Agarwal et al. 2021) and many other fields (see Chapter 6 of James et al. 2021).While the PCR model is dense at the variable level (see, e.g., the discussions in Ng 2013), it relies on only low-dimensional principal components in forecasting.However, when many idiosyncratic components are useful, even though each is negligible, the PCR is not optimal and leaves room for improvement in many empirical applications.
Consider a toy example with p = 100 covariates x i = (x i,1 , . . ., x i,p ) ∈ R p , where we denote the transpose of any matrix or vector A by A , and n = 100 observations from a factor model given by x i = 1 f i,1 + 2 f i,2 + e i , i = 1, . . ., n. (1.1) The latent factors f i,1 ∈ R and f i,2 ∈ R, the entries of idiosyncratic variables e i = (e i,1 , . . ., e i,p ) , and the entries of loading coefficients 1 = ( 1,1 , . . ., p,1 ) and 2 = ( 1,2 , . . ., p,2 ) are all generated independently from the standard normal distribution.One may think of the data vector x i as a set of economic variables at a given time.We then generate the target variables y i from a factor-augmented predictive regression model given by where the regression errors ε i are generated from the standard normal distribution independently of the regression means μ i .We generate the direction of the coefficient vector β = (β 1 , . . ., β p ) uniformly over the R p unit sphere, and thus each entry β i is stochastically negligible for large p.Then we scale the total idiosyncratic signal β 2 = p i=1 β 2 i through the grid {0, 0.1, . . ., 1}.The PCR model is the special case with β 2 = 0.The idiosyncratic information, at aggregate level, becomes more important as β 2 increases.
We plot the median, over 5000 replications, of the training predictive loss n −1 n i=1 (μ i − μ i ) 2 as a function of idiosyncratic signal length β 2 on the left-hand-side of Figure 1, where  μ i and μ i denote the population and estimated means of the target variable y i , respectively.We consider both the oracle PCR estimator using the true number of factors r = 2, and the feasible PCR estimator using a number of PCs selected by 5fold cross-validation.We compare these PCR estimators with the ridge regression estimator, hereinafter abbreviated as RR estimator, that uses all the raw variables x i and does not require selecting the number of factors.We select the ridge penalty using 5-fold cross-validation.The median predictive loss of the PCR estimators grows (almost) linearly in the idiosyncratic signal length β 2 , but that of the RR estimator grows significantly slower.The cross-validated RR estimator is only slightly worse than the oracle PCR estimator but comparable to the crossvalidated PCR estimator for small β 2 , whereas the advantage of RR estimator emerges quickly as β 2 grows.
To illustrate the sensitivity of RR to the idiosyncratic information, we plot the median of the sample variance of the estimated means (which equals to the sum of squared estimated coefficients on all standardized components) over the replications on the right-hand-side of Figure 1.To emphasize the changes, we only plot the differences to the values for β 2 = 0.The variance of RR estimates grow quickly with β 2 , while that of the oracle PCR coefficients hardly change.The cross-validation selection improves the sensitivity of PCR to β 2 only slightly because a few extra components are actually asymptotically negligible in our dense models.
Figure 2 compares the training and testing predictive losses for ridge regression, averaged over 1000 replications, as functions of the effective degrees of freedom (see Hastie, Tibshirani, and Friedman 2009, p. 64) implied by the ridge penalty.We define testing predictive loss as the expected squared forecast error of the out-of-sample regression mean μ oos = f oos θ +e oos β with new variables f oos and e oos using the estimated regression function and the new covariates x oos , where θ = (0.6, 0.2) and β are the population coefficients in (1.2).Again, the optimal ridge estimator (dashed line) for testing predictive loss matches the performance of PCR when β 2 = 0 but outperforms PCR when β 2 > 0.More importantly, these plots illustrate an important relation between training and testing predictive losses for performance evaluation: they share (almost) the same optimal ridge penalty.We formally establish this universal property in Section 2 using a general notation of predictive loss, and generalize the results to time series data in Section 3.
Our toy example illustrates that omitting idiosyncratic information can lead to nontrivial loss in predictive efficiency, and the ridge regression is more robust against PCR capturing both factor and idiosyncratic information.The contribution of this article is a comprehensive theory of the optimality, robustness, and cross-validation consistency of the ridge regression under a general factor-augmented regression model given by where the left-hand side of the equations represent a large sample of observations in high dimensions: X = [x 1 − x, . . ., x n − x] ∈ R n×p is the demeaned design matrix of predictors x i ∈ R p with large n and p, and Y = (y 1 − ȳ, . . ., y n − ȳ) is the demeaned vector of univariate targets y i .Throughout, ā = 1 n n i=1 a i denotes the sample mean of any variable a.The low-rank matrix is a demeaned vector of uncorrelated (but possibly dependent) regression errors ε i .One may also interpret (1.3) and (1.4) as an error-in-variables regression model (6.1) but we postpone the discussions to Section 6.
Many datasets support the low-rank perturbation form (1.3); see the survey by Johnstone and Paul (2018) for many examples in electrocardiogram tracing, microarrays, satellite images, medical shapes, climate, signal detection, and finance.We assume that the low-rank p × r coefficient matrix = [ 1 , . . ., r ] has a divergent strength a p = tr( ) → ∞ at an arbitrary rate to separate the spectrum of the factor component F from the stochastically bounded spectrum of the idiosyncratic component E in (1.3).Therefore, we can analyze the spectral regularization using the standard eigenbasis without suffering from PCA inconsistency issues (Paul 2007;Johnstone and Lu 2009;Onatski 2012).Following Cai, Han, and Pan (2020), we allow the unobservable idiosyncratic components to enter the factor model (1.3) subject to a possibly unknown rotation ∈ R p×p for generality.We discuss the possibility of extending our results for a p = O(1), especially when PCA becomes inconsistent, as future works in Appendix A.9 in the supplementary materials.
Our regression model (1.4) is the factor-augmented model proposed by Kneip and Sarda (2011).Substituting E = X − F under model (1.3) and using the identity = I p , we can reparameterize the regression model (1.4) as follows: where the entries of θ and β are called common and specific effects, respectively.One might also interpret the last equation as a misspecified model where we only observe the design matrix X but not the latent factor matrix F. Throughout the article, we refer factor-augmented regression models to these two equivalent representations (1.4) and (1.5) exchangeably.Note that our model augments the purely variable-based model (with θ = θ − β = (0, . . ., 0) ) in, for example, Castle, Clements, and Hendry (2013), Fan, Ke, and Wang (2020) by adding extra factor information.
Unlike the aforementioned papers, we study the cases where the coefficient vector β is possibly dense.Even if the true regression vector β on the idiosyncratic components is sparse, the regression vector β on the variables may contain little (or even no) zero entries due to a nontrivial rotation .In general, we are interested in the models that are possibly dense at both variable and principal component levels.We consider every idiosyncratic component to be asymptotically negligible but allow them to have nontrivial aggregate explanatory power; see Jolliffe (1982), Dobriban and Wager (2018), Giannone, Lenza, and Primiceri (2021), Hastie et al. (2022), and many references therein for examples of applications.
Based on spectral analysis of large-dimensional random matrices, this article shows that the ridge regression is optimal in capturing factor and idiosyncratic information simultaneously among the entire class of spectral regularized estimators under the mixed-effects hypothesis.Ridge regression has a long tradition in statistics tracing back at least to Tikhonov (1963aTikhonov ( , 1963b) ) and Hoerl and Kennard (1970); see Liu and Dobriban (2020) for an excellent survey.To our knowledge, we are the first to study the ridge regression model with factor augmentation.Our analysis of large-dimensional covariance matrices under factor models requires a different approach from that in Dicker (2016) and Dobriban and Wager (2018).In particular, our results extend to more general data generating processes of covariates by analyzing limiting predictive loss instead of predictive risk (i.e., the expected predictive loss); see Appendix A.1 in the supplementary materials for comparisons with Dobriban and Wager (2018).
We prove that the k-fold cross-validation selects the asymptotically optimal ridge penalty, as illustrated in Figure 2, subject to a straightforward bias correction.The bias-corrected selection is uniformly consistent for all positive ridge penalties bounded away from zero, including those diverging to infinity.We extend the scope of the k-fold cross-validation to non-iid data and provide theoretical guarantee for dealing with martingale difference regression errors.Furthermore, we propose a new method called "double" ridge regression and an asymptotically correct k-fold cross-validation method in the presence of nuisance variables, in particular autoregressors, under mild conditions.
The rest of the article is organized as follows.We develop the asymptotic theories for independent errors in Section 2 and then generalize the results to martingale difference errors and autoregressive models in Section 3.More simulated and reallife examples are discussed in Sections 4 and 5, respectively.We conclude the article in Section 6.All the proofs of the theorems are postponed to the supplementary materials.The supplementary materials also provides more technical discussions and remarks about the conditions and auxiliary results, together with additional simulation results and some important lemmas that may be of independent interest.

Main Results
First, consider the factor model (1.3) in introduction.We are interested in the asymptotic regime where the number of predictors is comparable to the sample size: Assumption 1 (Large-dimensional Asymptotics).The number of covariates p = p(n) and the sample size n are comparable in such a way that p/n → c ∈ (0, ∞) as n → ∞.
The concentration ratio p/n, possibly larger than 1, plays a central role in our asymptotic theory.Unless specified otherwise, all asymptotic results hold as n, p → ∞ simultaneously in the probability space, enlarged if necessary, with the largest sigmaalgebra.
Without loss of generality, we treat both the in-sample factor scores matrix F and the loading matrix in (1.3) as fixed parameters.When factors are random, our approach is equivalent to conditioning on a particular realization of F. The number of factors r is finite but not necessarily known.
Assumption 2 (Factor Model).The following conditions hold: n E E is the (unobservable) sample covariance matrix of the idiosyncratic components.
Conditions (a)-(c) are the weaker factor models in Bai and Ng (2021) allowing an arbitrarily slow rate of total factor strength a p .When all entries, say, of the loading matrix = { (i, j)} are bounded away from infinity, we readily have a p = p i=1 r j=1 (a).We allow the loading matrix to have many vanishing entries and one may use a p ∝ p α for any α ∈ (0, 1]; see, for example, Bai (2003), Hallin and Liska (2007), and Agarwal et al. (2021) for the linear case α = 1 and Onatski (2009) for the case α > 2/3.Via the factor decomposition (1.3) of X = F + E , these conditions ensure that singular values of the factor part F to be much larger than (hence, separated from) those of the idiosyncratic part E ; see, for example, Cai, Han, and Pan (2020) for discussions and comparisons with the generalized spiked model in Johnstone (2001).
Remarkably, our condition (b) relaxes some identification conditions in the previous papers.We allow ties in the proportion of total loading {σ ii : i = 1, . . ., r} and require only the subspace consistency of PCA in the spirit of Jung and Marron (2009).When there are ties, all bases of the span of corresponding latent factors satisfy our conditions and therefore we are not able to identify the true ones in the data generating process.Nevertheless, we could estimate the subspace as a whole consistently and unify the further factor analysis for cross-validation and autoregressive models, as train-test splits and residual-based estimation can make latent factors indistinguishable in subsamples or after projections.One limitation of condition (b) is that all eigenvalues of must diverge at the same rate of a p , but we can easily relax this requirement and allow different rates if necessary.
Condition (c) holds for a separable random matrix E = B 1/2 ς A 1/2 in (1.3) where A ∈ R p×p and B ∈ R n×n are positivedefinite matrices with bounded spectral norms and ς = {ς ij } ∈ R n×p is a random matrix of iid standardized entries with a bounded fourth moment; see, for example, Paul and Silverstein (2009) and Chapter 5 of Bai and Silverstein (2010).Condition (c) also holds for high-dimensional linear time series satisfying the conditions in Remark 8 of Onatski and Wang (2021).See Appendix A.2 of the supplementary materials for more detailed discussions.
For out-of-sample analysis, we assume that the data obeying (1.3) and (1.4) are from the following data generating process (DGP): (2.1) where φ 0 = φ 0 (n) is an unspecified intercept possibly indexed by n and the regression errors ε i satisfy the following condition: Assumption 3 (Independent Errors).The regression errors {ε i } are mutually independent with zero mean Eε i = 0 and a common variance σ 2 = Eε 2 i bounded away from zero and infinity, and they are independent of {f i , e i }.They are uniformly integrable in second moment such that sup i∈Z E|ε i | 2+2ι is bounded away from infinity for some ι > 0.
This assumption is common in statistical learning literature.We will relax it to handle uncorrelated but dependent errors in Section 3. See also Remark 2 for the relaxation of the homoscedasticity condition.We consider strongly exogenous predictors {x i } to allow for a standard factor estimation using the entire sample of raw predictors as in, for example, Stock and Watson (2012).
Throughout the article, we consider a mixed-effects model where the factor coefficients θ are arbitrary fixed-effects and the entries of idiosyncratic coefficient vector β are stochastic.In particular, we make the following assumption throughout the article for β = (β 1 , . . ., β p ) : Assumption 4 (Random Effects).The idiosyncratic coefficient vector β = τ p −1/2 b, where b = (b 1 , . . ., b p ) is independent of the predictors {x i } and the regression errors {ε i }, and has uncorrelated entries such that and max 1≤i≤p Eb 2+2ι i is bounded away from infinity for some ι > 0. The total idiosyncratic signal length τ 2 = τ 2 n is unknown, possibly dependent on {x i }, and τ 2 = O P (1).When τ 2 = 0, the augmented model (2.2) reduces to the principal component regression model.
This condition is based on the hypothesis of the random effect in, for example, Dobriban and Wager (2018), but here we relax the dependence structure between the entries β i and allow them to be non-identically distributed.See Appendix A.3 in the supplementary materials for more examples and comparisons.We emphasize that this assumption is only needed for modeling the idiosyncratic coefficients β, and we allow for arbitrary fixedeffects θ on the latent factors in our factor-augmented model (2.2).
We split our main results into two sections: the first one shows the efficiency of the ridge regression among the general class of spectral regularized estimators; the second one establishes the uniform consistency of the bias-corrected k-fold crossvalidation for selecting the best ridge penalty.

On the Optimality of Ridge Regression
In this section, we show that the ridge regression is asymptotically optimal among the spectral regularized estimators in capturing idiosyncratic information and it is more robust than principal component regression against factor augmentation.
Consider the singular value decomposition n , where λ i denotes the ith largest eigenvalue (counting multiplicities) and K ≤ {n − 1, p} denotes the rank of X.Note that {u i } are all orthogonal to the n-dimensional unit vector proportional to the all-ones vector, denoted by ι n := n −1/2 (1, . . ., 1) ∈ R n , due to data centering.To reduce the estimation variance of regression means μ = Fθ + Eβ in (1.4), we shrink the projections of the (uncentered) target vector y : where m δ is the estimated regression function given by (2.4) and β( δ) is a penalized least-squares estimator (Tikhonov 1963a(Tikhonov , 1963b) ) given by with A + denoting the Moore-Penrose inverse of any matrix A and the weight matrix In particular, choosing δ(x) ≡ γ and δ(x) = x x+γ for some constant γ > 0 in (2.5) yields the ridge (Hoerl and Kennard 1970) estimator β(γ ) = X X + nγ I p −1 X Y.We allow for infinite penalties for dimension reduction purpose: one should interpret δ(x) = ∞ if and only if δ(x) = 0, and then we must set β orthogonal to v i in estimation.For more discussions we refer to Hastie, Tibshirani, and Friedman (2009, chap. 3).Henceforth we call (2.3) a spectral regularized estimator of the regression means.
We define the predictive loss as the reducible mean squared forecast error (see James et al. 2021 for decomposition into reducible and irreducible errors), conditional on a realization of the training set, given by where x oos = f oos + e oos and μ oos = φ 0 + f oos θ + e oos β are out-of-sample (OOS) random covariates and the regression mean from the DGPs (2.1) and (2.2) with the same factor model parameters and regression coefficients.The sigma-algebra S n is generated by the in-sample variables {f i , e i , ε i : i ≤ n} and the random coefficients β.
We use the following paradigm to evaluate the spectral regularized estimators.
Assumption 5 (Evaluation Paradigm).The new variables f oos and e oos are independent of both in-sample regression errors {ε i : i ≤ n} and random coefficients β, given the in-sample random variables {f i , e i : i ≤ n}.The conditional covariance matrices of a oos centered around the sample mean given by a = E[(a oos − ā)(a oos − ā) |S n ] have spectral norms bounded in n for both a ∈ {f , e}.
This paradigm is flexible enough to include both the training and testing predictive losses used in Figure 2: • If one draws the pair of new variables f oos and e oos uniformly over the sample {f i , e i : 1 ≤ i ≤ n}, then we obtain the training predictive loss given by with the sample covariance matrices a = 1 • A more typical implementation is to generate f oos and e oos , independent of S n , from some population distribution or a random draw on a test set in order to assess the testing predictive loss.Then a = a + (μ a − ā)(μ a − ā) , where μ a and a denote the mean and covariance matrix of a oos for a ∈ {f , e}.
Our estimated regression function (2.4) uses oracle estimators β( δ) and β( δ) of θ and β, respectively via the decomposition: The following theorem relates L(δ) to the estimation error of linear coefficients and gives further stochastic approximations based on spectral analysis.
Theorem 1.Under Assumptions 1-5, for any sequence of candidate shrinkage function δ = δ n : (0, ∞) → [0, 1] may or may not depend on the covariates in the training set, where v 2 = v v denotes the weighted norm of any vector v. Furthermore, (2.10) where , if there is no tie values in {σ ii : i = 1, . . ., r} and the penalty function δ is bounded away from 0. The probability mass F p (0) > 0 when the data matrix X is columnrank deficient with p > n.More generally, the result remains The estimation errors of fixed and random effects are asymptotically orthogonal in (2.8) because their interaction vanishes through aggregation of the uncorrelated mean-zero random coefficients and regression errors.Via the approximation (2.8) one can interpret different predictive loss functions using different weight matrices: (a) the training errors weight the coefficient estimation errors by the sample covariance matrices; (b) the testing errors weight the coefficient estimation errors by the population covariance matrices (centered around sample means); 2 uses identity weight matrices a = I p for a ∈ {f , e}.The asymptotic limit in (2.9) shows the bias of shrinkage estimator of fixed-effects θ , and that in (1) shows the biasvariance tradeoff for random-effects estimation.
Remark 1 (PCR).Plugging in the shrinkage function δ(x) = 1[x > λ r+1 ] for the oracle PCR estimator yields a limit increasing linearly in β 2 as depicted in Figure 1: where the last equation is due to the facts that β 2 = τ 2 +o P (1) and the random slope F p (λ r+1 ) = p −1 tr( e ) + o P (1).In fact, this linear pattern generalizes for any PCR estimator using a finite number of r ≥ r principal components as adding a few idiosyncratic components does not change its asymptotic behavior in dense models.
Note that the candidate shrinkage function δ is arbitrary on the positive line as long as it is "honest" in the sense that it only depends on the covariates but not the target variable before selection.Our limit theorem shows that shrinking the projections of the target vector onto the first r principal components can only increase the predictive loss asymptotically via the squared bias (2.9).We should keep the leading principal components unshrunk asymptotically by choosing a shrinkage function such that 1 − δ(λ i ) P − → 0 for all i = 1, . . ., r and thus the fixed-effects estimator is consistent in the sense that θ − β( δ) 2 P − → 0. We will discuss how to choose the optimal (ridge) estimator satisfying this condition later on.Therefore, now we focus on this subset of shrinkage functions and study the limiting predictive loss in the following corollary: where we integrate the limiting error over the entire domain of F p for notational convenience although it is asymptotically equivalent to integrate over only the sub-interval [0, λ r+1 ].
If further provided that, with probability one, λ max ( e ) is bounded by some absolute constant and F p converges vaguely to a nonrandom limit F, then Corollary 1 remain true with F p replaced by the nonrandom limit F for every continuous shrinkage function δ.In particular, we have the following special cases.
Corollary 2. Let the signal length τ 2 and error variance σ 2 be constants, and generate the new variables e oos independent of the training data.Suppose that all the entries of e i and e oos are from an iid array with mean zero and variance σ 2 e .Then, Corollary 1 implies that for every continuous shrinkage functions δ such that lim x→∞ δ(x) = 1, (2.12) where F is the Marčenko and Pastur (1967)  For generality, we use the stochastic approximation (2.11) according to the random nature of the predictive loss.We shall show that the optimal (ridge) estimator is asymptotically invariant to the random measure F p (or its limit F if any), and therefore the optimality of ridge regression does not rely on the spectral convergence of large-dimensional covariance matrices.
First, consider the degenerate scenario in that the PCR model is correct with τ 2 = 0, or more generally τ 2 P − → 0. One may still use a ridge estimator to minimize the limiting error (2.11) with the shrinkage function δ(x) = x x+γ such that the ridge penalty γ = γ n is diverging at a slower rate of the smallest spiked value λ r P − → ∞.To our knowledge, this finding first appeared formally in De Mol, Giannone, and Reichlin (2008) who studied the PCR model with a p diverging at a linear rate of p.Such ridge estimator is a smoothed approximation of the PCR estimator performing an asymptotic selection of principal components in the sense λ r+1 +γ P − → 0 for all the non-spiked eigenvalues {λ i : i ≥ r + 1} that are stochastically bounded as γ P − → ∞.In practice, one may select such a ridge penalty coefficient by using cross-validation and we postpone the details to next section.The optimal predictive loss vanishes to zero in probability as we can estimate the fixedeffects consistently and the random-effects are negligible.
A more interesting case is the augmented model with an idiosyncratic signal length τ 2 > 0 bounded away from zero (stochastically), minimizing the asymptotic limit in (2.11) is equivalent to minimizing the integrand That is, for each x ∈ (0, ∞), we choose δ(x) as the minimum point of the convex quadratic function By solving the first-order condition x (δ) = 0, it is elementary to show that the optimal shrinkage function is given by which corresponds to a ridge estimator with the penalty coefficient γ * , that is, with W( δ * ) = γ * I p in (2.5).One can interpret γ * as the noise-to-signal ratio σ 2 /τ 2 scaled by the aspect ratio p/n.Note that the optimal predictive loss is not vanishing in this case as the high-dimensional random-effects β cannot be estimated consistently.Since the ridge coefficient γ * is now (stochastically) bounded, the optimal shrinkage function (2.13) naturally satisfies the conditions that 1−δ(λ i ) P − → 0 as the spiked eigenvalues λ i P − → ∞ for 1 ≤ i ≤ r, leading to a consistent estimation of the fixed-effects θ .
Remarkably, the optimal shrinkage function (2.13) is the same for all our predictive loss functions sharing the approximate form (2.11).We summarize the optimality of ridge regression into the following corollary: Corollary 3 (Optimality of ridge regression).Consider any predictive loss L admitting the asymptotic approximation in Theorem 1.For an arbitrarily small ζ > 0 and any candidate shrinkage function δ = δ n from Theorem 1, w i log λ i where w i ∈ (0, 1) are arbitrary fixed weights such that r max +1 i=1 w i = 1 for any pre-specified bound r max ≥ r.However, in general, only the fixed-effects are consistently estimated in the sense that θ − β( δ) 2 P − → 0 and the optimal ridge estimator trades off the estimation bias and variance of random-effects.
Remark 2. Corollary 3 generalizes for heteroscedastic errors with different variance σ 2 i = var(ε i ) under more strengthened conditions on the covariates {x i }, where one can relax expression (2.13) by replacing σ 2 with the average variance σ 2 = 1 n n i=1 σ 2 i .Due to the space limit, we postpone the details to Appendix A.8 in the supplementary materials.

Uniform Consistency of Bias-Corrected k-Fold Cross-Validation
This section establishes the uniform consistency of the k-fold cross-validation (CV) algorithm for selecting the best ridge estimators, subject to a straightforward bias correction.Our asymptotic limits of the cross-validated mean squared error is uniform for all positive ridge penalties bounded away from zero, including those diverging to infinity.Throughout we use a fixed number of folds k ≥ 2. Generally speaking, there are both computational and statistical advantages to use k-fold CV rather than the leave-one-out CV; see, for example, Section 5.1.3 of James et al. (2021).On the other hand, using a small number of folds may introduce a bias in the selection if the optimal parameter depends on the sample size of the training set.
To fix ideas, the procedure starts from partitioning the entire index sets {1, . . ., n} into k mutually exclusive subsets I 1 , . . ., I k .For generality, we treat the partition {I j : j = 1, . . ., k} as fixed for any given n.When the folds are randomly created, our setup is equivalent to conditioning on a particular realization.One simple choice is to divide I into blocks, using consecutive indexes in each fold.In general, it is unnecessary to use blocks, and we may assign indexes randomly to the folds.Following the common practice, we generate folds of approximately equal size such that n j /n → 1/k where n j := |I j | denotes the cardinality of the index subset I j .
Let y i = y i − ȳ and x i = x i − x be the demeaned observations, ȳ and x are the sample means using all data.For any fold I j treated as a validation set, we fit a ridge estimator β −j (γ ) on the remaining k−1 folds by minimizing the L 2 penalized error given by where n −j = n−n j denotes the number of training observations, and γ > 0 is a candidate ridge penalty coefficient.Recall the demeaned design matrix X = [ x 1 , . . ., x n ] and demeaned target vector ) be the n×n diagonal matrices that selects the observations on the training set I −j := I j .Solving the quadratic optimization problem (2.14) yields a close-form solution given by (2.15) The mean squared error on the observations in the held-out fold I j is given by We repeat the procedure for j = 1, . . ., k, and average these errors to compute the k-fold cross-validation MSE given by Then we choose the value of γ over a sufficiently fine grid that minimizes CV (k) (γ ).Note that we do not need to estimate the factor scores nor to determine the number of factors with ridge regression.
To apply factor analysis to the training subsamples rather than the full sample, we assume one additional regularity condition: ) denote the training covariance matrix latent factor scores.For every j = 1, . . ., k, the eigenvalues of the r × r matrix The condition allows us to reparameterize the factor scores and loadings on the training set to maintain similar orthogonal structures in Assumption 2.More specifically, consider the spectral decomposition 1/2 f ,−j 1/2 f ,−j = Q −j −j Q −j where −j is a diagonal matrix and Q −j is an orthogonal matrix.Define the demeaned matrix of the idiosyncratic information E := [e 1 − ē, . . ., e n −ē] .The training design matrix obeys the approximate factor model satisfy the orthogonal conditions: F −j F −j /n −j = I r , and −j −j /a p = −j → diag(σ 11,−j , . . ., σ rr,−j ).
Assumption 6 holds under a stronger condition that for all j = 1, . . ., k, in which case the limiting eigenvalue σ ii,−j = σ ii .If f i are iid random vectors with an identity covariance matrix, the conditions holds almost surely by strong law of large numbers.When {f i } is a (multivariate) time series and the folds are blocks, that is, each time index subset I j corresponds to a continuous time period, the condition follows from ergodicity: for any stationary ergodic process {f i : i ∈ Z} with zero mean and identity covariance matrix, where Z denotes the set of integers; see Blum and Eisenberg (1975) for more discussions.Assumption 6 then holds with probability approaching 1.Furthermore, when the folds are randomly generated, the results remain true if the projection process {w f i : i ∈ Z} is stationary and strong mixing for every unit vectors w ∈ R r by combining the theorem in Blum and Hanson (1960) and Cramér-Wold device.See Bradley (1986) and Bradley (2005) for the various types of sufficient conditions for strong mixing.
Theorem 2. Suppose that Assumptions 1-4 and 6 hold.Uniformly for all positive ridge penalties γ bounded away from 0, where F (k)  p is some (improper) empirical distribution depending only on the covariates {x i }, the effective training sample size is given by n eff = 1 k k j=1 n −j provided that n j /n converges to the same limit for all j, and W γ ∈ R r×r is a random weight matrix being stochastically bounded and positive-definite with probability tending to 1 when r > 0. The closed-form expressions of F Let γ cv denote the optimal ridge penalty selected by the regular k-fold CV.When τ 2 P − → 0, the k-fold CV chooses some divergent penalty γ cv P − → ∞ to shrink the random-effects estimator toward zero as much as possible but γ cv = o P (a p ) (unless θ 2 = 0) such that the fixed-effect estimation errors vanish as well.Note that any divergent penalty sequence proportional to γ cv is also asymptotically optimal.When τ 2 is bounded away from 0, by the same arguments above Corollary 3, one should again minimize the integrand τ 2 denotes the optimal ridge penalty in Corollary 3. To conclude, the bias-corrected estimator of the optimal ridge penalty given by satisfies the asymptotic optimality criterion in Corollary 3.

Extension of Main Results
This section discusses two extensions of the main results for handling dependent data.Since our applications are often (although not necessarily) based on time series data, we shall index the observation by t = 1, . . ., n instead of i to emphasize the concept of time.
Our first extension is to relax the independence Assumption 3 on regression errors and only require them to be martingale differences.
Assumption 7 (MD Errors).The regression errors ε t = σ η t , where {η t , F nt : t ∈ Z} is a martingale difference sequence for each n such that E t−1 η t = 0 and E 0 η 2 t = 1, and denotes the conditional expectation given the sigmaalgebra F nj generated by {η s : s ≤ j}, {f t } and {e t }.The variance σ 2 = σ 2 n = O P (1) is bounded away from zero almost surely, and may or may not depend on the covariates {x t }.For some ι > 0, This MD assumption is required for the theory of k-fold crossvalidation with uncorrelated (but dependent) errors using the usual random train-test splits.Note that the conditional variance E t−1 η 2 t can be stochastic, meaning that we allow for conditionally heteroscedastic errors with stochastic volatilities such as the GARCH-type error processes (see, e.g., Davis and Mikosch 2009 for an excellent survey).The last part of the assumption, allowing for heavy-tailed distributions, is needed for the uniform integrability of {η 2 t }.In general {η t } are even not necessarily to be identically distributed, but if they are then the last part simplifies to be E|η t | 2+2ι = O(1).Theorem 3.All theorems and corollaries in Section 2 remain true by replacing the IID Assumption 3 with the MD Assumption 7, provided the temporal dependence conditions on {η t } in Appendix A.5 in the supplementary materials.
The most shocking finding is that, at least for ridge regression, we can in fact split the time series arbitrarily without worrying about the so-called "time leakage" issues in machine learning.In other words, the time order between the training and validation data is unimportant and the usual random train-test splits are allowed.
Our second extension is the factor-augmented autoregressive model given by (3.1) where ε t are martingale differences as in Assumption 7. The first equation is the same as (2.1), while the second equation adds the lagged target variables into (2.2).In forecasting applications, x t are often lagged variables that are observable before time t.Let z t = (y t−1 , . . ., y t−q ) collect the lagged target values, and their coefficient vector be φ = (φ 1 , . . ., φ q ) .We can rewrite the model into a general form given by For simplicity, we assume that the order of autoregression q is finite and known.Unless specified otherwise, we assume that the autoregressive model satisfies the standard stability condition (see, e.g., Chapter 6 of Hayashi 2000, p. 373): Assumption 8.All the roots of the qth degree polynomial equation 1 − φ 1 z − φ 2 z 2 − • • • − φ q z q = 0 are greater than 1 in absolute value, when q > 0. The target variables has finite second moment, that is, E[y 2 t ] is bounded uniformly over t.
We extend the ridge regression to allow for the nuisance variables z t and minimize the penalized mean squared errors of the parameters (α, φ , β ) given by where γ > 0 is some candidate ridge penalty that may depend on {x t }.When q = 0, this reduces to the standard ridge problem.Let Z = [z 1 − z, . . ., z n − z] be the demeaned design matrix of autoregressors with z = n −1 n t=1 z t , and define the projection matrix onto its column space by P Z = Z(Z Z) −1 Z .Solving the quadratic optimization problem yields a three-steps estimator (3.4) We are interested in minimizing the predictive loss given by for some given forecast horizon h ≥ 1. (3.5) The sigma-algebra S n is generated by the past variables {f t , e t , ε t : t ≤ n} and the random coefficients β.An inevitable challenge here is that the target variables in training and test samples must be dependent due to autoregression (3.2).Nevertheless, we can generate new covariates in the future according to a similar paradigm in Section 2: Assumption 9 (Testing Paradigm for AR Models).The testing paradigm in Assumption 5 holds with f oos = f n+h and e oos = e n+h for all horizons h ∈ {1, 2, . ..}.For all 1 Theorem 4.Under the conditions of Theorem 3, Assumptions 8-9, and the conditions in Appendix A.6 in the supplementary materials, the optimal ridge penalty remains the same as in Corollary 3 for the testing predictive loss (3.5) at all forecast horizons h ≥ 1.
Nevertheless, the classic results for the standard k-fold crossvalidation do not extend to the autoregressive models with largedimensional covariates.We notice that there are some recent findings in Bergmeir, Hyndman, and Koo (2018) for simple autoregressive models with no exogenous variable, but their results neither do not apply in the presence of large-dimensional covariates.Hence, we suggest to remove the autoregressive components before the cross-validation using the following algorithm: Step 1 Let φ be some preliminary estimator of the autoregressive coefficients φ, and construct the autoregressive residuals r t = (y t − ȳ)−(z t − z) φ.Apply the k-fold cross-validation to select the optimal ridge penalty parameter by regressing the autoregressive residuals r t on the demeaned covariates x t − x using the standard ridge estimator.
Step 2 Calculate the bias-corrected ridge penalty γ = γ cv 1 − 1 k , where γ cv is the optimal ridge factor selected in Step 1.
Theorem 5.Under the conditions of Theorem 4, Theorem 2 remains true provided that the preliminary estimator φ is consistent toward φ.
In the rest of this section, we discuss how to obtain a consistent estimator φ under our mixed-effects models.We show in Appendix A.7 of the supplementary materials that the naive least-squares estimator using the autoregressors alone (i.e., neglecting the covariates) is consistent if the latent factors are asymptotically uncorrelated over time.To avoid this restrictive condition, we suggest to use the estimator φ from a preliminary ridge regression (3.4) instead according to the following theorem.We call this approach "double" ridge regression because we need to apply ridge regression twice, once in Step 1 and one more time in Step 3 using different penalties.
Theorem 6 (Double ridge regression).Suppose that the conditions of Theorem 4 hold.The ridge estimator φ defined in (3.4) is consistent toward φ if γ = o P (a p ) and γ is bounded away from 0 with probability approaching 1, where γ = γ n may depend on the covariates {x t }.When r i=1 θ 2 i = 0, the result remains true if γ = γ n diverges at the same or even higher rate of a p .In particular, the following choices all satisfy the consistency condition: 1. Any constant ridge penalty γ > 0 not depending on n; 2. Any diverging ridge penalty of the form γ = exp r max +1 i=1 w i log λ i , where r max is some prespecified upper bound of the true number of factors r and w i ≥ 0 are arbitrary fixed constant weights such that r max +1 i=1 w i = 1 and w r max +1 > 0.
An interesting implication of this proposition is that, even with a possibly inconsistent estimator of φ in Step 1 (e.g., the leastsquares estimator omitting covariates), the final estimator φ in Step 3 can recover consistency as long as the ridge penalty from Step 2 is not too large in the sense of Theorem 6.Unless specified otherwise, we use the ridge penalty γ = λ r max +1 for our preliminary ridge regression in numerical analysis.

Monte Carlo Simulations
In this section, we compare the predictive performance of the ridge estimators and that of their competitors under factor augmented models using Monte Carlo experiments.We fix the number of predictors p = 120 but vary the sample size n ∈ {60, 120, 240}, yielding concentration ratios p/n ∈ {2, 1, 0.5}.These dimensions are calibrated from our empirical analysis where n ≈ p = 120.The results for a fixed sample size n = 120 but a much larger number of predictors p ∈ {480, 840, 1200} are available in Appendix C.3 in the supplementary materials.
We calibrate the data generating process of the latent factors {f t } from a vector autoregressive model (VAR) with four lags and the factor loading matrices from a multivariate Gaussian distribution using the estimated scores and loadings of r = 7 factors using the FRED-MD database in Section 5.The details of the calibration procedure is available in Appendix B.1 of the supplementary materials.We then generate exogenous predictors from the factor model give by where the entries ς t = (ς t,1 , . . ., ς t,p ) are independently generated from the standard normal distribution, is a diagonal covariance matrix from Bai and Silverstein (1998) (with 20% of its eigenvalues equal to λ, 40% equal to 3λ, and 40% equal to 10λ) such that tr( ) = tr( ), the idiosyncratic matrix is generated uniformly from the class of orthogonal matrices.The factor scores f t are unobservable to the statistician, who always approximate the factor scores as √ n times the leading eigenvectors of XX when needed.We then generate the target variable from the factor augmented autoregressive model of order q = 3 given by where we set a zero intercept φ 0 = 0 without loss of generality, and the autoregressive coefficients φ = (0.29, 0.07, 0.11) are calibrated from the monthly growth in U.S. industrial production index.In the supplementary materials we repeat the analysis for the case with no autoregressor (i.e., q = 0), and the conclusions remain qualitatively the same.The entries of factor coefficients θ = (θ 1 , . . ., θ 7 ) are generated from the uniform distribution on (0, 1).All these coefficients are unknown to the statistician who always demeans the predictors in each sample and estimate all the coefficients.We consider two different data generating processes of the regression errors: 1. independent errors ε i from the standardized student t 5 distribution; 2. GARCH(1,1) errors Without loss of generality, we maintain a long run variance σ 2 = 1 in both cases.The errors are independent of the covariates.The results are similar for these two types of errors.To save space, we only report the results for GARCH errors in the main document and postpone the results for independent t 5 errors to Appendix C in the supplementary materials.
Table 1 reports the median, over 5000 replications, of the testing errors (μ n+h − μ n+h ) 2 at the horizon h = 1 outside brackets and training errors 1 n n t=1 (μ t − μ t ) 2 inside brackets for the following seven estimators: 1. PCR: Principal component regression using true number of factors r = 7. 2. KS2011: Factor-adjusted LASSO estimator proposed by Kneip and Sarda (2011) with L 1 penalty parameter selected by k-fold cross-validation; 3. PCR-CV: PCR using the number of factors selected by k-fold cross-validation.4. SRR: Single ridge regression with the ridge penalty parameter selected by the bias-corrected k-fold cross-validation procedure proposed in the last section, using the least-squares estimator of the autoregressive coefficients in Step 1. 5. FaSRR: Factor-adjusted ridge regression that shrinks the eigenvalues as SRR except for the largest r eigenvalues.6. DRR: Double ridge regression with the ridge penalty parameter selected by the bias-corrected k-fold cross-validation procedure, using the preliminary ridge regression of the autoregressive coefficients in Step 1 given in Theorem 6. 7. FaDRR: Factor-adjusted ridge regression that shrinks the eigenvalues as DRR except for the largest r eigenvalues.The estimators SRR and DRR are what we propose in the last section, except they use different estimators of the autoregressive coefficients.The SRR estimator is slightly worse than DRR overall due to a larger estimation error of the autoregressive residuals in the cross-validation algorithm.The other estimators PCR, KS2011, FaSRR, and FaDRR are oracle in the sense that they use the true number of factors r = 7.They can be feasible in practice when the number of factors can be consistently selected; see, for example, Bai and Ng (2002).In contrast, the SRR and DRR estimators do not require estimating the number of factors but use all raw covariates.More implementation details of these algorithms are available in Appendix B.2 of the supplementary materials.
When the idiosyncratic signal τ 2 = 0, the PCR and factoradjusted Ridge estimators usually perform the best.Although the SRR and DRR estimator suffers from more estimation bias of fixed-effects due to the shrinkage, they still show good performance and remain robust against the absence of idiosyncratic signals.The LASSO estimator KS2011 is the worst and suffers from variable selection errors in dense models especially when p > n.
On the other hand, when τ 2 > 0, our ridge estimators SRR and DRR consistently outperform all sparse learning methods PCR, KS2011 and PCR-CV in terms of both testing and training errors according to our asymptotic theory.The advantages of the ridge estimator becomes more significant as the signal length τ 2 grows.
It is noteworthy that direct factor adjustments cannot improve the ridge regression when τ 2 > 0. FaSRR and FaDRR violate the full-rank condition (Appendix A.6) on the residual factor matrix by removing all the in-sample factor information asymptotically.As a result, the residual ridge regression fails to shrink the divergent loading matrix (via Lemma A7 in the supplementary materials), leading to an inflation of fixed-effects estimation errors.
To conclude, the simulation results confirm that our double ridge regression with bias-corrected cross-validation is a robust forecasting method for factor-augmented models with potential idiosyncratic information and nuisance variables.It delivers the best performance and nontrivial improvements in dense models with large τ 2 consistently, while remaining robust against sparse idiosyncratic information with small τ 2 .

A Real-life Example
One empirical application of dense learning is to forecast the monthly growth rate of the U.S. industrial production index using a large set of macroeconomic variables from the FRED-MD database publicly available at: https://research.stlouisfed.org/econ/mccracken/fred-databases/.
This monthly database has similar predictive content as that in Stock and Watson (2002), and it is regularly updated through the Federal Reserve Economic Data.The industrial production index is an important indicator of macroeconomic activity.We calculate its monthly grow rate in percentage by where IP t denotes the U.S. industrial production index for month t.Our sample size span from January, 1959 to August, 2021 and all the data are available on a monthly basis.We transform the nonstationary economic variables into stationary forms and remove the data outliers using the MATLAB codes provided on the website mentioned above; see also McCracken and Ng (2016) for details.
In every month, we forecast the next-month growth ahead using q = 0, 1, . . ., 6 autoregressor(s) and (the principal components of) all the other covariates in the database.We estimate the factors and regression coefficients in rolling windows of a sample size n = 120, which equals to a time span of ten years.In each window we use the lagged variables with no missing values for training and forecasting.The covariates are normalized within each window, and we transform the latest observation accordingly.Tables 2 compares the out-of-sample mean squared forecast errors of the rolling-window predictions using the following six methods: purely autoregressive (AR), principal component regression (PCR), factor-adjusted LASSO regression (KS2011), principal component regression with the number of components selected by cross-validation (PCR-CV), the "single" ridge regression (SRR), and the "double" ridge regression (DRR) in Theorem 6.We report only the results using the 5-fold crossvalidation after bias correction, which are slightly better than that before correction.The implementation details are available in Appendix B.3 of the supplementary materials.The mean squared errors are normalized by that of the moving average forecasts (i.e., autoregressive estimator with q = 0) for comparison.
Our double ridge estimator, namely DRR, systematically improves the PCR estimators over different choices of the autoregressive orders q ≥ 1, while the factor-adjusted LASSO estimator KS2011 consistently underperform the PCR.The cross-validation procedure hardly improves PCR predictions.These findings suggest that the underlying model tend to be dense rather than sparse in terms of idiosyncratic information.The DRR estimator consistently improves the prediction errors against SRR, which is consistent with our theory that DRR is more robust against complex covariates structures; see Theorem 6 and the discussions above it.

Concluding Remarks
This article studies the high-dimensional ridge regression methods under factor models.Our theory of the efficiency and robustness of ridge regression in capturing both factor and idiosyncratic information simultaneously helps explain its excellent performance in many complicated applications.Our results are applicable to large-dimensional datasets showing spiked eigenstructure, serial dependence, conditional heteroscedasticity, and heavy tails simultaneously.In particular, we extend the theory of k-fold cross-validation beyond iid setting and provide theoretical support for its applications to more complex data.
One can compare the factor-augmented models (1.3)-(1.4)with the following synthetic control models studied by Agarwal et al. (2021) (see also Abadie, Diamond, and Hainmueller 2010;Amjad, Shah, and Shen 2018;Arkhangelsky et al. 2021): where X is the corrupted covariate matrix, the latent factor matrix A = F ∈ is the true covariate matrix of low rank r with a strong signal A F H F , ε contains response noises, and φ are model misspecification errors.Rather than treating the misspecification error φ as deterministic and "inevitable" (see the interpretation below their Theorem 3.1 on p. 1735), here we model φ = H β as a projection of noise matrix H through the random-effects β and an arbitrary rotation .Therefore, the RR can outperform PCR by optimizing the bias-variance tradeoff on the misspecification errors φ.It is an exciting future research topic is to allow for weak factors in (6.1) such that A F H F .One may wonder whether our findings generalize beyond the random matrix regime.In higher dimensions with κ n := max{p/n, 1} = p/n → ∞, our results remain true by generalizing Assumption 2 as follows: a p /κ n → ∞ (which is trivial if a p is linear in p) and λ max ( E ) = O P (κ n ).To see this, first apply our results to the rescaled covariates x i → x i / √ κ n with corresponding coefficients β → β √ κ n , and then transform back to the original scales.For lower dimensions with p/n → 0 (but p → ∞ sufficiently fast) and τ 2 > 0, the optimal penalty γ * → 0 in (2.13), meaning that the optimal ridge estimator should approximate the least-squares estimator and outperform PCR.Otherwise, the optimal RR asymptotically reduces to the PCR when τ 2 = 0; see Figure A.2 in Appendix A.9 for illustration.Non-asymptotic comparison between PCR and RR is possible using the concentration inequalities in martingale limit theory, but we leave it as future works.
Finally, our analysis assumes that the factor strength a p → ∞ so the standard PCA is (subspace) consistent.Is the RR still favorable otherwise?This is an interesting future research topic.Using the simulation example in the Introduction, we can illustrate that the ridge regression still consistently outperforms the PCR when the standard PCA becomes inconsistent.Due to the space limit, we leave the detailed illustrations to Appendix A.9 in the supplementary materials, where we also compare the regular and bias-corrected cross-validation and discuss a possible extension for "sparse + dense" idiosyncratic information.

Figure 1 .
Figure 1.Median predictive loss (left) and change in median sample variance of estimated means (right) as functions of idiosyncratic signal β 2 .

Figure 2 .
Figure 2. Average training and testing predictive losses for RR and Oracle PCR.
r finite, and E = [e 1 − ē, . . ., e n − ē] is a largedimensional random matrix of idiosyncratic components e i ∈ R p .The matrix is a rotation matrix, possibly unknown.The fixed-effects θ = (θ 1 , . . ., θ r ) on the factors and the randomeffects β = (β 1 , . . ., β p ) on the idiosyncratic components are both unknown.The noise component (a) The total factor loading a p := tr( ) → ∞, and a p = O(p).(b) := /a p converges to some r × r diagonal matrix with diagonal elements σ ii ≥ σ jj > 0 for i < j.Without loss of generality, F F/n = I r and = I p where I d denotes d × d identity matrix.(c) λ max ( E ) = O P (1) where E = 1 law with a point mass F(0) = max{0, 1 − 1/c} at the origin, and the positive density function F (x) x)(x − a) on [a, b] (and zero density elsewhere) with a = σ 2 e (1 − √ c) 2 and b = σ 2 e (1 + √ c) 2 .More examples satisfying (2.12) with various limits F are available in Appendix A.4 in the supplementary materials.

p
and W γ are available in Section E.3 of the supplementary materials.The first term σ 2 is irreducible but irrelevant to the choice of γ .The second term γ a p +γ 2 • θ W γ θ represents the crossvalidated estimation error of the fixed-effects at the same (stochastic) order of γ a p +γ 2 θ 2 .The rest represents the cross-validated error for the random-effects where we use n eff ≈ k−1 k n due to the train-test split.

Table 1 .
The testing errors (and training errors) over 5000 replications for GARCH errors.The smaller values for each row are highlighted in bold.The results for independent t 5 errors are similar and available in the supplementary materials.

Table 2 .
Out-of-sample relative mean squared forecast errors for the growth rate of U.S. industrial production index.The bold entry marks the smallest error in each column.