Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models

For high-dimensional linear regression models, we review and compare several estimators of variances $\tau^2$ and $\sigma^2$ of the random slopes and errors, respectively. These variances relate directly to ridge regression penalty $\lambda$ and heritability index $h^2$, often used in genetics. Direct and indirect estimators of these, either based on cross-validation (CV) or maximum marginal likelihood (MML), are also discussed. The comparisons include several cases of covariate matrix $\mathbf{X}_{n \times p}$, with $p \gg n$, such as multi-collinear covariates and data-derived ones. In addition, we study robustness against departures from the model such as sparse instead of dense effects and non-Gaussian errors. An example on weight gain data with genomic covariates confirms the good performance of MML compared to CV. Several extensions are presented. First, to the high-dimensional linear mixed effects model, with REML as an alternative to MML. Second, to the conjugate Bayesian setting, which proves to be a good alternative. Third, and most prominently, to generalized linear models for which we derive a computationally efficient MML estimator by re-writing the marginal likelihood as an $n$-dimensional integral. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of the resulting MML estimator of $\lambda$ as compared to CV. Software is provided to enable reproduction of all results presented here.


Introduction
Estimation of hyper-parameters is an essential part of fitting high-dimensional Gaussian random effect regression models, also known as ridge regression. These models are widely applied in genomics and genetics applications, where often the number of variables p is much larger than the number of samples n, i.e. p ≫ n.
We initially focus on the linear model. The goal is to estimate error variance σ 2 and random effects variance τ 2 or functions thereof, in particular the ridge penalty parameter, λ = σ 2 τ 2 , or heritability index, h 2 = pτ 2 pτ 2 +σ 2 . Here, the ridge penalty is used in classical ridge regression to shrink the regression coefficients to zero [1], whereas heritability measures the fraction of variation between individuals within a population that is due their genotypes [2]. The estimators of σ 2 and τ 2 can be used to estimate λ or h 2 , but also for statistical testing [3]. We review several estimators, based on maximum marginal likelihood (MML), moment equations, (generalized) cross-validation, dimension reduction, or degrees-of-freedom adjustment. Some of these estimators are classical, while others have recently been introduced.
We systematically review and compare the estimators in a broad variety of high-dimensional settings. For estimation of λ in low-dimensional settings, we refer to [4,5,6]. We address the effect of multi-collinearity and robustness against model misspecifications, such as sparsity and non-Gaussian errors. The comparisons are extended to the linear mixed effects model, with q n fixed effects added to the model and to Bayesian linear regression. The linear model part is concluded by a genomics data application to weight gain prediction after kidney transplantation.
The observed good performance of MML in the linear model setting was a stimulus to consider MML for high-dimensional generalized linear models (GLM). MML is more involved here than in the linear model, because of the non-conjugacy of the likelihood and prior. Therefore, approximations are required, such as Laplace ones. While these have been addressed by others [7,8], we derive an estimator which is computationally efficient for p ≫ n settings. For Poisson and Binomial ridge regression, we demonstrate the superior accuracy of MML estimation of λ as compared to cross-validation.
Our software enables reproduction of all results. In addition, it allows comparisons for one's own high-dimensional data matrix by simulating the response conditional on this matrix, as we do for two cancer genomics examples. Computational shortcuts and considerations are discussed throughout the paper, and detailed at the end, including computing times.
3. Joint estimation: estimate σ 2 and τ 2 jointly Below, we discuss several methods for each of these categories. They have several matrices and matrix computations in common, which we therefore introduce first.

Notation and matrix computations
Throughout the paper, we will use the following notation: β =β λ = C λ y = (X T X + λI p×p ) −1 X T y i.e. the linear ridge estimator H = H λ = XC λ = X(X T X + λI p×p ) −1 X T i.e. the hat matrix.
(2) Many of the estimators below require calculations on potentially very large matrices. The following two well-known equalities can highly alleviate the computational burden.
First, C = C λ , and hence alsoβ and H, can be efficiently computed by using singular value decomposition (SVD). Decompose X = U n×n D n×n (V p×n ) T by SVD, and denote Λ q = λI q . Then, The latter requires inversion of an n × n matrix only. Second, the following efficient trace computation for matrix products applies to tr(H) = tr(XC λ ) : A benchmark method that is used extensively to estimate λ = σ 2 /τ 2 is cross-validation. Here, we use K-fold CV, as implemented in the popular R-package glmnet [13]. Let f (i) denote the set of samples left out for testing at the same fold as sample i. Then, CV-based estimation of λ pertains to minimizing the cross-validated prediction error: whereβ −f (i) λ denotes the estimate of β based on training samples {1, . . . , n} \ f (i) and penalty λ. Note that for leave-one-out-cross-validation (n-fold CV) the analytical solution of (5) is the PRESS statistic [14].

Estimating λ by Generalized Cross Validation
Generalized Cross Validation (GCV) is a rotation-invariant form of the PRESS statistic. It is more robust than the latter to (near-diagonal) hat matrices H λ [9]. For the linear model, the criterion is [15]: where the trace of H λ can be computed efficiently by (4). Then, λ gcv = arg min λ GCV(λ).
Then, heritability is estimated by [10]: with i andỹ i the ith element of L andỹ = Q T y, respectively. The authors provide rigorous consistency results for their estimator, as well as theoretical confidence bounds, also for mixed models and sparse settings.

Basic estimate
A basic estimate of σ 2 , and often used in practice, is given by [16]: which is the residual mean square error. Here, the residual effective degrees of freedom [16] equals ν = n − tr(2H − HH T ), with H as in (2). We also considered (9) with ν = n − tr(H), as in [17], which rendered similar, slightly inferior results.

PCR-based estimate
The estimator for σ 2 may also be based on Principal Component Regression (PCR). PCR is based on the eigen-decomposition X T X =QD 2Q T . Denoting Z = XQ and α =Q T β, we have y = Zα + . Then, Z is reduced from p columns to r ≤ min(n, p) principal components, a crucial step [12]. Using the reduced model, σ 2 is estimated by the residual mean square error [12]: 2.3. Joint estimation of σ 2 and τ 2

MML
An Empirical Bayes estimate of σ 2 and τ 2 is obtained by maximizing the marginal likelihood (MML), also referred to as model evidence in machine learning [18]. This corresponds to: arg max Since y = Xβ + , P (y) is simply derived from the convolution of Gaussian random variables, This is easily maximized over σ 2 and τ 2 . Note that after computing XX T (12) requires operations on n × n matrices only.

Method of Moments (MoM)
An alternative to MML is to match the empirical second moments of y to their theoretical counterparts. From (12) we observe that the covariances depend on τ 2 only. Hence, we obtain an estimator of τ 2 by equating the sum of y i y k to that of the theoretical covariances, with Σ as in (12). Then, with Σ X = XX T , an estimator for σ 2 is obtained by substitutingτ 2 and equating the sum of y 2 i to the sum of theoretical variances, Σ ii = E[y 2 i ]: These equations also hold for non-Gaussian error terms, which could be an advantage over MML.
Moreover, no optimization over σ 2 and τ 2 is required, so MoM is computationally very attractive.

Comparisons
For the linear random effects model (ridge regression) we study the following settings: • β and generated from model (1), independent X • β or generated from non-Gaussian distributions, independent X • β and from model (1), multicollinear X • β and from model (1), data-based X.
As is common for real data, the variables, i.e. the rows of X, were always standardized for the

Departures from a normal effect size distribution
We study the robustness of the methods against (sparse) non-Gaussian effect size distribution or error distribution. In sparse settings, many variables do not have an effect. To mimic this, we simulated the β's from a mixture distribution with a 'spike' and a Gaussian 'slab': Here, we set p 0 = 0.9, τ 2 0 = 0.1, which implies , as in the Gaussian β j setting. Moreover, we also considered: where again the parameters are chosen such that E(β j ) = 0 and τ 2 = V(β j ) = 0.01. Apart from β all other quantities are simulated as in (14). Results are displayed for σ 2 = 10, τ 2 = 0.01, n = 100, p = 1000 in Figure 1(c) for the Laplace (= lasso) effect size distribution and in Supplementary   Figure 3 for the spike-and-slab and uniform effect size distribution.
Moreover, we considered heavy-tailed errors by sampling where the scalar is chosen such that σ 2 = V( i ) = 10, as in the Gaussian error setting. Apart from , all other quantities are simulated as in (14). Results are displayed in Supplementary Figure   3(c).

Simulated X
Next, the design matrix X is sampled using block-wise correlation. We replace the sampling of X in simulation model (14) by: where Ξ is a unit variance covariance matrix with blocks of size p * p with correlations ρ on the off-diagonal. Figure 2(a) shows the results for ρ = 0.5, p * = 10, n = 100, p = 1000.

Real data X
Finally, we consider the estimation of τ 2 and σ 2 in a high-and medium-dimensional setting where X are real data, with likely collinear columns. The first data set (TCGA KIRC)    and τ 2 in high-dimensional settings, because of their delicate interplay.

MML vs GCV and CV
For the estimation of λ MML seems slightly superior to GCV and CV. GCV shows more estimates that deviate towards too small values of λ (e.g. to the lasso prior with scale parameter 1/λ 1 [19]. The results indicate that MML with Gaussian prior could be useful to find the lasso penalty, or serve as a fast initial estimate by simply setting the lasso penalty λ 1 = (2)/τ , which follows from the variance of the lasso prior.

Data example
We re-analyse the weight gain data, recently discussed in [17]. Details on the data are presented there, we provide a summary. The data consists of expression profiles of n = 26 individuals with kidney transplants, where profiles consists of 28,869 genes as measured by Affymetrix Human Gene 1.0 ST arrays. The data is available in the EMBL-EBI ArrayExpress database (www.ebi.ac.uk/arrayexpress) under accession number E-GEOD-33070. It is known that kidney transplantation may lead to weight gain, and the study [20] investigates whether gene expression can be used to predict this. Such a prediction can be used to decide upon additional measures to prevent excessive weight gains. We reproduced the analysis by [17] as much as possible, including their prior selection of 1000 genes. Details on minor discrepancies, and an alternative analysis that accounts for the gene selection are discussed in the Supplementary Material. These did not affect the comparison qualitatively.
In [17], the authors illustrate their focused ridge (fridge) method and compare it with conventional ridge. In short, fridge estimates sample-specific ridge penalties, based on minimizing a per sample mean squared error (MSE) criterion on the level of the linear predictor X i β. Since β is not known, it is replaced by an initial ridge estimate,β λ . Their sample specific penalty then depends on X i , and also on bothλ andσ 2 . The authors use GCV (6) to obtain λ, and a slight variation of (9) to estimate σ 2 . They show that fridge improves upon GCV-based ridge estimation. We wish to investigate whether i) MML estimation of λ = σ 2 /τ 2 also improves the performance of GCV-based ridge regression; and ii) whether MML estimation further boosts the performance of the fridge estimator. Here, predictive performance is measured by the mean squared prediction error (MSPE) using leave-one-out cross-validation (loocv).
The estimates of MML differ markedly from those of GCV: (λ MML ,σ 2 MML ) = (0.77, 0.59), while (λ GCV ,σ 2 GCV ) = (20.92, 8.08). Usingλ MML instead ofλ GCV for the estimation of β substantially reduced the mean squared prediction error: MSPE MML = 14.40, while MSPE GCV = 16.38, a relative decrease of 12.1%. Usingλ GCV , as in [17], fridge also reduced the MSPE, but to a lesser extent: MSPE fridge = 15.80, a relative decrease of 3.5% with respect to MSPE GCV . Application of fridge usingλ MML did not further decrease MSPE MML , nor did it increase it. Possibly, the already fairly small value ofλ MML left little room for improvement. a lesser extent by fridge) with respect to ridge using λ GCV .

Extension 1: Mixed effects model
A natural extension of the high-dimensional random effects model (1) is the mixed effects model: where we assume that the n × m design matrix for the fixed effects, X f , is of low-rank, so m n, as opposed to the random effects design matrix X r . Restricted maximum likelihood (REML) deals with the fixed effects by contrasting them out. For the error contrast vector y − X fα OLS = A T y, with Σ r = X r X T r τ 2 +σ 2 I n . In addition to maximizing (18) as a function of (σ 2 , τ 2 ), we attempted solving the set of two estimation equations suggested by [22], but this rendered instable results inferior to maximizing (18) directly.
Alternatively, MML may be used, but it has to be adjusted to also estimate the fixed effects in the model. This implies replacing 0 in Gaussian likelihood (11) by X f α, and optimizing (11) with respect to 2 + m parameters, where m is the number of fixed parameters. The mixed model simulation setting is as follows: where n = 100, p = 1000, m = 10, p 0 = 0.9, τ 2 0 = 0.1 (implying variance τ 2 = (1 − p 0 )τ 2 0 = 0.01 for generating random effects) and p 0,f = 0.5, τ 2 0,f = 0.20 (implying variance τ 2 f = 0.1 for generating fixed effects). Note that we focused on a fairly sparse setting for the random effects and larger prior variance of fixed effects than of random effects, which enables a stronger impact of the small number of fixed effects. Figure 4 shows the results of REML, MML and CV (by glmnet, using penalty factor 0 for the fixed effects) for the estimation of τ 2 , σ 2 , λ and h 2 . From Figure 4 we observe that REML indeed improves MML in terms of bias, however at the cost of increased variability. For the estimation of λ, CV is fairly competitive to REML and MML, although it renders markedly more over-penalization.

Extension 2: Bayesian linear regression
So far, we focused on classical methods. Bayesian methods may be a good alternative. We is still analytical [23], and so is the posterior mode estimate of σ 2 . Figure 5 shows the results in comparison to MML, i.e. maximization of (12), for the random effects case with multi-collinear X, as in Section 3.3.1. Results for other settings were in essence very similar. As before, we a priori assume i.i.d. β j ∼ N (0, τ 2 ), here equivalent to an L 2 penalty λ = 1/τ 2 when estimating β by penalized likelihood. In [7] an iterative algorithm to estimate λ is derived which alternates estimation of β by maximization w.r.t. λ, requiring the computation of the trace of a Hessian of a p × p matrix. Here, the estimation of β itself is much slower than in the linear case, because it is not analytic and requires iterative weighted least squares approximation.
Below we show how to substantially alleviate the computational burden in the p ≫ n setting by re-parameterizing the marginal likelihood implying computations in R n instead of R p .
However,β X can be computed by noting that In other words, this is the penalized log-likelihood when regressing Y on the identity design matrix I n using an L 2 smoothing penalty matrix (β X ) T Σ −1 λ β X = λ(β X ) T (XX T ) −1 β X . The latter fits conveniently into the set-up of [8], as implemented in the R-package mgcv. This also facilitates MML estimation of λ by maximizing ML(λ), with h Y ,λ (β X ) as in (23). If the columns of X are standardized (common in high-dimensional studies), XX T has rank n − 1 instead of n, implying that (XX T ) −1 does not exist and should be replaced by a pseudo-inverse (XX T ) + , such as the Moore-Penrose inverse.
In a full Bayesian linear model setting, dimension reduction is also discussed by [25], where Xβ is substituted by a n-dimensional factor analytic representation, which requires an SVD of X. In addition, there it is not used for hyper-parameter estimation by marginal likelihood, but instead for specifying (hierarchical) priors for the factors.

Results
R packages like glmnet [13] and penalized [26] estimate λ by cross-validation, and also mgcv allows, next to the MML estimation, (generalized) CV estimation [8]. Figures 6(a)and 6)(b) show the results for Poisson ridge regression, with Y i ∼ Pois(λ i ), λ i = exp(X i β), β generated as in (14), and X generated as in (14) and (16), which denote the independent X and multi-collinear X setting, respectively.  inferior to MML based ones, but much better than the latter two, which may be due to the different regression estimators used (Laplace approximation versus iterative weighted least squares). We should stress that CV does not target for the estimation of λ as such, but merely for minimizing prediction error. Nevertheless, the difference is remarkably larger than in the corresponding linear case (see Figures 1 and 2).
The Supplementary Material shows the results for Binomial ridge regression. While the differences in performance are less dramatic than for the Poisson setting, MML still renders much better estimates of λ than CV-based approaches.

Computational aspects and software
All methods and simulations presented here are implemented in a few wrapper R scripts: one for the linear random effects model (which includes the conjugate Bayes estimator), one for the linear mixed effects model, and one for Poisson and Binomial ridge regression. Parallel computations are supported. The scripts allow exact reproduction of the results in this manuscript as well as comparisons for other simulation or user-specific real data X cases. In addition, a script is supplied to produce the box-plots as in this manuscript.
Computing times of the various methods largely depend on n and p, much less so on the exact simulation setting. These are displayed for n = 100, 500 and p = 10 3 , 10 4 , 10 5 in Table 6, based on computations with one CPU of an Intel R Xeon R CPU E5-2660 v3 @ 2.60GHz server. For Poisson ridge regression, we only report the computing times of MML and GCV, because, as reported in Figure 6, the performance of CV-based methods was very inferior. From Table 6 we Linear n = 100 n = 500 p = 10 3 p = 10 4 p = 10 5 p = 10 3 p = 10 4 p = 10 5  conclude that MML is also computationally very attractive. Its efficiency is explained by the fact that, unlike many of other methods, it does not require an SVD or other matrix decomposition of X. Moreover, the only computation that involves dimension p is the product XX T .

Discussion
We compared several estimators in a large variety of high-dimensional settings. The results showed that plain maximum marginal likelihood works well in many settings. MML is generally superior to methods that aim to separately estimate σ 2 (9, 10). Apparently, the estimates of σ 2 and τ 2 are so intrinsically linked in the high-dimensional setting that separate estimation is suboptimal. The moment estimator (MoM) is generally not competitive to MML. It may, however, be useful in large systems with multiple hyper-parameters to estimate relative penalties, which are less sensitive to scaling issues than the global penalty parameter [27]. MoM may also be a useful initial estimator for more complex estimators that are based on optimization, such as MML.
Possibly somewhat surprising is the good performance of MML for estimating λ and h 2 , as these are functions of σ 2 and τ 2 . For the estimation of λ it is generally better than or competitive to (generalized) CV, an observation also made for the low-dimensional setting [8]. The inferior performance of the basic estimator of σ 2 (9) implies that alternative estimators of λ that useσ 2 as a plug-in are unlikely to perform well in high-dimensional settings. Such estimators, including the original one by Hoerl and Kennard [1], are compared by [4,6], who show that some do perform well in the low-dimensional setting. For Poisson ridge regression, similar estimators of λ are available [5], but these rely on an initial maximum likelihood estimator of β, and hence do not apply to the high-dimensional setting. For estimating h 2 it should be noticed that HiLMM [10] aims to compute a confidence interval for h 2 as well. For that purpose their direct estimator (8) is likely more useful than MML on the pair (τ 2 , σ 2 ). We also used Esther [28], which precedes HiLMM by sure independence screening. It did not improve HiLMM in our (semi-)sparse settings, and requires manual steps. However, it likely improves HiLMM results in very sparse settings [28].
For mixed effect models with a small number of fixed effects, MML compares fairly well to REML, with a larger bias, but smaller variance. Probably the potential advantage of contrasting out the fixed effects is small when the number of random effects is large. REML may have a larger advantage in very sparse settings [29] or when the number of fixed effects is large with respect to n. Estimates from the conjugate Bayes model are very similar to those by MML. We show that estimating τ 2 along with σ 2 highly improves the σ 2 estimates presented by [24], where a fixed value of τ 2 is used. In the case of many variance components or multiple similar regression equations, Bayesian extensions that shrink the estimates by a common prior are appealing, in particular in combination with efficient posterior approximations such as variational Bayes [30].
Our model (1) implies a dense setting, but we have demonstrated that the MML and REML estimators of τ 2 and σ 2 are fairly robust against moderate sparsity, which corroborates the results by [29]. Nevertheless, true sparse models may be preferable when variable selection is desired, which depends on accurate estimation of β. On the other hand, post-hoc selection procedures can be rather competitive [31]. Moreover, the sparsity assumption is questionable for several applications.
E.g. in genetics, it was suggested that many complex traits (such as height or cholesterol levels) are not even polygenic, but instead "omnigenic" [32].
The extension of MML to high-dimensional GLM settings (22) is promising given its computational efficiency and performance for Poisson and Binomial regression. A special case of the latter, logistic regression, requires further research, because the Laplace approximations of the marginal likelihood are less accurate here [8]. Extension to survival is a promising avenue, because Cox regression is directly linked to Poisson regression [33]. Alternatively, parametric survival models may be pursued. To what extent the estimates of hyper-parameters impact predictions depends on the sensitivity of the likelihood to these parameters. For the linear setting, a re-analysis of the weight-gain data showed that predictions based onλ MML improved those based onλ CV .
The MML estimator can be extended to estimation of multiple variance components or penalty parameters, which was addressed by iterative likelihood minorization [34] and by parameter-based moment estimation [27]. The latter extends to non-Gaussian response such as survival or binary.
Further comparison of these methods with multi-parameter MML, both in terms of performance and computational efficiency, is left for future research. Finally, in particular in genetics applications, extensions of estimation of variance components by MML to non-independent individuals can be implemented by use of a well-structured between-individual covariance matrix Σ [3].
Although our simulations cover a fairly broad spectrum of settings, many other variations could be of interest. We therefore supply fully annotated R scripts https://github.com/markvdwiel/ Hyperpar that allow i) comparison of all algorithms discussed here, also for one's 'own' real covariate set X; and ii) reproduction of all results presented here.

Acknowledgement
Gwenaël Leday was supported by the Medical Research Council, grant number MR/M004421.
We thank Jiming Jiang and Can Yang for their correspondence, input and software for the MM algorithm. In addition, we thank Kristoffer Hellton for providing the fridge software and data.

Contents
This Supplementary Material contains: • Details on the estimation of σ 2 and τ 2 with the conjugate Bayesian model • Estimation results from this Bayesian model for a simulation by [24] • Details on the TCGA KIRC and TCPA OV gene and protein expression data The conjugate Bayesian linear regression model is: where X is the design matrix, β is the vector of unknown regression parameters, is the vector of random errors and the prior shape and rate hyper-parameters a and b are fixed (say a = 1 and b = 0.001) so as to induce a non-informative prior. In model (24) the ridge regularization parameter corresponds to ν = τ −2 . In the following we provide the marginal posterior distribution of β and σ −2 , as well as a reformulation of the marginal likelihood [23] that allows important computational savings.
The likelihood function and prior densities are Therefore, the joint posterior is given by The marginal posterior distribution of β is recognized to be a Student distribution with 2a + n degrees of freedom: Here The marginal posterior distribution of σ −2 is a Gamma distribution: where a * = a + 0.5n and b * The marginal likelihood of the model is: The marginal likelihood in (26) involves the determination of |Σ * ν |, which can be computationally demanding when the number of variables p is large and when we wish to evaluate the marginal likelihood for various values of ν. To tackle this problem it is helpful to consider the singular value decomposition U DV T of X, (where U and V are respectively n × q and p × q orthogonal matrices and D = diag(d 1 , . . . , d q ) is the diagonal matrix of singular values d 1 > . . . > d q with q = min(n, p)) and focus on the linear model y = d N (F θ, σ 2 I n ), where F = U D and θ = V T β, instead of y = d N (Xβ, σ 2 I n ). Using simple algebra it can be shown that E [θ|y] = θ * ν = V T β * ν and Additionally, using the fact that and Finally, the log-marginal likelihood reduces to with constant C = a log b + log Γ(a * ) − log Γ(a) − n 2 log π. Expression (31) makes the evaluation of log π(y) very efficient for different values of ν. Furthermore, the function log π(y) is log-concave in ν, which facilitates the use of numerical methods over grid-search approaches. In practice, it might be good to employ a numerical algorithm on a sub-domain of R + obtained from a rough grid-search.
As in [24], the results show that indeed the conjugate Bayes model is not robust against wrongly fixing τ 2 . However, it also shows that EB estimation of τ 2 strongly improves the σ estimate. In fact, these estimates are very competitive to the ones for the scale-independent prior, β ∼ N (0, τ 2 I p ), which was advocated by [24]. The TCGA KIRC data is downloaded from the harmonised GDC database http://cancergenome. nih.gov/ using the R-package TCGAbiolinks [35]. It contains RNAseq profiles of n = 71 kidney renal clear cell carcinomas. Only those genes were retained that had more than two counts per million in at least three samples, rendering p = 18, 391 genes. The data were normalised using the trimmed mean of M-values method of the R-package edgeR [36]. Following common practice, all gene expression values were standardized.

Ovarian tumor protein expression
We use protein expression data from the cancer proteome atlas (TCPA; [37]), which holds 408 ovarian serous cystadenocarcinoma profiles measuring 224 proteins by reverse-phase protein arrays. These are available from the TCPA portal: https://www.tcpaportal.org/tcpa/. Data were normalized by median centering per sample. All protein expression values were standardized.

Weight gain data example: alternative analysis
In the main document we re-analysed the weight gain data recently discussed in [17]. The authors kindly provided the R-code corresponding to the results in [17], which includes their focused ridge (fridge) methodology. Below we report a small discrepancy between our analysis and theirs, and also present results of an alternative analysis.
The authors of [17] opted to present results on n = 25 (rather than n = 26) samples, for technical reasons (personal communication). Results differed very little, so we chose to use all n = 26 samples of the original study [20]. In addition, the R-code by [17] includes a prior selection of those 1,000 genes with the largest absolute marginal correlation correlation to the response, i.e. weight gain. Such a prior selection can indeed enhance performance of ridge-type predictors.
The analysis by [17], however, did not include the gene selection as part of the outer leave-oneout cross-validation loop for assessing predictive performance. This may lead to over-optimism of the prediction errors. We therefore repeated the analysis and comparison as presented in the main document, but with the gene selection as part of the outer CV-loop. Indeed the error estimates increased substantially: MSPE MML = 38.65 (was 14.40), MSPE GCV = 44.20 (16.38) and MSPE fridge = 41.65 (15.80). Nevertheless, the conclusion remains that the MML and fridge MSPEs are lower that those of GCV, with relative decreases of 12.5% and 5.8%, respectively.

Binomial ridge regression
Note that unlike Poisson ridge regression Binomial ridge regression is not implemented in penalized. Hence, we compare mgcv-based results only with glmnet. Figure 10 displays the results for estimating λ, in case binomial N = 5. Results did not qualitatively differ for N = 3 and N = 10. While MML is fairly good on target, the estimates from GCV are roughly 10 times too small, whereas the estimates from CV by glmnet are roughly 2-3 times too large, with outliers towards a 10-100 fold overestimation.