Smoothing parameter and model selection for general smooth models

This paper discusses a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates. Gaussian random effects and parametric terms may also be present. By construction the method is numerically stable and convergent, and enables smoothing parameter uncertainty to be quantified. The latter enables us to fix a well known problem with AIC for such models. The smooth functions are represented by reduced rank spline like smoothers, with associated quadratic penalties measuring function smoothness. Model estimation is by penalized likelihood maximization, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood. The methods cover, for example, generalized additive models for non-exponential family responses (for example beta, ordered categorical, scaled t distribution, negative binomial and Tweedie distributions), generalized additive models for location scale and shape (for example two stage zero inflation models, and Gaussian location-scale models), Cox proportional hazards models and multivariate additive models. The framework reduces the implementation of new model classes to the coding of some standard derivatives of the log likelihood.


Introduction
This article is about smoothing parameter estimation and model selection in statistical models with a smooth regular likelihood, where the likelihood depends on smooth functions of covariates and these smooth functions are the targets of inference. Simple Gaussian random effects and parametric dependencies may also be present. When the likelihood (or a quasi-likelihood) decomposes into a sum of independent terms each contributed by a response variable from a single parameter exponential family distribution, then such a model is a generalized additive model (GAM, Tibshirani 1986, 1990). GAMs are widely used in practice (see, e.g., Ruppert, Wand, and Carroll 2003;Fahrmeir et al. 2013) with their popularity resting in part on the availability of statistically well founded smoothing parameter estimation methods that are numerically efficient and robust (Wood 2000(Wood , 2011 and perform the important task of estimating how smooth the component functions of a model should be. The purpose of this article is to provide a general method for smoothing parameter estimation when the model likelihood does not have the convenient exponential family (or quasilikelihood) form. For the most part we have in mind regression models of some sort, but the proposed methods are not the model coefficients. Implicit differentiation is used to obtain derivatives of the coefficients with respect to the smoothing parameters. This basic strategy works well in the GAM setting, but is substantially more complex when the simplifications of a GLM type likelihood no longer apply. Our aim is to provide a general method that is as numerically efficient and robust as the GAM methods, such that (i) implementation of a model class requires only the coding of some standard derivatives of the log-likelihood for that class and (ii) much of the inferential machinery for working with such models can reuse GAM methods (e.g., interval estimation or p-value computations). An important consequence of our approach is that we are able to compute a simple correction to the conditional AIC for the models considered, which corrects for smoothing parameter estimation uncertainty and the consequent deficiencies in a conventionally computed conditional AIC (see Greven and Kneib 2010). This facilitates the part of model selection distinct from smoothing parameter estimation.
The article is structured as follows. Section 2 introduces the general modeling framework. Section 3 then covers smoothness selection methods for this framework, with Section 3.1 developing a general method, Section 3.2 illustrating its use for the special case of distributional regression, and Section 3.3 covering the simplified methods that can be used in the even more restricted case of models with a similar structure to generalized additive models. Section 4 then develops approximate distributional results accounting for smoothing parameter uncertainty which are applied in Section 5 to propose a corrected AIC suitable for the general model class. The remaining sections present simulation results and examples, while extensive further background, and details for particular models, are given in the supplementary appendices (referred to as online "SA A, " "SA B, " etc., below).

The General Framework
Consider a model for an n-vector of data, y, constructed in terms of unknown parameters, θ, and some unknown functions, g j , of covariates, x j . Suppose that the log-likelihood for this model satisfies the Fisher regularity conditions, has four continuous derivatives, and can be written l(θ, g 1 , g 2 , . . . , g M ) = log f (y|θ, g 1 , g 2 , . . . , g M ). In contrast to the usual GAM case, the likelihood need not be based on a single parameter exponential family distribution, and we do not assume that the loglikelihood can be written in terms of a single additive linear predictor. Now let the g j (x j ) be represented via basis expansions of modest rank (k j ), where the β ji are unknown coefficients and the b ji (x) are known basis functions such as splines, usually chosen to have good approximation theoretical properties. With each g j is associated a smoothing penalty, which is quadratic in the basis coefficients and measures the complexity of g j . Writing all the basis coefficients and θ in one p-vector β, then the jth smoothing penalty can be written as β T S j β, where S j is a matrix of known coefficients, but generally has only a small nonzero block. The estimated model coefficients are then given M smoothing parameters, λ j , controlling the extent of penalization. A slight extension is that the smoothing penalties may be such that several λ i β T S i β are associated with one g j , for example when g j is a nonisotropic function of several variables. Note also that the framework can incorporate Gaussian random effects, provided the corresponding precision matrices can be written as λ i β T S i β (where the S i are known). From a Bayesian viewpointβ is a posterior mode for β. The Bayesian approach views the smooth functions as intrinsic Gaussian random fields with prior f λ given by N(0, S λ− ) where S λ− is a Moore-Penrose (or other suitable) pseudoinverse of j λ j S j . Then the posterior modes areβ from (1), and in the large sample limit, assuming fixed smoothing parameter vector, λ, we have β|y ∼ N(β, (I + S λ ) −1 ), where I is the expected negative Hessian of the log-likelihood (or its observed version) atβ. An empirical Bayesian approach is appealing here as it gives well calibrated inference for the g j (Wahba 1983;Silverman 1985;Nychka 1988;Marra and Wood 2012) in a GAM context. Appropriate summations of the elements of diag{(I + S λ ) −1 I} provide estimates of the "effective degrees of freedom" of the whole model, or of individual smooths.
Under this Bayesian view, smoothing parameters can be estimated to maximize the log marginal likelihood or a Laplace approximate version of this (e.g., Wood 2011). In practice optimization is with respect to ρ where ρ i = log λ i . Marginal likelihood estimation of smoothing parameters in a Gaussian context goes back to Anderssen and Bloomfield (1974) and Wahba (1985), while Shun and McCullagh (1995) showed that Laplace approximation of more general likelihoods is theoretically well founded. That marginal likelihood is equivalent to REML (in the sense of Laird and Ware 1982) supports its use when the model contains Gaussian random effects. Theoretical work by Reiss and Ogden (2009) also suggests practical advantages at finite sample sizes, in that marginal likelihood is less prone to multiple local minima than GCV (or AIC). Supplementary Appendix B (SA B) also demonstrates how Laplace approximate marginal likelihood (LAML) estimation of smoothing parameters maintains statistical consistency of reduced rank spline estimates. The use of Laplace approximation and demonstration of statistical consistency requires the assumption that dim(β) = O(n α ) where α < 1/3.

Smoothness Selection Methods
This section describes the general smoothness selection method, and a simplified method for the special case in which the likelihood is a simple sum of terms for each observation of a univariate response, and there is a single GAM like linear predictor. The nonlinear dependencies implied by employing a general smooth likelihood result in unwieldy expressions unless some care is taken to establish a compact notation. In the rest of this article, Greek subscripts denote partial differentiation with respect to the given variable, while Roman superscripts are indices associated with the derivatives. Hence, D i j Roman subscripts denote vector or array element indices. For matrices the first Roman sub-or superscript denotes rows, the second columns. Roman superscripts without a corresponding Greek subscript are labels, for example β 1 and β 2 denote two separate vectors β. For Hessian matrices only, D β i θ j is element i, j of the inverse of the matrix with elements D i β j θ . If any Roman index appears in two or more multiplied terms, but the index is absent on the other side of the equation, then a summation over the product of the corresponding terms is indicated (the usual Einstein summation convention being somewhat unwieldy in this context). To aid readability, in this article summation indices will be highlighted in bold. For example, the equation a i j b ik c il + d jkl = 0 is equivalent to i a i j b ik c il + d jkl = 0. An indexed expression not in an equation is treated like an equation with no indices on the other side (so a i j b j is interpreted as j a i j b j ).

General Model Estimation
Consider the general case in which the log-likelihood depends on several smooth functions of predictor variables, each represented via a basis expansion and each with one or more associated penalties. The likelihood may also depend on some strictly parametric model components. The log-likelihood is assumed to satisfy the Fisher regularity conditions and in addition we usually assume that it has 4 bounded continuous derivatives with respect to the parameters (with respect to g j (x) for any relevant fixed x in the case of a smooth, g j ). Let the model coefficients be β (recalling that this includes the vector θ of parametric coefficients and nuisance parameters). The penalized loglikelihood is then and we assume that the model is well enough posed that this has a positive definite maximum (at least after dealing with any parameter redundancy issues that can be addressed by linear constraint). Letβ be the maximizer of L and let H be the negative Hessian, with elements −L iβ jβ . The log LAML (see online SA C) is where S λ = λ j S j and |S λ | + is the product of the positive eigenvalues of S λ . M p is the number of zero eigenvalues of S λ , when all λ j are strictly positive. The basic strategy is to optimize V with respect to ρ = log(λ) via Newton's method. This requireŝ β to be obtained for each trial ρ via an inner Newton iteration, and derivatives ofβ must be obtained by implicit differentiation. The log determinant computations have the potential to be computationally unstable, and reparameterization is needed to deal with this. The full Newton method based on computationally exact derivatives has the substantial practical advantage that it can readily be detected when V is indefinite with respect to a particular ρ i , since then ∂V/∂ρ i = ∂ 2 V/∂ρ 2 i 0. Such indefiniteness occurs when a smoothing parameter, λ i , → ∞ or a variance component tends to zero, both of which are perfectly legitimate. Dropping a ρ i from Newton update when such indefiniteness is detected ensures that it takes a value which can be treated as "working infinity" without overflowing. Methods which use an approximate Hessian, or none, do not have this advantage.
The proposed general method consists of outer and inner iterations, as follows.
Outer algorithm for ρ 1. Obtain initial values for ρ = log(λ), to ensure that the effective degrees of freedom of each smooth lies away from its maximum or minimum possible values. 2. Find initialβ guesstimates (model specific). 3. Perform the initial reparameterizations required in Section 3.1.1 to facilitate stable computation of log |S λ | + . 4. Repeat the following standard Newton iteration until convergence is detected at Step (c).
Step 3 reparameterization. The method for evaluating V and its gradient and Hessian with respect to ρ is as follows, where Lβ kβ j denotes the inverse of L kβ jβ .
Inner algorithm for β 1. Reparameterize to deal with any "type 3" penalty blocks as described in Section 3.1.1 so that computation of log |S λ | + is stable, and evaluate the derivatives of log |S λ | + . 2. Use Newton's method to findβ, regularizing the Hessian, and applying step length control, to ensure convergence even when the Hessian is indefinite and/orβ is not identifiable, as described in Section 3.1.2. 3. Test for identifiability ofβ at convergence by examining the rank of the H as described in Section 3.1.2. Drop unidentifiable coefficients. 4. If coefficients were dropped, find the reducedβ by further steps of Newton's method (Section 3.1.2).
The following subsections fill in the method details, but note that to implement a particular model in this class it is necessary to be able to compute, l, l i β and l i β j β , given β, along with l i j k ββρ given dβ/dρ k , and Lββ k j l j k pv ββρρ given d 2β /dρ k dρ l . The last of these is usually computable much more efficiently than if l j k pv ββρρ was computed explicitly.

... Derivatives and Stable Evaluation of log |S λ | +
This section covers the details for outer Step 3 and inner Step 1. Stable evaluation of the log determinant terms is the key to stable computation with the LAML. The online SA C explains the issue. Wood (2011) proposed a solution which involves orthogonal transformation of the whole parameter vector β, but in the general case the likelihood may depend on each smooth function separately and such a transformation is therefore untenable. It is necessary to develop a reparameterization strategy which does not combine coefficients from different smooths. This is possible if we recognize that S λ is block diagonal, with different blocks relating to different smooths. For example, if S j denotes the nonzero sub-block of S j , That is, there are some blocks with single smoothing parameters, and others with a more complicated additive structure. There are usually also some zero blocks on the diagonal. The block structure means that the generalized determinant, its derivatives with respect to ρ k = log λ k and the matrix square root of S λ can all be computed blockwise. So for the above example, log |S λ | + = rank(S 1 ) log(λ 1 ) + log |S 1 | + + rank(S 2 ) log(λ 2 ) For any ρ k relating to a single parameter block we have and zero second derivatives. For multi-λ blocks there will generally be first and second derivatives to compute. There are no second derivatives "between-blocks. " In general, there are three block types, each requiring different preprocessing.
1. Single parameter diagonal blocks. A reparameterization can be used so that all nonzero elements are one, and the rank precomputed. 2. Single parameter dense blocks. An orthogonal reparameterization, based on the eigenvectors of the symmetric eigen-decomposition of the block, can be used to make these blocks look like the previous type (by similarity transform). Again the rank is computed. 3. Multi-λ blocks will require the reparameterization method of Wood (2011) appendix B to be applied for each new ρ proposal, since the numerical problem that the reparameterization avoids is ρ dependent (see online SA C). Initially, before the smoothing parameter selection iteration, it is necessary to reparameterize to separate the parameters corresponding to the block into penalized and unpenalized subvectors. This initial reparameterization can be based on the eigenvectors of the symmetric eigen decomposition of the "balanced" version of the block penalty matrix, j S j / S j F , where · F is the Frobenious norm. The balanced penalty is used for maximal numerical stability, and is usable because formally the spaces for the penalized and unpenalized components do not change with the smoothing parameters. The reparameterizations from each block type are applied to the model, usually to the model matrices X j of the individual smooth terms. The reparameterization information must be stored so that we can return to the original parameterization at the end.
After the one off initial reparameterization just described, then step one of the inner algorithm requires only that the reparameterization method of Wood (2011) Appendix B be applied to the parameters corresponding to type 3 blocks, for each new set of smoothing parameters.

... Newton Iteration forβ
This section provides details for inner Steps 2-4. Newton iteration forβ requires the gradient vector, G, with elements i j (we will also use H to denote the Hessian of the negative unpenalized log-likelihood with elements −l i β j β ). In principle Newton iteration proceeds by repeatedly setting β to β + , where = H −1 G. In practice, Newton's method is only guaranteed to converge to a maximum of L, provided (i) that the Hessian is perturbed to be positive definite if it is not, guaranteeing that the Newton direction is an ascent direction, (ii) that step reduction is used to ensure that the step taken actually increases L and (iii) that the computation of the step is numerically stable (see Nocedal and Wright 2006).
L may be indefinite away from a maximum, but even near the maximum there are two basic impediments to stability and positive definiteness. First, some elements of β may be unidentifiable. This issue will be dealt with by dropping parameters at convergence, as described shortly. The second issue is that some smoothing parameters may legitimately become very large during fitting, resulting in very large λ j S j components, poor scaling, poor conditioning and, hence, computational singularity. However, given the initial and Step 1 reparameterizations, such large elements can be dealt with by diagonal preconditioning of H. That is define diagonal matrix D such that D ii = |H ii | −1/2 , and preconditioned Hessian H = DHD. Then H −1 = DH −1 D, with the right-hand side resulting in much better scaled computation. In the work reported here the pivoted Cholesky decomposition of the perturbed Hessian R T R = H + I is repeated with increasing , starting from zero, until positive definiteness is obtained. The Newton step is then computed as = DR −1 R −T DG. If the step to β + fails to increase the likelihood then is repeatedly halved until it does. Note that the perturbation of the Hessian does not change the converged state of a Newton algorithm (although varying the perturbation strength can change the algorithm convergence rate).
At convergence H can at worst be positive semi-definite, but it is necessary to test for the possibility that some parameters are unidentifiable. The test should not depend on the particular values of the smoothing parameters. This can be achieved by constructing the balanced penalty S = j S j / S j F ( · F is the Frobenius norm, but another norm could equally well be used), and then forming the pivoted Cholesky decomposition P T P = H/ H F + S/ S F . The rank of P can then be estimated by making use of Cline et al. (1979). If this reveals rank deficiency of order q then the coefficients corresponding to the matrix rows and columns pivoted to the last q positions should be dropped from the analysis. The balanced penalty is used to avoid dropping parameters simply because some smoothing parameters are very large. Given the nonlinear setting it is necessary to repeat the Newton iteration to convergence with the reduced parameter set, in order that the remaining parameters adjust to the omission of those dropped.

... Implicit Differentiation
This section provides the details for inner Steps 5-7. We obtain the derivatives of the identifiable elements ofβ with respect to ρ. All computations here are in the reduced parameter space, if parameters were dropped. At the maximum penalized likelihood estimate we have L iβ = l iβ − λ k S k i jβ j = 0 and differentiating with respect to ρ k = log λ k yields given which we can compute l As mentioned in Section 3.1, it will generally be inefficient to form this last quantity explicitly, as it occurs only in the summations involved in computing the final trace in (3).

... The Remaining Derivatives
Recalling that H is the matrix with elements −L i where components involving L jβ are zero by definition ofβ. The components not covered so far are The final term above is expensive if computed naively by explicitly computing each term ∂ 2 H/∂ρ k ∂ρ l , but this is unnecessary and the computation of tr H −1 ∂ 2 H/∂ρ k ∂ρ l can usually be performed efficiently as the final part of the model specification, keeping the total cost to O(Mnp 2 ): see online SA G and Section 3.2 for illustrative examples. The Cox (1972) proportional hazards model provides a straightforward application of the general method, and the requisite computations are set out in online SA G in a manner that maintains O(Mnp 2 ) computational cost. Another example is the multivariate additive model, in which the means of a multivariate Gaussian response are given by separate linear predictors, which may optionally share terms. This model is covered in the online SA H and Section 8. Section 3.2 considers how another class of models falls into the general framework.

A Special Case: GAMLSS Models
The GAMLSS (or "distributional regression") models discussed by Rigby and Stasinopoulos (2005) (and also Yee and Wild 1996;Klein et al. 2014Klein et al. , 2015 fall within the scope of the general method. The idea is that we have independent univariate response observations, y i , whose distributions depend on several unknown parameters, each of which is determined by its own linear predictor. The log-likelihood is a straightforward sum of contributions from each y i (unlike the Cox models, e.g.), and the special structure can be exploited so that implementation of new models in this class requires only the supply of some derivatives of the log-likelihood terms with respect to the distribution parameters. Given the notational conventions established previously, the expressions facilitating this are rather compact (without such a notation they can easily become intractably complex).
Let the log-likelihood for the ith observation be which are also sufficient for first-order implicit differentiation.
LAML optimization also requires Notice how this is just an inner product X T VX, where the diagonal matrix V is the sum over q of some diagonal matrices. At this stage the second derivatives ofβ with respect to ρ can be computed, after which we require only So to implement a new family for GAMLSS estimation requires mixed derivatives up to fourth order with respect to the parameters of the likelihood. In most cases what would be conveniently available is, for example, l iμ l iμ m iμ q iμ s rather than l iη l iη m iη q iη s , where μ k is the kth parameter of the likelihood and is given by h k (μ k ) = η k , h k being a link function.
To get from the μ derivatives to the η derivatives, the rules (A.1)-(A.4) from Appendix A are used. This is straightforward for any derivative that is not mixed. For mixed derivatives containing at least one first-order derivative the transformation rule applying to the highest order derivative is applied first, followed by the transformations for the first-order derivatives. This leaves only the transformation of l iμ j iμ j iμ k iμ k as at all awkward, but we have were computed explicitly for this purpose (where P is the dimension of combined β). However, this can be reduced to O(nP 2 ) using a trick most easily explained by switching to a matrix representation. For simplicity of presentation assume K = 2, and define matrix B to be the inverse of the penalized Hence, following the one off formation of B X 1 0 See online SA I where a zero inflated Poisson model provides an example of the details. Figure 2 shows estimates for the model , f 1 is an adaptive P-spline and f 2 a cubic regression spline, while SA F.2 provides another application. Package mgcv also includes multinomial logistic regression implemented this way and further examples are under development. An interesting possibility with any model which has multiple linear predictors is that one or more of those predictors should depend on some of the same terms, and online SA H shows how this can be handled.

A More Special Case: Extended Generalized Additive Models
For models with a single linear predictor, in which the log-likelihood is a sum of contributions per y i , it is possible to perform fitting by iterative weighted least squares, enabling profitable reuse of some components of standard GAM fitting methods, including the exploitation of very stable orthogonal methods for solving least squares problems. Specifically, consider observations y i , and let the corresponding log-likelihood be of the form where the terms in the summation may also be written as l i for short, and μ i is often E(y i ), but may also be a latent variable (as in the ordered categorical model of SA K). Given h, a known link function, h(μ i ) = η i where η = Xβ + o, X is a model matrix, β is a parameter vector and o is an offset (often simply 0). θ is a parameter vector, containing the extra parameters of the likelihood, such as the p parameter of a Tweedie density (see online SA J), or the cut points of an ordered categorical model (see online SA K). Notice that in this case θ is not treated as part of β, since θ can not always be estimated by straightforward iterative regression. Instead θ will be estimated alongside the smoothing parameters. φ is a scale parameter, often fixed at one. Letl i = max μ i l i (y i , μ i , θ, φ) denote the saturated log-likelihood. Define the deviance corresponding to y i as D i = 2(l i − l i )φ, where φ is the scale parameter on which D i does not depend. Working in terms of the deviance is convenient in a regression setting, where deviance residuals are a preferred method for model checking and the proportion deviance explained is a natural substitute for the r 2 statistic as a measure of goodness of fit (but see the final comment in online SA I).
In general the estimates of β will depend on some log smoothing parameter ρ j = log λ j , and it is notationally expedient to consider these to be part of the vector θ, although it is to be understood that l does not actually depend on these elements of θ. Given θ, estimation of β is by minimization of the penalized deviance D(β, θ) = i D i (β, θ) + j λ j β T S j β, with respect to β. This can be achieved by penalized iteratively reweighted least squares (PIRLS), which consists of iterative minimization of i w i (z i − X i β) 2 + j λ j β T S j β, where the pseudodata and weights are given by Note that if w i = 0 (or w i is too close to 0), the penalized least squares estimate can be computed using only w i z i , which is then well defined and finite when z i is not. Estimation of θ, and possibly φ, is by LAML. Writing W as the diagonal matrix of w i values, the log LAML is given by where W is evaluated at theβ implied by θ. To compute the derivatives of V with respect to θ the derivatives ofβ with respect to θ are required. Note that V is a full Laplace approximation, rather than the "approximate" Laplace approximation used to justify PQL (Breslow and Clayton 1993), so that PQL's well known problems with binary and low count data are much reduced. In particular: (i) most PQL implementations estimate φ when fitting the working linear mixed model, even in the binomial and Poisson cases, where it is fixed at 1. For binary and low count data this can give very poor results. (ii) PQL uses the expected Hessian rather than the Hessian, and these only coincide for the canonical link case. (iii) PQL is justified by an assumption that the iterative fitting weights only vary slowly with the smoothing parameters, an assumption that is not needed here. The parameters θ and φ can be estimated by maximizing V using Newton's method, or a quasi-Newton method. Notice that V depends directly on the elements of θ via D,l and S λ , but also indirectly via the dependence ofμ and W onβ and hence on θ. Hence, each trial θ, φ requires a PIRLS iteration to find the correspondingβ, followed by implicit differentiation to find the derivatives ofβ with respect to θ. Once these are obtained, the chain rule can be applied to find the derivatives of V with respect to θ and φ.
As illustrated in SA C, there is scope for serious numerical instability in the evaluation of the determinant terms in V, but for this case we can reuse the stabilization strategy from Wood (2011), namely for each trial θ and φ: 1. Use the orthogonal reparameterization from Appendix B of Wood (2011) to ensure that log |S λ | + can be computed in a stable manner. 2. Estimateβ by PIRLS using the stable least squares method for negatively weighted problems from Section 3.3 of Wood (2011), setting structurally unidentifiable coefficients to zero. 3. Using implicit differentiation, obtain the derivatives of V required for a Newton update.
Step 3 is substantially more complicated than in Wood (2011), and is covered in Appendix A.

... Extended GAM New Model Implementation
The general formulation above assumes that various standard information is available for each distribution and link. What is needed depends on whether quasi-Newton or full Newton is used to findθ. Here is a summary of what is needed for each distribution 1. For findingβ.
θ . In addition, first and second derivatives ofl with respect to its arguments are needed. All of these quantities can be obtained automatically using a computer algebra package. ED i μ i μ is also useful for further inference. If it is not readily computed then we can substitute D i We have implemented beta, negative binomial, scaled t models for heavy tailed data, simple zero inflated Poisson, ordered categorical and Tweedie additive models in this way. The first three were essentially automatic: the derivatives were computed by a symbolic algebra package and coded from the results. Some care is required in doing this, to avoid excessive cancellation error, underflow or overflow in the computations. Overly naive coding of derivatives can often lead to numerical problems: The online SA I on the zero inflated Poisson provides an example of the sort of issues that can be encountered. The ordered categorical and Tweedie models are slightly more complicated and details are therefore provided in the online SA J and K (including further examples of the need to avoid cancellation error).

Smoothing Parameter Uncertainty
Conventionally in a GAM context smoothing parameters have been treated as fixed when computing interval estimates for functions, or for other inferential tasks. In reality smoothing parameters must be estimated, and the uncertainty associated with this has generally been ignored except in fully Bayesian simulation approaches. Kass and Steffey (1989) proposed a simple first-order correction for this sort of uncertainty in the context of iid Gaussian random effects in a one way ANOVA type design. Some extra work is required to understand how their method works when applied to smooths. It turns out that the estimation methods described above provide the quantities required to correct for smoothing parameter uncertainty.
Assume we have several smooth model components, let ρ i = log λ i and S λ = j λ j S j . Writingβ ρ forβ, to emphasize the dependence ofβ on the smoothing parameters, we use the Bayesian large sample approximation (see SB.4) which is exact in the Gaussian case, along with the large sample approximation where V ρ is the inverse of the Hessian of the negative log marginal likelihood with respect to ρ. Since the approximation (6) applies in the interior of the parameter space, it is necessary to substitute a Moore-Penrose pseudoinverse of the Hessian if a smoothing parameter is effectively infinite, or otherwise to regularize the inversion (which is equivalent to placing a Gaussian prior on ρ). Conventionally (5) is used withρ plugged in and the uncertainty in ρ neglected. To improve on this note that if (5) and (6) are correct, while z ∼ N(0, I) and independently ρ * ∼ This provides a way of simulating from β|y, but it is computationally expensive asβ ρ * and R ρ * must be computed afresh for each sample. (The conventional approximation would simply set ρ * =ρ.) Alternatively consider a first-order Taylor expansion where r is a lower order remainder term and J = dβ/dρ|ρ. Dropping r, the expectation of the right-hand side isβρ. Denoting the elements of R ρ by R i j , tedious but routine calculation shows that the three remaining random terms are uncorrelated with covariance matrix which is computable at O(Mp 3 ) cost (see online SA D). Dropping V we have the Kass and Steffey (1989) approximation (A first-order Taylor expansion ofβ about ρ yields a similar correction for the frequentist covariance matrix ofβ: V * β = (Î + S λ ) −1Î (Î + S λ ) −1 + JV ρ J T , whereÎ is the negative Hessian of the loglikelihood).
The online SA D shows that in a Demmler-Reinsch like parameterization, for any penalized parameter β i with posterior standard deviation σ β i , So the J(ρ −ρ) correction is dominant for components that are strongly nonzero. This offers some justification for using the Kass and Steffey (1989) approximation, but not in a model selection context, where near zero model components are those of most interest: hence, in what follows we will use (7) without dropping V .

An Information Criterion for Smooth Model Selection
When viewing smoothing from a Bayesian perspective, the smooths have improper priors (or alternatively vague priors of convenience) corresponding to the null space of the smoothing penalties. This invalidates model selection via marginal likelihood comparison. An alternative is a frequentist AIC (Akaike 1973), based on the conditional likelihood of the model coefficients, rather than the marginal likelihood. In the exponential family GAM context, Hastie and Tibshirani (1990, §6.8.3) proposed a widely used version of this conditional AIC in which the effective degrees of freedom of the model, τ 0 , is used in place of the number of model parameters (in the general setting τ 0 = tr{V βÎ } is equivalent to the Hastie and Tibshirani (1990) proposal). But Greven and Kneib (2010) showed that this is overly likely to select complex models, especially when the model contains random effects: the difficulty arises because τ 0 neglects the fact that the smoothing parameters have been estimated and are, therefore, uncertain (a marginal AIC based on the frequentist marginal likelihood, in which unpenalized effects are not integrated out, is equally problematic, partly because of underestimation of variance components and consequent bias toward simple models). A heuristic alternative is to use τ 1 = tr(2ÎV β −ÎV βÎ V β ) as the effective degrees of freedom, motivated by considering the number of unpenalized parameters required to optimally approximate a bias corrected version of the model, but the resulting AIC is too conservative (see, Section 6, e.g.). Greven and Kneib (2010) show how to exactly compute an effective modified AIC for the Gaussian additive model case based on defining the effective degrees of freedom as i ∂ŷ i /∂y i (as proposed by Liang et al. 2008). Yu and Yau (2012) and Säfken et al. (2014) considered extensions to generalized linear mixed models. The novel contribution of this section is to use the results of the previous section to avoid the problematic neglect of smoothing parameter uncertainty in the conditional AIC computation in a manner that is easily computed and applicable to the general model class considered in this article. The derivation of AIC (see, e.g., Davison 2003, sec. 4.7) with the MLE replaced by the penalized MLE is identical up to the point at which the AIC score is represented as where β d is the coefficient vector minimizing the KL divergence and I d is the corresponding expected negative Hessian of the log-likelihood. In an unpenalized setting Penalization means that the expected inverse covariance matrix ofβ is no longer well approximated byÎ, and there are then two ways of proceeding.
The first is to view β as a frequentist random effect, with predicted valuesβ. In that case the covariance matrix for the predictions,β, corresponds to the posterior covariance matrix obtained when taking the Bayesian view of the smoothing process, so we have the conventional estimate τ = tr{V βÎ } if we neglect smoothing parameter uncertainty, or τ = tr(V βÎ ) accounting for it using (7).
The frequentist random effects formulation is not a completely natural way to view smooths, since we do not usually expect the smooth components of a model to be resampled from the prior with each replication of the data. However in the smoothing context V β has the interpretation of being the frequentist covariance matrix forβ plus an estimate of the prior expectation of the squared smoothing bias (matrix), which offers some justification for using the same τ estimate as in the strict random effects case. To see this consider the decomposition where β is the smoothing bias inβ. The first term on the right-hand side, above, can be replaced by the standard frequentist estimate Vβ = (Î + S λ ) −1Î (Î + S λ ) −1 . Now expand the penalized log-likelihood around β d : Differentiating with respect to β and equating to zero we obtain the approximation Edl/dβ| β d = 0 by definition of β d , so taking expectations of both sides we have E(β) (I d + S λ ) −1 I d β d . Hence estimating I d byÎ we have˜ β {(Î + S λ ) −1Î − I}β d . Considering the expected value of˜ β˜ T β according to the prior mean and variance assumptions of the model, we have the following.
Lemma 1. Let the setup be as above and let E π denote expectation assuming the prior mean and covariance for β. TreatingÎ as fixed, then Vβ + E π (˜ β˜ For proof see online SA D. This offers some justification for again using τ = tr{V βÎ }, or τ = tr(V βÎ ) accounting for ρ uncertainty. So both the frequentist random effects perspective and the prior expected smoothing bias approach result in This is the conventional Hastie and Tibshirani (1990) conditional AIC with an additive correction 2tr{Î (V + V )}, accounting for smoothing parameter uncertainty. The correction is readily computed for any model considered here, provided only that the derivatives ofβ and V β can be computed: the methods of Section 3 provide these. Section 6 provides an illustration of the efficacy of (10).

Simulation Results
The improvement resulting from using the corrected AIC of Section 5 can be illustrated by simulation. Simulations were conducted for additive models with true expected values given by , where the f j are shown in the online SA E, and the x covariates are all independent U (0, 1) deviates. Two model comparisons were considered. In the first a 40 level Gaussian random effect was added to η, with the random effect standard deviation being varied from 0 (no effect) to 1. AIC was then used to select between models with or without the random effect included, but where all smooth terms were modeled using penalized regression splines.
In the second case models with and without f 0 were compared, with the true model being based on c f 0 in place of f 0 , where the effect strength c was varied from 0 (no effect) to 1. Model selection was based on (i) conventional conditional generalized AIC using τ 0 from Section 5, (ii) the corrected AIC of Section 5, (iii) a version of AIC in which the degrees of freedom penalty is based on τ 1 from Section 5, (iv) AIC based on the marginal likelihood with the number of parameters given by the number of smoothing parameters and variance components plus the number of unpenalized coefficients in the model, and (v) The Greven and Kneib (2010) corrected AIC for the Gaussian response case. The marginal likelihood in case (iv) is a version in which unpenalized coefficients are not integrated out (to avoid the usual problems with fixed effect differences and REML, or improper priors and marginal likelihood).
Results are shown in the top row of Figure 3 for a sample size of 500 with Gaussian sampling error and standard deviation of 2. For the random effect comparison, conventional conditional AIC is heavily biased toward the more complex model, selecting it on over 70% of occasions. The ML based AIC is too conservative for an AIC criterion with 3.5% selection of the larger model when it is not correct, as against the roughly 16% one might expect from AIC comparison of models differing in 1 parameter. The known underestimation of variance components estimated by this sort of marginal likelihood is partly to blame. The AIC based on τ 1 from Section 5 also lacks power, performing even less well than the ML based version. By contrast, the new corrected AIC performs well, and in this example is a slight improvement on Greven and Kneib (2010). For the smooth comparison the different calculations differ much less, although the alternatives are slightly less biased   (beta, nb, scat, zip) and BayesX (ocat) packages for one dimensional P-spline models. The two plots at lower right show comparisons of log 10 computing times for the case with the smallest time advantage for the new method -Beta regression. The remaining panels show boxplots of replicate by replicate difference in MSE/Brier's score each standardized by the average MSE or Brier's score for the particular simulation comparison. Each panel shows three box plots, one for each noise to signal level. Positive values indicate that the new method is doing better than the alternative. Boxplots are shaded grey when the difference is significant at the % level (all three for nb correlated should be gray). In all cases where the difference is significant at % the new method is better than the alternative, except for the zero inflated Poisson with uncorrelated data, where the alternative method is better at all noise levels.
toward the more complex model than the conventional conditional generalized AIC, with the corrected Section 5 version showing the smallest bias. The lower row of Figure 3 shows equivalent power plots for the same Gaussian random effect and linear predictor η, but with Bernoulli, beta and Cox proportional hazard (partial) likelihoods (the first two using logit links).
The purpose of this article is to develop methods to allow the rich variety of smoothers illustrated in Figure 1 to be used in models beyond the exponential family, a task for which general methods were not previously available. However, for the special case of univariate P-splines (Eilers and Marx 1996;Marx and Eilers 1998) some comparison with existing methods is possible, in particular using R package gamlss Stasinopoulos 2005, 2014) and the BayesX package (Fahrmeir and Lang 2001;Fahrmeir, Kneib, and Lang 2004;Brezger and Lang 2006;Umlauf et al. 2015;Belitz et al. 2015, www.bayesx.org). For this special case both packages implement models using essentially the same penalized likelihoods used by the new method, but they optimize localized marginal likelihood scores within the penalized likelihood optimization algorithm to estimate the smoothing parameters.
The comparison was performed using data simulated from models with the linear predictor given above (but without any random effect terms). Comparison of the new method with GAMLSS was only possible for negative binomial, beta, scaled t and simple zero inflated Poisson families, and with BayesX was only possible for the ordered categorical model (BayesX has a negative binomial family, but it is currently insufficiently stable for a sensible comparison to be made). Simulations with both uncorrelated and correlated covariates were considered. Three hundred replicates of the sample size 400 were produced for each considered family at three levels of noise (see SA E for further details). Models were estimated using the correct link and additive structure, and using P-splines with basis dimensions of 10, 10, 15, and 8 which were chosen to avoid any possibility of forced oversmoothing, while keeping down computational time.
Model performance for the negative binomial (nb), beta, scaled t (scat), and zero inflated Poisson (zip) families was compared via MSE, n −1 n i=1 η(x i ) − η t (x i ) 2 , on the additive predictor scale. The Brier score, 1 n n i=1 R j=1 (p i j −p i j ) 2 , was used to measure the performance for the ordered categorical (ocat) family, where R is a number of categories, p i j are true category probabilities andp i j their estimated values. In addition, the computational performance (CPU time) of the alternative methods was recorded. Figure 4 summarizes the results. In general, the new method provides a small improvement in statistical performance, which is slightly larger when covariates are correlated. The correlated covariate setting is the one in which local approximate smoothness selection methods would be expected to perform less well, relative to "whole model" criteria. In terms of speed and reliability the new method is an improvement, especially for correlated covariates, which tend to lead to reduced numerical stability, leading the alternative methods to fail in up to 4% of cases.

Example: Predicting Prostate Cancer
This section and the next provide example applications of the new methods, while the online SA F provides further examples Figure . Three representative protein mass spectra (centered and normalized) from serum taken from patients with apparently healthy prostate, enlarged prostate, and prostate cancer. It would be useful to be able to predict disease status from the spectra. The red and blue spectra have been shifted upward by  and  units, respectively. in survival analysis and animal distribution modeling. Figure 5 shows representative protein mass spectra from serum taken from patients with a healthy prostate, relatively benign prostate enlargement and prostate cancer (see Adam et al. 2002). To avoid the need for intrusive biopsy there is substantial interest in developing noninvasive screening tests to distinguish cancer, healthy and more benign conditions. One possible model is an ordered categorical signal regression in which the mean of a logistically distributed latent variable z is given by where f (D) is an unknown smooth function of mass D (in Daltons) and ν i (D) is the ith spectrum. The probability of the patient lying in category 1, 2, or 3 corresponding to "healthy, " "benign enlargement" and "cancer" is then given by the probability of z i lying in the range (−∞, −1], (−1, θ] or (θ, ∞), respectively (see online SA K). Given the methods developed in this article, estimation of this model is routine, as is the exploration of whether an adaptive smooth should be used for f , given the irregularity of the spectra. Figure 6 shows some results of model fitting. The estimated f (D) is based on a rank 100 thin plate regression spline. Its effective degrees of freedom is 29. An adaptive smooth gives almost identical results. The right panel shows a QQ-plot of ordered deviance residuals against simulated theoretical quantiles (Augustin, Sauleau, and Wood 2012). There is modest deviation in the lower tail. The middle panel shows boxplots of the probability of cancer according to the model for the three observed categories. Cancer and healthy are quite well separated, but cancer and benign enlargement less so. For cases with cancer, the model gave cancer a higher probability than normal prostate in 92% of cases, and a higher probability that either other category in 83% of cases. For healthy patients the model gave the normal category higher probability than cancer in 85% of cases and the highest probability in 77% of cases. These results are somewhat worse than those reported by Adam et al. (2002) for a relatively complex machine learning method which involved first preprocessing the spectra to identify peaks believed to be discriminating. On the other hand the signal regression model here would allow the straightforward inclusion of further covariates, and does automatically supply uncertainty estimates. Figure 7 shows part of a dataset on the fuel efficiency of 207 U.S. car models, along with their characteristics (Bache and Lichman 2013). Two efficiency measures were taken: miles per gallon (MPG) in city driving, and the same for highway driving. One possible model might be a bivariate additive model, as detailed in the online SA H, where the two mpg measurements are modeled as bivariate Gaussian, with means given by separate linear predictors for the two components. A priori, it might be expected that city efficiency would be highly influenced by weight and highway efficiency by air resistance and, hence, by frontal area or some other combination of height and width of the car.

Multivariate Additive Modeling of Fuel Efficiency
The linear predictors for the two components were based on the additive fixed effects of factors "fuel type" (petrol or diesel), "style" of car (hatchback, sedan, etc.) and "drive" (all-, frontor rear-wheel). In addition i.i.d. Gaussian random effects of the 22 car manufacturers were included, as well as smooth additive effects of car weight and horsepower. Additive and tensor product smooths of height and width were tried as well as a smooth of the product of height and width, but there was no evidence to justify their inclusion-term selection penalties (Marra and  Wood 2011) remove them, p-values indicate they are not significant and AIC suggests that they are better dropped.
The possibility of smooth interactions between weight and horsepower were also considered, using smooth main effects plus smooth interaction formulations of the form f 1 (h) + f 2 (w) + f 3 (h, w). The smooth interaction term f 3 can readily be constructed in a way that excludes the main effects of w and h, by constructing its basis using the usual tensor product construction (e.g., Wood 2006), but based on marginal bases into which the constraints i f 1 (h i ) = 0 and i f 2 (w i ) = 0 have already been absorbed by linear reparameterization. The marginal smoothing penalties and, hence, the induced tensor product smoothing penalties are unaffected by the marginal constraint absorption. This construction is the obvious generalization of the construction of parametric interactions in linear models, and is simpler than the various schemes proposed in the literature.
The interactions again appear to add nothing useful to the model fit, and we end up with a model in which the important smooth effects are horse power (hp) and weight, while the important fixed effects are fuel type and drive, with diesel giving lower fuel consumption than petrol and all wheel drive giving higher consumption than the two-wheel drives. These effects were important for both city and highway, whereas the random effect of manufacturer was only important for the city. Figure 8 shows the smooth and random effects for the city and highway linear predictors. Notice the surprising similarity between the effects although the city smooth effects are generally slightly less pronounced than those for the highway. The overall r 2 for the model is 85% but with the city and highway error MPG standard deviation estimated as 1.9 and 2.3 MPG respectively. The estimated correlation coefficient is 0.88.

Discussion
This article has outlined a practical framework for smooth regression modeling with reduced rank smoothers, for likelihoods beyond the exponential family. The methods build seamlessly on the existing framework for generalized additive modeling, so that practical application of any of the models implemented as part of this work is immediately accessible to anyone familiar with GAMs via penalized regression splines.
The key novel components contributed here are (i) general, reliable and efficient smoothing parameter estimation methods based on maximized Laplace approximate marginal likelihood, (ii) a corrected AIC and distributional results incorporating smoothing parameter uncertainty to aid model selection and further inference, and (iii) demonstration of the framework's practical utility by provision of the details for some practically important models. The proposed methods should be widely applicable in situations in which effects are really smooth, and the methods scale well with the number of smooth model terms.
In situations in which some component functions are high rank random fields, then the INLA approach of Rue, Martino, and Chopin (2009) will be much more efficient; however, there are trade-offs between efficiency and stability in this case, since pivoting, used by our method to preserve stability, has instead to be employed to preserve sparsity in the INLA method (see online SA K).
The methods are implemented in R package mgcv from version 1.8 (see online SA M).
of the determinant terms are obtainable using Wood (2011) once derivatives of w i with respect to θ have been obtained. These are

Supplementary Materials
The online supplementary materials contain additional appendices for the article.