Skip to Main Content

ABSTRACT

This article discusses a general framework for smoothing parameter estimation for models with regular likelihoods constructed in terms of unknown smooth functions of covariates. Gaussian random effects and parametric terms may also be present. By construction the method is numerically stable and convergent, and enables smoothing parameter uncertainty to be quantified. The latter enables us to fix a well known problem with AIC for such models, thereby improving the range of model selection tools available. The smooth functions are represented by reduced rank spline like smoothers, with associated quadratic penalties measuring function smoothness. Model estimation is by penalized likelihood maximization, where the smoothing parameters controlling the extent of penalization are estimated by Laplace approximate marginal likelihood. The methods cover, for example, generalized additive models for nonexponential family responses (e.g., beta, ordered categorical, scaled t distribution, negative binomial and Tweedie distributions), generalized additive models for location scale and shape (e.g., two stage zero inflation models, and Gaussian location-scale models), Cox proportional hazards models and multivariate additive models. The framework reduces the implementation of new model classes to the coding of some standard derivatives of the log-likelihood. Supplementary materials for this article are available online.

1. Introduction

This article is about smoothing parameter estimation and model selection in statistical models with a smooth regular likelihood, where the likelihood depends on smooth functions of covariates and these smooth functions are the targets of inference. Simple Gaussian random effects and parametric dependencies may also be present. When the likelihood (or a quasi-likelihood) decomposes into a sum of independent terms each contributed by a response variable from a single parameter exponential family distribution, then such a model is a generalized additive model (GAM, Hastie and Tibshirani 1986, 1990). GAMs are widely used in practice (see, e.g., Ruppert, Wand, and Carroll 2003; Fahrmeir et al. 2013) with their popularity resting in part on the availability of statistically well founded smoothing parameter estimation methods that are numerically efficient and robust (Wood 2000, 2011) and perform the important task of estimating how smooth the component functions of a model should be.

The purpose of this article is to provide a general method for smoothing parameter estimation when the model likelihood does not have the convenient exponential family (or quasi-likelihood) form. For the most part we have in mind regression models of some sort, but the proposed methods are not limited to this setting. The simplest examples of the extension are generalized additive models where the response distribution is not in the single parameter exponential family. For example, when the response has a Tweedie, negative binomial, beta, scaled t, or some sort of ordered categorical or zero inflated distribution. Examples of models with a less GAM like likelihood structure are Cox proportional hazard and Cox process models, scale-location models, such as the GAMLSS class of Rigby and Stasinopoulos (2005), and multivariate additive models (e.g., Yee and Wild 1996). Smooth function estimation for such models is not new: what is new here is the general approach to smoothing parameter estimation, and the wide variety of smooth model components that it admits.

The proposed method broadly follows the strategy of Wood (2011) that has proved successful for the GAM class. The smooth functions will be represented using reduced rank spline bases with associated smoothing penalties that are quadratic in the spline coefficients. There is now a substantial literature showing that the reduced rank approach is well-founded, and the basic issues are covered in an online Supplementary Appendix A (henceforth online “SA A”). More importantly, from an applied perspective, a wide range of spline and Gaussian process terms can be included as model components by adopting this approach (Figure 1). We propose to estimate smoothing parameters by Newton optimization of a Laplace approximate marginal likelihood criterion, with each Newton step requiring an inner Newton iteration to find maximum penalized likelihood estimates of the model coefficients. Implicit differentiation is used to obtain derivatives of the coefficients with respect to the smoothing parameters. This basic strategy works well in the GAM setting, but is substantially more complex when the simplifications of a GLM type likelihood no longer apply.

Figure 1. Examples of the rich variety of smooth model components that can be represented as reduced rank basis smoothers, with quadratic penalties and therefore can routinely be incorporated as components of a GAM. This article develops methods to allow their routine use in a much wider class of models. (a) One dimensional smooths such as cubic, P- and adaptive splines. (b) isotropic smooths of several variables, such as thin plate splines and Duchon splines. (c) Nonisotropic tensor product splines used to model smooth interactions. (d) Gaussian Markov random fields for data on discrete geographies. (e) Finite area smoothers, such as soap film smoothers. (f) Splines on the sphere. Another important class are simple Gaussian random effects.

Our aim is to provide a general method that is as numerically efficient and robust as the GAM methods, such that (i) implementation of a model class requires only the coding of some standard derivatives of the log-likelihood for that class and (ii) much of the inferential machinery for working with such models can reuse GAM methods (e.g., interval estimation or p-value computations). An important consequence of our approach is that we are able to compute a simple correction to the conditional AIC for the models considered, which corrects for smoothing parameter estimation uncertainty and the consequent deficiencies in a conventionally computed conditional AIC (see Greven and Kneib 2010). This facilitates the part of model selection distinct from smoothing parameter estimation.

The article is structured as follows. Section 2 introduces the general modeling framework. Section 3 then covers smoothness selection methods for this framework, with Section 3.1 developing a general method, Section 3.2 illustrating its use for the special case of distributional regression, and Section 3.3 covering the simplified methods that can be used in the even more restricted case of models with a similar structure to generalized additive models. Section 4 then develops approximate distributional results accounting for smoothing parameter uncertainty which are applied in Section 5 to propose a corrected AIC suitable for the general model class. The remaining sections present simulation results and examples, while extensive further background, and details for particular models, are given in the supplementary appendices (referred to as online “SA A,” “SA B,” etc., below).

2. The General Framework

Consider a model for an n-vector of data, y, constructed in terms of unknown parameters, θ, and some unknown functions, gj, of covariates, xj. Suppose that the log-likelihood for this model satisfies the Fisher regularity conditions, has four continuous derivatives, and can be written l(θ,g1,g2,...,gM)=logf(y|θ,g1,g2,...,gM). In contrast to the usual GAM case, the likelihood need not be based on a single parameter exponential family distribution, and we do not assume that the log-likelihood can be written in terms of a single additive linear predictor. Now let the gj(xj) be represented via basis expansions of modest rank (kj), gj(x)=i=1kjβjibji(x),where the βji are unknown coefficients and the bji(x) are known basis functions such as splines, usually chosen to have good approximation theoretical properties. With each gj is associated a smoothing penalty, which is quadratic in the basis coefficients and measures the complexity of gj. Writing all the basis coefficients and θ in one p-vector β, then the jth smoothing penalty can be written as βTSjβ, where Sj is a matrix of known coefficients, but generally has only a small nonzero block. The estimated model coefficients are then (1) β^=argmaxβl(β)-12jMλjβTSjβ(1) given M smoothing parameters, λj, controlling the extent of penalization. A slight extension is that the smoothing penalties may be such that several λiβTSiβ are associated with one gj, for example when gj is a nonisotropic function of several variables. Note also that the framework can incorporate Gaussian random effects, provided the corresponding precision matrices can be written as λiβTSiβ (where the Si are known).

From a Bayesian viewpoint β^ is a posterior mode for β. The Bayesian approach views the smooth functions as intrinsic Gaussian random fields with prior fλ given by N(0, Sλ −) where Sλ − is a Moore–Penrose (or other suitable) pseudoinverse of ∑jλjSj. Then the posterior modes are β^ from (1), and in the large sample limit, assuming fixed smoothing parameter vector, λ, we have β|yN(β^,(I+Sλ)-1), where I is the expected negative Hessian of the log-likelihood (or its observed version) at β^. An empirical Bayesian approach is appealing here as it gives well calibrated inference for the gj (Wahba 1983; Silverman 1985; Nychka 1988; Marra and Wood 2012) in a GAM context. Appropriate summations of the elements of diag{(I+Sλ)-1I} provide estimates of the “effective degrees of freedom” of the whole model, or of individual smooths.

Under this Bayesian view, smoothing parameters can be estimated to maximize the log marginal likelihood (2) Vr(λ)=logf(y|β)fλ(β)dβ,(2) or a Laplace approximate version of this (e.g., Wood 2011). In practice optimization is with respect to ρ where ρi = log λi. Marginal likelihood estimation of smoothing parameters in a Gaussian context goes back to Anderssen and Bloomfield (1974) and Wahba (1985), while Shun and McCullagh (1995) showed that Laplace approximation of more general likelihoods is theoretically well founded. That marginal likelihood is equivalent to REML (in the sense of Laird and Ware 1982) supports its use when the model contains Gaussian random effects. Theoretical work by Reiss and Ogden (2009) also suggests practical advantages at finite sample sizes, in that marginal likelihood is less prone to multiple local minima than GCV (or AIC). Supplementary Appendix B (SA B) also demonstrates how Laplace approximate marginal likelihood (LAML) estimation of smoothing parameters maintains statistical consistency of reduced rank spline estimates. The use of Laplace approximation and demonstration of statistical consistency requires the assumption that dim(β)=O(nα) where α < 1/3.

3. Smoothness Selection Methods

This section describes the general smoothness selection method, and a simplified method for the special case in which the likelihood is a simple sum of terms for each observation of a univariate response, and there is a single GAM like linear predictor.

The nonlinear dependencies implied by employing a general smooth likelihood result in unwieldy expressions unless some care is taken to establish a compact notation. In the rest of this article, Greek subscripts denote partial differentiation with respect to the given variable, while Roman superscripts are indices associated with the derivatives. Hence, Dijβθ = ∂2D/∂βi∂θj. Similarly Dβθij=2D/βiθj|β^. Roman subscripts denote vector or array element indices. For matrices the first Roman sub- or superscript denotes rows, the second columns. Roman superscripts without a corresponding Greek subscript are labels, for example β1 and β2 denote two separate vectors β. For Hessian matrices only, Dβijθ is element i, j of the inverse of the matrix with elements Diβjθ. If any Roman index appears in two or more multiplied terms, but the index is absent on the other side of the equation, then a summation over the product of the corresponding terms is indicated (the usual Einstein summation convention being somewhat unwieldy in this context). To aid readability, in this article summation indices will be highlighted in bold. For example, the equation aijbikcil+djkl=0 is equivalent to ∑iaijbikcil + djkl = 0. An indexed expression not in an equation is treated like an equation with no indices on the other side (so aijbj is interpreted as ∑jaijbj).

3.1. General Model Estimation

Consider the general case in which the log-likelihood depends on several smooth functions of predictor variables, each represented via a basis expansion and each with one or more associated penalties. The likelihood may also depend on some strictly parametric model components. The log-likelihood is assumed to satisfy the Fisher regularity conditions and in addition we usually assume that it has 4 bounded continuous derivatives with respect to the parameters (with respect to gj(x) for any relevant fixed x in the case of a smooth, gj). Let the model coefficients be β (recalling that this includes the vector θ of parametric coefficients and nuisance parameters). The penalized log-likelihood is then L(β)=l(β)-12λjβTSjβ,and we assume that the model is well enough posed that this has a positive definite maximum (at least after dealing with any parameter redundancy issues that can be addressed by linear constraint). Let β^ be the maximizer of L and let H be the negative Hessian, with elements -Lβ^iβ^j. The log LAML (see online SA C) is V(λ)=L(β^)+12log|Sλ|+-12log|H|+Mp2log(2π),where Sλ=λjSj and |Sλ|+ is the product of the positive eigenvalues of Sλ. Mp is the number of zero eigenvalues of Sλ, when all λj are strictly positive. The basic strategy is to optimize V with respect to ρ=log(λ) via Newton’s method. This requires β^ to be obtained for each trial ρ via an inner Newton iteration, and derivatives of β^ must be obtained by implicit differentiation. The log determinant computations have the potential to be computationally unstable, and reparameterization is needed to deal with this. The full Newton method based on computationally exact derivatives has the substantial practical advantage that it can readily be detected when V is indefinite with respect to a particular ρi, since then V/ρi=2V/ρi20. Such indefiniteness occurs when a smoothing parameter, λi, →∞ or a variance component tends to zero, both of which are perfectly legitimate. Dropping a ρi from Newton update when such indefiniteness is detected ensures that it takes a value which can be treated as “working infinity” without overflowing. Methods which use an approximate Hessian, or none, do not have this advantage.

The proposed general method consists of outer and inner iterations, as follows.

Outer algorithm for ρ

1.

Obtain initial values for ρ=log(λ), to ensure that the effective degrees of freedom of each smooth lies away from its maximum or minimum possible values.

2.

Find initial β^ guesstimates (model specific).

3.

Perform the initial reparameterizations required in Section 3.1.1 to facilitate stable computation of log |Sλ|+.

4.

Repeat the following standard Newton iteration until convergence is detected at Step (c).

(a)

Find β^, Vρi and Vρρij by the inner algorithm.

(b)

Drop any Vρi, Vρρij and Vρρji for which VρiVρρii0. Let I denote the indices of the retained terms.

(c)

Test for convergence, that is, all Vρi0 and the Hessian (elements -Vρρji) is positive semidefinite.

(d)

If necessary perturb the Hessian (elements -Vρρji) to make it positive definite (guaranteeing that the Newton step will be a descent direction).

(e)

Define ΔI as the subvector of Δ indexed by I, with elements -VijρρVρj, and set Δj = 0 ∀ jI.

(f)

While V(ρ+Δ)<V(ρ) set ΔΔ/2.

(g)

Set ρρ+Δ.

5.

Reverse the Step 3 reparameterization.

The method for evaluating V and its gradient and Hessian with respect to ρ is as follows, where Lkβ^jβ^ denotes the inverse of Lβ^kβ^j.

Inner algorithm for β

1.

Reparameterize to deal with any “type 3” penalty blocks as described in Section 3.1.1 so that computation of log |Sλ|+ is stable, and evaluate the derivatives of log |Sλ|+.

2.

Use Newton’s method to find β^, regularizing the Hessian, and applying step length control, to ensure convergence even when the Hessian is indefinite and/or β^ is not identifiable, as described in Section 3.1.2.

3.

Test for identifiability of β^ at convergence by examining the rank of the H as described in Section 3.1.2. Drop unidentifiable coefficients.

4.

If coefficients were dropped, find the reduced β^ by further steps of Newton’s method (Section 3.1.2).

5.

Compute dβ^i/dρk=Liβ^jβ^λkSjlkβ^l and hence lβ^iβ^jρl=lβ^iβ^jβ^kdβ^k/dρl (Section 3.1.3).

6.

Compute d2β^i/dρkdρl=Liβ^jβ^{(-l**β^jβ^ρpl+λlSjpl)dβ^p/dρk+λkSjpkdβ^p/dρl}+δkldβ^i/dρk, (Section 3.1.3).

7.

Compute L**kβ^β^jl**β^jkβ^ρρpv (model specific). (3.1.3)

8.

The derivatives of V can now be computed according to Section 3.1.4.

9.

For each parameter dropped from β^ during fitting, zeroes must be inserted in β^, β^/ρj and the corresponding rows and columns of Lkβ^jβ^. The Step 1 reparameterization is then reversed.

The following subsections fill in the method details, but note that to implement a particular model in this class it is necessary to be able to compute, l, liβ and liβjβ, given β, along with l**β^ijβ^kρ given dβ^/dρk, and L**kβ^β^jl**β^jkβ^ρρpv given d2β^/dρkdρl. The last of these is usually computable much more efficiently than if l**β^jkβ^ρρpv was computed explicitly.

3.1.1. Derivatives and Stable Evaluation of log |Sλ|+

This section covers the details for outer Step 3 and inner Step 1. Stable evaluation of the log determinant terms is the key to stable computation with the LAML. The online SA C explains the issue. Wood (2011) proposed a solution which involves orthogonal transformation of the whole parameter vector β, but in the general case the likelihood may depend on each smooth function separately and such a transformation is therefore untenable. It is necessary to develop a reparameterization strategy which does not combine coefficients from different smooths. This is possible if we recognize that Sλ is block diagonal, with different blocks relating to different smooths. For example, if Sj denotes the nonzero sub-block of Sj, Sλ=λ1S1.....λ2S2.....λjSj.............That is, there are some blocks with single smoothing parameters, and others with a more complicated additive structure. There are usually also some zero blocks on the diagonal. The block structure means that the generalized determinant, its derivatives with respect to ρk = log λk and the matrix square root of Sλ can all be computed blockwise. So for the above example, log|Sλ|+= rank (S1)log(λ1)+log|S1|++ rank (S2)log(λ2)+log|S2|++log|λjSj|++For any ρk relating to a single parameter block we have log|Sλ|+ρk=rank(Sk)and zero second derivatives. For multi-λ blocks there will generally be first and second derivatives to compute. There are no second derivatives “between-blocks.”

In general, there are three block types, each requiring different preprocessing.

1.

Single parameter diagonal blocks. A reparameterization can be used so that all nonzero elements are one, and the rank precomputed.

2.

Single parameter dense blocks. An orthogonal reparameterization, based on the eigenvectors of the symmetric eigen-decomposition of the block, can be used to make these blocks look like the previous type (by similarity transform). Again the rank is computed.

3.

Multi-λ blocks will require the reparameterization method of Wood (2011) appendix B to be applied for each new ρ proposal, since the numerical problem that the reparameterization avoids is ρ dependent (see online SA C). Initially, before the smoothing parameter selection iteration, it is necessary to reparameterize to separate the parameters corresponding to the block into penalized and unpenalized subvectors. This initial reparameterization can be based on the eigenvectors of the symmetric eigen decomposition of the “balanced” version of the block penalty matrix, jSj/SjF, where ‖ · ‖F is the Frobenious norm. The balanced penalty is used for maximal numerical stability, and is usable because formally the spaces for the penalized and unpenalized components do not change with the smoothing parameters.

The reparameterizations from each block type are applied to the model, usually to the model matrices Xj of the individual smooth terms. The reparameterization information must be stored so that we can return to the original parameterization at the end.

After the one off initial reparameterization just described, then step one of the inner algorithm requires only that the reparameterization method of Wood (2011) Appendix B be applied to the parameters corresponding to type 3 blocks, for each new set of smoothing parameters.

3.1.2. Newton Iteration for β^

This section provides details for inner Steps 2–4. Newton iteration for β^ requires the gradient vector, G, with elements Lβi=lβi-λkSijkβj and negative Hessian matrix H with elements -Lβiβj=-lβiβj+λkSijk (we will also use H to denote the Hessian of the negative unpenalized log-likelihood with elements − liβjβ). In principle Newton iteration proceeds by repeatedly setting β to β+Δ, where Δ=H-1G. In practice, Newton’s method is only guaranteed to converge to a maximum of L, provided (i) that the Hessian is perturbed to be positive definite if it is not, guaranteeing that the Newton direction is an ascent direction, (ii) that step reduction is used to ensure that the step taken actually increases L and (iii) that the computation of the step is numerically stable (see Nocedal and Wright 2006).

L may be indefinite away from a maximum, but even near the maximum there are two basic impediments to stability and positive definiteness. First, some elements of β may be unidentifiable. This issue will be dealt with by dropping parameters at convergence, as described shortly. The second issue is that some smoothing parameters may legitimately become very large during fitting, resulting in very large λjSj components, poor scaling, poor conditioning and, hence, computational singularity. However, given the initial and Step 1 reparameterizations, such large elements can be dealt with by diagonal preconditioning of H. That is define diagonal matrix D such that Dii=|Hii|-1/2, and preconditioned Hessian H'=DHD. Then H-1=DH'-1D, with the right-hand side resulting in much better scaled computation. In the work reported here the pivoted Cholesky decomposition of the perturbed Hessian RTR=H'+ϵI is repeated with increasing ε, starting from zero, until positive definiteness is obtained. The Newton step is then computed as Δ=DR-1R-TDG. If the step to β+Δ fails to increase the likelihood then Δ is repeatedly halved until it does. Note that the perturbation of the Hessian does not change the converged state of a Newton algorithm (although varying the perturbation strength can change the algorithm convergence rate).

At convergence H can at worst be positive semi- definite, but it is necessary to test for the possibility that some parameters are unidentifiable. The test should not depend on the particular values of the smoothing parameters. This can be achieved by constructing the balanced penalty S = ∑jSj/‖SjF (‖ · ‖F is the Frobenius norm, but another norm could equally well be used), and then forming the pivoted Cholesky decomposition PTP=H/HF+S/SF. The rank of P can then be estimated by making use of Cline et al. (1979). If this reveals rank deficiency of order q then the coefficients corresponding to the matrix rows and columns pivoted to the last q positions should be dropped from the analysis. The balanced penalty is used to avoid dropping parameters simply because some smoothing parameters are very large. Given the nonlinear setting it is necessary to repeat the Newton iteration to convergence with the reduced parameter set, in order that the remaining parameters adjust to the omission of those dropped.

3.1.3. Implicit Differentiation

This section provides the details for inner Steps 5–7. We obtain the derivatives of the identifiable elements of β^ with respect to ρ. All computations here are in the reduced parameter space, if parameters were dropped. At the maximum penalized likelihood estimate we have Lβ^i=lβ^i-λkSijkβ^j=0 and differentiating with respect to ρk = log λk yields Lβ^iρk=lβ^iβ^jdβ^jdρk-λkSijkβ^j-λlSijldβ^jdρk=0andrearranging,dβ^idρk=Liβ^jβ^λkSjlkβ^l,given which we can compute l**β^iβ^ρjl=l**β^iβ^β^jkdβ^k/dρl from the model specification. -l**β^iβ^ρjl+δklλkSijk are the elements of H/ρl, required in the next section (δlk is 1 for l = k and 0 otherwise). Then d2β^idρkdρl=Liβ^jβ^-lβ^jβ^pρl+λlSjpldβ^pdρk+λkSjpkdβ^pdρl+δkldβ^idρk,which enables computations involving 2H/ρkρl, with elements -l**β^ijβ^ρρkl+δklλkSijk, and l**β^ijβ^ρρkl=lβ^β^β^β^ijrtdβ^rdρkdβ^tdρl+Lβ^β^β^ijrd2β^rdρkdρl.As mentioned in Section 3.1, it will generally be inefficient to form this last quantity explicitly, as it occurs only in the summations involved in computing the final trace in (3).

3.1.4. The Remaining Derivatives

Recalling that H is the matrix with elements -Lβiβj=-lβiβj+λkSijk, we require (inner Step 8) Vρk=-λk2β^TSkβ^+12log|Sλ|+ρk-12log|H|ρkand 2Vρkρl=-δklλk2β^TSkβ^-dβ^TdρlHdβ^dρk+122log|Sλ|+ρkρl-122log|H|ρkρl,where components involving Lβ^j are zero by definition of β^. The components not covered so far are (3) log|H|ρk= tr H-1Hρkand2log|H|ρkρl=- tr H-1HρkH-1Hρl+ tr H-12Hρkρl.(3) The final term above is expensive if computed naively by explicitly computing each term 2H/ρkρl, but this is unnecessary and the computation of tr H-12H/ρkρl can usually be performed efficiently as the final part of the model specification, keeping the total cost to O(Mnp2): see online SA G and Section 3.2 for illustrative examples.

The Cox (1972) proportional hazards model provides a straightforward application of the general method, and the requisite computations are set out in online SA G in a manner that maintains O(Mnp2) computational cost. Another example is the multivariate additive model, in which the means of a multivariate Gaussian response are given by separate linear predictors, which may optionally share terms. This model is covered in the online SA H and Section 8. Section 3.2 considers how another class of models falls into the general framework.

3.2. A Special Case: GAMLSS Models

The GAMLSS (or “distributional regression”) models discussed by Rigby and Stasinopoulos (2005) (and also Yee and Wild 1996; Klein et al. 2014, 2015) fall within the scope of the general method. The idea is that we have independent univariate response observations, yi, whose distributions depend on several unknown parameters, each of which is determined by its own linear predictor. The log-likelihood is a straightforward sum of contributions from each yi (unlike the Cox models, e.g.), and the special structure can be exploited so that implementation of new models in this class requires only the supply of some derivatives of the log-likelihood terms with respect to the distribution parameters. Given the notational conventions established previously, the expressions facilitating this are rather compact (without such a notation they can easily become intractably complex).

Let the log-likelihood for the ith observation be l(yi, η1i, η2i, …) where the ηk=Xkβk are K linear predictors. The Newton iteration for estimating β=(β1T,β2T,...)T requires lβlj=lηliXijl and lβljβmk=lηliηmiXijlXikm, which are also sufficient for first-order implicit differentiation.

LAML optimization also requires lβ^ljβ^mkρp=lβ^ljβ^mkβ^qrdβ^rqdρp=lη^liη^miη^qiXijlXikmXirqdβ^rqdρp=lη^liη^miη^qiXijlXikmdη^iqdρp.Notice how this is just an inner product XTVX, where the diagonal matrix V is the sum over q of some diagonal matrices. At this stage the second derivatives of β^ with respect to ρ can be computed, after which we require only lβ^ljβ^mkρpρv=lβ^ljβ^mkβ^qrβ^stdβ^rqdρpdβ^tsdρv+lβ^ljβ^mkβ^qrd2β^rqdρpdρv=lη^liη^miη^qiη^siXijlXikmdη^iqdρpdη^isdρv+lη^liη^miη^qiXijlXikmd2η^iqdρpdρv.So to implement a new family for GAMLSS estimation requires mixed derivatives up to fourth order with respect to the parameters of the likelihood. In most cases what would be conveniently available is, for example, lμ^liμ^miμ^qiμ^si rather than lη^liη^miη^qiη^si, where μk is the kth parameter of the likelihood and is given by hkk) = ηk, hk being a link function.

To get from the μ derivatives to the η derivatives, the rules (A.1)--(A.4) from Appendix A are used. This is straightforward for any derivative that is not mixed. For mixed derivatives containing at least one first-order derivative the transformation rule applying to the highest order derivative is applied first, followed by the transformations for the first-order derivatives. This leaves only the transformation of lμ^jiμ^jiμ^kiμ^ki as at all awkward, but we have l**η^jiiη^jiη^kη^ki=(l**μ^jiiμ^jiμ^kμ^ki/hij'2-l**μ^jiiμ^kiμ^khij''/hij'3)/hik'2-(l**μ^jiiμ^jiμ^k/hij'2-l**μ^jiiμ^khij''/hij'3)hik''/hik'3.Figure 2

The general method requires Lkjβ^β^lβ^β^ρρjkpv to be computed, which would have O{M(M+1)nP2/2} cost if the terms lβ^β^ρρjkpv were computed explicitly for this purpose (where P is the dimension of combined β). However, this can be reduced to O(nP2) using a trick most easily explained by switching to a matrix representation. For simplicity of presentation assume K = 2, and define matrix B to be the inverse of the penalized Hessian, so that Bij=Liβ^jβ^. Defining vilm=lη^liη^miη^qiη^sidη^iqdρpdη^isdρv+lη^liη^miη^qid2η^iqdρpdρvand[4pt]Vlm=diag(vilm)wehave[4pt](4) Lkjβ^β^lβ^β^ρρjkpv=trBX1TV11X1X1TV12X2X2TV12X1X2TV22X2[4pt]=trBX100X2TV11X1V12X2V12X1V22X2.(4) Hence, following the one off formation of BX100X2T (which need only have O(nP2) cost), each trace computation has O(MnP) cost (since tr(CTD)=DijCij).

See online SA I where a zero inflated Poisson model provides an example of the details. Figure 2 shows estimates for the model acceliN(f1(ti),σi2) where log σi = f2(ti), f1 is an adaptive P-spline and f2 a cubic regression spline, while SA F.2 provides another application. Package mgcv also includes multinomial logistic regression implemented this way and further examples are under development. An interesting possibility with any model which has multiple linear predictors is that one or more of those predictors should depend on some of the same terms, and online SA H shows how this can be handled.

Figure 2. A smooth Gaussian location scale model fit to the motorcycle data from Silverman (1985), using the methods developed in Section 3.2. The left plot shows the raw data as open circles and an adaptive p-spline smoother for the mean overlaid. The right plot shows the simultaneous estimate of the standard deviation in the acceleration measurements, with the absolute values of the residuals as circles. Dotted curves are approximate 95% confidence intervals. The effective degrees of freedom of the smooths are 12.5 and 7.3 respectively.

3.3. A More Special Case: Extended Generalized Additive Models

For models with a single linear predictor, in which the log-likelihood is a sum of contributions per yi, it is possible to perform fitting by iterative weighted least squares, enabling profitable reuse of some components of standard GAM fitting methods, including the exploitation of very stable orthogonal methods for solving least squares problems. Specifically, consider observations yi, and let the corresponding log-likelihood be of the form l=ili(yi,μi,θ,ϕ),where the terms in the summation may also be written as li for short, and μi is often E(yi), but may also be a latent variable (as in the ordered categorical model of SA K). Given h, a known link function, hi) = ηi where η=Xβ+o, X is a model matrix, β is a parameter vector and o is an offset (often simply 0). θ is a parameter vector, containing the extra parameters of the likelihood, such as the p parameter of a Tweedie density (see online SA J), or the cut points of an ordered categorical model (see online SA K). Notice that in this case θ is not treated as part of β, since θ can not always be estimated by straightforward iterative regression. Instead θ will be estimated alongside the smoothing parameters. φ is a scale parameter, often fixed at one. Let l˜i=maxμili(yi,μi,θ,ϕ) denote the saturated log-likelihood. Define the deviance corresponding to yi as Di=2(l˜i-li)ϕ, where φ is the scale parameter on which Di does not depend. Working in terms of the deviance is convenient in a regression setting, where deviance residuals are a preferred method for model checking and the proportion deviance explained is a natural substitute for the r2 statistic as a measure of goodness of fit (but see the final comment in online SA I).

In general the estimates of β will depend on some log smoothing parameter ρj = log λj, and it is notationally expedient to consider these to be part of the vector θ, although it is to be understood that l does not actually depend on these elements of θ. Given θ, estimation of β is by minimization of the penalized deviance D(β,θ)=iDi(β,θ)+jλjβTSjβ, with respect to β. This can be achieved by penalized iteratively reweighted least squares (PIRLS), which consists of iterative minimization of iwi(zi-Xiβ)2+jλjβTSjβ, where the pseudodata and weights are given by zi=ηi-oi-12wiDiηi,wi=12d2Didηi2.Note that if wi = 0 (or wi is too close to 0), the penalized least squares estimate can be computed using only wizi, which is then well defined and finite when zi is not.

Estimation of θ, and possibly φ, is by LAML. Writing W as the diagonal matrix of wi values, the log LAML is given by V(θ,ϕ)=-D(β^,θ)2ϕ+l˜(θ,ϕ)-log|XTWX+Sλ|-log|Sλ|+2+Mp2log(2πϕ),where W is evaluated at the β^ implied by θ. To compute the derivatives of V with respect to θ the derivatives of β^ with respect to θ are required. Note that V is a full Laplace approximation, rather than the “approximate” Laplace approximation used to justify PQL (Breslow and Clayton 1993), so that PQL’s well known problems with binary and low count data are much reduced. In particular: (i) most PQL implementations estimate φ when fitting the working linear mixed model, even in the binomial and Poisson cases, where it is fixed at 1. For binary and low count data this can give very poor results. (ii) PQL uses the expected Hessian rather than the Hessian, and these only coincide for the canonical link case. (iii) PQL is justified by an assumption that the iterative fitting weights only vary slowly with the smoothing parameters, an assumption that is not needed here.

The parameters θ and φ can be estimated by maximizing V using Newton’s method, or a quasi-Newton method. Notice that V depends directly on the elements of θ via D, l˜ and Sλ, but also indirectly via the dependence of μ^ and W on β^ and hence on θ. Hence, each trial θ,ϕ requires a PIRLS iteration to find the corresponding β^, followed by implicit differentiation to find the derivatives of β^ with respect to θ. Once these are obtained, the chain rule can be applied to find the derivatives of V with respect to θ and φ.

As illustrated in SA C, there is scope for serious numerical instability in the evaluation of the determinant terms in V, but for this case we can reuse the stabilization strategy from Wood (2011), namely for each trial θ and φ:

1.

Use the orthogonal reparameterization from Appendix B of Wood (2011) to ensure that log |Sλ|+ can be computed in a stable manner.

2.

Estimate β^ by PIRLS using the stable least squares method for negatively weighted problems from Section 3.3 of Wood (2011), setting structurally unidentifiable coefficients to zero.

3.

Using implicit differentiation, obtain the derivatives of V required for a Newton update.

Step 3 is substantially more complicated than in Wood (2011), and is covered in Appendix A.

3.3.1. Extended GAM New Model Implementation

The general formulation above assumes that various standard information is available for each distribution and link. What is needed depends on whether quasi-Newton or full Newton is used to find θ^. Here is a summary of what is needed for each distribution

1.

For finding β^. Diμ, Diμiμ, h′, and h′′.

2.

For ρ^ via quasi-Newton. h′, Diμjθ, Diθ, Diμiμiμ, and Diμiμjθ.

3.

For ρ^ via full Newton. h′′′, Diθjθ, Diμjθkθ, Diμiμiμiμ, Diμiμiμjθ, and Diμiμjθkθ.

In addition, first and second derivatives of l˜ with respect to its arguments are needed. All of these quantities can be obtained automatically using a computer algebra package. EDμiμi is also useful for further inference. If it is not readily computed then we can substitute Diμiμ, but a complication of penalized modeling is that Diμiμ can fail to be positive definite at β^. When this happens EDμiμi can be estimated as the nearest positive definite matrix to Diμiμ.

We have implemented beta, negative binomial, scaled t models for heavy tailed data, simple zero inflated Poisson, ordered categorical and Tweedie additive models in this way. The first three were essentially automatic: the derivatives were computed by a symbolic algebra package and coded from the results. Some care is required in doing this, to avoid excessive cancellation error, underflow or overflow in the computations. Overly naive coding of derivatives can often lead to numerical problems: The online SA I on the zero inflated Poisson provides an example of the sort of issues that can be encountered. The ordered categorical and Tweedie models are slightly more complicated and details are therefore provided in the online SA J and K (including further examples of the need to avoid cancellation error).

4. Smoothing Parameter Uncertainty

Conventionally in a GAM context smoothing parameters have been treated as fixed when computing interval estimates for functions, or for other inferential tasks. In reality smoothing parameters must be estimated, and the uncertainty associated with this has generally been ignored except in fully Bayesian simulation approaches. Kass and Steffey (1989) proposed a simple first-order correction for this sort of uncertainty in the context of iid Gaussian random effects in a one way ANOVA type design. Some extra work is required to understand how their method works when applied to smooths. It turns out that the estimation methods described above provide the quantities required to correct for smoothing parameter uncertainty.

Assume we have several smooth model components, let ρi = log λi and Sλ = ∑jλjSj. Writing β^ρ for β^, to emphasize the dependence of β^ on the smoothing parameters, we use the Bayesian large sample approximation (see SB.4) (5) β|y,ρN(β^ρ,Vβ)whereVβ=(I^+Sλ)-1(5) which is exact in the Gaussian case, along with the large sample approximation (6) ρ|yN(ρ^,Vρ),(6) where Vρ is the inverse of the Hessian of the negative log marginal likelihood with respect to ρ. Since the approximation (6) applies in the interior of the parameter space, it is necessary to substitute a Moore-Penrose pseudoinverse of the Hessian if a smoothing parameter is effectively infinite, or otherwise to regularize the inversion (which is equivalent to placing a Gaussian prior on ρ). Conventionally (5) is used with ρ^ plugged in and the uncertainty in ρ neglected. To improve on this note that if (5) and (6) are correct, while zN(0,I) and independently ρ*N(ρ^,Vρ), then β|y=dβ^ρ*+Rρ*Tz where Rρ*TRρ*=Vβ (and Vβ depends on ρ*). This provides a way of simulating from β|y, but it is computationally expensive as β^ρ* and Rρ* must be computed afresh for each sample. (The conventional approximation would simply set ρ*=ρ^.) Alternatively consider a first-order Taylor expansion β|y=dβ^ρ^+J(ρ-ρ^)+Rρ^Tz+kRρTzρkρ^(ρk-ρ^k)+r,where r is a lower order remainder term and J=dβ^/dρ|ρ^. Dropping r, the expectation of the right-hand side is β^ρ^. Denoting the elements of Rρ by Rij, tedious but routine calculation shows that the three remaining random terms are uncorrelated with covariance matrix (7) Vβ'=Vβ+V'+V'',whereV'=JVρJTandVjm''=iplMkMRijρkVρ,klRimρl,(7) which is computable at O(Mp3) cost (see online SA D). Dropping V′′ we have the Kass and Steffey (1989) approximation β|yN(β^ρ^,Vβ*)whereVβ*=Vβ+JVρJT. (A first-order Taylor expansion of β^ about ρ yields a similar correction for the frequentist covariance matrix of β^: Vβ^*=(I^+Sλ)-1I^(I^+Sλ)-1+JVρJT, where I^ is the negative Hessian of the log-likelihood).

The online SA D shows that in a Demmler-Reinsch like parameterization, for any penalized parameter βi with posterior standard deviation σβi, dβi^/dρjd(RTz)i/dρjβ^iziσβi.So the J(ρ-ρ^) correction is dominant for components that are strongly nonzero. This offers some justification for using the Kass and Steffey (1989) approximation, but not in a model selection context, where near zero model components are those of most interest: hence, in what follows we will use (7) without dropping V′′.

5. An Information Criterion for Smooth Model Selection

When viewing smoothing from a Bayesian perspective, the smooths have improper priors (or alternatively vague priors of convenience) corresponding to the null space of the smoothing penalties. This invalidates model selection via marginal likelihood comparison. An alternative is a frequentist AIC (Akaike 1973), based on the conditional likelihood of the model coefficients, rather than the marginal likelihood. In the exponential family GAM context, Hastie and Tibshirani (1990, §6.8.3) proposed a widely used version of this conditional AIC in which the effective degrees of freedom of the model, τ0, is used in place of the number of model parameters (in the general setting τ0=tr{VβI^} is equivalent to the Hastie and Tibshirani (1990) proposal). But Greven and Kneib (2010) showed that this is overly likely to select complex models, especially when the model contains random effects: the difficulty arises because τ0 neglects the fact that the smoothing parameters have been estimated and are, therefore, uncertain (a marginal AIC based on the frequentist marginal likelihood, in which unpenalized effects are not integrated out, is equally problematic, partly because of underestimation of variance components and consequent bias toward simple models). A heuristic alternative is to use τ1=tr(2I^Vβ-I^VβI^Vβ) as the effective degrees of freedom, motivated by considering the number of unpenalized parameters required to optimally approximate a bias corrected version of the model, but the resulting AIC is too conservative (see, Section 6, e.g.). Greven and Kneib (2010) show how to exactly compute an effective modified AIC for the Gaussian additive model case based on defining the effective degrees of freedom as iy^i/yi (as proposed by Liang et al. 2008). Yu and Yau (2012) and Säfken et al. (2014) considered extensions to generalized linear mixed models. The novel contribution of this section is to use the results of the previous section to avoid the problematic neglect of smoothing parameter uncertainty in the conditional AIC computation in a manner that is easily computed and applicable to the general model class considered in this article.

The derivation of AIC (see, e.g., Davison 2003, sec. 4.7) with the MLE replaced by the penalized MLE is identical up to the point at which the AIC score is represented as (8) AIC =-2l(β^)+2E(β^-βd)TId(β^-βd)[4pt](8) (9) =-2l(β^)+2trE{(β^-βd)(β^-βd)T}Id,(9) where βd is the coefficient vector minimizing the KL divergence and Id is the corresponding expected negative Hessian of the log-likelihood. In an unpenalized setting E{(β^-βd)(β^-βd)T} is estimated as the observed inverse information matrix I^-1 and τ'=tr{E(β^-βd)(β^-βd)TId} is estimated as tr(I^-1I^)=k. Penalization means that the expected inverse covariance matrix of β^ is no longer well approximated by I^, and there are then two ways of proceeding.

The first is to view β as a frequentist random effect, with predicted values β^. In that case the covariance matrix for the predictions, β^, corresponds to the posterior covariance matrix obtained when taking the Bayesian view of the smoothing process, so we have the conventional estimate τ=tr{VβI^} if we neglect smoothing parameter uncertainty, or τ=tr(Vβ'I^) accounting for it using (7).Figure 3

The frequentist random effects formulation is not a completely natural way to view smooths, since we do not usually expect the smooth components of a model to be resampled from the prior with each replication of the data. However in the smoothing context Vβ has the interpretation of being the frequentist covariance matrix for β^ plus an estimate of the prior expectation of the squared smoothing bias (matrix), which offers some justification for using the same τ estimate as in the strict random effects case. To see this consider the decomposition E{(β^-βd)(β^-βd)T}=E{(β^-Eβ^)(β^-Eβ^)T}+ΔβΔβT,where Δβ is the smoothing bias in β^. The first term on the right-hand side, above, can be replaced by the standard frequentist estimate Vβ^=(I^+Sλ)-1I^(I^+Sλ)-1. Now expand the penalized log-likelihood around βd: lp(β')l(βd)+lβT(β'-βd)-12(β'-βd)TId(β'-βd)-12β'TSλβ'.Differentiating with respect to β' and equating to zero we obtain the approximation β^(Id+Sλ)-1Idβd+lββd.Edl/dβ|βd=0 by definition of βd, so taking expectations of both sides we have E(β^)(Id+Sλ)-1Idβd. Hence estimating Id by I^ we have Δ˜β{(I^+Sλ)-1I^-I}βd. Considering the expected value of Δ˜βΔ˜βT according to the prior mean and variance assumptions of the model, we have the following.

Lemma 1.

Let the setup be as above and let Eπ denote expectation assuming the prior mean and covariance for β. Treating I^ as fixed, then Vβ^+Eπ(Δ˜βΔ˜βT)=Vβ.

For proof see online SA D. This offers some justification for again using τ=tr{VβI^}, or τ=tr(Vβ'I^) accounting for ρ uncertainty. So both the frequentist random effects perspective and the prior expected smoothing bias approach result in (10) AIC =-2l(β^)+2tr(I^Vβ').(10) This is the conventional Hastie and Tibshirani (1990) conditional AIC with an additive correction 2tr{I^(V'+V'')}, accounting for smoothing parameter uncertainty. The correction is readily computed for any model considered here, provided only that the derivatives of β^ and Vβ can be computed: the methods of Section 3 provide these. Section 6 provides an illustration of the efficacy of (10).

6. Simulation Results

The improvement resulting from using the corrected AIC of Section 5 can be illustrated by simulation. Simulations were conducted for additive models with true expected values given by η = f0(x0) + f1(x1) + f2(x2) + f3(x3), where the fj are shown in the online SA E, and the x covariates are all independent U(0, 1) deviates. Two model comparisons were considered. In the first a 40 level Gaussian random effect was added to η, with the random effect standard deviation being varied from 0 (no effect) to 1. AIC was then used to select between models with or without the random effect included, but where all smooth terms were modeled using penalized regression splines. In the second case models with and without f0 were compared, with the true model being based on cf0 in place of f0, where the effect strength c was varied from 0 (no effect) to 1. Model selection was based on (i) conventional conditional generalized AIC using τ0 from Section 5, (ii) the corrected AIC of Section 5, (iii) a version of AIC in which the degrees of freedom penalty is based on τ1 from Section 5, (iv) AIC based on the marginal likelihood with the number of parameters given by the number of smoothing parameters and variance components plus the number of unpenalized coefficients in the model, and (v) The Greven and Kneib (2010) corrected AIC for the Gaussian response case. The marginal likelihood in case (iv) is a version in which unpenalized coefficients are not integrated out (to avoid the usual problems with fixed effect differences and REML, or improper priors and marginal likelihood).

Results are shown in the top row of Figure 3 for a sample size of 500 with Gaussian sampling error and standard deviation of 2. For the random effect comparison, conventional conditional AIC is heavily biased toward the more complex model, selecting it on over 70% of occasions. The ML based AIC is too conservative for an AIC criterion with 3.5% selection of the larger model when it is not correct, as against the roughly 16% one might expect from AIC comparison of models differing in 1 parameter. The known underestimation of variance components estimated by this sort of marginal likelihood is partly to blame. The AIC based on τ1 from Section 5 also lacks power, performing even less well than the ML based version. By contrast, the new corrected AIC performs well, and in this example is a slight improvement on Greven and Kneib (2010).Figure 4For the smooth comparison the different calculations differ much less, although the alternatives are slightly less biased toward the more complex model than the conventional conditional generalized AIC, with the corrected Section 5 version showing the smallest bias. The lower row of Figure 3 shows equivalent power plots for the same Gaussian random effect and linear predictor η, but with Bernoulli, beta and Cox proportional hazard (partial) likelihoods (the first two using logit links).

Figure 3. Simulation based illustration of the problems with previous AIC type model selection criteria and the relatively good performance of the Section 5 version. In all panels: (i) the solid curves are for conventional conditional AIC, (ii) the dotted curves are for the Section 5 version, (iii) the middle length dashed curves are for AIC based on the heuristic upper bound degrees of freedom, (iv) the dashed dot curves are for the marginal likelihood based AIC and (v) the long dashed curves are for the Greven and Kneib (2010) corrected AIC (top row only). (a) Observed probability of selecting the larger model as the effect strength of the differing term is increased from zero, for a 40 level random effect and Gaussian likelihood. (b) whole model effective degrees of freedom used in the alternative conditional AIC scores for the left hand panel as effect size increases. (c) Same as (a), but where the term differing between the two models was a smooth curve. (d) As (a) but for a Bernoulli likelihood. (e) As (a) for a beta likelihood. (f) As (a) for a Cox proportional hazards partial likelihood.

The purpose of this article is to develop methods to allow the rich variety of smoothers illustrated in Figure 1 to be used in models beyond the exponential family, a task for which general methods were not previously available. However, for the special case of univariate P-splines (Eilers and Marx 1996; Marx and Eilers 1998) some comparison with existing methods is possible, in particular using R package gamlss (Rigby and Stasinopoulos 2005, 2014) and the BayesX package (Fahrmeir and Lang 2001; Fahrmeir, Kneib, and Lang 2004; Brezger and Lang 2006; Umlauf et al. 2015; Belitz et al. 2015, www.bayesx.org). For this special case both packages implement models using essentially the same penalized likelihoods used by the new method, but they optimize localized marginal likelihood scores within the penalized likelihood optimization algorithm to estimate the smoothing parameters.

The comparison was performed using data simulated from models with the linear predictor given above (but without any random effect terms). Comparison of the new method with GAMLSS was only possible for negative binomial, beta, scaled t and simple zero inflated Poisson families, and with BayesX was only possible for the ordered categorical model (BayesX has a negative binomial family, but it is currently insufficiently stable for a sensible comparison to be made). Simulations with both uncorrelated and correlated covariates were considered. Three hundred replicates of the sample size 400 were produced for each considered family at three levels of noise (see SA E for further details). Models were estimated using the correct link and additive structure, and using P-splines with basis dimensions of 10, 10, 15, and 8 which were chosen to avoid any possibility of forced oversmoothing, while keeping down computational time.

Model performance for the negative binomial (nb), beta, scaled t (scat), and zero inflated Poisson (zip) families was compared via MSE, n-1i=1nη^(xi)-ηt(xi)2, on the additive predictor scale. The Brier score, 1ni=1nj=1R(pij-p^ij)2, was used to measure the performance for the ordered categorical (ocat) family, where R is a number of categories, pij are true category probabilities and p^ij their estimated values. In addition, the computational performance (CPU time) of the alternative methods was recorded. Figure 4 summarizes the results. In general, the new method provides a small improvement in statistical performance, which is slightly larger when covariates are correlated. The correlated covariate setting is the one in which local approximate smoothness selection methods would be expected to perform less well, relative to “whole model” criteria. In terms of speed and reliability the new method is an improvement, especially for correlated covariates, which tend to lead to reduced numerical stability, leading the alternative methods to fail in up to 4% of cases.

Figure 4. Results of simulation comparison with gamlss (beta, nb, scat, zip) and BayesX (ocat) packages for one dimensional P-spline models. The two plots at lower right show comparisons of log 10 computing times for the case with the smallest time advantage for the new method — Beta regression. The remaining panels show boxplots of replicate by replicate difference in MSE/Brier’s score each standardized by the average MSE or Brier’s score for the particular simulation comparison. Each panel shows three box plots, one for each noise to signal level. Positive values indicate that the new method is doing better than the alternative. Boxplots are shaded grey when the difference is significant at the 5% level (all three for nb correlated should be gray). In all cases where the difference is significant at 5% the new method is better than the alternative, except for the zero inflated Poisson with uncorrelated data, where the alternative method is better at all noise levels.

7. Example: Predicting Prostate Cancer

This section andFigure 5, Figure 6the next provide example applications of the new methods, while the online SA F provides further examples in survival analysis and animal distribution modeling. Figure 5 shows representative protein mass spectra from serum taken from patients with a healthy prostate, relatively benign prostate enlargement and prostate cancer (see Adam et al. 2002). To avoid the need for intrusive biopsy there is substantial interest in developing noninvasive screening tests to distinguish cancer, healthy and more benign conditions. One possible model is an ordered categorical signal regression in which the mean of a logistically distributed latent variable z is given by μi=α+f(D)νi(D)dD,where f(D) is an unknown smooth function of mass D (in Daltons) and νi(D) is the ith spectrum. The probability of the patient lying in category 1, 2, or 3 corresponding to “healthy,” “benign enlargement” and “cancer” is then given by the probability of zi lying in the range ( − ∞, −1], ( − 1, θ] or (θ, ∞), respectively (see online SA K).

Figure 5. Three representative protein mass spectra (centered and normalized) from serum taken from patients with apparently healthy prostate, enlarged prostate, and prostate cancer. It would be useful to be able to predict disease status from the spectra. The red and blue spectra have been shifted upward by 5 and 10 units, respectively.

Given the methods developed in this article, estimation of this model is routine, as is the exploration of whether an adaptive smooth should be used for f, given the irregularity of the spectra. Figure 6 shows some results of model fitting. The estimated f(D) is based on a rank 100 thin plate regression spline. Its effective degrees of freedom is 29. An adaptive smooth gives almost identical results. The right panel shows a QQ-plot of ordered deviance residuals against simulated theoretical quantiles (Augustin, Sauleau, and Wood 2012). There is modest deviation in the lower tail. The middle panel shows boxplots of the probability of cancer according to the model for the three observed categories. Cancer and healthy are quite well separated, but cancer and benign enlargement less so. For cases with cancer, the model gave cancer a higher probability than normal prostate in 92% of cases, and a higher probability that either other category in 83% of cases. For healthy patients the model gave the normal category higher probability than cancer in 85% of cases and the highest probability in 77% of cases. These results are somewhat worse than those reported by Adam et al. (2002) for a relatively complex machine learning method which involved first preprocessing the spectra to identify peaks believed to be discriminating. On the other hand the signal regression model here would allow the straightforward inclusion of further covariates, and does automatically supply uncertainty estimates.

Figure 6. Results from the ordered categorical prostate model fit. (a) The estimated coefficient function f(D) with 95% confidence interval. (b) Boxplots of the model probability of cancer, for the 3 observed states (1, healthy, 2, enlarged and 3, cancer). (c) QQ-plot of ordered deviance residuals against simulated theoretical quantiles, indicating some mismatch in the lower tail.

8. Multivariate Additive Modeling of Fuel Efficiency

Figure 7 shows part of a dataset on the fuel efficiency of 207 U.S. car models, along with their characteristics (Bache and Lichman 2013). Two efficiency measures were taken: miles per gallon (MPG) in city driving, and the same for highway driving. One possible model might be a bivariate additive model, as detailed in the online SA H, where the two mpg measurements are modeled as bivariate Gaussian, with means given by separate linear predictors for the two components. A priori, it might be expected that city efficiency would be highly influenced by weight and highway efficiency by air resistance and, hence, by frontal area or some other combination of height and width of the car.

Figure 7. Part of a dataset from the USA on fuel efficiency of cars.

The linear predictors for the two components were based on the additive fixed effects of factors “fuel type” (petrol or diesel), “style” of car (hatchback, sedan, etc.) and “drive” (all-, front- or rear-wheel). In addition i.i.d. Gaussian random effects of the 22 car manufacturers were included, as well as smooth additive effects of car weight and horsepower. Additive and tensor product smooths of height and width were tried as well as a smooth of the product of height and width, but there was no evidence to justify their inclusion-term selection penalties (Marra and Wood 2011) remove them, p-values indicate they are not significant and AIC suggests that they are better dropped.

The possibility of smooth interactions between weight and horsepower were also considered, using smooth main effects plus smooth interaction formulations of the form f1(h) + f2(w) + f3(h, w). The smooth interaction term f3 can readily be constructed in a way that excludes the main effects of w and h, by constructing its basis using the usual tensor product construction (e.g., Wood 2006), but based on marginal bases into which the constraints ∑if1(hi) = 0 and ∑if2(wi) = 0 have already been absorbed by linear reparameterization. The marginal smoothing penalties and, hence, the induced tensor product smoothing penalties are unaffected by the marginal constraint absorption. This construction is the obvious generalization of the construction of parametric interactions in linear models, and is simpler than the various schemes proposed in the literature.

The interactions again appear to add nothing useful to the model fit, and we end up with a model in which the important smooth effects are horse power (hp) and weight, while the important fixed effects are fuel type and drive, with diesel giving lower fuel consumption than petrol and all wheel drive giving higher consumption than the two-wheel drives. These effects were important for both city and highway, whereas the random effect of manufacturer was only important for the city. Figure 8 shows the smooth and random effects for the city and highway linear predictors. Notice the surprising similarity between the effects although the city smooth effects are generally slightly less pronounced than those for the highway. The overall r2 for the model is 85% but with the city and highway error MPG standard deviation estimated as 1.9 and 2.3 MPG respectively. The estimated correlation coefficient is 0.88.

Figure 8. Fitted smooth and random effects for final car fuel efficiency model. Panels (a)–(c) relate to the city fuel consumption, while (d)–(f) are for the highway. (c) and (f) are normal QQ-plots of the predicted random effects for manufacturer, which in the case of highway MPG are effectively zero.

9. Discussion

This article has outlined a practical framework for smooth regression modeling with reduced rank smoothers, for likelihoods beyond the exponential family. The methods build seamlessly on the existing framework for generalized additive modeling, so that practical application of any of the models implemented as part of this work is immediately accessible to anyone familiar with GAMs via penalized regression splines. The key novel components contributed here are (i) general, reliable and efficient smoothing parameter estimation methods based on maximized Laplace approximate marginal likelihood, (ii) a corrected AIC and distributional results incorporating smoothing parameter uncertainty to aid model selection and further inference, and (iii) demonstration of the framework’s practical utility by provision of the details for some practically important models. The proposed methods should be widely applicable in situations in which effects are really smooth, and the methods scale well with the number of smooth model terms. In situations in which some component functions are high rank random fields, then the INLA approach of Rue, Martino, and Chopin (2009) will be much more efficient; however, there are trade-offs between efficiency and stability in this case, since pivoting, used by our method to preserve stability, has instead to be employed to preserve sparsity in the INLA method (see online SA K).

The methods are implemented in R package mgcv from version 1.8 (see online SA M).

Supplemental material

Supplementary Materials

Download PDF (795 KB)

Acknowledgment

We thank the anonymous referees for a large number of very helpful comments that substantially improved the paper and Phil Reiss for spotting an embarrassing error in Supplementary Appendix A.

Supplementary Materials

The online supplementary materials contain additional appendices for the article.

Funding

SNW and NP were funded by EPSRC grant EP/K005251/1 and NP was also funded by EPSRC grant EP/I000917/1. BS was funded by the German Research Association (DFG) Research Training Group “Scaling Problems in Statistics” (RTG 1644). SNW is grateful to Carsten Dorman and his research group at the University of Freiburg, where the extended GAM part of this work was carried out.

    References

  • Adam, B.-L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J., Schellhammer, P. F., Yasui, Y., Feng, Z., et al. (2002), “Serum Protein Fingerprinting Coupled With a Pattern-Matching Algorithm Distinguishes Prostate Cancer From Benign Prostate Hyperplasia and Healthy Men,” Cancer Research, 62, 36093614. [PubMed], [Web of Science ®][Google Scholar]
  • Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle,” in International Symposium on Information Theory, eds. B. Petran, and F. Csaaki, Budapest: Akadeemiai Kiadi, pp. 267–281. [Google Scholar]
  • Anderssen, R., and Bloomfield, P. (1974), “A Time Series Approach to Numerical Differentiation,” Technometrics, 16, 6975. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Augustin, N. H., Sauleau, E.-A., and Wood, S. N. (2012), “On Quantile Quantile Plots for Generalized Linear Models,” Computational Statistics & Data Analysis, 56, 24042409. [Crossref], [Web of Science ®][Google Scholar]
  • Bache, K., and Lichman, M. (2013), “UCI Machine Learning Repository,” available at http://archive.ics.uci.edu/ml/. [Google Scholar]
  • Belitz, C., Brezger, A., Kneib, T., Lang, S., and Umlauf, N. (2015), “Bayesx: Software for Bayesian Inference in Structured Additive Regression Models,” available at http://www.statistik.lmu.de/~bayesx/bayesx.html. [Google Scholar]
  • Breslow, N. E., and Clayton, D. G. (1993), “Approximate Inference in Generalized Linear Mixed Models,” Journal of the American Statistical Association, 88, 925. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Brezger, A., and Lang, S. (2006), “Generalized Structured Additive Regression Based on Bayesian p-Splines,” Computational Statistics & Data Analysis, 50, 967991. [Crossref], [Web of Science ®][Google Scholar]
  • Cline, A. K., Moler, C. B., Stewart, G. W., and Wilkinson, J. H. (1979), “An Estimate for the Condition Number of a Matrix,” SIAM Journal on Numerical Analysis, 16, 368375. [Crossref], [Web of Science ®][Google Scholar]
  • Cox, D. (1972), “Regression Models and Life Tables,Journal of the Royal Statistical Society, Series B, 34, 187220. [Google Scholar]
  • Davison, A. C. (2003), Statistical Models, Cambridge, UK: Cambridge University Press. [Crossref][Google Scholar]
  • Eilers, P. H. C., and Marx, B. D. (1996), “Flexible Smoothing With B-Splines and Penalties,Statistical Science 11, 89121. [Crossref], [Web of Science ®][Google Scholar]
  • Fahrmeir, L., Kneib, T., and Lang, S. (2004), “Penalized Structured Additive Regression for Space-Time Data: A Bayesian Perspective,” Statistica Sinica, 14, 731761. [Web of Science ®][Google Scholar]
  • Fahrmeir, L., Kneib, T., Lang, S., and Marx, B. (2013), “Regression Models, New York: Springer. [Crossref][Google Scholar]
  • Fahrmeir, L., and Lang, S. (2001), “Bayesian Inference for Generalized Additive Mixed Models Based on Markov Random Field Priors,” Applied Statistics, 50, 201220. [Crossref], [Web of Science ®][Google Scholar]
  • Greven, S., and Kneib, T. (2010), “On the Behaviour of Marginal and Conditional AIC in Linear Mixed Models,” Biometrika, 97, 773789. [Crossref], [Web of Science ®][Google Scholar]
  • Hastie, T., and Tibshirani, R. (1986), “Generalized Additive Models” (with discussion), Statistical Science, 1, 297318. [Crossref][Google Scholar]
  • Hastie, T., and Tibshirani, R. (1990), Generalized Additive Models, London: Chapman & Hall. [Google Scholar]
  • Kass, R. E., and Steffey, D. (1989), “Approximate Bayesian Inference in Conditionally Independent Hierarchical Models (Parametric Empirical Bayes Models),” Journal of the American Statistical Association, 84, 717726. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Klein, N., Kneib, T., Klasen, S., and Lang, S. (2014), “Bayesian Structured Additive Distributional Regression for Multivariate Responses,Journal of the Royal Statistical Society, Series C, 64, 569591. [Google Scholar]
  • Klein, N., Kneib, T., Lang, S., and Sohn, A. (2015), “Bayesian Structured Additive Distributional Regression With an Application to Regional Income Inequality in Germany,” Annals of Applied Statistics, 9, 10241052. [Crossref], [Web of Science ®][Google Scholar]
  • Laird, N. M., and Ware, J. H. (1982), “Random-Effects Models for Longitudinal Data,” Biometrics, 38, 963974. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Liang, H., Wu, H., and Zou, G. (2008), “A Note on Conditional AIC for Linear Mixed-Effects Models,” Biometrika, 95, 773778. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Marra, G., and Wood, S. N. (2011), “Practical Variable Selection for Generalized Additive Models,” Computational Statistics & Data Analysis, 55, 23722387. [Crossref], [Web of Science ®][Google Scholar]
  • Marra, G., and Wood, S. N. (2012), “Coverage Properties of Confidence Intervals for Generalized Additive Model Components,” Scandinavian Journal of Statistics, 39, 5374. [Crossref], [Web of Science ®][Google Scholar]
  • Marx, B. D., and Eilers, P. H. (1998), “Direct Generalized Additive Modeling With Penalized Likelihood,” Computational Statistics and Data Analysis, 28, 193209. [Crossref], [Web of Science ®][Google Scholar]
  • Nocedal, J., and Wright, S. (2006), “Numerical Optimization (2nd ed.), New York: Springer Verlag. [Google Scholar]
  • Nychka, D. (1988), “Bayesian Confidence Intervals for Smoothing Splines,” Journal of the American Statistical Association, 83, 11341143. [Taylor & Francis Online], [Web of Science ®][Google Scholar]
  • Reiss, P. T., and Ogden, T. R. (2009), “Smoothing Parameter Selection for a Class of Semiparametric Linear Models,Journal of the Royal Statistical Society, Series B, 71, 505523. [Crossref][Google Scholar]
  • Rigby, R., and Stasinopoulos, D. M. (2005), “Generalized Additive Models for Location, Scale and Shape,Journal of the Royal Statistical Society, Series C, 54, 507554. [Crossref][Google Scholar]
  • Rigby, R. A., and Stasinopoulos, D. M. (2014), “Automatic Smoothing Parameter Selection in GAMLSS With an Application to Centile Estimation,Statistical Methods in Medical Research, 23, 318–332. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Rue, H., Martino, S., and Chopin, N. (2009), “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations,Journal of the Royal Statistical Society, Series B, 71, 319392. [Crossref][Google Scholar]
  • Ruppert, D., Wand, M. P., and Carroll, R. J. (2003), Semiparametric Regression, New York: Cambridge University Press. [Crossref][Google Scholar]
  • Säfken, B., Kneib, T., van Waveren, C.-S., and Greven, S. (2014), “A Unifying Approach to the Estimation of the Conditional Akaike Information in Generalized Linear Mixed Models,” Electronic Journal of Statistics, 8, 201225. [Crossref], [Web of Science ®][Google Scholar]
  • Shun, Z., and McCullagh, P. (1995), “Laplace Approximation of High Dimensional Integrals,Journal of the Royal Statistical Society, Series B, 57, 749760. [Google Scholar]
  • Silverman, B. W. (1985), “Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression Curve Fitting,Journal of the Royal Statistical Society, Series B, 47, 153. [Google Scholar]
  • Umlauf, N., Adler, D., Kneib, T., Lang, S., and Zeileis, A. (2015), “Structured Additive Regression Models: An r Interface to Bayesx,” Journal of Statistical Software, 63, 146. [Crossref], [Web of Science ®][Google Scholar]
  • Wahba, G. (1983), “Bayesian Confidence Intervals for the Cross Validated Smoothing Spline,Journal of the Royal Statistical Society, Series B, 45, 133150. [Google Scholar]
  • Wahba, G. (1985), “A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem,The Annals of Statistics, pp. 13781402. [Crossref], [Web of Science ®][Google Scholar]
  • Wood, S. N. (2000), “Modelling and Smoothing Parameter Estimation With Multiple Quadratic Penalties,Journal of the Royal Statistical Society, Series B, 62, 413428. [Crossref][Google Scholar]
  • Wood, S. N. (2006), “Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models,” Biometrics, 62, 10251036. [Crossref], [PubMed], [Web of Science ®][Google Scholar]
  • Wood, S. N. (2011), “Fast Stable Restricted Maximum Likelihood and Marginal Likelihood Estimation of Semiparametric Generalized Linear Models,Journal of the Royal Statistical Society, Series B, 73, 336. [Crossref][Google Scholar]
  • Yee, T. W., and Wild, C. (1996), “Vector Generalized Additive Models,Journal of the Royal Statistical Society, Series B, 481493. [Google Scholar]
  • Yu, D., and Yau, K. K. (2012), “Conditional Akaike Information Criterion for Generalized Linear Mixed Models,” Computational Statistics & Data Analysis, 56, 629644. [Crossref], [Web of Science ®][Google Scholar]

Appendix A: Implicit Differentiation in the Extended Gam Case

Let Diβ^jβ^ denote elements of the inverse of the Hessian matrix (XTWX+Sλ) with elements Dβ^iβ^j, and note that β^ is the solution of Dβ^i=0. Finding the total derivative with respect to θ of both sides of this we have Dβ^iβ^kdβ^kdθj+Dβ^θij=0,implyingthatdβ^kdθj=-Dkβ^iβ^Dβ^θijDifferentiating once more yields d2β^idθjdθk=-Diβ^lβ^Dβ^lβ^pβ^qdβ^qdθjdβ^pdθk+Dβ^lβ^θpjdβ^pdθk+Dβ^lβ^θpkdβ^pdθj+Dβ^θθljk.The required partials are obtained from those generically available for the distribution and link used and by differentiation of the penalty. Generically we can obtain derivatives of Di w.r.t μi and θ.

The preceding expressions hold whether θj is a parameter of the likelihood or a log smoothing parameter. Suppose Λ denotes the set of log smoothing parameters, then Dβiθj=2exp(θj)SikjβkθjΛDβiθjotherwise,where Sj here denotes the penalty matrix associated with θj. Similarly Dβlβpθj=2exp(θj)SlpjθjΛDβlβpθjotherwisewhileDβlθjθk=2exp(θj)Slmjβmj=k;θj,θkΛDβlθjθkθj,θkΛ0otherwise.Derivatives with respect to η are obtained by standard transformations (A.1) Dηi=Dμi/hi',(A.1) where hi = h′(μi) and more primes indicate higher derivatives. Furthermore, (A.2) Dηiηi=Dμiμi/hi'2-Dμihi''/hi'3,(A.2) where the expectation of the second term on the right-hand side is zero at the true parameter values. (A.3) AlsoDηiηiηi=Dμiμiμi/hi'3-3Dμiμihi''/hi'4[4pt]+Dμi3hi''2/hi'5-hi'''/hi'4,and[4pt](A.3) (A.4) Dηiηiηiηi=Dμiμiμiμi/hi'4-6Dμiμiμihi''/hi'5+Dμiμi(15hi''2/hi'6[4pt]-4h'''/hi'5)-Dμi(15hi''3/hi'7-10hi''hi'''/hi'6+hi''''/hi'5).(A.4) Mixed partial derivatives with respect to η/μ and θ transform in the same way, the formula to use depending on the number of η subscripts. The rules relating the derivatives w.r.t η to those with respect to β are much easier: Dβi=DηkXki,Dβiβj=DηηkkXkiXkj,Dβiβjβk=DηlηlηlXliXljXlk. Again mixed partials follow the rule appropriate for the number of β subscripts present. It is usually more efficient to compute using the definitions, rather than forming the arrays explicitly.

The ingredients so far are sufficient to compute β^ and its derivatives with respect to θ. We now need to consider the derivatives of V with respect to θ. Considering D first, the components relating to the penalties are straightforward. The deviance components are then dDdθi=Dη^jdη^jdθi+Dθ^iandd2Ddθidθj=Dη^η^kkdη^kdθidη^kdθj+Dη^kd2η^kdθidθj+Dη^θkjdη^kdθi+Dη^θkidη^kdθj+Dθθij,where the derivatives of η^ are simply X multiplied by the derivatives of β^. The partials of l˜ are distribution specific. The derivatives of the determinant terms are obtainable using Wood (2011) once derivatives of wi with respect to θ have been obtained. These are dwidθj=12Dη^iη^iη^idη^idθj+12Dηη^θiij,[4pt]d2widθjdθk=12Dη^iη^iη^iη^idη^idθjdη^idθk+12Dη^iη^iη^id2η^idθjdθk+12Dη^iη^iη^iθkdη^idθj[4pt]+12Dη^η^η^θiiijdη^idθk+12Dη^η^θθiijk.