BAMLSS: Bayesian Additive Models for Location, Scale, and Shape (and Beyond)

ABSTRACT Bayesian analysis provides a convenient setting for the estimation of complex generalized additive regression models (GAMs). Since computational power has tremendously increased in the past decade, it is now possible to tackle complicated inferential problems, for example, with Markov chain Monte Carlo simulation, on virtually any modern computer. This is one of the reasons why Bayesian methods have become increasingly popular, leading to a number of highly specialized and optimized estimation engines and with attention shifting from conditional mean models to probabilistic distributional models capturing location, scale, shape (and other aspects) of the response distribution. To embed many different approaches suggested in literature and software, a unified modeling architecture for distributional GAMs is established that exploits distributions, estimation techniques (posterior mode or posterior mean), and model terms (fixed, random, smooth, spatial,…). It is shown that within this framework implementing algorithms for complex regression problems, as well as the integration of already existing software, is relatively straightforward. The usefulness is emphasized with two complex and computationally demanding application case studies: a large daily precipitation climatology, as well as a Cox model for continuous time with space-time interactions. Supplementary material for this article is available online.


Introduction
The generalized additive model for location, scale, and shape (GAMLSS, Rigby and Stasinopoulos 2005) relaxes the distributional assumptions of a response variable to allow for modeling the mean (location) as well as higher moments (scale and shape) in terms of covariates. This is especially useful in cases where, for example, the response does not follow the exponential family or particular interest lies on scale and shape parameters. Moreover, covariate effects can have flexible forms such as, for example, linear, nonlinear, spatial, or random effects. Hence, each parameter of the distribution is linked to an additive predictor in similar fashion as for the well-established generalized additive model (GAM, Hastie and Tibshirani 1990).
The terms of an additive predictor are most commonly represented by basis function approaches. This leads to a very generic model structure and can be further exploited because each term can be transformed into a mixed model representation (Ruppert, Wand, and Carrol 2003). In a fully Bayesian setting, this generality remains because priors on parameters can also be formalized in a general way, for example, by assigning normal priors to the regression coefficients of smooth terms (Fahrmeir, Kneib, and Lang 2004;Fahrmeir et al. 2013).
The fully Bayesian approach using Markov chain Monte Carlo (MCMC) simulation techniques is particularly attractive since the inferential framework provides valid credible intervals for estimators in situations where confidence intervals for corresponding maximum likelihood estimators based on asymptotic combination of single "bricks. " Due to many parallels to the GAMLSS class, the conceptional framework is called BAMLSS (Bayesian additive models for location, scale, and shape). However, it also encompasses many more general model terms beyond linear combinations in a design matrix with regression coefficients. The toolbox can be exploited in three ways: First, to quickly develop new models and algorithms. Second, to compare existing algorithms and samplers. Third, to easily integrate existing implementations. A proof of concept is given in the corresponding R package bamlss .
The remainder of the article is structured as follows. In Section 2, the models supported by this framework are briefly introduced. Section 3 presents the conceptional algorithm used to estimate numerous (complex) models along with the corresponding "Lego bricks" in Section 4. Section 5 describes computational strategies for implementation before Section 6 illustrates the concept using two complex and computationally demanding illustrations: a large climatology model for daily precipitation observations using censored heteroscedastic regression and a Cox model for continuous time with space-time interactions.

Model Structure
Based on data for i = 1, . . . , n observations, the models discussed in this article assume conditional independence of individual response observations given covariates. As in the classes of GAMLSS (Rigby and Stasinopoulos 2005) or distributional regression models (Klein et al. 2015b) all parameters of the response distribution can be modeled by explanatory variables such that y ∼ D (h 1 (θ 1 ) = η 1 , h 2 (θ 2 ) = η 2 , . . . , h K (θ K ) = η K ) , where D denotes a parametric distribution for the response variable y with K parameters θ k , k = 1, . . . , K, that are linked to additive predictors using known monotonic and twice differentiable functions h k (·). Note that the response may also be a qdimensional vector y = (y 1 , . . . , y q ) , for example, when D is a multivariate distribution (see, e.g., Klein et al. 2015a). The additive predictor for the kth parameter is given by based on j = 1, . . . , J k unspecified (possibly nonlinear) functions f jk (·), applied to each row of the generic data matrix X, encompassing all available covariate information. The corresponding parameters β k = (β 1k , . . . , β J k k ) are typically regression coefficients pertaining to model matrices X k = (X 1k , . . . , X J k k ) , whose structure only depend on the type of covariate(s) and prior assumptions about f jk (·). For notational convenience, the vector of function evaluations (across all observations i = 1, . . . , n) is also denoted by f jk = f jk (X jk ; β jk ) = ( f jk (x 1 ; β jk ), . . . , f jk (x n ; β jk )) , where X jk (n × m jk ) is the design matrix of the jth term for the kth parameter. In the case where it is derived through a basis function approach, it can be written as f jk = X jk β jk . While functions f jk (·) are usually based on a basis function approach, where η k then is a typical GAM-type or so-called structured additive predictor (STAR, Fahrmeir, Kneib, and Lang 2004), in this article we relax this assumption and let f jk (·) be an unspecified composition of covariate data and regression coefficients. A simple example for an f jk (·) that is nonlinear in the parameters β jk would be a Gompertz growth curve f jk = β 1 · exp(− exp(β 2 + X jk β 3 )).
Note that using basis functions the individual model components X jk β jk can be further decomposed into a mixed model representation given by f jk =X jkγ jk + U jkβ jk , whereγ jk represents the fixed effects parameters andβ jk ∼ N (0, τ 2 jk I) iid random effects. The matrix U jk is derived from a spectral decomposition of the penalty matrix K jk andX jk by finding a basis of the null space of K jk such thatX jk K jk = 0, that is, parametersγ jk are not penalized (see, e.g., Ruppert, Wand, and Carrol 2003;Wood 2004;Fahrmeir et al. 2013). Such transformations can be used to estimate functions f jk (·) using standard algorithms for random effects (see, e.g., Wood 2016).

Response and Posterior Distribution
The main building block of regression model algorithms is the probability density function d y (y|θ 1 , . . . , θ K ), or for computational reasons its logarithm. Estimation typically requires to evaluate the log-likelihood function a number of times, where the vector β = (β 1 , . . . , β K ) comprises all regression coefficients/parameters that should be estimated, X = (X 1 , . . . , X K ) are the respective covariate matrices whose ith row is denoted by x i and θ k are distribution parameter vectors of length n. Assigning prior distributions p jk (·) to the individual components yields the log-posterior log π (β, τ; y, X, α) ∝ (β; y, X) where is the vector of all assigned hyperparameters used within prior functions p jk (·) and similarly α is the set of all fixed prior specifications. More precisely, the rather general prior for the jth model term of the kth parameter is given by with prior densities (or combinations of densities) d β jk (·) and d τ jk (·) that depend on the type of covariate and prior assumptions about f jk (·). In this framework, τ jk are typically variances, for example, that account for the degree of smoothness of f jk (·) or the amount of correlation between observations. For example, using a spline representation of f jk (·) in combination with a normal prior for d β jk (·), the variances can be interpreted as the inverse smoothing parameters in a penalized regression context, that is, from a frequentist perspective (3) can be viewed as a penalized log-likelihood. In addition, the fixed prior specifications α jk = {α β jk , α τ jk } can further control the shape of d β jk (·) and d τ jk (·), incorporate prior beliefs about β jk , or for GAM-type models α jk usually holds the so-called penalty matrices, among others.

Model Fitting
Bayesian point estimates of β and τ are typically obtained by either one of: E1. Maximization of the log-posterior for posterior mode estimation. E2. Solving high-dimensional integrals, for example, for posterior mean or median estimation. For the possibly complex models within the BAMLSS framework, E1 and E2 are commonly solved by computer-intensive iterative algorithms, since analytical solutions are available only in a few special cases. In either case, the algorithms perform an updating of type where function U (·) is an updating function, for example, for generating one Newton-Raphson step in E1 or getting the next step in an MCMC simulation in E2, among others. The updating scheme can be partitioned into separate updating equations using leapfrog or zigzag iteration (see, e.g., Smyth 1996). Now let be a partitioned updating scheme with updating functions U k (·), that is, in each iteration updates for the kth parameter are computed while holding all other parameters fixed. Furthermore, this strategy can be applied for all terms within a parameter using updating function U jk (·) for an individual model term The partitioned updating system allows for having different functions U jk (·) for different model terms, for example, in problem E1 some updating functions could be based on iteratively weighted least squares (IWLS, Gamerman 1997) and some on ordinary Newton-Raphson steps (see, e.g., Section 6.2). In problem E2 using MCMC it is common to mix between several sampling methods depending on the type of model term or distribution parameter.
Using highly modular systems like (6) and (7), it is possible to develop a generic estimation algorithm for numerous possibly very complex models, as outlined in Algorithm A1. The algorithm initializes all model parameters and predictors. Then an outer iteration loops over all distributional parameters performing an inner iteration updating all model terms of the respective parameter, that is, the algorithm uses backfitting type updating schemes. In practice, for full Bayesian inference the algorithm is applied twice, that is, first computing estimates for E1 and then using these as starting values for solving E2.
Finding good starting values is especially important for complex model terms, for example, for multi-dimensional f jk (·) with multiple smoothing variances. Therefore, we propose to estimate τ jk using a goodness-of-fit criterion within the stepwise selection from Algorithm A2a, similar to Belitz and Lang (2008). In each updating step in A1 each τ jk = (τ 1 jk , . . . , τ L jk jk ) is optimized one after the other using adaptive search intervals. Hence, the optimization problem is reduced to a onedimensional search that is relatively fast and straightforward. The algorithm does not guarantee a global minimum, however, the solution is at least "close" and serves as the starting point for full MCMC. Optimization speed can be further increased if for a given search interval only a grid of possible values for each τ l jk is used.
The MCMC updating usually either accepts or rejects samples of the parameters and smoothing variances are sampled after β jk . In Algorithm A2b the general scheme is shown. Note again the general structure of sampling Algorithm A2b, that is, the proposal functions q jk (·) generate parameter samples β (t+1) jk , τ (t+1) jk using (possibly) different sampling schemes like derivative-based Metropolis-Hastings and slice sampling, see Section 4.2
new link function, B2 and B4. Then, the new building blocks are straightforward to combine with other previously available building blocks, moreover, most parts that are used for solving E1 can also be used for E2.
The remainder of this section is as follows. Details about commonly used prior densities in GAM-type models (building block B3) are provided in the next section. In Section 4.2, we derive the general parts that are needed for updating functions in Algorithm A2a and A2b, that is, building blocks B6a, B6b, B7a, and B7b. In Section 4.3 and 4.4, we briefly discuss model choice, Bayesian inference, and prediction.

Model Terms and Priors
In the following, we summarize commonly used specifications for priors p jk (·) used for estimating GAM-type models (building block B3). In addition, Table 1 of the supplemental material provides a more detailed overview of model terms and prior structures.

... Linear Effects
For simple linear effects f jk (·), a common choice for p jk (·) is to use a noninformative (constant) flat prior. One of the simplest informative priors is a normal prior given by where τ jk are assumed to be fixed with d τ jk (·) = 1 and α jk = {α β jk = {m}} with m as a prior mean for β jk . The matrix P jk (τ jk ) is a fixed prior precision matrix, for example, P jk = diag(τ jk ). In a lot of applications, a vague prior specification is used with m = 0 and a large precision (see, e.g., Fahrmeir et al. 2013).

... Nonlinear Effects
If the nonlinear functions f jk (·) in (1) are modeled using a basis function approach, the usual choice of prior p jk (·) is based on a multivariate normal kernel for β jk given by (8) Here, the precision matrix P jk (τ jk ) is derived from prespecified so-called penalty matrices α β jk = {K 1 jk , . . . , K L jk }, for example, for tensor product smooths the precision matrix is P jk (τ jk ) = L jk l=1 1 τ l jk K l jk . Note that P jk (τ jk ) is often not of full rank and therefore d β jk (·) is partially improper. The variances τ l jk account for the amount of smoothness (regularization) of the function and can be interpreted as the inverse smoothing parameters in the frequentist approach. A common choice for the prior for τ jk is based on inverse gamma distributions for each with fixed parameters α τ jk = {a jk , b jk }. Small values for a jk and b jk correspond to approximate flat priors for log(τ l jk ). Setting b jk = 0 and a jk = −1 or a jk = −1/2 · 1 yields flat priors for τ l jk and τ 0.5 l jk , respectively. However, the inverse gamma prior is very sensitive to the choice of a jk and b jk if τ l jk is close to zero. Therefore, Gelman (2006) proposed the half-Cauchy prior which has the desirable property that for τ l jk = 0 the density is a nonzero constant, whereas the density of the inverse gamma for τ l jk → 0 vanishes (see also Polson and Scott 2012). Another question is the actual choice of hyperparameters. A recent suggestion reducing this issue to the choice of a scale parameter that is directly related to the functions f jk (·) (and thus much better interpretable and accessible for the user) was given by Klein and Kneib (2016) for several different hyperpriors for τ l jk , such as resulting priors from half-Cauchy, half-normal or uniform priors for τ 0.5 l jk or resulting penalized complexity priors (Simpson et al. 2017), so-called scale-dependent priors.

... Multilevel Effects
In numerous applications, geographical information and spatial covariates are given at different resolutions. Whenever there is such a nested structure in the data, it is possible to model the complex (spatial) heterogeneity effects using a compound prior β jk =η jk (x;β jk ) + ε jk , where ε jk ∼ N (0,τ jk I) is a vector of iid Gaussian random effects andη jk (x;β jk ) represents a full predictor of nested covariates, for example, including a discrete regional spatial effect. This way, potential costly operations in updating Algorithm A2a and A2b can be avoided since the number of observations inη jk (x;β jk ) is equal to the number of coefficients in β jk , which is usually much smaller than the actual number of observations n. Moreover, the full conditionals (see also Section 4.2) forβ jk are Gaussian regardless of the response distribution and leads to highly efficient estimation algorithms, see Lang et al. (2014).

Model Fitting
The construction of suitable updating functions U jk (·) for solving E1 and E2 can be carried out in many ways. (Note again that the general algorithm A1 does not restrict to a specific iterative procedure.) In the following, we describe commonly used quantities that can be used for estimation of BAMLSS. Moreover, this section highlights the "Lego" properties obtained from the gradient-based updating in E1 and E2. More precisely, we first describe posterior mode updating as used within Algorithm A2a, before we introduce several MCMC sampling schemes that can be employed in the updating Algorithm A2b.

... Posterior Mode
The mode of the posterior distribution is the mode of the logposterior (3) given by Mode(β, τ| y, X, α) = arg max β,τ log π (β, τ; y, X, α) and is equivalent to the maximum likelihood estimator ML(β| y, X) = arg max β (β; y, X) when assigning flat (constant) priors to β jk for j = 1, . . . , J k , k = 1, . . . , K. For models involving shrinkage priors, for example, in GAM-type models given by (8), the posterior mode is equivalent to a penalized maximum likelihood estimator for fixed parameters τ jk and prior densities d τ jk (·) ∝ constant. Moreover, the structure of the log-posterior (3) usually prohibits estimation of τ jk through maximization and the estimatorτ jk is commonly derived by additionally minimizing information criteria such as the Akaike (AIC) or Bayesian information criterion (BIC). See also Algorithm A2a for an adaptive stepwise approach for estimation of τ jk (see also Rigby and Stasinopoulos 2005, Appendix A.2. for a more detailed discussion on smoothing parameter estimation). In Section 4.3, we briefly discuss details on the computation of information criteria with equivalent degrees of freedom.
For developing general updating functions, we begin with describing posterior mode estimation for the case of fixed parameters τ jk , because these updating functions form the basis of estimation algorithms for τ jk . Estimation of β = (β 1 , . . . , β K ) requires solving equations ∂(log π (β, τ; y, X, α))/∂β = 0. A particularly convenient updating function (5) for maximization of (3) is a Newton-Raphson type updating with the following score vector s(β) and Hessian matrix H ks (β) By chain rule, the part of the score vector involving the derivatives of the log-likelihood for the kth parameter can be further decomposed to including the derivatives of: the log-likelihood with respect to η k and θ k (building block B6a), the inverse link functions (B4), the predictor η k with respect to coefficients β k (B5). Similarly, the components of H ks including (β; y, X) (B7a) can be written as yielding a decomposition of building blocks B7b and B5. The second term on the right-hand side cancels out if all functions (2) can be written as a matrix product of a design matrix and coefficients, for example, when using a basis function approach. Within the first term, the second derivatives of the log-likelihood involving the predictors can be written as involving the second derivatives of the link functions.
Using a k-partitioned updating scheme as in (6), updating functions U k (·) are given by Assuming model terms (2) that can be written as a matrix product of a design matrix and coefficients, the Hessian matrix in (13) is given by with diagonal weight matrix W kk = −diag(∂ 2 (β; y, X)/ ∂η k ∂η k ) and matrices forming building block B8 Here, we want to emphasize that the influence of these prior derivatives matrices is usually controlled by τ jk , however, note once again that the τ jk are held fixed for the moment and usually estimation cannot be done with maximization of the logposterior (see also Section 4.3). Typically, using a linear basis function representation of functions f jk (·) and priors based on multivariate normal kernels (8) matrices G jk (τ jk ) are a simple product of smoothing variances and penalty matrices, for example, with only one smoothing variance building block B8 becomes G jk (τ jk ) = τ −1 jk K jk with corresponding penalty matrix K jk .
Similarly, the score vector is and derivatives u k = ∂ (β; y, X)/∂η k . Focusing on the jth row of (13) leads to single model term updating functions U jk (·) as presented in algorithm (7). The updates are based on an iteratively weighted least-square scheme given by with working observations z k = η (t ) k + W −1 kk u (t ) k (in the supplemental material Section 3, the detailed derivations are presented), that is, the algorithm only requires building blocks B6b, B7b, and B8. Hence, this leads to a backfitting algorithm and cycling through (15) for terms j = 1, . . . , J k and parameters k = 1, . . . , K approximates a single Newton-Raphson step in (10), since cross derivatives are not incorporated in the updating scheme. Note that this yields the ingredients of the RS-algorithm developed by Rigby and Stasinopoulos (2005), Appendix B.2. The updating scheme (15) can be further generalized to β (t+1) , that is, theoretically any updating function applied on the "partial residuals" z k − η (t+1) k,− j can be used. Note also that this is equivalent to updating function where matrix J kk (·) is the derivative matrix given in (11), involving building blocks B6a, B7a, and B8 (see also the supplemental material Section 3 the detailed derivations). For optimization, different strategies of the backfitting algorithm (15) can be applied. One alternative is a complete inner backfitting algorithm for each parameter k, that is, the backfitting procedure updates β jk , for j = 1, . . . , J k until convergence, afterward updates for parameters for the next k are calculated again by a complete inner backfitting algorithm, and so forth (see also Rigby and Stasinopoulos 2005).
Note that for numerical reasons, it is oftentimes better to replace the Hessian by the expected Fisher information with weights W kk = −diag(E(∂ 2 (β; y, X)/∂η k ∂η k )), see Klein, Kneib, and Lang (2015). Moreover, to achieve convergence, algorithms for posterior mode usually initialize the parameter vectors θ k . Then, after one complete inner backfitting iteration, the algorithm can proceed in a full zigzag fashion or again with inner iterations. For all updating schemes, it might also be appropriate to vary the updating step length of parameter updates (half-stepping), possibly in each iteration.
Clearly, the problem in deriving the expectation, and other quantities like the posterior median, relies on the computation of usually high-dimensional integrals, which can be rarely solved analytically and thus need to be approximated by numerical techniques. MCMC simulation is commonly used in such situations as it provides an extensible framework that can adapt to almost any type of problem. In the following, we summarize sampling techniques that are especially well-suited within the BAMLSS framework, that is, techniques that can be used for a highly modular and extensible system. In this context, we describe sampling functions for the updating scheme presented in (7), that is, the functions U jk (·) now generate the next step in a Markov chain.
Note that for some models, there exist full conditionals that can be derived in closed form from the log-posterior (3). However, we especially focus on situations where this is not generally the case. MCMC samples for the regression coefficients β jk can be derived by each of the following methods: r Derivative-based Metropolis-Hastings: Probably the most important algorithm, because of its generality and ease of implementation, is random-walk Metropolis. The sampler proceeds by drawing a candidate β jk from a symmetric jumping distribution q(β jk | β (t ) jk ) which is commonly a normal distribution N (β (t ) jk , jk ) centered at the current iterate. However, in complex settings this sampling scheme is usually not efficient and tuning the covariance matrix jk in an adaptive phase is difficult and does not necessarily result in iid behavior of the Markov chain. Therefore, a commonly used alternative for the covariance matrix of the jumping distribution is to use the local curvature information jk = −(∂ 2 π (ϑ; y, X)/∂β jk β jk ) −1 , or its expectation, computed at the posterior mode estimateβ jk , requiring building blocks B7a and B8. However, fixing jk during MCMC simulation might still lead to undesired behavior of the Markov chain especially when parameter samples move into regions with low probability mass of the posterior distribution. A solution with good mixing properties is to construct approximate full conditionals π (β jk |·) that are based on a second-order Taylor series expansion of the log-posterior centered at the last state (Gamerman 1997;Klein, Kneib, and Lang 2015). The resulting proposal density q(β jk |β (t ) jk ) is again normal (see supplements Section 4) with precision matrix and mean given by which is equivalent to the updating function given in (16) and can again be build using blocks B7a and B8. Hence, the mean is simply one Newton or Fisher scoring iteration toward the posterior mode at the current step. Again, assuming a basis function approach for f jk (·) the precision and mean are k,− j with weights W kk = −diag(∂ 2 (β; y, X)/∂η k ∂η k ), or the corresponding expectation, as in posterior mode updating using building blocks B7b and B8, with working observations z k = η (t ) k + W −1 kk u (t ) k (see also the supplemental material Section 4 for a detailed derivation). Note again, the computation of the mean is equivalent to a full Newton step as given in updating function (16), or Fisher scoring when using −E(∂ 2 (β; y, X)/∂η k ∂η k ), in each iteration of the MCMC sampler using iteratively weighted least squares (IWLS). If the computation of the weights is expensive, one simple strategy is to update W kk only after samples of all parameters of θ k are drawn.
r Slice sampling: Slice sampling (Neal 2003) is a gradientfree MCMC sampling scheme that produces samples with 100% acceptance rate. Therefore, and because of the simplicity of the algorithm, slice sampling is especially useful for automated general purpose MCMC implementations that allow for sampling from many distributions. The basic slice sampling algorithm samples univariate directly under the plot of the log-posterior (3). Updates for the ith parameter in β jk are generated by: 1. Sample h ∼ U (0, π (β (t ) i jk | ·)).

Sample β (t+1)
i jk ∼ U (S) from the horizontal slice S = {β i jk : h < π(β i jk | ·)}. The full conditional π (τ jk | ·) for smoothing variances is commonly constructed using priors for τ jk that lead to known distributions, that is, simple Gibbs sampling is possible. For example, this is the case when using a basis function approach and only one smoothing variance τ jk is assigned. Then, by using an inverse gamma prior (9) for τ jk in combination with the normal prior (8) for β jk the full-conditional π (τ jk | ·) is again an inverse gamma distribution withã jk = 1 2 rk(K jk ) + a jk ,b jk = 1 2 (β jk ) K jk β jk + b jk and matrix K jk is again a penalty matrix that depends on the type of model term. As mentioned in Section 4.1, other priors than the inverse gamma might be desirable. Then, Metropolis-Hastings steps also for the variances can be constructed, see Klein and Kneib (2016) for details. If a simple Gibbs sampling step cannot be derived, for example, for multidimensional tensor product splines, another strategy is to use slice sampling, since the number of smoothing variances is usually not very large, the computational burden does most of the times not exceed possible benefits.

... Diagnostics
Quantile residuals defined asr i = −1 (F (y i |θ i )) with the inverse cumulative distribution function (CDF) of a standard normal distribution −1 and F (·) the CDF of the modeled distribution D(·) with estimated parametersθ i = (θ i1 , . . . ,θ iK ) plugged in, should at least approximately be standard normally distributed if the correct model has been specified. Resulting residuals can be assessed by quantile-quantile plots or probability integral transforms which consider u i = F (y i |θ i ). If the estimated model is a good approximation to the true datagenerating process, the u i will then approximately follow a uniform distribution on [0, 1]. Graphically, histograms of the u i can be used for this purpose.

... Smoothing Variances with Posterior Mode
As noted in Section 4.2, depending on the structure of the priors (4) parameters τ jk cannot be estimated by maximization of the log-posterior (3), for example, this is the case for GAM-type models using normal priors (8) where τ jk represents smoothing variances.
Then, goodness-of-fit criteria like the Akaike information criterion (AIC), or the corrected AIC, as well as the Bayesian information criterion (BIC), among others, are commonly used for selecting the smoothing variances. Estimation of model complexity is based on the so-called equivalent degrees of freedom (EDF), calculated by edf jk (τ jk ) := trace[J kk (β jk )(J kk (β jk ) + G jk (τ jk )) −1 ], where J kk (·) is the derivative matrix given in (11) and matrix G jk (τ jk ) is the prior derivative matrix as given in (14). The total degrees of freedom used to fit the model are then estimated by k j edf jk (τ jk ). Note that the definition of EDF here is slightly more general and is usually defined as the trace of the smoother matrix (see, e.g., Hastie and Tibshirani 1990) and can be applied even for more complex likelihood structures.
Instead of global optimization of smoothing variances, a fast strategy is the adaptive stepwise selection approach presented in Algorithm A2a.

... Variable Selection with Posterior Mean
The deviance information criterion (DIC) can be used for model choice and variable selection in Bayesian inference. It is easily be computed from the MCMC output without requiring additional computational effort. If β (1) , . . . , β (T ) is an MCMC sample from the posterior for the complete parameter vector β, the DIC is given by where D(β) = −2 · (β; y, X) is the model deviance and pd = D(β) − D(β) is an effective parameter count.

Inference and Prediction
Under suitable regularity conditions inference for parameters β jk can be based on the asymptotic normality of the posterior distribution β jk | y a ∼ N (β jk , H(β jk ) −1 ), withβ jk as the posterior mode estimate. However, this approach is problematic since it does not take into account the uncertainty of estimated smoothing parameters. From a computational perspective, it can be difficult to derive the full Hessian information, because this might involve complex cross derivatives of the parameters and there are cases where standard numerical techniques cannot be applied (see Section 6.2).
Instead, applying fully Bayesian inference is relatively easy by direct computation of desired statistics from posterior samples. Computational costs are relatively low, since only samples for parameters β jk and τ 2 jk need to be saved (in practice about 2000-3000 are sufficient) from which inference of any combination of terms is straightforward, too.
The posterior predictive distribution is approximated similarly. Random samples for response observations given new covariate values x are computed by drawing samples from the response distribution y ∼ D(h 1 (θ

Strategies for Implementation
An implementation of the conceptional framework proposed in the previous sections is provided in the R package bamlss . In this section, we outline the strategies that have been guiding this implementation but technical and R-specific details are kept brief. Instead we focus on how the flexible conceptual framework with its "Lego bricks" can be turned into an extensible and modular computational framework that readily allows to construct estimation algorithms as well as interfaces to existing software packages such as JAGS (Plummer 2003) or BayesX (Belitz et al. 2017).
To provide a common toolbox that allows to play with the Lego bricks introduced in the previous sections, a general BAMLSS software system can be set up as shown in Figure 1. This proceeds in the following steps:   where the first two lines basically represent the standard model frame specifications (see Chambers and Hastie 1992).
The formula combines the classic Wilkinson and Rogers (1973) symbolic description-used in most standard R regression functions (Chambers and Hastie 1992)-with the infrastructure for smooth model terms like s(), te(), ti(), etc.-based on recommended R package mgcv-and handling multiple additive predictors-using the extended formula processing of Zeileis and Croissant (2010). Thus, a formula can be as simple as in a typical linear regression model with a response variable y and two regressors: y ∼ x1 + x2. But it can also encompass smooth terms in further covariates: y ∼ x1 + x2 + s(x3) + s(x4, x5). Or it may even have different additive predictors for different model parameters: list( y ∼ x1 + x2 + s(x3) + s(x4), sigma ∼ x1 + x2 + s(x3) ), for example, in a normal model with y ∼ N (μ = η μ , log(σ ) = η σ ).
Similarly to other flexible model fitting functions, users can specify their own family objects to plug in different Lego bricks for B1-B8. Family objects from the gamlss suite are readily supported.
Estimation is performed by an optimizer and/or sampler function, which can be provided by the user. The default optimizer function implements the IWLS backfitting algorithm (15) with automatic smoothing variance selection, see also Algorithm A2a. The default sampler function implements derivativebased MCMC using IWLS proposals, smoothing variances are sampled using slice sampling, see also Section 4. For writing new optimizer and sampler functions, only a simple general format of function arguments and return values must be adhered to. More technical details are deferred to the documentation manual of package bamlss.
To give a brief insight how to use bamlss, we illustrate the basic steps on a small textbook example using the wellknown simulated motorcycle accident data (Silverman 1985). The data contain measurements of the head acceleration (in g, variable accel) in a simulated motorcycle accident, recorded in milliseconds after impact (variable times). To estimate a Gaussian location-scale model with accel ∼ N (μ = f (times), log(σ ) = f (times)), we use the following model formula: R> f <list(accel˜s(times, k = 20), + sigma˜s(times, k = 20)) where s() is the smooth term constructor from mgcv. The model is then fitted by R> data("mcycle", package = "MASS") R> b <bamlss(f, data = mcycle, + family = "gaussian") The returned object is of class "bamlss" for which generic extractor functions like summary(), plot(), predict(), etc., are provided. For example, the posterior mean function based on MCMC samples for parameter μ can be computed by where argument FUN can be any function, for example, a function computing credible intervals from the empirical quantiles of the MCMC samples. In Figure 2, the estimated function for a Gaussian model with constant scale log(σ ) = β 0 is shown in the left panel. Although the mean function seems to modeled appropriately, the plot shows that the 95% credible bands are too wide in the beginning and the end of the experiment. The middle panel shows the estimated function of the heteroscedastic model from above. Here, the credible bands follow much better the distribution of the data driven by the estimated variance function shown in the right panel. Uncertainty is low in the beginning and goes up until about 35 milliseconds and slightly decreases afterward.

Censored Heteroscedastic Precipitation Climatology
Climatology models are one important component of the meteorological tool set. The accurate and complete knowledge of precipitation climatologies is especially relevant for problems involving agriculture, risk assessment, water management, tourism, etc. One particular challenge of such models is the prediction at high temporal and spatial resolutions, especially in areas without measurement. This is usually accounted for by simple averaging/smoothing at a coarse temporal scale (e.g., monthly aggregations) combined with a second step using spatial interpolation methods like kriging (Krige 1951). However, such approaches may not work well enough at a daily resolution where precipitation data are skewed and exhibit high density at zero observations. To address these issues, Stauffer et al. (2017) had recently suggested an additive regression model for daily precipitation observations based on a censored normal response distribution and various smooth spatio-temporal effects.
Following the model of Stauffer et al. (2017) for the province of Tyrol in Austria, we take their approach a step further and establish a daily precipitation climatology for all of Austria using a large and freely available homogenized data source. The data are taken from the HOMSTART project (http://www.zamg.ac.at/cms/de/forschung/klima/datensaetze/ homstart/) conducted at the Zentralanstalt für Meteorologie und Geodynamik (ZAMG) and funded by the Austrian Climate Research Programme (ACRP, Nemec et al. 2011Nemec et al. , 2013. Homogenization was successfully carried out for daily precipitation time series within 1948-2009 from a rather dense net of 57 meteorological stations (see the left panel of Figure 3). Umlauf et al. (2012) previously investigated the data based on a much simpler ordered probit model to answer the question whether it rains more frequently on weekends than during work days (it does not). Here, we reanalyze the data using a much more complex additive regression model with a normal response left-censored at zero. To make positive observations more "normal, " a commonly used square-root transformation has been applied prior to regression modeling (see the right panel of Figure 3).
Because precipitation in the Alps is driven by the season and local characteristics, for example, differing altitude form north to south, we use the following predictor for μ and σ : here function f 1 is an altitude effect, f 2 is the cyclic seasonal variation, f 3 is a spatially correlated effect of longitude and latitude coordinates, and f 4 is a spatially varying seasonal effect. Hence, the overall seasonal effect is constructed by the main effect f 2 and the interaction effect f 4 , where the deviations from the main effect are modeled to sum to zero for each day of the year, that is, this can be viewed as a functional ANOVA decomposition. For full Bayesian estimation with Algorithm A1, A2a, and A2b, we construct updating functions U jk (·) based on IWLS structures. Hence, as shown in Section 4 this only requires the following "Lego bricks" to be implemented (the detailed expressions are provided in the supplemental material Section 2): B1. The density function of a Gaussian distribution left censored at zero.
Since the HOMSTART dataset has over 1.2 million observations, the full storage of the resulting design matrices would lead to excessive demands concerning both computer storage as well as CPU power. To prevent computational problems associated with very large datasets like HOMSTART, we make use of the fact that the number of unique regressor observations is much smaller, for example, only 365 for the day-of-year effect. This is much smaller than the total number of observations of the dataset and duplicated rows in the corresponding design matrix can be avoided within the model fitting algorithms. Therefore, we implemented updating functions U jk (·) that support shrinkage of the design matrices based on unique covariate observations, using the highly efficient algorithm of Lang et al. (2014). This essentially employs a reduced form of the diagonal weight matrix W kk in the IWLS algorithm and computes the reduced partial residual vector from z k − η (t ) k,− j separately. For usage within bamlss, see also the documentation of estimation engines bfit() and GMCMC() and the corresponding updating functions bfit_iwls() and GMCMC_iwls().
With a total of 4000 iterations of the MCMC sampler, on a Linux system with 8 Intel i7-2600 3.40 GHz processors running the model takes approximately 17 hr. For each core, the first 2000 samples are withdrawn and only every 10th sample is saved.
The plots of the estimated effects are shown in Figure 4. The top row illustrates the spatial variation of the seasonal effect (solid lines) together with the mean effect (dashed lines) for parameters μ and σ . The estimates indicate that during June to August precipitation is highest in the mean effect for μ. However, there is some clear spatial variation, especially differences between the regions north and south of the Alps. This is highlighted by the red, gray, and blue lines and show that the southern stations have a clear annual peak while for the northern stations the semiannual pattern is more pronounced. Similarly, the seasonal effect for parameter σ has considerable variation between north and south. The uncertainty peak is shifting from the middle of summer to fall when going from north to south.
The second row of Figure 4 shows the resulting spatial trends. The spatial effect for parameter μ indicates that regions with positive effect accumulate in the north-west part of Austria. The overall importance of the spatial effect is somewhat smaller compared to the seasonal effects, which is highlighted by using the same range for y-axes in the first row and the color legends in the second row. The spatial effect for parameter σ shows that model uncertainty is the highest within the southern regions (especially the province of Carinthia) and in the most western province (Vorarlberg).
The bottom plot in Figure 4 is an example of the resulting precipitation climatology for January 10. The predicted average precipitation is quite low all over Austria, ranging from 0 to 1.1mm. The map indicates that more precipitation can be expected in the northern parts of the Alps, especially in the west (Vorarlberg) and in the center (Salzburg). The effect of elevation is also visible since the valleys exhibit less precipitation than the alpine regions, however, the effect is not as pronounced as, for example, the seasonal effect(s), most probably because the variation of elevation of the meteorological stations used in this dataset is relatively small.

Complex Space-Time Interactions in a Cox Model
This analysis is based on the article of Taylor (2017) and contributes to the developed model by inclusion of complex spacetime interactions using the BAMLSS framework.
The London Fire Brigade (LFB, http://www.londonfire.gov.uk/) is one of the largest in the world. Each year, the LFB is called thousands of times, in most cases due to dwelling fires. To prevent further damage or fatal casualties, a short arrival time is important, that is, the time it takes until a fire engine arrives at the scene after an emergency call has been received. The LFB's annual performance target is an average fire engine arrival time of 6 min at maximum. Clearly, this mostly depends on the distance between the site and the responsible fire station but it may also depend on the number of fire stations in the area because fire engines may already be in use at another nearby fire scenery. Therefore, Taylor (2017) analyzed the effect of fire station closures in 2014 using a parametric proportional hazards model to identify regions of possible concern about the number of available fire stations. To contribute to the topic, we apply an extended complex Cox model to the 2015 dwelling fire response time data and illustrate how the generic BAMLSS framework can be used to set up new estimation algorithms for this type of model.
The distribution of the 5838 dwelling fires in 2015 together with the fire stations is shown in Figure 5 (a precompiled version of the data is available in the bamlss package). The top left panel indicates that both, fire stations and fire events, are spread all over London with a higher density in the city center which is brought out more clearly by the heatmap in the bottom left panel. The panels on the right-hand side pertain to the arrival time and show that overall about 30% of these were greater than 6 min (bottom right) with most of these occurring at the borders of London (top right).
Taylor (2017) analyzed the response times within a survival context where the hazard of an event (fire engine arriving) at time t with a relative risk model of the form that is, a model for the instantaneous arrival rate conditional on the engine not having arrived before time t. Here, the hazard function is assumed to depend on a time-varying predictor η λ (t ) and a time-constant predictor η γ . In most survival models, the time-varying part η λ (t ) represents the so-called baseline hazard and is a univariate function of time t. Compared to Taylor (2017), we set up a similar model but with the extended timeconstant predictor where β 0 is an intercept and function f 1 is the effect of fire station intensity (fsintens, computed with a kernel density estimate of all fire stations in London). Thus, this variable is a proxy for the distance to the next fire station(s), especially suited for situations when the responsible fire station already send out all fire engines such that help needs to arrive from another station.
Function f 2 accounts for the effect that it is more difficult for a fire engine to arrive at the scene in rush hours, that is, the risk of waiting longer than 6 min is expected to depend on the time of the day, variable daytime. To treat the question of structured spatially driven hazards, a spatial effect f 3 of longitude and latitude coordinates is included in the model. Moreover, we also treat the daytime effect in a spatially correlated context, function f 4 . For example, we assume that rush hour peaks may have local hot spots that can be captured by this three-dimensional effect. Again, all functions f 1 , . . . , f 4 are assumed to be possibly nonlinear and are modeled using penalized splines. Moreover, we also relax the time-varying predictor to η λ (t ) = f 0 (t ) + J λ j=1 f j (t, x). Here, the baseline hazard is represented by f 0 (t ) and all functions f j (t, x) are time-varying possibly nonlinear functions of covariates. Hence, our model is a complex Cox-type additive model as introduced by Kneib and Fahrmeir (2007). To further investigate if there is a space-time varying effect, that is, if the shape of the baseline hazard is dependent on the location we use the following time-varying additive predictor where f 0 is the baseline hazard for variable arrivaltime, the waiting time until the first fire engine arrives after the received emergency call. Function f 1 (arrivaltime, lon, lat) is a space-time varying effect modeling the deviations from the baseline which can capture whether the risk of waiting longer than 6 min is driven by other factors that are not available in this analysis. Both functions are modeled using penalized splines. The probability that the engine will arrive on the scene after time t is described by the survival function , which is of prime interest in this analysis. For full Bayesian inference, the following "Lego bricks" need to be implemented for updating functions U jk (·) using algorithms A1, A2a, and A2b (the detailed expressions are provided in the supplemental material Section 3): B1. The log-likelihood function of the continuous time Cox model. B6a. For derivative-based estimation using Algorithm A2a and for MCMC simulation with Algorithm A2b, the score vectors and Hessian need to be computed. B7a. The elements of the Hessian w.r.t. β λ . Note that these cannot be computed by fragmenting with the chain rule to obtain building block B7b and IWLS updating functions, see the supplemental material Section 3 for details. B6b & B7b. However, constructing updating functions for the time-constant part η γ again yields an IWLS updating scheme based on building block B7b. As a result, applying the generic algorithm presented in Algorithm A1 to this type of problem, two specific difficulties need to be considered. First, the updating functions U jk (·) for the time-varying predictor η λ (t ) are different from the time-constant updating functions for η γ . Second, a specific hurdle of the continuous-time Cox model is the computation of the integrals, because these do not have a closed-form solution and need to be approximated numerically, for example, by the trapezoidal rule or Gaussian quadrature (Hofner 2008;Waldmann et al. 2017). Moreover, it is inefficient to compute the integrals anew for every updating step, since for the timeconstant part the integrals given in P do not change anymore.
To reduce computing time, we account for the idiosyncrasy of the Cox model and implement an optimizer function cox.mode() for posterior mode estimation as well as the sampler function cox.mcmc() for MCMC simulation (which are part of the corresponding bamlss family object cox_bamlss()). On a Linux system with 8 Intel i7-2600 3.40 GHz processors estimation takes approximately 1.2 days. Note that function cox.mode() also applies an automated procedure for smoothing variances selection using information criteria, see also Algorithm A2a.
The estimated effects are shown in Figure 6. The upper left panel shows that the average "risk" that a fire engine arrives increases steeply until the target time of 6 min. The space-time varying effect is relatively small compared to the overall size of the effect, especially until the 6 min target time it seems that the location does not have a great influence on the relative risk. Only for waiting times above ∼15 min, the space-time varying effect is more pronounced. The effect for fire station intensity is quite large and bounded, that is, there is a natural limit for the benefit from opening new fire stations in the area. The effect of the time of the day then indicates that in the morning hours around 4-5 am, as well as in the afternoon around 2-4 pm, the risk of waiting longer for the fire engine to arrive is only slightly increasing. In addition, the spatial deviations from the mean time of day effect are modest, similar in magnitude as the spatial varying baseline effects. The largest deviation seems to be at around 10 am. In Figure 7, the spatial varying effect is illustrated on nine time points. The maps indicate possible hot-spots of this effect, however, as mentioned above the overall effect size from −0.4 to 0.4 is not very large (see also Figure 6 bottom left) such that differences in risk probabilities are almost negligible. In contrast, the time-constant spatial effect clearly shows that the average risk of increased waiting times are higher in the city center and some smaller area in southern London. However, the estimated probabilities of waiting longer than 6 min around the center show moderate variation, while the borders of London indicate higher probabilities as well as in the western parts, most probably because of the lower fire station density in these areas. In summary, next to the baseline effect, the most important effects on the log risk are the fire station intensity and the time-constant spatial effect which have an absolute range of about 4 on the log-scale.
To conclude, the proposed model including complex model terms beyond "classical" structures, like space-time interactions in both the time-constant and the time-varying part, is a considerable extension of this type of model and can gain more insight into potential risk factors that are probably not obvious. The presented modular framework facilitates the development of such complex algorithms essentially, for example, Köhler et al. (2017) develop flexible joint models for longitudinal and time-to-event data using the modular BAMLSS framework.

Summary and Discussion
This article combines frequently used algorithms for the estimation of additive Bayesian models in a flexible framework for distributional regression, also termed Bayesian additive models for location, scale, and shape (BAMLSS), and beyond. We highlight the similarities between optimization and sampling concepts and coalesce these in a generic toolbox of modular "Lego bricks. " Two case studies illustrate how the framework can be leveraged to establish complex and difficult-to-estimate models based on the accompanying implementation in the R package bamlss . One drawback of our framework when using an MCMC engine with highly complex predictor structures in the distribution parameters and big datasets is a possibly long computing time. To overcome this problem, one may either resort to parallel computing facilities-if availableor approximative, faster methods like integrated nested Laplace approximations (Rue, Martino, and Chopin 2009). The latter however, is currently restricted to distributions with only one predictor and hence not usable for instance a heteroscedastic Gaussian model.
Interesting extensions in the future include efficient matrix transformations for additional effect types based on Gaussian processes (Paciorek 2007) and the broadening of the available model classes in bamlss such as state-space models (Carter and Kohn 1994).

Supplementary Materials
Algorithmic details: Detailed information about algorithmic derivations and "Lego bricks" used in the main manuscript are provided in the accompanying online supplementary material .pdf.
R package: R package bamlss, including the datasets for reproducing the illustrations, is available at http://CRAN.Rproject.org/package=bamlss. Code: The R code for reproducing the models is provided in the script models.R, for reproducing all figures in the script figures.R.