Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data

Abstract We develop scalable methods for fitting penalized regression spline based generalized additive models with of the order of 104 coefficients to up to 108 data. Computational feasibility rests on: (i) a new iteration scheme for estimation of model coefficients and smoothing parameters, avoiding poorly scaling matrix operations; (ii) parallelization of the iteration’s pivoted block Cholesky and basic matrix operations; (iii) the marginal discretization of model covariates to reduce memory footprint, with efficient scalable methods for computing required crossproducts directly from the discrete representation. Marginal discretization enables much finer discretization than joint discretization would permit. We were motivated by the need to model four decades worth of daily particulate data from the U.K. Black Smoke and Sulphur Dioxide Monitoring Network. Although reduced in size recently, over 2000 stations have at some time been part of the network, resulting in some 10 million measurements. Modeling at a daily scale is desirable for accurate trend estimation and mapping, and to provide daily exposure estimates for epidemiological cohort studies. Because of the dataset size, previous work has focused on modeling time or space averaged pollution levels, but this is unsatisfactory from a health perspective, since it is often acute exposure locally and on the time scale of days that is of most importance in driving adverse health outcomes. If computed by conventional means our black smoke model would require a half terabyte of storage just for the model matrix, whereas we are able to compute with it on a desktop workstation. The best previously available reduced memory footprint method would have required three orders of magnitude more computing time than our new method. Supplementary materials for this article are available online.


Introduction
This article proposes a method for estimating generalized additive models (a particular class of Gaussian latent process models) for much larger datasets and models than has hitherto been possible. For our application we achieve a three order of magnitude speed up relative to previous big data GAM methods (e.g., Wood, Goude, and Shaw 2015). Our new method rests on three innovations: (i) an efficient new fitting iteration, employing a minimal number of matrix operations all of which scale reasonably well, (ii) OpenMP based parallelization of these matrix operations, and (iii) a novel marginal covariate discretization scheme, enabling compact model representation and efficient computation of key matrix crossproducts. These three elements work together, and dropping any one of them leads to an increase in fitting time of an order of magnitude or more.
We are motivated by a practical problem in spatial epidemiology: the local estimation of short-term exposure to air pollution, based on monitoring network data. Specifically we focus on the United Kingdom Black Smoke (BS) monitoring network, which collected daily data on μg m −3 (micrograms per cubic meter) of BS particulates (largely from coal and Diesel combustion) from  Figure 1(a) shows the network in 1967, indicating the average log BS measurements in that year. The other panels in Figure 1 illustrate the temporal patterns in the data, and in the network size. In total the data comprise 9,451,232 daily measurements from 2862 monitoring sites. Because of the data volume, previous attempts to model spatiotemporal patterns in the BS data have focused on annual averages (e.g., Shaddick and Zidek 2014). This is not entirely satisfactory from an epidemiological perspective, since acute respiratory disease is usually sensitive to exposure to high levels of pollution over short time periods, and such exposure can be completely hidden in an annual average. Retrospective cohort studies, for example, really require estimates of exposure at the daily level, rather than annual averages, if they are to successfully uncover acute effects. This difference between acute and longterm exposure is also reflected in the health guidelines, with EU regulations currently stipulating that annual average exposure should not exceed 68μg m −3 while daily peak exposure should not exceed 213 μg m −3 . Given the data volume, an obvious option is not to model, but simply to estimate daily exposure directly from the raw measurement, but this is a poor option for several reasons. First, the network design is not random but shows a type of preferential sampling (Shaddick and Zidek 2014), so that a design based approach to exposure estimation will result in bias, which is only avoidable by taking a model-based approach. Second, the reduced number of stations later in the data make spatial predictions difficult without a model that is able to share information across years. Third, there are strong covariate effects.
We will end up using a model structure where y, doy and dow denote, year, day of year, and day of week; n and e denote location as kilometers north and east; h and r are height (elevation of station) and cube root transformed rainfall (unfortunately only available as monthly average); T 0 and T 1 are daily minimum and maximum temperature, whileT1 andT2 are daily mean temperature on and two days previously; α k(i) is a fixed effect for the site type k of the ith observation (type is one of R (rural), A (industrial), B (residential), C (commercial), D (city/town center), X (mixed), or M (missing)); b id(i) is a random effect for the idth station, while e i is a Gaussian error term following an AR process at each site. Using reduced rank spline basis expansions for the terms in (1) requires around 8000 model coefficients. So estimating the model as a penalized GLM in the manner of Wood (2011) would require half a terabyte of storage just for the model matrix and is clearly infeasible. Our original intention was to use the method of Wood, Goude, and Shaw (2015) (available in R package mgcv) or to follow Shaddick and Zidek (2014) in using the method of Rue, Martino, and Chopin (2009) (via the INLA package), however this proved not to be feasible. Even if the computational load had been acceptable in terms of execution time, our experiments with smaller models and datasets suggested that INLA would require more than the 128Gb of memory that we had available. The Wood, Goude, and Shaw (2015) method would have been possible in terms of memory footprint, but we estimated that fitting would have taken in excess of a month of computing time (12 core Xeon E5-2670 2.3 GHz CPU), even using an enhanced efficiency version of the method employing some of the ideas from the current article for REML smoothing parameter selection. Using just the published method would have required approximately five times as long.
After reviewing model representation in Section 2, we develop a practical fitting method in Sections 3 and 4, which reduces the fitting time for model (1) to under an hour. The novel developments that allow this are covered in Section 4 and appendix A. Sections 5 and 5.1 then discuss the black smoke modeling in more detail.

Model Class and Representation
We first review the class of generalized additive models (GAM) introduced by Tibshirani (1986, 1990) (see also Wahba 1990), relating a univariate response, y i to predictors x ji (which may be vector). A GAM has the structure μ i = E(y i ), EF denotes an exponential family distribution with known or unknown scale parameter φ, g is a known smooth monotonic link function, A(i, :) the ith row of any parametric model matrix, and θ the corresponding parameter vector. The f j are unknown smooth functions to be estimated (and must usually be subjected to sum-to-zero identifiability constraints). For estimation purposes we adopt the widely used approach of representing the unknown functions using reduced rank smoothing splines. Full smoothing splines arise from solving variational problems. For example, the cubic spline problem seeks f , from some reproducing kernel Hilbert (or appropriate a smoothing parameter). The result can be represented in terms of an explicit n-dimensional basis, while the spline penalty becomes a quadratic penalty on the basis coefficients. However, since, at latest, Wahba (1980) and Parker and Rice (1985), it has been recognized that an n-dimensional basis representation is computationally wasteful for negligible statistical gain and use of a k n dimensional basis is often preferable. Theoretical work by Gu and Kim (2002), Hall and Opsomer (2005), Li and Ruppert (2008), Kauermann, Krivobokova, andFahrmeir (2009), Claeskens, Krivobokova, andOpsomer (2009) and Wang, Shen, and Ruppert (2011) show that the reduced rank approach is well founded, with k needing to grow only rather slowly with sample size (e.g., k = O(n 1/5 ) for a cubic spline under REML smoothness estimation).
A rich variety of reduced rank model terms are available in addition to cubic splines. Examples are the P-splines of Eilers and Marx (1996); Marx and Eilers (1998); Ruppert, Wand, and Carroll (2003), and adaptive variants (e.g., Wood 2011), as well as the isotropic thin plate and other Duchon splines (Duchon 1977), for which rank reduction is conveniently performed by the eigen method of Wood (2003). Reduced rank tensor product splines (e.g., Eilers and Marx 2003;Wood 2006) are important for representing smooth interactions, splines on the sphere (Wahba 1981) and Gaussian process smoothers (Kammann and Wand 2003;Handcock, Meier, and Nychka 1994) are useful in some spatial applications. In all cases if f j = [ f j (x j1 ), f j (x j2 ), . . .] T we can write f j = X j β j where X j is an n × p j model matrix for the smooth, containing its basis functions evaluated at the observed x j values. β j is the corresponding coefficient vector. The smoothing penalty for f j can then be written β T j S j β j , where S j contains known coefficients. Since the individual f j in (2) are only estimable to within an intercept term, identifiability constraints need to be applied. As discussed in Wood, Scheipl, and Faraway (2013) the sum-to-zero constraints, i f j (x ji ) = 0 have the advantage of leading to narrow confidence intervals on the constrained f j , and it is easy to reparameterize to incorporate the constraints directly into X j and S j (which, respectively, lose a column, and a row and column in the process).
It is then straightforward to create a single n × p model matrix X = (A, X 1 , X 2 , . . .) with corresponding combined parameter vector β. Given some smoothing parameters λ a combined smoothing penalty could then be written as where S j is simply a zero padded version of S j and S λ = j λ j S j . Hence, we have an overparameterized GLM structure, g(μ) = Xβ. Given smoothing parameters it is estimated viâ This penalized likelihood approach (e.g., Green and Silverman 1994) can be viewed as a reasonable approach in its own right.
An alternative is to view penalization as the expression of a belief that "smooth is more probable than wiggly" and to express this using the (improper) prior where S − λ is a Moore-Penrose pseudoinverse (S λ being rank deficient because the penalties leave some space of functions unpenalized, and in any case do not penalize the fixed effects). In that caseβ is the MAP estimator of β, and it is clear that we can view the GAM as a Gaussian latent random field model (see Kimeldorf and Wahba 1970;Wahba 1983;Silverman 1985;Fahrmeir and Lang 2001;Ruppert, Wand, and Carroll 2003, etc.). The smoothing parameters, λ, can be estimated by generalized crossvalidation or similar (e.g., Craven and Wahba 1979), but Reiss and Ogden (2009) showed that a (restricted) marginal likelihood approach (e.g., Wood 2011) offers practical reliability advantages, in being less prone to multiple local optima and consequent undersmoothing.

The Fitting Iteration
The purpose of this article is to allow the rich existing modeling framework, described in Section 2, to be used with much larger models and datasets than has hitherto been possible, by providing substantially new scalable fitting methods. The new methods are based on the performance iteration (Gu 1992) or PQL (Breslow and Clayton 1993) approach to model fitting, modified to obtain reasonable scalability. Before introducing the modifications, we motivate the basic approach and provide an alternative justification for its use, suited to penalized regression.
It is readily shown that maximization of (3) by Fisher scoring is equivalent to the following penalized iteratively reweighted least squares (PIRLS) scheme. Initializeμ i = y i + δ i andη i = g(μ i ) where δ i is a small constant (often zero) chosen to ensure g(μ i ) exists. Then iterate the following to convergence 1. Form "pseudodata" 2. By penalized least squares, estimate β for the working model The key idea of performance iteration/PQL is to estimate λ and φ at each iteration from the working model. Consider using restricted marginal likelihood (REML) for this purpose. First suppose that we were to make the clearly false assumption and M is the dimension of the null space of S λ , then the twice negative log REML (e.g. Wood 2011) is Differentiating V with respect to φ and equating to zero, we find that the REML estimate of φ must satisfŷ where τ = tr{(X T WX/φ + S λ ) −1 X T WX/φ} is the "effective degrees of freedom" of the model. Soφ is simply the "Pearson estimator" of the scale parameter, which is a reasonable estimator without any REML justification, and without assuming normality of z (see, e.g., Wahba 1983;McCullagh and Nelder 1989;Hastie and Tibshirani 1990). Now let us eliminate the false assumption of normality of z, replacing it with central limit theorem justification. Consider the QR decomposition √ WX = QR, where Q has orthogonal columns and R is upper triangular (this decomposition is purely a theoretical device, nowhere in the new methods below do we actually need to compute a QR decomposition).
where the multivariate central limit theorem justifies e ∼ N(0, Iφ) as an n/p → ∞ approximation. The twice negative log restricted marginal likelihood for this model is For a given φ, V and V r differ only by an additive constant, and therefore result in identical inference about λ and β. Inference about φ would of course differ, since r carries information about φ, but if we plug the Pearson estimate (5) into V r then we obtain identical inference to that obtained by simply using V for φ and λ. This justifies use of (4) for λ, φ estimation.
Note that once the coefficients and smoothing parameters are estimated, further inference can be based on the large sample Bayesian result, which turns out to provide well calibrated frequentist inference (Wahba 1983;Silverman 1985;Nychka 1988;Marra and Wood 2012;Wood 2013).

A Practical Fitting Method
Implementation of the fitting iteration of Section 3 is limited by several practical considerations. 1. For the target datasets and models, it is impractical to explicitly form X whole. 2. The log determinant terms in V are potentially numerically unstable. Because having some λ j → ∞ is legitimate in GAM estimation, S λ can become so badly scaled that the computation of log determinants involves taking the logs of terms that are numerically zero. 3. For maximal efficiency it is not sensible to optimize V at each iteration step, when it will anyway be modified at the next step. 4. The update step for V should involve computations that scale well to multi-core computation. Wood, Goude, and Shaw (2015) addressed 1 by iteratively updating the QR factorization of X, and then applying the method of Wood (2011) to (6). This approach ignored 3, requires pivoted QR decomposition and addressed 2 by stabilizing reparameterizations involving p × p symmetric eigen decomposition: the QR and eigen decompositions do not scale well. For example the state of the art block pivoted QR decomposition of Quintana-Ortí, Sun, and Bischof (1998), only has around half the floating point operations as matrix-matrix computations. In consequence the Wood, Goude, and Shaw (2015) was computationally impractical for the black smoke model. See appendix C for a discussion of the issues around multicore computing.
Our proposal here addresses 3 by taking a single Newton step to update ρ = log(λ) at each cycle of the iteration (rather than fully optimizing V at each cycle). We propose to avoid the stabilizing reparameterization step by avoiding evaluation of the log determinants altogether (hence, addressing 2). This is based on the observation that the Newton step, , only involves the derivative of V, and the derivatives of the log determinants are less numerically problematic. Evaluation of V is usually required to ensure that the Newton step results in an improvement of V.
We cannot skip such a check, but we can substitute the alternative check that T ∇V (ρ + ) ≤ 0, that is, that V is nonincreasing in the direction of at the end of the Newton step (see, e.g., Wood, 2015, sec. 5.1.1).
Adopting this approach we find that the derivatives of V can be obtained using simple matrix operations and a pivoted Cholesky decomposition of X T WX, which can be accumulated blockwise, thereby dealing with 1. Lucas (2004) provides a block oriented pivoted Cholesky decomposition readily parallelized using openMP (OpenMP Architecture Review Board, 2008), which deals with point 4. The resulting method has the further advantage that, with some further work, it turns out to be possible to produce further substantial efficiency savings by discretization of the model covariates (see Section 4.5).

The Modified Fitting Iteration
Based on the above considerations, the proposed fitting iteration is as follows. Its convergence properties are discussed in Appendix B.
Step 1 does not require the explicit formation of the whole matrix X.
Step 3 reduces the β step taken if the Newton step was too long, in that it increased the penalized deviance at the ρ value at which it was computed.
Step 5 reduces the ρ step if it was so long that the REML score was increasing at the end of the step. When log φ is unknown it can be included as an extra element of ρ.
Step 6 consists of estimating theβ λ implied by the proposed ρ and the current W and z. Further more the marginal likelihood of the working penalized linear model is uses as a smoothing parameter estimation criterion, and the gradient vector of this criterion along with the first Newton step for optimizing it are also computed. The next sections detail how Step 6 is accomplished.

The REML Update
Now consider the calculation of the Newton step, , to improve (4). We have thatβ λ is the solution of (X T WX + φS λ )β λ = X T Wz. The actual computation proceeds by taking a Cholesky decomposition R T R = X T WX/φ + S λ using a parallel version of Lucas (2004). This is usually done with pivoting, in which case the rank of R is then estimated and unidentifiable parameters set to zero and dropped from subsequent computations. We then computeβ λ = R −1 R −T X T Wz/φ (by backward and forwards substitution). In what follows "pivoting" and "unpivoting" refer to the application of the Cholesky pivoting order and its reversal.
The Newton step is = −(d 2 V/dρdρ T ) −1 dV/dρ, where d 2 V/dρdρ T will have been perturbed if necessary to ensure definiteness (see Nocedal and Wright 2006). Recalling that d( and, defining δ j k = 1 if k = j and 0 otherwise, Implicit differentiation implies that This latter computation is most efficient ifβ λ is first unpivoted, S jβλ is formed and this is then repivoted: the block structure of S j (see next section) can then be be exploited. The next two sections cover computation of the derivatives of the log determinants.

Computing The Derivatives
Of log |S λ | + S λ has block diagonal structure that can be exploited. For example, denoting zero blocks by '. ' , That is there are some blocks with single smoothing parameters, and others with a more complicated additive structure. There are usually also some zero blocks on the diagonal. The block structure means that the generalized determinant and its derivatives w.r.t. ρ k = log λ k can be computed blockwise. Note in particular that, for the above example, log |S λ | + = rank(S 1 ) log(λ 1 ) + log |S 1 | + + rank(S 2 ) log(λ 2 ) For any ρ k relating to a single parameter block we have and zero second derivatives. For multi-λ blocks there will generally be first and second derivatives to compute. There are no second derivatives "between-blocks. " To facilitate computations some prefit reparameterization is undertaken, according to the type of block. 1. Single parameter diagonal blocks. These can be reparameterized so that all nonzero elements are one, and the rank precomputed. 2. Single parameter dense blocks. These can be reparameterized to look like the previous type, by similarity transform, again computing rank. 3. Multi-λ blocks are transformed so that j λ j S j has full rank in the new parameterization. Again a similarity transform is used. Typically the S j are of smaller dimension in the reparameterization and consequently an extra zero block is introduced on the diagonal of S j . The generalized determinant of type 3 blocks becomes an ordinary determinant of j λ j S j after reparameterization. Hence, its derivatives follow from the standard result d log |S| dρ = tr S −1 ∂S ∂ρ .

Computing The Derivatives of log |X T WX/φ + S λ |
The following computations build on the Cholesky decomposition of the previous sections 1. Form P = R −1 , and unpivot the rows of P. Then form PP T . These steps are O(p 3 ), but can be parallelized. 2. Form the matrices containing the nonzero rows of S k PP T (∀k). This step is cheap for all but type 3 blocks. 3. Compute the required derivatives using The trace computations in step 3 are very efficient, given the block structure of the S k , if we employ the following tricks. In general tr(AB) = k j A k j B jk . Now let A have nonzero rows only between k 1 and k 2 , while B has nonzero rows only between j 1 and j 2 .
Of course normally the initial zero rows would not actually be stored in which case we have

The Model Matrix: Efficient Storage and Computation
We are interested in computing with models in which it is impractical to store the whole model matrix, and in which computing the required matrix cross product may be prohibitively expensive. For this reason we discretize the model covariates so that the columns of the model matrix corresponding to a single smooth term can be stored in compact form. Specifically, suppose that the covariate for the jth term is discretized into m discrete values, then the model matrix columns for that term can be written as whereX j has only m rows and k is an index vector. StoringX j and k uses much less memory than storing X j directly. This idea is introduced in Lang et al. (2014) to obtain efficient storage and computation for large datasets. However, in that article they employ smooths of one covariate and only require terms of the form X T j WX j , but not X T j WX k . For smoothing parameter estimation we require these "off diagonal" product terms as well. In addition we require tensor product smooths of multiple covariates. Discretizing multiple covariates onto multidimensional grids requires either substantial storage or substantial approximation error, and in the tensor product context it makes sense to instead discretize each component marginal model matrix separately, constructing the full tensor product model matrix "on the fly. " Appendix A develops the identities and algorithms required to compute with X and its products when the submatrices of X corresponding to individual terms are stored compactly, and when tensor product terms are computed "on-the-fly" from compactly stored marginal model matrices. With the correct structuring each matrix inner product is a factor of p faster than it would be under direct computation, where p is the number of columns in the largest marginal model matrix involved in the product and for nontensor product smooths their only model matrix is their single model matrix. The crucial advance over Lang et al. (2014) is the ability to deal with tensor product smooths efficiently, and to compute the off diagonal crossproducts efficiently (between single smooths, tensor product smooths or a mixture of the two). Our method has the major advantage over alternative discretization approaches (e.g., Helwig and Ma 2016) of discretizing covariates independently (marginally), rather than discretizing jointly so that the unique combinations of discretized covariates are stored (or the basis functions evaluated at those unique combinations). The joint approach typically requires more storage, and/or coarser discretization than our fully marginal approach.
An obvious question is how fine a discretization is necessary? Suppose we discretize n observations of covariate x onto a regular grid of m values (just covering the x range). In the large m limit an upper bound on the resulting approximation error is 0.5m −1 max |g (x)| where g is the true function we are trying to recover. The sampling error on the estimate of g is at best O(n −1/2 ), implying that m = O(n 1/2 ) is more than adequate. For any finite sample analysis the approximation error bound can be evaluated to check the adequacy of m. Note however, that for the black smoke network data, many covariates are already discrete: for example, there are only a finite number of site locations, site labels and elevations, temperature is only recorded to within 0.1 • C, etc.

Black Smoke Model Development
Following the industrial revolution, problems associated with air pollution worsened in many countries. During the first half of the 20th century major pollution episodes occurred in London, notably in 1952 an episode of fog, in which levels of black smoke exceeded 4500 μg m −3 , was associated with 4000 excess deaths (Ministry of Health 1954). Other early episodes, which were caused by a combination of industrial pollution sources and adverse weather conditions, and resulted in large numbers of deaths among the surrounding populations, include those in the Meuse valley (Firket 1936) and the United States (Ciocco and Thompson 1961). Attempts to measure levels of air pollution in a regular and systematic way arose as a result of these episodes. In 1961 the world's first coordinated national air pollution monitoring network was established in the United Kingdom, to monitor black smoke and sulphur dioxide at around 1000 sites (Clifton 1964). Since then all European countries have established monitoring networks, some of them run at the national level, others by local authorities or municipalities, with the initial focus on black smoke (soot) and sulfur dioxide, initially largely from coal burning but shifting more recently to other pollutants. Monitoring has increased in the wake of national and international legislation and the issuing of air quality guidelines, but most monitoring networks share features of the U.K. BS network that challenge the interpretation of the data for epidemiological and policy purposes: (i) monitoring is expensive and so monitoring networks are typically sparse and change over time, (ii) concentrations may vary greatly over small distances, especially in urban areas and (iii) networks designed to monitor compliance with standards, may not give a good representation of levels over an area. Modeling offers the possibility to alleviate these problems, at least partially, and our approach to the U.K. black smoke data should be applicable to other monitoring networks.
In addition to the Black smoke data (Loader 2002), we obtained daily temperature and monthly rainfall data for the United Kingdom Hollis 2005b, 2005a) to use as covariates, alongside site elevation (of Terrain-50 2015). Given the volume of data, our initial exploratory model development concentrated first on modeling space without time, and then time without space. In this way we were able to develop candidate temporal decompositions (in terms of year, day of year, and day of week), and candidate models for covariates and space, which were then combined while allowing space and time effects to interact.
Our basic approach was first to decompose the black smoke signal into components dependent on different temporal scales: year (y) for the long term changes, day of year (doy) for the annual cycle and day of week (dow) for the working week related cycle. These are represented by f 1 − f 3 in model (1). These effects were all allowed to interact: for example, the weekly pattern could change with time of year, and over longer timescales. These interactions are f 4 − f 6 in model (1). We then allowed the effects of year, time of year and day of the week to vary spatially (terms f 8 − f 10 ), as well as allowing a "main effect" of space, f 7 . Elevation and rainfall effects f 11 and f 14 were also included alongside effects for site type and a site specific random effect. Residual analysis for a model including only these effects suggested strong temperature dependence, with an interaction of daily minimum and maximum temperatures ( f 12 ). Including this latter term still left a correlation with mean temperatures at lags of one and two days, resulting in f 13 .
Main effects of time were represented using cubic regression splines for y and cyclic cubic regression splines for doy and dow. Tensor product smooths (e.g., Wood 2006) were used for the interactions. In cases in which smooth main effects and interactions were present, then the interaction smooths were constructed to exclude the main effects, by the simple expedient of applying sum-to-zero constraints to the marginal bases of the tensor product smooth, prior to construction of the tensor product basis. Space time interaction terms follow Augustin et al. (2009), that is tensor product smoothers with isotropic smoothers used for the spatial marginal smooth and cubic splines for the temporal margin.
Due to the marked reduction in the size of the network in its last decade, and the uneven spatial coverage, some care is required in the specification of the two-dimensional spatial smoothers of n and e, to avoid extrapolation artifacts in later years. We chose to use Duchon splines (see Duchon 1977;Miller and Wood 2014), using first derivative penalties with Duchon's s parameter therefore set to 1/2. The use of first derivative penalties means that such smoothers are smoothing toward the constant function, which is a reasonable modeling assumption for black smoke data in sparsely observed regions. Duchon splines are the general class of splines introduced in Duchon (1977) of which the popular thin plate spline is a special case: see Miller and Wood (2014) for an accessible introduction. For comparison we also tried Gaussian process smoothers with a Matérn covariance function following Kammann and Wand (2003) and Handcock, Meier, and Nychka (1994), as well as thin plate splines, but in both cases basic model checking revealed artifacts in model predictions toward the end of the data. The online supplementary material includes an animation of predicted log black smoke, clearly illustrating such artifacts for the thin plate spline based model (the equivalent animation for the Duchon spline based model is also included).
Given our interest in using the model for prediction away from the stations, we aimed to keep the station specific random effects structure of the model as simple as possible, however it proved impossible to achieve an adequate fit without any random effects at all, and the model therefore includes a single random intercept term per station, reflecting the individual idiosyncrasies of station locations not captured by the available covariates.
Model adequacy was checked using standard residual plots, as well as autocorrelation function plots and semivariogram plots to check for unmodeled spatial and temporal correlation. Figures 2 and 3 show such plots for model 1, showing that the model does a reasonable job of capturing spatial and temporal correlation, in the data. Further plots are shown in the online supplementary material. To illustrate the importance of Figure . Aggregate ACF for model () residuals assuming independent residuals in gray, with the equivalent for the standardized residuals assuming AR residuals, overlaid in black. While not perfect, the AR model greatly reduces the unmodeled temporal autocorrelation.
the weather variables and site-specific random effects, models were fitted without these leading to AIC increases of 1.6 × 10 6 and 2.4 × 10 6 for models without weather variables and the random effect, respectively (the corresponding r 2 reductions were approximately 2% and 1%).
A concern with these data is that they show evidence of a type of preferential sampling (Shaddick and Zidek 2014): as the network was reduced over time, monitors in areas of low concentrations were more likely to be dropped than those in high pollution areas (note that this is different in nature to preferential sampling considered by Diggle, Menezes, and Su (2010), for example). If we had a perfect model without penalties (smoothing priors) then this preferential sampling might reduce efficiency but would not introduce bias. However, when using penalties there is a danger that the reduction of the network so reduces the coverage over some space-time regions that the model predictions for these regions are dominated by the influence of the penalty. If the network reduction is subject to preferential sampling, then it is possible that these space-time regions are systematically those in which pollution is actually lowest, and that the reliance on the penalty/prior then introduces systematic positive bias.
To investigate the potential for such effects, we fitted a reduced model (1) to the data from the year with the most complete spatial coverage, 1967, dropping all terms involving long-term effects of time. We also dropped the temperature and rainfall effects, to force the spatial effects to do as much of the explanatory work as possible. Using the actual network design (i.e., with stations added and dropped over time), we then simulated from a model in which the 1967 fitted model spatiotemporal pollution fields were repeated each year, but with a long-term decay matching the full dataset. Stationspecific random effects were added with standard deviations as estimated from our fit of (1) to the full dataset. Further details are given in the online supplementary material. So our simulated data comes from a "truth" that maintains a degree of spatiotemporal complexity driven by the most "spatially complete" year throughout the simulated dataset, and in which the sampling is given by the real network evolution and, therefore, preferentially drops stations from low pollution regions of the simulation. We then fitted the complete model (1) to the simulated data, and examined its ability to reconstruct the simulated "true" pollution field at each of the locations of stations present in 1967, throughout the whole modeling period (i.e., without any drop out). If our model is sensitive to the preferential sampling evident in the network evolution, then we should be able to detect a positive bias in the full model predictions, which would be likely to grow over time. In fact we can only detect a very small constant bias of about 0.006 on the log scale (corresponding to a 0.6% bias on the original scale). There is no evidence for a trend in the bias: the online supplementary material includes a plot illustrating this and a fuller discussion.

Results and Predictions
The model (1) has a conditional r 2 of 0.79 (i.e., treating the AR process as induced by a random field), and a marginal r 2 of 0.7 (i.e., ignoring the auto-regressive structure of the residuals). The online supplementary material includes an animation showing the evolution of the predicted spatial pollution field over time. Careful examination shows some artifacts in the fields, usually in coastal regions away from observation stations, but otherwise the results appear reasonable, predicting high pollution levels in the industrial centers especially in the first decade or so, generally showing cleaner air in wetter regions, and tending to show an annual cycle reflecting higher fossil fuel use in the winter.
In this section we illustrate the model results with two sets of plots examining how the chance of exceeding current daily recommended limits (213 μg m −3 ) has changed over time. Figure 4 shows the log of the number of days for which levels are predicted to exceed the daily limit, for a town center location, for several years in the 1960s. These figures are obtained by simply counting up predicted exceedance days by 5 km 2 .  An alternative is to compute the average posterior probability of the mean exceeding the recommended level using the predicted level and its standard deviation, based on (7). Figure 5 shows such a plot. Broadly both figures show the same pattern, with the situation improving rapidly in London in the wake of the U.K. Clean Air Acts, but taking much longer to improve in the cold northern industrial conurbations.

Discussion
Our development of scalable additive model fitting methods rests on three innovations: (i) the development of a fitting method which required only basic easily parallelized matrix computations and a pivoted Cholesky decompostion; (ii) the use of a scalable parallel block pivoted Cholesky algorithm; and (iii) an efficient approach to model matrix storage and computations with the model matrix, using discretized covariates. The approach allows much larger additive/latent Gaussian process models of much larger datasets than has hitherto been feasible, and is general enough for routine use (see R package mgcv). For the black smoke modeling, fitting is three orders of magnitude faster than we could have achieved otherwise.
The three method innovations are interlinked so that cleanly attributing elements of the speed up to each separately is not really possible. However, model fitting time increases from around 55 min to over 7.5 hr if we use a single core, instead of 12 (CPU turbo modes disabled to aid comparability). Using the new method, profiling reveals that the time spent on the matrix crossproduct is approximately equal to the time spent on the other method steps, for the black smoke model. From the operations counts in Appendix A the crossproduct is around a factor of 10 2 less floating point intensive using the new discrete methods relative to direct crossproduct formation, while the subleading order cost of basis function evaluation is up to 10 4 times less costly. Similarly the leading order costs of each smoothing parameter update can be compared. The Wood, Goude, and Shaw (2015) method requires approximately 40 times the floating point operations per smoothing parameter update, due to O(p 3 ) costs per smoothing parameter, coupled with a symmetric eigen decompositon and several QR steps. Hence, all three components of the new method are required to achieve the observed efficiency gains.
For discretization we chose to generalize the approach of Lang et al. (2014), rather than attempt to use the grid-based approach of Currie, Durban, and Eilers (2006). This is largely as a result of the very irregular nature of our "grids": for example, the approach here avoids having to compute anything that will then be given zero weight as a result of data being missing at a grid node. However, our smoothing parameter selection method should be directly applicable to models fit using the Currie, Durban, and Eilers (2006) approach (unlike, e.g., the approach of Wood (2011)).
The Black smoke model presented here is the first successful attempt to model these data on a daily basis over several decades, and offers a basis for estimating daily exposures for use in retrospective cohort studies, for example. While a major advance, we do not believe that this model is definitive. For example, the only meteorological variables available to us on a daily basis were temperature, and the fact that we are forced to use monthly rainfall data offers an obvious area for improvement. The model as it stands shows some artifacts in coastal areas that we are working to improve. Another obvious deficiency is the lack of any pollution source data. One might expect substantial improvements if fine scale data on coal and diesel use were available as predictors.
The method is implemented in the bam function of R package mgcv from version 1.8-9, and is invoked via bam arguments discrete and nthreads. The black smoke data are available from the first author's web page (http://www.maths. bris.ac.uk/∼sw15190/).

A. Methods for Discretized Covariates
This section describes the algorithms required to compute efficiently with marginally gridded covariates in detail. The idea is that we have a model matrix X = (X 0 : X 1 : · · · ). Each X j represents either a single smooth, or a tensor product smooth (e.g. Wood 2006). In the case of a single smooth X j (i, l) =X j (k j (i), l), whereX j is an m j × p j matrix evaluating the smooth at the corresponding gridded values. For a tensor product where M j k are marginal model matrices and Q j is a constraint matrix, usually imposing a sum to zero constraint over a representative subset of the data. denotes the Kronecker product (⊗) applied row-wise (i.e., one row at a time). In this case the marginal model matrices are stored in compact form: The following algorithms are most efficient if tensor product terms are always arranged so that the marginal model matrix with the most columns is last, but this can be achieved by automatic rearrangement.
Note that in principle covariates could be discretized jointly onto a multidimensional grid, so that we store the unique combinations of covariates, rather than storing the unique covariate values independently. With the joint scheme the cross product X T WX is easy to compute. IfX andW contain the unique model matrix rows and corresponding unique weights, respectively, whileN is the diagonal matrix containing the number of occurrences of each now ofX in X then X T WX =X TNWX . The problem is that the number of unique combinations of covariates, and hence number of rows ofX can be very large, unless very coarse discretisation is used. Hence the requirement for the methods of this appendix.
A variant of the scheme is required when the model contains terms of the form k f j (z ik )L ik = j { f j (vec (z) If z is n × m, then j is n × nm, and the index vectors must be of length nm, which is also the number of rows inX (the model matrix for f j (vec(z))). The regular case corresponds to j = I. Note that an L term can be treated as an extra single column tensor product marginal. A1, A2, A5 and A6, below, simply require j to be applied as the final step, while A3 and A4 require the extra work detailed.
The matrix products required in fitting require the following basic algorithms. A1 Extraction of a single column of a single term X j uses (9) at O(n) cost. A2 Extraction of a single column of a tensor product term X j .
Let p k denote the number of columns of M