Spatial statistical modelling of insurance risk: a spatial epidemiological approach to car insurance

ABSTRACT Spatial models, such as the Besag, York and Mollie (BYM) model, have long been used in epidemiology and disease mapping. A common research question in these subjects is modelling the number of disease events per region; here the BYM models provides a holistic framework for both covariates and dependencies between regions. We use these tools to assess the relative insurance risk associated with the policyholders geographical location. A Bayesian modelling approach is presented and an elastic net is used to reduce the large number of possible geographic covariates. The final inference is performed using Integrated Nested Laplace Approximation. The model is applied to car insurance data from If P&C Insurance together with spatially referenced covariate data of high resolution, provided by Insightone. The entire analysis is performed using freely available R-packages. Including spatial dependence when modelling the number of claims significantly improves on the result obtained using ordinary generalised linear models. However, the support for adding a spatial component to the model for claims cost is weaker.


Introduction
An insurance contract is a policy that protects the policyholder against uncertain financial loss. Accurately predicting the risk associated with each policy allows the insurance company to differentiate its pricing between high and low risk policyholders.
The risk associated with a group of policies is described by the risk premium, R p ; defined as the total claims cost, C, divided by the total exposure, E, during a specified period of time, i.e. the total duration of active policies during that period (Ohlsson & Johansson 2010): The risk premium should cover the insurers expected cost for the policies. Although direct modelling of the risk premium is possible, it is common to split the modelling into two parts -modelling severity, S, and claims frequency, F, separately (Ohlsson & Johansson 2010, p. 34) where Y is the number of claims. The main reason for this separation is that claim frequency is usually much more stable than claim severity, allowing the frequency to be estimated with greater accuracy. & Rue 2015). INLA is able to fit models with spatial dependence structures to our 13,831 regions in minutes, and should be able to handle models with 10 5 -10 6 geographic regions using a modern desktop computer (Rue et al. 2009). This allows us to handle dataset which are substantially larger and more complex than those previously studied.

Data
The data used here was obtained from a commercial provider (Insightone). The data illustrates both the advantages and limitations of commercially available high resolution geographic data. The geographic data consists of area centroid coordinates and a neighbourhood structure; due to confidentiality Insightone does not provide exact area borders. The neighbourhood specification accounts for 'natural' obstacles, such as water, and conforms with the graph representation of neighbourhoods commonly used in CAR models (e.g. Ch. 6.1.2 in Blangiardo & Cameletti 2015). The division is depicted in Figure 1.
In the division each of the areas are indexed -and associated to each area is a generous amount of underlying data such as gender distribution, age distribution and average income; giving roughly 140 possible geographic covariates. In addition to the external data, claims and insurance data for vehicle hull damage policies from the years 2011-2015 were provided. The choice of policies to study was motivated by the large amount of data and the relatively nice distribution of claims cost arising from this type of policies. Since the area borders are unknown each policy was associated with the closest area centroid using the address of that policyholder, s i -this corresponds to grouping the policies using a Voronoi tessellation based on the area centroids (Okabe et al. 2000) -and the aggregate number of claims, Y i and claims cost, C i , in each area were computed.

Model
In line with the ordinary pricing models in insurance we assume that the geographic rating factors, γ G,i , in each area can be modelled using a multiplicative model, see Equations (3) and (4). The geographic rating factors are then modelled using demographic and socio-economic variables (see Section 2) as well as neighbouring geographic rating factors. Using the aggregate number of claims and claims cost, this is in principle the same approach as in the ordinary tariff model, applied on the area-level. In principle this translates to assuming a GLM on the relative geographic rating factors γ G,i -letting the areas act as policyholders.
The first step of the modelling is to determine a suitable GLM model for the aggregated claim frequency and severity. This model will then be expanded to GLMM, Besag, and BYM models; allowing for spatial dependence through correlated random effects.

Aggregated claim frequency and severity
A standard model for claims frequency is to assume that the number of claims for a single policy follows a Poisson distribution (Ohlsson & Johansson 2010, p. 18). By assuming that policies are independent the number of claims on the aggregate level (i.e. areas) will also follow a Poisson distribution (Isham 2010). Thus, as in disease mapping (Waller & Carlin 2010), the number of events (claims), Y i , in each area, i, are modelled using a Poisson distribution: Here η F i is the linear predictor of the GLM for claims frequency and E i is the exposure, i.e. total duration of policies, in area i. The areas will likely be inhomogeneous with respect to demography and socio-economic status. While this is preferable for geographical analysis of the data, some of the variation in risk between areas can be explained by the ordinary tariff. This can be taken into account in the measure of exposure. Thus, the raw exposure, E i , is replaced by a weighted exposure, E * i , when modelling claim counts in area i.
The weighting is computed according to the composition of policyholders in each area and their rating factors. This method of obtaining a weighted exposure, E * i , compares to the weighted population (a.k.a. population at risk) used in disease mapping; where the distribution of e.g. age and gender in each area is accounted for in E * i , as discussed by Blangiardo & Cameletti (2015, p. 179) and Papoila et al. (2014, p. 1).
Although the number of claims is taken as response variable, the expected frequency, F i , can always be found as using Equations (4) and (2). As in Gschlößl & Czado (2007) the aggregate claims cost, C i , in each area is modelled conditioned on the number of claims, Y i . For claims cost the gamma distribution has become more or less the standard option in modelling (Ohlsson & Johansson 2010, p. 20). For the aggregate claims cost the use of a gamma distribution can be motivated by the addition over the given number of claims. We have with and η C i is the linear predictor for claims severity. If one is interested in the severity rather than the total claims cost, the scaling properties of the Gamma distribution yields Map illustrating the duality between neighbourhood graph and spatial areas. All areas are represented by centroid coordinates and vertices connected by an edge indicate neighbours, i.e. areas that share a common border. For the data used in this paper only the neighbourhood structure (graph) is known since the area borders are confidential.
The GLM for frequency and severity is now completed by assuming a linear model for η (suppressing the F,C -superscripts), where z i is a (suitable) collection of underlying covariates in each area. Adding an individual level random effects to Equation (8) gives a GLMM, where the added i.i.d. Gaussian noise, v i , accounts for moderate over-dispersion in the observations (Harrison 2014).

Modelling the spatial dependence
To allow for spatial dependence we instead add a structured random effect to the the linear predictor in Equation (8), resulting in a Besag model, Here u i is modelled using a CAR(1)-field (Besag & Kooperberg 1995). Given a neighbourhood graph (see Figure 2 for an example) the CAR(1) model assumes that the value in one area is related to the values in neighbouring areas (i.e. areas that share a common border) through a conditional Gaussian distribution: where N i is the set of neighbours to area i and ψ 1 is a precision parameter. The CAR(1) model corresponds to a multivariate Gaussian model for Here, the resulting precision matrix (inverse covariance matrix) is sparse resulting in a Gaussian Markov Random Field (GMRF) allowing us to use INLA for fast approximate Bayesian inference (Rue et al. 2009, Lindgren & Rue 2015. Combining the random effects in Equations (9) and (10) gives the BYM model (Besag et al. 1991), where u i is a CAR(1)-field and v i is i.i.d. Gaussian noise. As in the GLMM the i.i.d. noise accounts for moderate over-dispersion in the observations which could also be modelled using a Besag model with negative binomial observations.

Model and inference
The model for both claim frequency and severity can be summarised as a GLMM, using either a Poisson or gamma distribution for the observations and with the spatial dependence among areas captured using a BYM model. For the frequency, negative binomial observations are considered as an alternative to handle over-dispersion. Since the i.i.d. Gaussian noise in the BYM model accounts for moderate over-dispersion (Harrison 2014), negative binomial observations are only feasible in combination with a Besag model for spatial dependence (combining negative binomial observations with a BYM model leads to identifiability issues). If π(y i |η i ) denotes the observation likelihood (Poisson, negative binomial, or gamma) for the ith observation given the linear predictor η i , then the complete Bayesian hierarchical model is The spatial effect (if any) is modelled through u while the unstructured effects, v, models additional random variation, unexplained by geographical location. The larger the effect of v, compared to u, is in the model, the less exchange of information between areas is allowed. The precision (inverse variance) of u and v -i.e. the relative strengths of the structured and unstructured effects -are controlled by the two hyper-parameters ψ 1 and ψ 2 , respectively. Two common choice for the priors for ψ are either independent gamma distributions (Ch. 6 in Gschlößl & Czado 2007, Waller & Carlin 2010, Blangiardo & Cameletti 2015 or the recently introduced principled (PC) priors (Simpson et al. 2017). Here we use both the default uninformative priors in INLA, ψ i ∼ (1, 5 · 10 −4 ) (Lindgren & Rue 2015) and the default INLA PC-priors . A final step before fitting the model is to determine which of the roughly 140 possible geographic covariates to include in z i . A suitable set of candidate covariates were first selected using elastic net for the regular GLM, Equation (8), as described in Section 3.4. Then the model in Equation (14a) was fitted in R, using the INLA-package (http://www.r-inla.org, Lindgren & Rue 2015), and the number of covariates where further reduced using posteriors, p(β|y), and deviance information criterion (DIC) values provided by INLA, see Section 4

Selecting covariates
Including all the possible covariates in the model is likely to result in an over-fitted model that is hard to interpret. Additionally, many of the covariates are highly correlated, with the potential of causing numerical and identifiability issues when fitting the model. A first step is therefore to extract a suitable subset of candidate covariates for use in the full model, Equation (14a). The idea, inspired by the two-step procedure used by Mercer et al. (2011), is to initially consider frequency and severity as pure GLMs, i.e. excluding all spatial and unstructured effects, and choose as few parameters as possible while retaining as much predictive quality as possible. Completing this in an algorithmic or at least structured fashion is desirable.
An effective algorithm for this purpose is the elastic net (Zou & Hastie 2005). A ready-to-use implementation of the algorithm is available in the R-package glmnet (Friedman et al. 2010). Consider a GLM, and let l(Y i , β 0 + βz i ) denote the negative log-likelihood for observation i. The elastic net then solves the following convex optimisation problem over a grid of values of λ, where || · || p denotes the L p -norm. The elastic net extends on the standard LASSO (Hastie et al. 2001, Ch. 3.4.3) by allowing a combination of L 1 and L 2 penalties in Equation (15), allowing the elastic net to better handle correlated covariates. It is known that α = 0 shrinks the coefficients of correlated covariates towards each other while α = 1 tends to pick one of them and discard the others (Friedman et al. 2010). The idea of the elastic net is to mix these properties, i.e. choosing 0 < α < 1. If predictors are correlated in groups, as is highly likely for our demographic and socio-economic variables, an α around 0.5 tends to select the groups in or out together. The tuning parameter λ controls shrinkage and thus the number of selected covariates. To obtain an optimal choice of λ glmnet applies a K-fold cross-validation.
Unfortunately the glmnet-package does not support gamma distributed observations, obstructing the direct implementation in the case of the claims cost model. Our approach was to model claims cost as log-normally distributed for the initial variable selection, since glmnet supports normal observations. The idea is that the algorithm should still (roughly) give the same suggestions of best covariates to include, even if another observational distribution is assumed. Note that the models which include spatial random effects (Besag and BYM) are expected to need fewer covariates compared to the GLM model, thus the elastic net is only used as a first step in the variable selection. Although elastic net penalties can be implemented for gamma distributed observations the lognormal approach allows us to use standard software, increasing the reproducibility of the method and results.

Results
For both claims frequency and claims severity models using four different linear predictors, see Table 1, are fitted to the available data. All of the models share the general form specified in Equation (14a), but differ in the combination of unstructured, v i , and spatially structured, u i , random effects included in the linear predictor, η. For the claims frequency, models using both Poisson and negative binomial observations as well as two types of priors are also considered ( Table 1).
The available data consists of 13,831 areas, to allow for model evaluation the areas are randomly split into a modelling set (90% of the areas) and a validation set (10% of the areas). The modelling set is used for covariate selection, parameter estimation, and prediction of the validation set.
As Besag but using a negative binomial observations; only used for claims frequency. BYM -PC As BYM but using principled priors instead of uninformative -priors; only used for claims frequency.

Using elastic net to select covariates
The first step in the modelling is to apply the elastic net to the regular GLM model, Equation (8); with the aim of extracting a (small) set of candidate geographic covariates from the roughly 140 potential covariates. The cross-validation results from applying the elastic net algorithm with α = 0.5 and different penalties, λ, are displayed in Figure 3. The recommended approach (Hastie et al. 2001, Ch. 7.10) is to find the penalty, λ min , that minimises the cross-validation error and then pick the smallest λ such that the resulting error is within one standard error of the minimal error; i.e. pick the smallest model such that the error is not 'statistically different' from the minimal error.
For the frequency data this approach works well and results in the selection of 16 candidate covariates. For the severity data the large uncertainty in the cross-validation error (Figure 3(b)) causes the one standard error approach to select no covariates. The reasonable choice here is to simply accept the 31 candidate covariates obtained for λ min .
The number of covariates might be further reduced when fitting the model without penalisation term, since it is possible that not all covariates are statistically significant.

Parameter significance and model reduction
To assess significant covariates the GLM model, Equation (8), is fitted using the INLA package. The approximate p-values for each β-coefficient are computed from the estimated posterior distributions and used to determine if all candidate covariates are significant on the 95% level. In the case of insignificant covariates the covariate with largest p-value is excluded and the model re-fitted without that covariate, this procedure is repeated until all remaining covariates are significant. The procedure is applied to both the frequency model and the severity separately resulting in nine and eight covariates, respectively. Due to confidentiality it is not possible to fully disclose which variables are considered. Note: Lowest DIC, MSE and smallest number of covariates are marked in bold. All models except one (Besag -Nbin; negative binomial) were run using Poisson observations. For the BYM models we considered both uninformative -priors and principled priors (BYM -PC). The four models with a reduced number of covariates all had the same six covariates.

Estimating frequency and severity models
Recapitulating, the number of claims, Y i , in each area, i, are modelled using a Poisson, Equation (4), or negative binomial distribution; and the aggregate claims cost, C i -given the number of claimsare modelled using a gamma distribution, Equation (6). For both cases the linear predictor, η i -in the complete Bayesian model, Equation (14a) -is specified according to Table 1. The covariates found in Section 4.1.1 are used in the linear predictor.
For both outcomes and all four linear predictors the models are fitted using INLA, thereafter the DIC value and Mean Squared Error (MSE) when predicting the validation set are computed. For models that include random effects (u i or v i or both) significance of the β-coefficients is checked; insignificant covariates are removed -as outlined in Section 4.1.1; and the model is re-fitted. Results for the number of claims are given in Table 2 and for the claims cost in Table 3. In both tables the number of significant covariates are indicated in the dim(β)-column.

Evaluation of the frequency models
For the frequency models, including the unstructured effect (GLMM and BYM) results in the lowest DIC values. However, for the out of sample validation (MSE) adding a spatial effect and accounting for over-dispersion (BYM and Besag with negative binomial observations) gives the best resultssupporting the existence of a spatial effect. For these two alternatives the MSE values are almost identical and the BYM model is preferred due to a lower DIC value. For the different priors the results are similar with virtually no difference in DIC or MSE values between the and principled priors. Furthermore it is optimal to reduce model complexity by removing some of the covariates in both the Besag and BYM models; in all four cases (Besag, Besag-Nbin, BYM, and BYM-PC) the same six covariates were significant. In all cases this leads to a (slightly) worse MSE, but a marginally better DIC. Overall models with spatial effect and over-dispersion produce the lowest values of DIC and MSE; given the similar values of DIC and MSE when using nine and six covariates the simpler model, i.e. BYM with six covariates and -priors, will be considered as the best fit for the data in hand. If the estimates are plotted according to their geographical position the spatial trend is clearly revealed, Figure 4. The spatial field, exp(u), is smooth in general, but there are some seemingly sharp edges. These are explained by 'natural' obstacles such as water. Areas separated by such an obstacle are not considered to be neighbours in the graph specification discussed in Section 3.2.

Evaluation of the severity models
In contrast to the frequency model, the severity model improves only slightly when adding random effects (iid, spatial, or both). In terms of DIC the BYM model is still best, but the gain is much smaller than for the frequency model. Considering the MSE for out of sample predictions the GLM model now provides the best performance; making the model choice less obvious for the severity data. Since the GLM model provides near optimal results with a simpler structure it will be preferred. Confirming the model choice, the estimated expected values of exp(u i ) in the BYM model were all in the interval [0.996, 1.002], indicating that the spatial effect adds very little to the model. To evaluate any residual spatial effect the expected values of e η i are obtained and prediction errors for the validation set are computed. The prediction errors seem to lack any spatial structure, Figure 5(b), again demonstrating that the GLM model is sufficient for modelling the severity data.
Plotting the estimates of E[exp(z i β)] from the GLM model for severity, Figure 5(a), reveals a geographical pattern similar to the covariate effect for claims frequency, Figure 4(a). This indicates that the two models are picking up the same underlying phenomena.

Performance of the geographical pricing model
Another type of model validation and performance testing is to fit the model on data from all areas, but leaving out a validation set of policies before aggregating the data to area level. A randomly selected validation set consisting of 15% of the available policies is set aside, and the BYM model for frequency and the GLM model for severity is fitted to the remaining data. In Table 4, the estimated total frequency and severity are displayed, together with the observed values for the left out policies. The resulting frequency estimate is very accurate, while the severity estimate is slightly high. These results seem consistent with the problems fitting anything more complex than a GLM model to the severity data. For further evaluation of the models see Tufvesson (2017).

Discussion and conclusions
Statistical models for the number of claims, Y, and claims cost, C, that allows for spatial dependence through spatial auto regression were introduced. The models were presented in a fully Bayesian framework and related to previous use in epidemiology and disease mapping (Waller & Carlin 2010, Blangiardo & Cameletti 2015. The entire analysis was performed using commonly available statistical software; the R-packages glmnet and INLA. The models were demonstrated on vehicle hull damage policies using a high resolution spatial subdivision of Stockholm into 13,831 areas. Tractable inference for the high spatial resolution was obtained using INLA (Rue et al. 2009) as an alternative to MCMC. To handle the many potential covariates an elastic net (Zou & Hastie 2005) was used to extract a small set of candidate covariates and further covariate selection was performed using the marginal posteriors provided by INLA. Final model selection was performed using DIC and out of sample prediction errors.
The frequency model can be improved by adding both an unstructured, v, and a spatial effect, u, to the ordinary GLM model, yielding a BYM model (Besag et al. 1991). Since the hyper-parameters of the BYM model are estimated from data the magnitude of the spatial smoothing is completely data driven; eliminating any need for ad-hoc choices of smoothing parameters. The similarity between a BYM model with Poisson observations and a Besag model with negative binomial observations concur with results from Harrison (2014), who noted that adding observation level random effects to a Poisson model 'appeared to robustly estimate the correct parameters at all but the highest levels of over-dispersion'. Further the frequency model was robust to the choice of priors, likely due to the very large dataset. However, the principled priors have shown impressive behaviour for smaller datasets (Simpson et al. 2017 and are often easier to interpret (Sørbye et al. 2018).
For the severity model no significant improvement was identified when increasing model complexity. Thus the already popular GLM model was suggested for severity. The lack of spatial effect for the claims cost was not completely unexpected. Strong predictors, such as car brand and average income, are already accounted for in the weighted exposure, E * i . It might be that the spatial model for severity is more suitable for other policies, such as theft or third part liability. The predictive performance was encouraging for the frequency model, but unfortunately not as accurate for the severity model.
We acknowledge that the use of confidential covariates is problematic. However, the subset of covariates used in any given model will depend on data availability, legal restrictions, geographic regions, and modelled insurance policy. Thus, the main aim of this paper is to provide a statistical framework for how very large, spatially resolved datasets can be used to model claims frequency and severity.

Modelling alternatives
While the suggested modelling approach provides good results for the data studied, we would like to point out a few alternatives that might be relevant for other data. Firstly, the data used in this study allowed for the use of standard Poisson or negative binomial and gamma models for claims frequency and severity. Should the need arise the INLA package provides several alternative models ) including zero-inflated models for the counts; and log-normal, skew-normal, generalised extreme value (GEV) distributions or quantile regression (Opitz et al. 2018) for severity.
INLA provides some limited ability to model severity as spatial extremes by using a latent Gaussian field for one of the GEV parameters; more general versions of this modelling idea (i.e. spatial dependence in several parameters) can be fitted using MCMC (e.g. Cooley et al. 2007). A review of spatial extreme modelling (Davison et al. 2012) notes that latent variable models, as those described above, provide good fits for marginal distributions but copula models or max-stable processes (Koch 2017) are better suited to model joint distributions. A good starting point for spatial extremes might be the models provided in the SpatialExtremes R-package.
The two-step procedure used for variable selection ignores the spatial correlation when making the initial variable selection. It should, however, be noted that this initial selection only serves as a first filtering of suitable covariates and using a smaller λ in the elastic net allows for additional covariates to be considered in the spatial modelling step. An alternative, fully Bayesian approach would be to fit the model using a Metropolis Adjusted Langevin MCMC (Roberts & Stramer 2002) with horseshoe priors (Carvalho et al. 2009, Makalic & Schmidt 2016 for the variable selection (See e.g. Pirzamanbein et al. 2018, for an example on compositional data.). However, we fell that the potential gain of this model is unlikely to outweigh the added computational time and implementation complexity compared to our suggested approach using widely available software.

Extending the model to a larger area
Clearly, the division used here is on a very fine scale, with Stockholm divided into 13,831 areas. This is, to the extent of our knowledge, a spatial model for insurance risk on a much more detailed level than previously published. The fine spatial scale makes the MCMC inference proposed in Gschlößl & Czado (2007) very costly and probably infeasible. For both MCMC and INLA the computational costs associated with each likelihood evaluation grows as O n 3/2 with the size, n, of the spatial subdivision (Rue & Held 2005, Ch. 2.3). The numerical optimisation in INLA is likely to benefit from larger datasets. However, the mixing of random walk MCMC algorithms (and thus the effective number of samples) scales as O n −1/2 (Roberts et al. 1997, Rosenthal 2011), implying that not only will the computational cost of each MCMC step increase but longer chains will be needed to obtain the same numerical accuracy.
Using INLA for approximate Bayesian inference makes the problem computationally tractable and should allow for models with 10 5 -10 6 areas being solved on modern desktop computers (Rue et al. 2009). Since the granularity of the division decreases outside of Stockholm an ability to handle 10 5 -10 6 areas should allow for a substantial geographical extension. Alternatively the computational burden could be reduced by fitting the model on each metropolitan region separately.

Other types of insurance
The case study here considered car insurance data. However, the outlined approach is applicable to other products in non-life insurance. The introduction of a pure spatial dependence might be an alternative solution in situations where it is hard to find suitable explanatory variables.
To wrap things up, accounting for spatial dependence when modelling insurance risk yields both better model fit and geographically resolved risk estimates, allowing for much better price differentiation between regions.