Isotonic recalibration under a low signal-to-noise ratio

ABSTRACT Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to restore auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to an explainable pricing system because the resulting isotonically recalibrated regression function has a low complexity.


Introduction
There are two seemingly unrelated problems in insurance pricing that we are going to tackle in this paper.First, an insurance pricing system should not have any systematic cross-financing between different price cohorts.Systematic cross-financing implicitly means that some parts of the portfolio are under-priced, and this is compensated by other parts of the portfolio that are over-priced.We can prevent systematic cross-financing between price cohorts by ensuring that the pricing system is auto-calibrated.We propose to apply isotonic recalibration which turns any regression function into an auto-calibrated pricing system.The second problem that we tackle is the explainability of complex algorithmic models for insurance pricing.In a first step, one may use any complex regression model to design an insurance pricing system such as, e.g., a deep neural network.Such complex regression models typically lack explainability and rather act as black boxes.For this reason, there are several tools deployed to explain such complex solutions, we mention, for instance, SHAP by Lundberg-Lee [22].Since algorithmic solutions do not generally fulfill the aforementioned auto-calibration property, we propose to apply isotonic recalibration to the algorithmic solution.If the signal-tonoise ratio is low in the data, then the isotonic recalibration step leads to a coarse partition of the covariate space and, as a consequence, it leads to an explainable version of the algorithmic model used in the first place.Thus, explainability is a nice side result of applying isotonic recalibration in low signal-to-noise ratio problems, which is typically the case in insurance pricing settings.There are other methods for obtaining auto-calibration through a recalibration step; we mention Lindholm et al. [21] and Denuit et al. [8].These other methods often require tuning of hyperparameters, e.g., using cross-validation.Isotonic recalibration does not involve any hyperparameters as it solves a constraint regression problem (ensuring monotonicity).As such, isotonic recaliabration is universal because it also does not depend on the specific choice of the loss function within the family of Bregman losses.
We formalize our proposal.Throughout, we assume that all considered random variables have finite means.Consider a response variable Y that is equipped with covariate information X ∈ X ⊆ R q .The general goal is to determine the (true) regression function x → E[Y |X = x] that describes the conditional mean of Y , given X.Typically, this true regression function is unknown, and it needs to be determined from i.i.d.data (y i , x i ) n i=1 , that is, a sample from (Y, X).For this purpose, we try to select a regression function x → µ(x) from a (pre-chosen) function class on X that approximates the conditional mean E[Y |X = •] as well as possible.Often, it is not possible to capture all features of the regression function from data.In financial applications, a minimal important requirement for a well-selected regression function µ(•) is that it fulfills the auto-calibration property.
Definition 1.1 The regression function µ is auto-calibrated for (Y, X) if P-a.s.
Auto-calibration is an important property in actuarial and financial applications because it implies that, on average, the (price) cohorts µ(X) are self-financing for the corresponding claims Y , i.e., there is no systematic cross-financing within the portfolio, if the structure of this portfolio is described by the covariates X ∼ P and the price cohorts µ(X), respectively.In a Bernoulli context, an early version of auto-calibration (called well-calibrated) has been introduced by Schervish [28] to the community in statistics, and recently, it has been considered in detail by Gneiting-Resin [12].In an actuarial and financial context, the importance of auto-calibration has been emphasized in Krüger-Ziegel [17], Denuit et al. [8], Wüthrich [30] and Lindholm et al. [21].Many regression models do not satisfy the auto-calibration property.However, there is a simple and powerful method, which we call isotonic recalibration, to obtain an (in-sample) autocalibrated regression function starting from any candidate function π : X → R. We apply isotonic recalibration to the pseudo-sample (y i , π(x i )) n i=1 to obtain an isotonic regression function µ.Then, where (Y , X ) is distributed according to the empirical distribution P n of (y i , x i ) n i=1 ; see Section 2.1 for details.Isotonic regression determines an adaptive partition of the covariate space X , and µ is determined by averaging y-values over the partition elements.Clearly, other binning approaches can also be used on the pseudo-sample (y i , π(x i )) n i=1 to enforce (1.1), but we argue that isotonic regression is preferable since it avoids subjective choices of tuning parameters and leads to sensible regression functions under reasonable and verifiable assumptions.The only assumption for isotonic recalibration to be informative is that the function π gets the rankings of the conditional means right, that is, whenever we would like to have π(x i ) ≤ π(x j ).Using isotonic regression for recalibration is not new in the literature.In the case of binary outcomes, it as already been proposed by Zadrozny-Elkan [32], Menon et al. [23] and recently by Tasche [29,Section 5.3].The monotone single index models of Balabdaoui et al. [2] follow the same strategy as described above but the focus of their work is different from ours.They specifically consider a linear regression model for the candidate function π, which is called the index.In the case of distributional regression, that is, when interest is in determining the whole conditional distribution of Y given covariate information X, Henzi et al. [13] have suggested to first estimate an index function π that determines the ordering of the conditional distributions w.r.t.first order stochastic dominance and then estimate conditional distributions using isotonic distributional regression; see Henzi et al. [14].
As a new contribution, we show that the size of the partition of the isotonic recalibration may give insight concerning the information content of the recalibrated regression function µ.Furthermore, the partition of the isotonic recalibration allows to explain connections between covariates and outcomes, in particular, when the signal-to-noise ratio is small which typically is the case for insurance claims data.In order to come up with a candidate function π : X → R, one may consider any regression model such as, e.g., a generalized linear model, a regression tree, a tree boosting regression model or a deep neural network regression model.The aim is that π(•) provides us with the correct rankings of the conditional means E[Y |X = x i ], i = 1, . . ., n.The details are discussed in Section 3.
Organization.In Section 2, we formally introduce isotonic regression which is a constraint optimization problem.This constraint optimization problem is usually solved with the pool adjacent violators (PAV) algorithm, which is described in Appendix A.1.Our main result is stated in Section 2.2.It relates the complexity of the isotonic recalibration solution to the signalto-noise ratio in the data.Section 3 gives practical guidance on the use of isotonic recalibration, and in Section 4 we exemplify our results on a frequently used insurance data set.In this section we also present graphic tools for interpreting the regression function.In Section 5, we conclude.

Isotonic regression 2.1 Definition and basic properties
For simplicity, we assume that the candidate function π : X → R does not lead to any ties in the values π(x 1 ), . . ., π(x n ), and that the indices i = 1, . . ., n are chosen such that they are aligned with the ranking, that is, π(x 1 ) < . . .< π(x n ).Remark 2.1 explains how to handle ties.The isotonic regression of z = (y i , π(x i )) n i=1 with positive case weights (w i ) n i=1 is the solution µ ∈ R n to the restricted minimization problem We can rewrite the side constraints as Aµ ≥ 0 (component-wise), where A = (a i,j ) i,j ∈ R n×(n−1) is the matrix with the elements a i,j = 1 i=j−1 − 1 i=j .We define y = (y 1 , . . ., y n ) ∈ R n and the (diagonal) case weight matrix W = diag(w 1 , . . ., w n ).The above optimization problem then reads as This shows that the isotonic regression is solved by a convex minimization with linear side constraints.It remains to verify that the auto-calibration property claimed in (1.1) holds.
Remark 2.1 If there are ties in the values π(x 1 ), . . ., π(x n ), for example, π(x i ) = π(x j ) for some i = j, we replace y i and y j with their weighted average (w i y i + w j y j )/(w i + w j ) and assign them weights (w i + w j )/2.The procedure is analogous for more than two tied values.This corresponds to the second option of dealing with ties in Leeuw et al. [20,Section 2.1].
Remark 2.2 Barlow et al. [3,Theorem 1.10] show that the square loss function in (2.1) can be replaced by any Bregman loss function, L φ (y, µ) = φ(y) − φ(µ) + φ (µ)(y − µ), without changing the optimal solution µ.Here, φ is a strictly convex function with subgradient φ .Bregman loss functions are the only consistent loss functions for the mean; see Savage [27] and Gneiting [11,Theorem 7].If y and µ only take positive values, a Bregman loss function of relevance for this paper is the gamma deviance loss, which is equivalent to the QLIKE loss that arises by choosing φ(x) = − log(x); see Patton [25].
The solution to the minimization problem (2.2) can be given explicitly as a min-max formula, that is, j=k w j j=k w j y j .
While the min-max formula is theoretically appealing and useful, the related minimum lower sets (MLS) algorithm of Brunk et al. [6] is not efficient to compute the solution.The pool adjacent violators (PAV) algorithm, which is due to Ayer et al. [1], Miles [24] and Kruskal [18], allows for fast computation of the isotonic regression and provides us with the desired insights about the solution.In Appendix A.1, we describe the PAV algorithm in detail.The solution is obtained by suitably partitioning the index set I = {1, . . ., n} into (discrete) intervals with z-dependent slicing points 0 = i 0 < i 1 < . . .< i K = n, and with K(z) ∈ {1, . . ., n} denoting the number of discrete intervals I k .The number K(z) of intervals and the slicing points i k = i k (z), k = 1, . . ., K(z), for the partition of I depend on the observations z.On each discrete interval I k we then obtain the isotonic regression parameter estimate for instance i ∈ I k see also (A.5).Thus, on each block I k we have a constant estimate µ i k , and the isotonic property tells us that these estimates are strictly increasing over the block indices k = 1, . . ., K(z), because these blocks have been chosen to be maximal.We call K(z) the complexity number of the resulting isotonic regression.Figure 1 gives an example for n = 20 and rankings π(x i ) = i for i = 1, . . ., n.The resulting (non-parametric) isotonic regression function µ = µ(z), which is only uniquely determined at the observations (π(x i )) n i=1 , can be interpolated by a step function.In Figure 1 this results in a step function having K(z) − 1 = 9 steps, that is, we have K(z) = 10 blocks, and the estimated regression function µ takes only K(z) = 10 different values.This motivates to call K(z) the complexity number of the resulting step function, see Figure 1.The partition of the indices I into the isotonic blocks I k is obtained naturally by requiring monotonicity.This is different from the regression tree approach considered in Lindholm et al. [21].In fact, this latter reference does not require monotonicity but aims at minimizing the "plain" square loss using, e.g., cross-validation for determining the optimal number of partitions.
In our context, the complexity number K(z) is fully determined through requiring monotonicity and, in general, the results will differ.
In insurance applications, the blocks I k ⊂ I provide us with the (empirical) price cohorts µ i = µ i k , for i ∈ I k , and (2.4) leads to the (in-sample) auto-calibration property for Y where (Y , X ) is distributed according to the weighted empirical distribution of (y i , x i ) n i=1 with weights (w i ) n i=1 .Moreover, summing over the entire portfolio we have the (global) balance property that is, in average the overall (price) level is correctly specified if we price the insurance policies with covariates x i by w i µ i , where the weights w i > 0 now receive the interpretation of exposures.

Monotonicity of the expected complexity number
In this section, we prove that the expected complexity number E[K(z)] is an increasing function of the signal-to-noise ratio.For this, we assume a location-scale model for the responses Y i , that is, we assume that with noise terms i , location parameters µ i ∈ R with µ 1 ≤ . . .≤ µ n , and scale parameter σ > 0. Here, µ i takes the role of π(x i ) in the previous section.The parameters µ 1 , . . ., µ n are unknown but it is known that they are labeled in increasing order.The signal-to-noise ratio is then described by the scale parameter σ, i.e., we have a low signal-to-noise ratio for high σ and vice-versa.The explicit location-scale structure (2.7) allows us to analyze point-wise in the sample points ω ∈ Ω of the probability space (Ω, F, P) as a function of σ > 0; this is similar to the re-parametrization trick of Kingma-Welling [16] that is frequently used to explore variational auto-encoders.
In this section, we write K(y) = K(z), because the ranking of the outcomes y is clear from the context (labeling), and we do not go via a ranking function π(•).
Theorem 2.3 Assume that the responses Y i , i = 1, . . ., n, follow the location-scale model (2.7) with (unknown) ordered location parameters µ 1 < . . .< µ n , and scale parameter σ > 0.Then, the expected complexity number Theorem 2.3 proves that, under a specific but highly relevant model, the complexity number K(Y ) of the isotonic regression is decreasing on average with a decreasing signal-to-noise ratio.Implicitly, this means that more noisy data, which has a lower information ratio, leads to a less granular regression function.Consequently, if the partition of the isotonic regression is used to obtain a partition of the covariate space X via the candiate function π, this partition will be less granular, the more noise of Y cannot be explained by π(X), see also Section 3.3 for a further discussion.
To the best of our knowledge, our result is a new contribution to the literature on isotonic regression.While we focus on the finite sample case, a related result is the analysis of the complexity number of the isotonic regression function as function of the sample size n, see Dimitriadis et al. [9,Lemma 3.2].
We are assuming strictly ordered location parameters in the formulation of Theorem 2.3.This assumption simplifies the proof in the case where we show that the expected complexity number K(Y ) is strictly decreasing in σ.With some additional notation, the theorem could be generalized to allow for ties between some (but not all) µ i .Figure 2 gives an example of a location-scale model (2.7) with i.i.d.standard Gaussian noise and scale parameters σ = 2 (lhs) and σ = 20 (rhs), and both figures consider the same sample point ω ∈ Ω in the noise term (ω), see (2.8).On the right-hand side of Figure 2, we have complexity number K(y) = 13, and on the left-hand side K(y) = 46; the chosen sample size is n = 100.
3 Isotonic recalibration for prediction and interpretation

Prediction and estimation
In order to determine an auto-calibrated model for the true regression function , we are suggesting a two-step estimation procedure.First, we choose a regression model and use the data (y i , x i ) n i=1 to obtain an estimate π of a candidate function π that should satisfy for all x, x ∈ X .For example, in the case study in Section 4, a deep neural network model is chosen for π.For sensible results, it is important that the estimation method for π does not overfit to the data.
In the second step, we apply isotonic regression to the pseudo-sample (y i , π(x i )) n i=1 to obtain an in-sample auto-calibrated regression function µ defined on { π(x i ) : i = 1, . . ., n}.We call this second step isotonic recalibration.In order to obtain a prediction for a new covariate value x ∈ X , we compute π(x), find i such that π(x i ) < π(x) ≤ π(x i+1 ), and interpolate by setting µ(x) = ( µ(x i ) + µ(x i+1 ))/2.This interpolation may be advantageous for prediction.For interpretation and analysis, however, we prefer a step function interpolation as this leads to a partition of the covariate space, see Section 3.3, below, and Figure 2.This two-step estimation approach can be interpreted as a generalization of the monotone single index models considered by Balabdaoui et al. [2].They assume that the true regression function is of the form E[Y |X = x] = ψ(α x), with an increasing function ψ.In contrast to our proposal, the regression model π is fixed to be a linear model α x in their approach.They consider global least squares estimation jointly for (ψ, α), but find it computationally intensive.As an alternative they suggest a two-step estimation procedure similar to our approach but with a split of the data such that α and the isotonic regression are estimated on independent samples.They find that if the rate of convergence of the estimator for α is sufficiently fast, then the resulting estimator of the true regression function is consistent with a convergence rate of order n 1/3 .In a distributional regression framework, Henzi et al. [13] considered the described two-step estimation procedure with an isotonic distributional regression [14], instead of a classical least squares isotonic regression as described in Section 2.1.They show that in both cases, with and without sample splitting, the procedure leads to consistent estimation of the conditional distribution of Y given X, as long as the index π can be estimated at a parametric rate.The two options, with and without sample splitting, do not result in relevant differences in predictive performance in the applications considered by Henzi et al. [13].Assumption (3.1) can be checked by diagnostic plots using binning similarly to the plots in Henzi et al. [13,Figure 2] in the distributional regression case.Predictive performance should be assessed on a test set of data disjoint from (y i , x i ) n i=1 , that is, on data that has not been used in the estimation procedure at all.Isotonic recalibration insures auto-calibration in-sample, and under an i.i.d.assumption, auto-calibration will also hold approximately out-of-sample.Out-of-sample auto-calibration can be diagnosed with CORP (consistent, optimally binned, reproducible and PAV) mean reliability diagrams as suggested by Gneiting-Resin [12], and comparison of predictive performance can be done with the usual squared error loss function or deviance loss functions.

Over-fitting at the boundary
There is a small issue with the isotonic recalibration, namely, it tends to over-fit at the lower and upper boundaries of the ranks π(x 1 ) < . . .< π(x n ).For instance, if y n is the largest observation in the portfolio (which is not unlikely since the ranking π is chosen response data-driven), then we estimate µ i K = y n , where K = K((y i , π(x i )) n i=1 ).Often, this over-fits to the (smallest and largest) observations, as such extreme values/estimates cannot be verified on out-of-sample data.For this reason, we visually analyze the largest and smallest values in the estimates µ, and we may manually merge, say, the smallest block I 1 with the second smallest one I 2 (with the resulting estimate (2.4) on the merged block).More rigorously, this pooling could be crossvalidated on out-of-sample data, but we refrain from doing so.We come back to this in Figure 5, below, where we merge the two blocks with the biggest estimates.

Interpretation
In (2.3) we have introduced the complexity number K((y i , π(x i )) n i=1 ) that counts the number of different values in µ, obtained by the isotonic regression (2.2) in the isotonic recalibration step.This complexity number K((y i , π(x i )) n i=1 ) allows one to assess the information content of the model, or in other words, how much signal is explainable from the data.Theorem 2.3 shows that the lower the signal-to-noise ratio, the lower the complexity number of the isotonic regression that we can expect.Clearly, in Theorem 2.3 we assume that the ranking of the observations is correct which will only be approximately satisfied since π has to be estimated.In general, having large samples and flexible regression models for modeling π, it is reasonable to assume that the statement remains qualitatively valid.However, in complex (algorithmic) regression models, we need to ensure that we prevent from in-sample overfitting; this is typically controlled by either using (independent) validation data or by performing a cross-validation analysis.Typical claims data in non-life insurance have a low signal-to-noise ratio.Regarding claims frequencies, this low signal-to-noise ratio is caused by the fact that claims are not very frequent events, e.g., in car insurance annual claims frequencies range from 5% to 10%, that is, only one out of 10 (or 20) drivers suffers a claim within a calendar year.A low signal-to-noise ratio also applies to claim amounts, which are usually strongly driven by randomness and the explanatory part from policyholder information is comparably limited.Therefore, we typically expect a low complexity number K((y i , π(x i )) n i=1 ) both for claims frequency and claim amounts modeling.In case of a small to moderate complexity number K = K((y i , π(x i )) n i=1 ), the regression function µ becomes interpretable through the isotonic recalibration step.For this, we extend the autocalibrated regression function µ from the set { π(x 1 ), . . ., π(x n )} to the entire covariate space X by defining a step function for all x ∈ X , where 0 = i 0 < i 1 < • • • < i K = n are the slicing points of the isotonic regression as defined in (2.3). Figure 1 illustrates this step function interpolation which is different from an interpolation scheme that one would naturally use for prediction.We define a partition X 1 , . . ., X K of the original covariate space X by Figure 4 illustrates how this partition of X provides insights on the covariate-response relationships in the model.This procedure has some analogy to regression trees and boosting trees that rely on partitions of the covariate space X .In the case study in Section 4, we illustrate two further possibilities to use the partition defined at (3.2) for understanding covariate-response relationships.First, in Figure 7, the influence of individual covariates on the price cohorts is analyzed, and second, Figure 9 gives a summary view of the whole covariate space for a chosen price cohort.

Isotonic recalibration vs. binary regression trees
We start by considering the two covariate components RiskClass and VehAge only.Since the resulting covariate space X = {(RiskClass, VehAge)} ⊂ R2 is two-dimensional, we can graphically illustrate the differences between the isotonic recalibration approach and a binary regression tree (as a competing model) for interpretation.In Section 4.2, we consider all available covariates.We fit a deep feed-forward neural network (FFNN) regression model to these 683 claims.We choose a network architecture of depth 3 with (20,15,10) neurons in the three hidden layers, the hyperbolic tangent activation function in the hidden layers, and the log-link for the output layer.The input has dimension 2, this results in a FFNN architecture with a network parameter of dimension 546; for a more detailed discussion of FFNNs we refer to [31,Chapter 7], in particular, to Listings 7.1-7.3 of that reference.We fit this model using the gamma deviance loss, see [31,Section 5.3.7] and Remark 2.2, use the nadam version of stochastic gradient descent, and exercise early stopping on a validation set being 20% of the entire data.Line (1a) of Table 1, called gamma FFNN, shows the performance of the fitted FFNN regression model.This is compared to the null model (empirical mean) on line (0) that does not consider any covariates. 2We observe a decrease in gamma deviance loss and in root mean squared error (RMSE) which justifies the use of a regression model; note that these are in-sample figures, but we use early stopping to prevent the network from in-sample overfitting.The difficulty here is that, only having 683 claims, we cannot provide a reasonable out-of-sample analysis.The last column of Table 1  In the next step, we use the FFNN estimates as ranks π(x i ) for ordering the claims y i and the covariates x i , respectively.Then we apply the non-parametric isotonic recalibration step (2.2) to these ranks and claims.The Swedish motorcycle claims data is aggregated w.r.t. the available covariate combinations, and the 683 positive claims come from 656 different covariate combinations x i .This requires that we work with the weighted version of (2.2), where w i ∈ N corresponds to the number of claims that have been observed for covariate x i , and y i corresponds to the average observed claim amount on x i . 3We use the R package monotone [7] which provides a fast implementation of the PAV algorithm.The numerical results are presented on line (1b) of Table 1.There is a slight decrease in average loss through the isotonic recalibration.This is expected since the isotonic regression is optimizing the in-sample loss for any Bregman loss function, see Remark 2.2.The last column of Table 1 verifies that now the global balance property (2.6) holds.Figure 3 provides the resulting step function from the isotonic recalibration (in red color) of the ranking ( π(x i )) n i=1 given by the gamma FFNN; this is complemented with the observed amounts y i (in blue color).The resulting complexity number is K = K((y i , π(x i )) n i=1 ) = 18, i.e., in this example the conditional expected claim amounts can be represented by 18 different estimates µ i k ∈ R, k = 1, . . ., K = 18; the FFNN regression function uses 6 • 21 = 126 different values (ranks) which corresponds to the cardinality of the available covariate values (RiskClass, VehAge) ∈ X .The isotonic recalibration on the ranks π(x) = π(RiskClass, VehAge) of the FFNN leads to a partition X 1 , . . ., X 18 of the covariate space as defined at (3.2).We compare this partition to the one that results from a binary split regression tree approach.We use 10-fold crossvalidation to determine the optimal tree size.In this example the optimal tree has only 3 splits, and they all concern the variable VehAge.The resulting losses of this optimal tree are shown on line (2) of Table 1, and we conclude that the regression tree approach is not fully competitive, here.More interestingly, Figure 4 shows the resulting partitions of the covariate space X = {(RiskClass, VehAge)} from the two approaches.The plot on the right-hand side shows the three splits of the regression tree (all w.r.t.VehAge).From the isotonic recalibration approach on the left-hand side, we learn that a good regression model should have diagonal structures, emphasizing that the two covariates interact in a nontrivial way which cannot be captured by the binary split regression tree in this case.

Consideration of all covariates
We now consider all available covariate components, see lines 2-7 of Listing 1.We first fit a FFNN to this data.This is done exacly as in the previous example with the only difference that the input dimension changes from 2 to 6, when we consider all available information.We transform the (ordered) Area code into real values, and also we also merge Area codes 5 to 7 because of scarcity of claims for these Area codes, and we call this new variable Zone.The FFNN has then a network parameter of dimension 626.The network is fitted with stochastic gradient descent that is early stopped based on a validation loss analysis.The results are presented on line (2a) of  We compare the fitted FFNN regression model to the null model (empirical mean) and a gamma generalized linear model (GLM).The gamma GLM is identical to model Gamma GLM1 in [31, Table 5.13].We give some remarks on the results of Table 2. Firstly, the FFNN has the smallest gamma deviance loss and the smallest RMSE of the three models on lines (0)-(2a).Thus, the gamma FFNN adapts best to the data among the three model choices (we use early stopping in the FFNN fitting).Interestingly, the gamma GLM and the FFNN both fail to have the global balance property (2.6), see last column of Table 2. Stochastic gradient descent fitted models with early stopping generally fail to satisfy the global balance property, whereas the gamma GLM fails to have the global balance property because we work with the log-link and not with the canonical link of the Gamma GLM, here.Figure 5: Isotonically recalibrated regression models in the Swedish motorcycle example using all covariates for the gamma GLM with complexity number K((y i , π(x i )) n i=1 ) = 24 (lhs), for the gamma FFNN with complexity number K((y i , π(x i )) n i=1 ) = 23 (middle) and over-fitting corrected (rhs).
In the next step, we use the FFNN predictions as ranks π(x i ) for ordering the responses and covariates, and we label the claims y i such that π(x 1 ) < . . .< π(x n ).There are no ties in this data, and we obtain n = 656 pairwise different values.The results of the isotonic recalibration are presented in Figure 5 (middle).The complexity number is K = K((y i , π(x i )) n i=1 ) = 23, thus, the entire regression problem is encoded in 23 different values µ i k , k = 1, . . ., K. In view of this plot, it seems that the largest value µ i K over-fits to the corresponding observation, as this estimate is determine by a single observation y n , being bigger than the weighted block mean µ i K−1 on the previous block I K−1 ; compare Section 3.2.For this reason, we manually pool the two last blocks I K−1 and I K .This provides us with a new estimate (2.4) on this merged block, and reduces the complexity number by 1 to K = 22.The resulting isotonic recalibration is shown in Figure 5 (rhs), and the empirical losses are provided on line (2b) of Table 2. Importantly, this isotonic recalibrated regression is in-sample auto-calibrated (2.5) and, henceforth, it fulfills the global balance property which can be verified in the last column of Table 2.We perform the same isotonic recalibration to the ranks obtained from the gamma GLM in Table 2.We observe that the isotonic recalibration step leads to a major decrease in average loss in the gamma GLM, and it results in the complexity number K = 24, see also Figure 5 (lhs).We compare isotonic recalibration to a recent proposal of Lindholm et al. [21] that also achieves auto-calibration in-sample.Isotonic regression provides a partition of the index set I = {1, . . ., n} into disjoint blocks I 1 , . . ., I K on which the estimated regression function is constant.This can also be achieved by considering a binary regression tree algorithm applied to the (rank) covariates { π(x i ); 1 ≤ i ≤ n} and corresponding responses y i ; see Section 2.3.2 of Lindholm et al. [21].We call this latter approach the tree binning approach.There are two main differences between the tree binning approach and the isotonic recalibration approach.First, generally, the tree binning approach does not provide a regression function that has the same ranking as the first regression step providing π(x i ).Second, in the isotonic regression approach, the complexity number K((y i , π(x i )) n i=1 ) is naturally given, i.e., the isotonic regression (2.2) automatically extracts the degree of information contained in the responses y, and generally, this degree of information is increasing for an increasing signal-to-noise ratio by Theorem 2.3.Conversely, in the tree binning approach, we need to determine the optimal number of bins (leaves), e.g., by k-fold cross-validation.The obtained number of bins depends on the hyperparameters of the minimal leaf size and of the number of folds in cross-validation, as well as on the random partition of the instances for cross-validation.We found that the number of bins is sensitive to the tuning choices, and hence, contrary to isotonic recalibration, the resulting partition is subject to potentially subjective choices and randomness.
For the results on the tree binning approach in Table 2 we have chosen k = 10 folds and a minimal leaf size of 10, and only the random partitioning of the pseudo-sample is different for the results in lines (2c)-(2d).A first random seed gives 4 bins and a second one 8 bins, and we observe a considerable difference in the two models with respect to gamma deviance loss and the RMSE. Figure 6 shows the isotonic recalibration and the tree binning approach with 8 bins, corresponding to lines (2b) and (2d) of Table 2. From this plot, we conclude that the tree binning approach does not necessarily preserve the rankings induces by π(x i ) as the resulting step function (in blue color) is not monotonically increasing.We recommend isotonic recalibration to achieve auto-calibration since it preserves monotonicity of the regression model in the first estimation step, and there are no potentially influential tuning parameters.In Figure 7, we illustrate the resulting marginal plots if we project the estimated values µ of the isotonic recalibration to the corresponding covariate values, i.e., this is the marginal view of the resulting covariate space partition (3.2).For a low complexity number K((y i , π(x i )) n i=1 ) this can be interpreted nicely.We see relevant differences in the distributions of the colors across the different covariate levels of OwnerAge, Zone, RiskClass and VehAge.This indicates that these variables are important for explaining claim sizes, with the reservation that this marginal view ignores potential interactions.For the variable Gender we cannot make any conclusion as the gender balance inequality is too large.The interpretation of BonusClass is less obvious.In fact, from the gamma GLM we know that BonusClass is not significant, see [31,Table 5.13].This is because the BonusClass is related to collision claims, whereas our data studies comprehensive insurance that excludes collision claims.Figure 8 shows the marginal view of the isotonically  2 of the 6 considered covariate components OwnerAge, Gender, Zone, RiskClass, VehAge, BonusClass.
recalibrated gamma FFNN (lhs) and the gamma GLM (rhs) for the covariate BonusClass.As mentioned, BonusClass is not significant in the gamma GLM, and it seems from the figure that, indeed, the color distribution across the different levels is rather similar for both models.Clearly, the VehAge is the most important variable showing the picture that claims on new motorcycles are more expensive.There are substantial differences in claim size distributions between the zones, Zone 1 being the three largest cities of Sweden having typically more big claims.RiskClass corresponds to the size of the motorcycle which interacts with the OwnerAge, the VehAge and the Zone, and it is therefore more difficult to interpret as we have relevant interactions between these variables.Figure 9 gives an illustration of the partition (X k ) k=1,...,K of the 6-dimensional covariate space X w.r.t. the isotonic recalibration ( µ i k ) k=1,...,K for two selected values of k.The lines connect all the covariate components in x that are observed within the data (x i ) 1≤i≤n for a given value µ i k , and the size of the black dots illustrates how often a certain covariate level is observed.E.g., the figure on the right-hand side belongs to the second largest claim prediction µ i K−1 = 59, 851.
For this expected response level, the OwnerAge is comparably small (around 25 years), everyone is Male mostly living in Zone 1 (three biggest cities of Sweden), having a motorcycle of a higher RiskClass with a small VehAge.Similar conclusions can be drawn for the other parts X k of the covariate space X , thus, having a low complexity number K((y i , π(x i )) n i=1 ) enables to explain the regression model.

Conclusions
We have tackled two problems.First, we have enforced that the regression model fulfills the auto-calibration property by applying an isotonic recalibration to the ranks of a fitted (first) regression model.This isotonic recalibration does not involve any hyperparameters, but it solely assumes that the ranks from the first regression model are (approximately) correct.Isotonic regression has the property that the complexity of the resulting (non-parametric) regression function is small in low signal-to-noise ratio problems.Benefiting from this property, we have shown that this leads to explainable regression functions because a low complexity is equivalent to a coarse partition of the covariate space.In insurance pricing problems this is particularly useful, as we typically face a low signal-to-noise ratio in insurance claims data.We can then fit a complex (algorithmic) model to that data in a first step, and in a subsequent step we propose to auto-calibrate the first regression function using isotonic recalibration, which also leads to a substantial simplification of the regression function.

Figure 3 :
Figure 3: Isotonic recalibration in the Swedish motorcycle example only using RiskClass and VehAge as covariates resulting in the complexity number K((y i , π(x i )) n i=1 ) = 18.

Figure 4 :
Figure4: (lhs) Isotonic recalibration and (rhs) binary regression tree, both only using RiskClass and VehAge as covariates; the color scale is the same in both plots.

Figure 8 :
Figure 8: Marginal view of the isotonically recalibrated gamma FFNN model (lhs) and the isotonically recalibrated gamma GLM (rhs) for the covariate components BonusClass.

Table 1 :
called 'average' compares the average claims estimate of the FFNN to the empirical mean, and we observe a slight positive bias in the FFNN prediction, i.e., 24, 932 > 24, 641.Loss figures in the Swedish motorcycle example only considering RiskClass and VehAge as covariates.

Table 2 :
Losses in the Swedish motorcycle example based on all available covariates.