A nonparametric data mining approach for risk prediction in car insurance: a case study from the Montenegrin market

For prediction of risk in car insurance we used the nonparametric data mining techniques such as clustering, support vector regression (SVR) and kernel logistic regression (KLR). The goal of these techniques is to classify risk and predict claim size based on data, thus helping the insurer to assess the risk and calculate actual premiums. We proved that used data mining techniques can predict claim sizes and their occurrence, based on the case study data, with better accuracy than the standard methods. This represents the basis for calculation of net risk premium. Also, the article discusses advantages of data mining methods compared to standard methods for risk assessment in car insurance, as well as the specificities of the obtained results due to small insurance market, such as Montenegrin.


Introduction
Aggregate claims for a homogeneous car insurance portfolio have long been estimated using pure algorithmic methods to calculate tariffs (Mayer, 2002). This understanding does not allow quantification of uncertainties. Uncertainties can only be determined if we have an underlying stochastic model on which the calculation algorithms can be based.
The generalised linear models (GLS) and other more flexible stochastic models are used in recent studies to predict insurance tariffs on a micro-level, i.e., on level of individual claims (Bortoluzzo, Claro, Caetano, & Artes, 2011;David, 2015;Ohlsson & Johansson, 2010). For these models a major limitation is that the structure is restricted to a linear form, which can be too rigid for real applications. Also, there is a problem with modelling if many explanatory variables are discrete and multi-valued, which is quite common for insurance data-sets. These drawbacks can be overcome by using nonparametric methods from modern statistical machine learning and data mining theory such as support vector regression (SVR) or kernel logistic regression (KLR) (Christmann, 2005). The support vector machine (SVM) has better predictive performance than other techniques, especially for data that exhibit nonlinearity (Tian, Shi, & Liu, 2012). KEYWORDS car insurance; net risk premium; data mining; clustering; support vector regression (svR); kernel logistic regression (kLR) The majority of insurance companies keep the data on history of its operations in a data warehouse. These huge quantities of data hide very important information, which could contribute to easier decision-making and risk assessment in car insurance. Data mining is capable of extracting this important information and it can also justify the investments of insurance companies in data.
Standard methods for risk classification in car insurance are usually based on risk factors such as type of vehicle, age, region, etc. However, the main problem is dispersion of data to large number of classes, and this leads to small number of examples in a class. A datadriven clustering approach to risk classification can provide necessary massiveness and homogeneity within the class as well as the heterogeneity between different classes (Smith, Willis, & Brooks, 2000;Yeo, Smith, Willis, & Brooks, 2001).
This article presents the possibilities, advantages and disadvantages of risk assessment and prediction in car insurance, with application of data mining techniques such as clustering, SVR and KLR.
The second section provides the review of papers dealing with similar issues. Third section defines the concept of risk in car insurance and discusses standard methods for risk assessment and their shortcomings. Data mining techniques, used in this article for risk prediction, are explained in section four. We present the capabilities of these techniques on data of the insurance company Sava Montenegro in section five. We start with a complete data-set which is clustered to homogenous clusters, i.e., clusters with similar amounts of claims. Expected claim sizes for identified clusters are predicted with SVR. For calculation of net risk premium, besides the amount, the probability of claim occurrence is important, too. It is estimated using KLR. In this section we also discussed the obtained results, advantages and disadvantages of the applied data mining methods for risk prediction on a small insurance market such as Montenegrin. Conclusions and future research are discussed in the last section.

Related work
There are numerous papers dealing with risk assessment in car insurance.
The heuristic approach implies that the insurance companies categorise policy owners to several different groups depending on the risk factors such as territory, age, sex, type of vehicle, etc. and also on basis of historical data on policies. Some scientific papers already researched this approach. So, for example, Samson and Thomas (1987) selected four factors and categorised each factor to additional three levels, which in total gives 81 (3 4 ) classes. On each of these classes they assessed the claim sizes using the linear regression. As mentioned above, the main problem in this model is dispersion of data to large number of classes, and this leads to small number of examples in a class. It is known that the main requirement for accuracy in prediction is volume and homogeneity of data within the class as well as the heterogeneity between the classes. With increase in the number of factors being considered, this problem is even more emphasised, because the number of classes increases drastically. Because of this, the approach determined with in advanced defined factors is the limiting one. Yeo et al. (2001) used clustering of 13 variables in their paper. With this method, they got a total of 30 classes containing between 1000 and 20,000 policy owners. The policy owners within a class had very similar amounts of claims, while the average claim sizes between different classes differ significantly. In other words, the conditions of massiveness and homogeneity were met in their classification. Due to comparison they also used the heuristic method. They have taken only three factors which were divided on five classes. So, they created 125 (5 3 ) classes with at least 10,000 policies, which were much less discriminatory in relation to claim sizes. It turned out that the prediction results for claim sizes were much better on clusters.
The clustering technique was also used by Williams and Huang (1997) for identification of policy owners with high claim sizes. They combined the clustering with decision tree. In other words, first they obtained the classes using clustering. Then they used the decision trees to generate descriptions of those classes. Smith et al. (2000), suggested that the clusters should be used for prediction of claim sizes by data mining techniques. In order to predict the claim sizes, Chapados et al. (2002) used more sophisticated data mining techniques such as neural networks (NN).
Recent studies have perceived that a mixed discrete-continuous model may be appropriate to estimate claims and risk in insurance data (Christmann, 2004(Christmann, , 2005Heller, Stasinopoulos, & Rigby, 2006;Parnitzke, 2008). According to Parnitzke (2008), the model explicitly specifies a logit-linear model for the claim occurrence (i.e. claim probability) and linear regression model for the mean claim size. GLMs and more flexible Tweedie's compound Poisson models are often used to construct insurance tariffs (Bortoluzzo et al., 2011;Ohlsson & Johansson, 2010). However, even these more general models still can yield problems in modelling high-dimensional relationships which is quite common for insurance data-sets. Namely, many explanatory variables are discrete which is quite common for insurance data-sets (if there are eight discrete explanatory variables each with eight different values there are approximately 8 8 ≈ 16.7 million interaction terms possible). The best modelling in these circumstances is one which using nonparametric methods from machine learning and data mining such as SVR and KLR (Christmann, 2004(Christmann, , 2005 or tree-based gradient boosting methods (Guelman, 2012;Yang, Qian, & Zou, 2015).
In recent years many papers have dealt with the application of data mining methods for loss cost estimation and risk analysis in insurance (Gepp, Wilson, Kumar, & Bhattacharya, 2012;Liu, Wang, & Lv, 2014;Paglia & Phelippe-Guinvarc'h, 2011).

Risk in car insurance and methods for its assessment
Risk assessment is very important for insurance companies. Determination of premium level based on assessed risk (net risk premium), enables insurance company to avoid negative selection, i.e. to lose good clients due to high premiums.
According to Mayer (2002), based on actuarial equivalence principle, net risk premium is equal to expected claim level. According to Renshaw (1994) and Parnitzke (2008), expected claim size is calculated as product of predicted claim size and probability that at least one claim will occur, given in relation (1).
Probability for occurrence of at least one claim, in insurance practice, is evaluated with logistic regression (Parnitzke, 2008), given in relation (2).
(1) E ClaimSize i = Predicted Claim Size i * P ClaimOccur i Yi = ln P ClaimOccurr i ∕ 1 − P ClaimOccur i = + ∑ J X IJ J The claim size can be estimated with linear regression (Parnitzke, 2008), given in relation (3).
The X IJ is value of variable which represent risk factors (tariff criteria) and α, 0 , β J are regression coefficients. However, the dependencies between risk factors and the claim size, usually is not linear or even monotonic (Christmann, 2004). Classic GLS and more flexible Tweedie's compound Poisson models have lower predictive performance for unknown claim sizes than nonparametric methods. Paglia & Phelippe-Guinvarc'h (2011), compared GLS with nonparametric tree boosted (TB) and NN. They concluded that considered nonparametric methods have better predictive performance (GLS Mean Squared Error [MSE] = 485,685; NN MSE = 473,112; TB MSE = 459,099). Generally, the SVM has better predictive performance than other techniques, especially for data that exhibit nonlinearity (Tian et al., 2012).
According to Christmann (2005), because of the large number of possible values of risk factors, even for data-sets with several million customers, it is not possible to estimate simultaneously all the interaction terms with these classic statistic methods, because the number of interaction terms increases too fast. A nonparametric approach based on a combination of KLR and SVR was able to detect an interesting interaction term and violations of a monotonicity assumption without the necessity that the researcher has to model interaction terms or polynomial terms manually. These methods don't explicitly obtain the intensity of the factor impacts, but the impact can be implicitly shown. Christmann (2005) presented the expected claim size stratified by the age of the main user or by gender and age of the main user. Looking at these dependencies, implicitly can be seen the impact of individual factors and of the interaction terms to the net risk premium.
In standard methods of car insurance, policies are classified based on risk factors such as age, region, type of vehicle, etc. Also, if we take into consideration the bonus-malus classes, which are determined based on history of policies and claims, it is clear that this approach leads to large number of tariff classes. This leads to dispersion of data, so that classes contain very little experiences related to risk and claim sizes.
If the classes have a higher number of policies with similar levels of claim sizes, such classes could be better for prediction of claim sizes, i.e. of premium level. Data clustering can provide necessary massiveness and homogeneity within the class, as well as the heterogeneity between different classes.

Description of data mining techniques and tools
Data mining techniques used in this article are clustering, KLR and SVR. In the previous section we explained the advantage of these methods compared to standard methods.
Clustering finds similar groups within the data-set. In this article we used the k-mean clustering method. It forms k-clusters iteratively, using functions for evaluation of clusters distances and their mean values. Mean values are initially set for all clusters. For a specific data point, distances from the cluster mean values are calculated and the data point is associated to cluster with the smallest distance. In that cluster, mean value is calculated again, taking into consideration this newly added data. The procedure is iteratively repeated, for all data from the initial data-set. Regression is used for the continuous target (dependent) variable prediction based on the predictor (independent) variables in a data-set. For that purpose, we used SVR and KLR.
Logistic regression is used for prediction of binomial (0,1) or categorical (with limited set of categories) target variable based on the predictor variables, where the data-set is classified to as many classes as the target variable has values.
In the case of binomial target variable, logistic regression predicts a continuous variable p, i.e., probability that the target variable value is 1 (success probability). In order to transform the regression into linear form, logistic function ln is used. Logistic regression model is given in the following form (4).
Coefficients are evaluated with method of maximal credibility, which maximises probability p. The method uses iterative calculation of coefficients. When the coefficients are calculated, the probability p can be obtained according to relation (5) KLR is a nonlinear form of logistic regression, which is obtained by replication of data vector using the kernel function. In this article we used KLR based on fast dual algorithm (Keerthi, Duan, Shevade, & Poo, 2005). Ruping (2003) implemented this algorithm in form of programme MyKLR.
The SVR was introduced by Vapnik (1995), and it is a regression technique that generally produces accurate nonlinear models.
The main concept employed by SVR is that the data vectors, which are not linearly related in the original space, can be mapped to higher or infinite dimensional space (feature space) where their linear relation is possible. In the epsilon SVR (ε−SVR), the goal is to find a hyperplane in feature space that has at most ε deviation from the actually obtained targets y i for all the training data.
The geometrical margin corresponds to the shortest distance between the closest data points (support vectors) and the hyperplane, and SVR aims to find the hyperplane that minimises this distance. The margin maximisation process increases the generalisability of the support vector machine (SVM). The SVM aims to maximise the accuracy using training data and it also retains sufficient space for the correct prediction of future data. A SVM needs to solve an optimisation problem to find the maximum margin hyperplane, which requires the calculation of the dot product in the feature space.
The mapping of the data vector to the feature space does not have to be defined explicitly. It is sufficient to design it to facilitate the calculation of the dot product in terms of the input space variables, i.e., the dot product derived from the feature space is represented by a kernel function (which meets Mercer's condition) in the input space. This procedure is known as the kernel trick, which allows calculations to be made in the input space instead of calculating the dot product in the feature space. The most frequently used kernel functions are as follows: Thus, the problem of finding the maximum margin hyperplane is converted into following dual quadratic optimisation problem: where α i are Lagrange multipliers, n is the number of training examples and C is a parameter, which is adjusted to trade off margin maximisation against regression error minimisation.
In classic SVR, the proper value for the parameter ε is difficult to determine beforehand. Fortunately, this problem is partially resolved in a new algorithm, nu-SVR (ν -SVR), in which ε itself is a variable in the optimisation process and is controlled by another new parameter ν ∈ (0,1). Parameter ν is the upper bound on the fraction of error points or the lower bound on the fraction of points inside the ε -insensitive tube. Thus, a good ε can be automatically found by choosing ν, which adjusts the accuracy level to the data at hand. This makes ν a more convenient parameter than the one used in ε-SVR. Fan, Chen, and Lin (2005), proposed SVM learner which has been implemented using LIBSVM software since version 2.8 (Chang & Lin, 2011). LIBSVM supports the ε -SVR and ν -SVR. In this article we used that SVM learner.
In this article we used an open source data mining platform Rapid Miner (RM) (www. rapidminer.com), as a tool. We used the data mining platform RM for k-mean clustering, as well as for SVR and KLR. In RM the KLR is implemented as a Java implementation of MyKLR programme. SVR is implemented as a LIBSVM learner. To evaluate the models we used a RM-Split Validation operator (split ratio=0.7), with stratified sampling. It randomly splits up the example set into a training set and a test set.

A case study
In this section, we describe one case study of risk assessment in car insurance using data mining techniques. We describe the used data and the results of applied data mining techniques to these data. We make the comparison and discuss the results. Also, we point to advantages and disadvantages of data-driven approach compared to standard methods for risk assessment.

Description of data
In the research we used motor third party liability data for 2009, 2010 and 2011 from the insurance company Sava Montenegro. The data include 35,521 policies, out of which only 3528 policies are with total claim sizes other than zero. We took the appropriate number of policies without any claims. The size of this data-set is 7285 records. We used the following policies data: region, age, sex, type of vehicle, number of claims per policy, years of policy ownership, insured cases number for a user and average claim size.
In the process of preparation, data were purged. We removed records with unknown values and age is categorised into Old (over the age of 65), Young (up to the age of 25) and Middle (aged 25-65) age. Policies with extremely low and extremely high average claim sizes are removed. Due to regression method the categorical variables with multiple categories are replaced with dummy (indicator) variables. Some of the parameters which describe the initial data-set are presented in Table 1.

Clustering of data and estimation of claim size
As initial data-set for claim sizes prediction, we took only policies with one or more claims (3021 records). In order to get homogenous groups of policies this data-set is clustered. The result of clustering is 12 clusters.
Then we applied SVR on such defined clusters. Target variable is Avg Claim Costs, while remaining 10 variables from the initial data-set are predictor variables.
In order to define SVR models, for optimisation of parameters (gamma, C, nu and epsilon) we used a 'grid-search' strategy. We realised it using the RM operator Optimise Parameters (Grid). We divided the data within each cluster into the training and test set in proportion of 70%:30%. Performance vector is obtained from the test data-set (unknown data). The defined SVR models and their predictive performance are presented in Table 2.
The results of the performance vectors, from Table 2, show that relative errors are less than 10% for most of clusters. Cluster 9, with highest claim sizes and small number of policies (only 23), has the largest relative error (14.00% +/-5.94%). Table 3 shows results we obtained applying the SVR models to the full clusters (with 30% of unknown data). Average claim costs per clusters are presented in the first column. The model provided deviations lower than 10% for 69% of data, while for 95% of data, deviations were lower than 20%.  Policies with an unknown claim size can be joined to the appropriate clusters based on the probability of belonging. This probability can be determined using logistic regression based on the relation (5).
The SVR models in Table 2 give Predicted Claim Size i inside the clusters. According to relation (1), this size is one of the factors necessary for calculation of expected claim size, i.e., net risk premium. The second factor is probability of claim occurrence, for which the model will be defined in following section.

Estimation of claim occurrence probability
For prediction of occurrence of at least one claim we used the logistic regression. Target variable is Number of Claims (0/1) and its values are 0 or 1 (for policies with one or more claims value is set to 1). For this analysis, we used all 7,284 records. Using the KLR procedure (kernel type='dot' , C=1.0 -these parameters are determined using 'grid-search' strategy), we obtained the model in Table 4. Because of the applied linear (dot) kernel, the resulting weight coefficients can be interpreted as the intensity of factor impacts to the target variable. Table 5 shows the performance of this model. It can be seen from the Confusion Matrix that prediction of claim occurrence has the accuracy of 92.69%. In other words, in the test set out of 643 policies with claims, 47 of them are misclassified. Overall accuracy of the model is 76.70%.
According to relation (1), for calculation of net risk premium it is necessary to have P(ClaimOccur i ). This value is calculated based on the model from Table 4 as well as on relation (2).

Analysis and discussion of the results
Standard method for calculation of premium for motor third party liability in Montenegro is defined by a system of premium tariffs which is adopted by Montenegro National Bureau of Insurers. This system defines eight basic tariff groups, depending on the type of vehicle. Each of these groups is divided on subgroups depending on the engine power, bearing capacity for cargo vehicles, type of transportation for buses, type of trailer and also purpose of special and work vehicles. For each of the subgroups, they have defined three bonus    premium classes, one basic class and three malus classes, i.e., seven premium classes, and all together that produces large number of tariff classes. Taking into consideration that the Montenegro insurance market is quite small, a very small number of policies will fall into appropriate tariff class. Average claim sizes for certain tariff classes are 0, and this means that there are no policies with claims. The question is, if the small number of policies can provide satisfactory predictions. In these conditions, tariff classes are unusable for estimation of claim size, i.e. for calculation of net risk premium. It became obvious that certain, more sophisticated methods, have to be applied.
Applying clustering method, insurance policies are classified into 12 homogeneous groups (with a similar claim sizes) containing enough data for claim size estimation with high accuracy. With analysis of the prediction results, it can be seen that the majority of the models has accuracy from approximately 80% to 95%.
In practice, the claim size is not always known exactly (if a big accident occurs in December, the exact claim size will often not be known at the end of the year and perhaps not even at the end of the following year). In order to construct a new insurance tariff for the next year, in this case, a statistician will have to use appropriate predictions of the exact claim size. Hence, the empirical distribution of the claim sizes is, in general, a mixture of really observed values and of estimated claim sizes (Christmann, 2004).
SVR models in Table 2 have good predictive performance (relative_error < 10% for most of clusters) on the test data-set for which the claim sizes are unknown (the generality of the model is achieved by adjusting parameter C). So, SVR can be successfully applied to predict the unknown claim sizes on the small data-sets, too.
Model of linear regression on clusters, used by Yeo et al. (2001), provided for 57% of data deviations lower than 10%, while for as much as 90% of data, deviations were lower than 20%. Our model of SVR provided deviations lower than 10% for 69% of data, while for 95% of data deviations were lower than 20% (Table 3). In a previously mentioned paper they used the sample of 146,326 policies with claims. Our data-set contained only 3528 policies with claims. This shows that the method of clustering data and SVR can achieve good results on a small data-set.
The model for prediction of risky policies from Table 4, shows that positive impact on claim occurrence probability have: years of policy ownership, young policy holders, motorcycle and towing vehicles. The model provided accuracy of around 80%, although the percentage of policies which are predicted as risk-free, but they are in a fact risky, is around 30%, and that is quite a high percentage (Table 5). However, with analysis of our data from 2011, we have determined that out of 1182 policies with claims in that year, only 52 policies had claims in previous two years. This means that 1130 out of 1182 (95.6%) of risky polices were recognised as risk-free, based on standard methods for premium calculation. These policies included even bonuses. Models obtained using data mining methods have significantly better accuracy than the risk assessment based on premium tariff tables, which is the standard method for motor third party liability for domestic insurance companies.
Still, our approach can be used not to replace, but to complement, traditional methods, which are used in practice. This approach will be especially important since 2017, when is planned to be introduced premium liberalisation in the Montenegrin car insurance market.
However, this approach also has its own disadvantages. One of the main problems, which we have noticed in this article, is insufficient quantity of data. In other words, on small markets such as Montenegrin, the number of policies is too small to have models with high accuracy. Also, we noted the lack of certain data which could have significant influence to premium predictions. So, for example, in car insurance, if we would have accurate data about policy owners, such as occupation, wealth, tendency for use of alcohol, condition of health, habits, etc. the prediction itself would be more accurate. Models for prediction of claim occurrence, i.e. for risk classification, classify policies to risky and risk-free. However, within the risky policies there are levels of risk depending on the number of claims. The model which predicts the level of risk, i.e. makes the classification according to number of claims, would be much more useful. But, in small initial data-set, like in our case, this prediction does not provide good results.
So, the quality and availability of data are the most important presumptions for success of data mining process. The other problem that appears is adequacy of applied model, i.e., is it good enough for predictions related with specific data (Pichler, 2014). Selection of appropriate model of certain data-set is precondition for good results of data mining process. Approach of clustering combined with SVM regression has good performance on a small number of policies, such as in our example.

Conclusion
In this article we have discussed the methods for risk assessment in car insurance. Standard methods imply classification of policies to large number of tariff classes and calculation of premiums based on them. Using data-driven methods it is possible to get better results in risk assessment and premiums estimation.
On the case study data we proved that nonparametric data mining methods, with better accuracy than the standard methods, can predict claim sizes and occurrence of claims and this represents the basis for calculation of net risk premium. In this approach, the level of premium is not determined based on tariff classes. Using clustering method we classified policies to groups with same level of risk, without fragmentation of data into too high number of small groups. We predicted expected claim size using the SVR method and claim occurrence using KLR. We achieved the prediction accuracy of around 80% or more, on a small data-set, where 30% of the policies have unknown claim sizes.
The main advantage of the proposed approach, even in a small data-set, is its good predictive performance for unknown claim sizes, which is common in car insurance. Also, this research is important for the Montenegrin insurance companies, due to expected premium liberalisation in 2017, because they will be able to use their own methods for risk prediction.
The proposed approach has its drawbacks, which are reflected mainly in the lack of data from small and still underdeveloped market such as Montenegrin.
Some future research could determine how much the use of some other data mining techniques would contribute to better results in car insurance risk assessment, especially on small data-sets, where the dependencies are harder to notice. Analysis and prediction of customer loyalty (churn prediction) in car insurance, using data mining techniques, would also contribute to better risk assessment.