Bayesian regularized artificial neural networks for the estimation of the probability of default

Artificial neural networks (ANNs) have been extensively used for classification problems in many areas such as gene, text and image recognition. Although ANNs are popular also to estimate the probability of default in credit risk, they have drawbacks; a major one is their tendency to overfit the data. Here we propose an improved Bayesian regularization approach to train ANNs and compare it to the classical regularization that relies on the back-propagation algorithm for training feed-forward networks. We investigate different network architectures and test the classification accuracy on three data sets. Profitability, leverage and liquidity emerge as important financial default driver categories.


Introduction
Credit scoring became popular in the USA during the 1950s. The boosting economy in the next two decades required the need for accessible credit and it was during this period when the methods used for automated credit scoring became more advanced (Tufféry 2011). However, the origin of credit scoring goes back to the early 1940s in the USA, when it was initially applied to differentiate between good and bad customers (Durand 1941).
Among the many options offered and investigated in the literature for credit scoring, artificial neural networks (ANNs) are a flexible and rich concept to solve not only classification problems but also to offer solutions to clustering, time series and function approximation problems (Bell 2015). The flexibility of ANNs inspired researchers to investigate their applicability to classification tasks. Recently, an extensive research has been conducted to utilize and apply ANNs for corporate credit scoring given the large amount of financial data collected. Heaton et al. (2016) and Pérez-Martín et al. (2018) advocated the extensive use of ANNs with many layers, the so-called deep learning approach. Furthermore, Bonini and *Corresponding author. Email: eduard.sariev.11@ucl.ac.uk Caivano (2018) showed that artificial intelligence methods including ANNs outperform traditional statistical methods. Nonetheless, the performance advantages of ANNs are questioned by Kalayci et al. (2018) and by Addo et al. (2018), who show that ANNs underperform when compared to respectively logistic regression and decision trees.
Here we focus on the overifitting issue of ANNs. Several recent studies have been devoted to this problem. Zhang et al. (2018) investigated various ways of detecting overfitting in an ANN and advocated splitting the data into training and validation as a main way of dealing with overfitting. Using a genetic algorithm, Nicolae-Eugen (2016) prevented overfitting in an ANN by encoding the weights of the ANN into binary chromosomes and applying high-probability mutation in the genetic algorithm. A different approach to avoid overfitting was proposed by Vincent et al. (2010). They applied a drop-out strategy combined with a stacked denoising autoencoder to reduce overfitting. They found that this strategy outperforms a single drop-out strategy and is computationally more efficient. One of the reasons for overfitting is the noise in the training data. In this context, Hindi and Al-Akhras (2011) recommended smoothing the decision boundaries by eliminating border instances from the training set before training an ANN. This is achieved by using a variety of instance reduction techniques.
In contrast to these studies on overfitting in ANNs, we take a Bayesian approach to solve the issue. Bayesian estimation in ANNs has become applicable only since the advancement of computational power has allowed its use. Initially, Bayesian learning in ANNs was used to offer a solution for creating an optimal network architecture. For example, Neal (1992) explored the difficulties related to the selection of the prior knowledge as well as the problems associated with the computation of the posterior distribution. Neal (1996) studied the effect of using different priors for the estimation of the network weights. Rasmussen (1996) investigated how to estimate the weights of a network using dynamic simulation. Furthermore, Lampinen and Vehtari (2001) applied an ANN with Bayesian learning to regression and classification. Titterington (2004) reviewed the various approaches taken to determine the network architecture, involving the use of Gaussian approximations and of non-Gaussian but deterministic approximations called variational approximations.
The Bayesian estimation of an ANN for credit scoring implies that the optimal architecture of the neural network is important to the performance because the architecture greatly impacts the estimation efficiency of the network (Heaton et al. 2016). However, in this study, we focus on the Bayesian regularization of the network in order to avoid overfitting. We compare our approach to the classical regularization approach examined in Ashiquzzaman et al. (2017). For that reason, we report our results as an average performance over a range of different network architectures, i.e. combinations of layers and neurons. Overfitting is one of the main challenges faced by statisticians today. The volume and the complexity of the data increase every year, which requires special attention to not overfit the classification algorithm. In this work, we use a combination of different network architectures combined with early stopping (ES) and regularization to tackle the problem of overfitting. We define ES as the process where we monitor the test error in n consecutive runs, while training the network. If the test error increases n times then the training of the network is terminated. We used n = 6, which is a typical choice for most classification problems.
Drawing on the studies devoted to Bayesian learning, in this paper, we improve an approach recommended by MacKay (1992) to estimate the regularization parameters. MacKay (1992) proposed a Gauss-Newton approximation to the posterior distribution of the regularization parameters. In this Gauss-Newton approximation, an objective function with parameters α and β is maximized. MacKay (1992) proposed an iterative solution for α and β by applying the Levenberg-Marquardt algorithm. We on the other hand apply a MCMC scheme to estimate the regularization parameters. In our approach, α and β are considered random variables and are based on the mean of a posterior distribution. Finally, we compare the improved Bayesian regularization approach to classical regularization and to Bayesian regularization based on the Gauss-Newton approximation. With respect to the above discussed articles on ANNs, we contribute to the literature first by proposing an update to the estimation of the regularization parameters and secondly by exploring classical and Bayesian regularization in the estimation of a network with different architectures.
The rest of the article is organized as follows. Section 2 presents the theoretical formulation of an ANN in a classical and in a Bayesian framework. Section 3 presents the results from the regularized networks. Section 4 discusses the policy implications of the selected default factors and their business intuition. Finally, Section 5 concludes the paper by summarizing the main findings.

Theoretical foundations
In theory, there are several neural network architectures. In practice, most researchers (Demuth et al. 2014) focus on three main types: feed-forward, competitive and recurrent networks. While competitive and recurrent networks are definitely an interesting area of research, in this article, we explore the most popular kind of network architecture, the feedforward network. It is called a feed-forward network because data moves in forward direction only: initially, the data input is processed in the first layer of the network, then it is pushed forward to the next layer until it reaches the final output layer. In a feed-forward network, data is not fed back from a layer to the previous, which instead happens in a recurrent network. A detailed description of a feed-forward network is given in the next section.

Feed-forward neural network architecture
In this section, we briefly introduce the most basic theoretical concepts behind an ANN. A detailed discussion is given in Kim et al. (1996). A multilayer ANN can be described as a system with the following elements: (i) An input data vector x ∈ R p and a categorical variable y ∈ {0, 1}. (ii) An output y = P(Y = 1 | X = x). (iii) Layers k = 1, . . . , l with m units per layer; the layers with k < l are hidden, the layer l is the output layer. Each layer has a bias b k ∈ R and each unit has an activation h k i ∈ R. The units in layer k are connected to those in the previous layer by weights w k ij ∈ R, i, j = 1, . . . , m, k = 1, . . . , l. (iv) A previous layer is defined as layer k − 1 with respect to layer k. (v) The individual inputs x ∈ R p are each weighted by weights w k ij . Each neuron i is weighted in each layer k. (vi) The final output has a bias b l+1 ∈ R and is connected to the units of the output layer by weights w l+1 j ∈ R. (vii) An activation function s k i (.) for layer k and unit i. An activation function determines how each node reacts in an ANN and what output each node generates. This output is then used as input for the next node in an iterative procedure until the estimation process converges to a local or global optimum. The most popular choices of activation functions are the logistic sigmoid and the hyperbolic tangent (Farhadi 2017).
Below we present the sequence in which the estimation of the network weights is performed. The first step in the estimation process of the network weights w is to feed data into the first layer of the network. The unit activations of the first layer are computed from the input data as where s 1 i (·) is an activation function. After receiving the output from the first layer, we can proceed with the estimation of the second layer activation functions. The unit activations of the next layer are computed from those of the previous layer as After reaching the final output by sequentially moving through each hidden layer k, the output probability is estimated as In the estimation process described above the activation function s(·) plays a vital role. In our analysis, we apply the logistic function which is the most common non-linear activation function.
The above estimation process can be described as a learning process where the weights of the network are estimated through learning from the data. In particular, the weights w k ij for layer k and neuron i in the neural network are estimated sequentially and iteratively. Afterwards, the network performance with weights learned from the training data is monitored on test data.
In order to estimate the network weights a cost function is required. The purpose of the cost function is to serve as an objective to be minimized during the learning process. A typical choice of a cost function is the mean squared error (MSE) where N is the number of observations, i.e. the number of input data vectors and categorical variables. Another popular cost function is the cross entropy (CE) where p i and q i are discrete probabilities. A common issue in estimating network weights is the overfitting of the network, upon which the network cannot generalize well and subsequently the network performance on new data is poor. When overfitting occurs, the network weights are calculated in a way that maximizes the network performance on the training data but this is achieved through significantly decreasing the performance on the test data. The most common way of solving the overfitting issue that occurs in the estimation process is applying regularization during the estimation (Deng et al. 2014). Regularization can be applied to penalize the cost function with the squared sum of the weights so that the generalization performance of the network is maintained. For the MSE cost function, this can be written as where γ ∈ (0, 1) is a regularization constant. Usually the backpropagation algorithm, Dreyfus (1990) is used to estimate the weights. A common optimization algorithm used to make the estimation procedure converge is the gradient descent algorithm. Although classical regularization as described above works adequately, in this paper, we recommend a Bayesian approach to regularization which we describe in the next section. We advocate that the Bayesian approach to regularization allows for more flexibility by reducing the bias inherent to classical regularization (through the choice of the regularization constant) and therefore leading to a higher performance.

A Bayesian approach for feed-forward neural networks
After we have explained what a feed-forward network is, in this section, we present the theory behind our proposed approach to regularization. The networks are trained using supervised learning, with a training data set of inputs and targets D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N )}. We choose an interpolating function of the form where φ h (x) are basis functions and w h are coefficients inferred from the data. We assume that the targets are generated by where g(x i ) is an unknown function and i are independent Gaussian random variables with average zero and variance σ 2 . The initial objective of the training process is to minimize the sum of squared errors where y i represents the neural network response to observation i. An extensive work on Bayesian estimation and regularization has been done by MacKay (1992). In summary, the Bayesian regularization requires the Hessian matrix of the objective function. For the MSE cost function and regularization by the sum of squared weights, it follows that the Hessian matrix is a quadratic function and can be approximated using the Levenberg-Marquardt algorithm (Gill and Murray 1978). The objective function becomes ( 1 1 ) where E W was defined in equation (7), and α and β are objective function parameters.
In the Bayesian framework (Foresee and Hagan 1997), the weights of the network are considered random variables. Given the data, the probability density function of an array w of network weights is where M is the particular neural network model used; is the prior density, which represents our knowledge of the weights before any data is collected; is the likelihood function, which is the probability of the data occurring given the weights; f (D | α, β, M ) is a normalization factor, which guarantees that the total probability is 1. Under the assumption of Gaussian noise, the probability of the data given the parameters w is where The density of the prior can be written as where Z W (α) = exp(−αE W ) dw. If equations (13) and (14) are substituted into equation (12), we obtain (15) where Z F (α, β) = exp(−F) dw. In this Bayesian framework, the optimal weights should maximize the posterior probability.

Optimizing the regularization parameters
After we showed that the weights are a function of the parameters α and β, we optimize the latter using Bayes' theorem, If a uniform prior density f (α, β | M ) is taken for the regularization parameters α and β, then maximizing the posterior is achieved by maximizing the likelihood function f (D | α, β, M ). This likelihood function is the normalization factor in equation (12). Since all probabilities have a Gaussian form, the posterior can be expressed as Z D (β) and Z W (α) are known from equations (13) and (14). Z F (α, β) can be expanded in a Taylor series. Since the objective function has a quadratic shape in the surrounding of the minimum, we can expand Z F (w) around the minimum point of the posterior density w MP , where the gradient is zero. We refer to w MP as the most probable interpolant and therefore F can be written as where It follows that Z F is a Gaussian integral that can be expressed as Thus we can rewrite the log evidence for α and β as Notice that this expression contains the logarithm of the Occam factor (2π) k/2 (det H) −1/2 /Z W (α), which can control the overfitting. Substituting Z D from equation (13) and Z W from equation (14), We differentiate the log evidence with respect to α and β to find the condition that is satisfied at the maximum. Differentiating with respect to α and setting the result equal to zero gives differentiating with respect to β and setting the result equal to zero gives Here γ = k − 2α MP tr H −1 MP is the effective number of parameters and k is the total number of parameters in the network. The parameter γ is a measure of how many parameters in the neural network are effectively used in reducing the error function; it can range from zero to k.
Summarizing, the steps required for the Bayesian optimization of the regularization parameters with a quadratic approximation of the Hessian matrix are: (i) Initialize the parameters α, β and the weights w.
(ii) Take one step of the Levenberg-Marquardt algorithm to minimize the objective function F(w) = αE W + βE D . (iii) Compute the effective number of parameters γ = k − 2α tr H −1 using the Gauss-Newton approximation of the Hessian available in the Levenberg-Marquardt training algorithm, where J is the Jacobian matrix of the training set errors. (iv) Compute new estimates of the objective function

Markov chain Monte Carlo estimation of α and β
We propose an improvement on the estimation of the regularization parameters developed by MacKay (1992). We advocate applying a Markov chain Monte Carlo (MCMC) scheme to estimate α and β rather than approximating Z F (α, β) and consequently approximating the Hessian matrix to estimate the parameters α, β. Collecting these paramenters into the two-dimensional vector x and indicating their estimate with X, the MCMC method (Gelfand and Smith 1990) can be described as We apply a standard normal prior distribution to the MCMC scheme.

Application of neural networks to financial data
As discussed in the literature, neural networks are a powerful concept that can be applied to different problems ranging from function approximation to clustering. There are many articles devoted to the comparison of neural networks to each other or to other algorithms. Specht (1990) investigated probabilistic neural networks; Wang and Peng (2000) explored vector-quantization networks; Stallkamp et al. (2012) compared convolutional neural networks with linear discriminant analysis and decision trees. In contrast to the above studies, in our analysis, we focus on the concept of regularization and how it is applied in the context of neural networks. We test the performance of feed-forward networks with and without regularization. Furthermore, we combine the regularization with ES. However, the main focus of the analysis is on the Bayesian approach to regularization for neural networks. In contrast to the classical approach to regularization, in the Bayesian approach, the regularization parameters are inferred from the data. We propose an improvement of the Bayesian estimation over the one suggested by MacKay (1992). Our estimation approach provides objectivity to the estimation and reduces the bias. We now apply the methodology proposed in Section 2 to three different data sets on (1) corporate obligors based in Eastern Europe, (2) corporate obligors based in Poland, and (3) retail obligors based in Germany. We use data on corporates from Poland and Eastern Europe because these are developing markets where the relations between the risk factors and the default event are not yet well investigated. In a developing market, the group of default drivers could be significantly different from what is observed in a developed market. By using data from developing markets, we try to find out whether the default drivers in these markets are significantly different from the default drivers in developed markets. Finally, we examine retail data from a developed market to check whether the default identification of our proposed algorithm is adequate on a data set that is not corporate.

East-European data set
The data set contains information for 7996 observations on 33 independent variables (covariates or features) and on 1 binary target variable which indicates whether a default occurred one year after the issue of the financial statement. The 33 covariates are constructed based on data from the entity's financial statements. These financial ratios are split into several groups and further analysed. For the feature names and construction refer to Appendix 1, table A1. The data are on an annual basis from the period 2007 to 2012. The data set is not publicly available, but the authors can share the data set if requested.

Polish data set
This data set is publicly available (Tomczak 2016) and was collected from Emerging Markets Information Service, which is a database containing information on emerging markets around the world. The bankrupt companies are analysed in the period 2000-2012, while the still operating companies are evaluated from 2007 to 2013. The data set has 5910 observations on 64 independent variables. The default indicator shows the bankruptcy status after 1 year. For the feature names and construction, refer to Appendix 1, table A2.

German data set
Also this data set is freely available (Hofmann 1994). It contains retail data for German credit borrowers with 1000 observations on 20 independent variables (covariates or features) and on 1 binary target variable which indicates the presence of default. The data set contains categorical and numerical variables. Following Agresti (2019), who explains that the choice of scores for the categorical variables has little impact on the final result, for clarity and simplicity we transform the categorical variables on a numerical scale by mapping them to integer numbers corresponding to the level of each category. For example, a categorical variable with categories small, medium and large is mapped to an integer variable with values 1, 2 and 3, respectively. After that all the variables (continuous and categorical) are standardized. There are no missing values. For the feature names and construction refer to Appendix 1, table A3.

Feature selection
The literature offers a variety of algorithms for variable selection such as filter and wrapper methods. However, the main goal of our analysis is to examine the effect of Bayesian regularization on ANNs. Therefore, we apply a simple approach to variable selection based on the 80% percentile of the vector containing the absolute value of the correlation with the target variable. We select only variables whose correlation is equal or above the 80% percentile of the vector containing the absolute value of the correlation with the target variable. This leads to a balanced number of variables that are shown in table 1. Nonetheless, not to bias our results based on a single combination of variables, we report our results for different number of variables by changing the percentile value from 0% to 90%, see Appendix 2. This is consistent with the principle applied in Sariev and Germano (2019), where model performance is assessed comparing a model on a different set of variables.

Results
Table 2 presents eight different feed-forward neural network architectures. Prior to applying the networks on the data, we need to make a choice on the number of neurons and the number of layers for each network. Determining the number of neurons and layers is driven by many factors such as the number of variables in the model, the number of data points, etc. In order to avoid reporting biased results, we run each network on a range of different combinations of layers and neurons. This allows us to monitor the performance of the network over different network architectures and to summarize the network performance. Although there is no clear rule on selecting the number of neurons and layers, we follow Demuth et al. (2014) who argue that the number of neurons should be lower than the number of variables used in the network. Furthermore, Table 1. Selected variables by data set based on the 80% percentile of the correlation to the target variable.

Data set Selected variables
East-European data payables turnover, return on assets, cash ratio, income from sales/total assets, liquid assets/total assets, interest coverage Polish data total costs/total sales, (sales − cost of products sold)/sales, profit on sales/sales, working capital, logarithm of total assets, sales(n)/sales(n − 1), sales/inventory, working capital/total assets, sales/receivables, short-term liabilities/total assets, total liabilities/total assets, sales/total assets German data duration in months of the account, credit history, checking account status the number of hidden layers should not be more than two to three because most problems are tackled even with one hidden layer. Adding many hidden layers on small data sets (less than one million observations) does not result in a better performance. Therefore, we decide to report the performance of the network on a combination of neurons that range from 1 to 25 and hidden layers that range from 1 to 3. We investigate combinations with more neurons than Demuth et al. (2014) suggested so that our results are more comprehensive. However, we show that increasing the number of layers does not lead to a higher performance; see Section 3.6. We acknowledge that our decision on the number of neurons and layers is subjective, but we aim to cover a wide enough range of neurons and layers so that our results are less biased than using just a single combination of hidden layers and neurons. For classical regularization, the regularization parameter should be determined before applying the network to the data. Based on 10-fold cross-validation, we estimated the parameter γ for classical regularization, see equation (7), to be 0.05. All networks are trained with the MSE loss function. One third of the data is left for testing the networks, two-thirds of the data are allocated for training and validating.
Following the above logic, we report our results in table 2. The first column shows the architecture type; the second column shows the regularization type: classical, Bayesian or Bayesian based on MCMC; the third column indicates whether ES was applied or not. The fourth to seventh columns give the mean percentage of correctly classified observations, the percentage of non-defaulted obligors, the percentage of defaulted obligors and the Gini coefficient on the grid of 1-25 neurons and 1-3 hidden layers. All results are reported on test data. Finally, the eighth column presents the average CPU time in seconds to compute a network; the CPU was an Intel Celeron N2840 with 2.16 GHz.
Based on table 2, we observe that: (i) The percentage of overall correctly classified obligors is the highest for a network architecture where the regularization parameters are estimated by the Bayesian approach with the proposed MCMC estimation rather than the Gaussian approximation. (ii) In some cases, the ES procedure can lead to a better performance but in other cases ES undermines the network performance. (iii) The computational time needed for the MCMC estimation of the network is significantly higher than for the other networks but the Bayesian estimation automatically estimates the regularization thus reducing the bias.
The above observations are valid for all data sets. Below we examine the results for each data set separately.
(i) For the East-European data Bayesian regularization with MCMC leads to the highest overall performance. The improvement in performance is 4% which is high enough to make a difference from a practical point of view. In terms of identification of bad obligors and Gini coefficient, Bayesian regularization with MCMC performs similarly to classical regularization. (ii) For the Polish data Bayesian regularization with MCMC leads to the highest overall performance. The improvement in performance is 1%, which can be ignored for practical purposes. However, in terms of identification of bad obligors and Gini coefficient, Bayesian regularization with MCMC performs significantly better than the other methods. (iii) For the German data, Bayesian regularization with MCMC leads to the highest overall performance. The improvement in performance is 2%, which may make a difference in a situation where overall performance is of utmost importance. However, in terms of identification of bad obligors and Gini coefficient, Baeysian regularization with MCMC under-performs compared to the other methods.
The results in table 2 are based on the 80% percentile of the correlation to the target variable. In Appendix 2, one can see the results for the other combinations of variables but the conclusion stays the same. In all cases, Bayesian regularization with MCMC overall outperforms the other methods.
Finally, in table A4 in Appendix 2, we apply a two-sample t-test on the overall performance of our proposed method namely architectures 7 and 8. The two-sample t-test is a parametric test that compares the location parameter of two independent data samples. The statistics of the test follows a Student's t distribution. The null hypothesis states that the means of two populations are equal. In table A4, 1 indicates a rejection of the null hypothesis. Therefore, we can claim that our results are statistically different for each data set. Figure 1 presents distributions of the overall correctly classified obligors per data set and per network architecture that are shown in table 2. We can see from figure 1 that the distribution of the overall correctly classified obligors for architectures 7 and 8 is right skewed for the East-European and Polish data. The results in figure 1 are based on the 80% percentile of the correlation to the target variable. In Appendix 2 one can see the results for the other combinations of variables but the conclusion stays the same. In all cases, Bayesian regularization with MCMC overall outperforms the other methods.

Neural network performance on increasing the number of layers
Inspired by the flexibility of the deep neural network paradigm, we tried to increase the number of layers with the goal of increasing the performance on the test data. However, contrary to our expectations, the generalization power of the network decreased for each data set, as can be seen from figure 2. The decrease in performance is different for each data set. For the Polish and German, the decrease of performance is not significant but for the East-European data the decrease is significant. The reason is that our data sets are not big enough to allow the application of many layers. The networks with more than 4 layers significantly overfit the data.

Comparison to other classification algorithms
The main objective of our research is to analyse the effect of Bayesian regularization and compare it to classical regularization for ANNs. We report our results as an average over different combinations of layers and neurons. Therefore, we do not report the maximum classification accuracy that can be achieved rather we aim to present the effect of Bayesian regularization with MCMC over different network architectures and advocate that on average our proposed approach leads to higher performance when compared to other regularization approaches for ANNs. However, for the purpose of completeness, we apply two other non-linear classification methods to the three data sets. The first is a support vector machine (SVM) and the second is k-nearest neighbours (KNN). The results in terms of classification accuracy are shown in table 3.
Overall the performance is similar to our proposed method. On the Polish corporate data SVM and KNN outperform but as we emphasized before the results we report in table 2 are averaged over a grid of different neurons and layers and therefore are not directly comparable to the results in table 3. Therefore, the performance reported in table 2 is not the    highest that could be achieved using Bayesian regularization but this average performance is close to the maximum performance we achieve when we apply SVM and KNN to the data.

Policy implications
Identifying a classification method to estimate the PD is an important factor but equally important is deriving business intuition from the selected default factors. Typically, PD models are used by a non-technical audience and the interpretation of the default factors from an industry prospective is of utmost importance. For that reason, we split the selected ratios into three categories † (i) Leverage category: ratios that signal how much debt and debt-related costs a company utilizes against its equity or assets. Effectively this category indicates the level of indebtedness of a company. (ii) Profitability category: ratios that signal the ability of a company to generate income relative to its equity or assets. Effectively this category indicates how efficiently a company utilizes it assets. (iii) Liquidity category: ratios that signal a company's ability to meet the current liabilities when they become due with its current assets. Effectively this category indicates the ability of a company to pay off its short-term obligations. Table 4 presents the allocation of the selected ratios from the variable selection method on the two corporate data sets (East-European and Polish) to each of the above three categories. As can be seen from table 4, the default risk in the Polish data set is driven mainly by the profitability ratios, followed by the leverage ratios. Liquidity ratios do not play an important role in determining the default risk of the Polish obligors. We compare our approach with that of Liang et al. (2016) where they split the financial ratios into several groups namely: solvency (leverage) ratios, profitability ratios, capital structure ratios, cash flow ratios, ownership ratios, turnover ratios, and growth ratios. They found that leverage and profitability ratios are the most important categories in identifying defaults. Interestingly, they have used data from the Taiwan Stock Exchange. The fact that their findings align with ours proofs the significance and the universality of the leverage and profitability ratios. Another study by Al-Kassar and Soileau (2014) also indicates the importance of profitability and leverage ratios through the use of factor analysis. However, they advocate that non-financial data are also important in identifying and measuring default risk. Furthermore, a study by Chen et al. (2011) emphasizes the role of the profitability and leverage ratios. The analysis is done on 20 000 solvent and 1000 insolvent companies. Their study applies SVM on German companies and shows the importance of profitability and leverage in identifying defaults.
Similarly the default risk in the East-European data set is driven by profitability and leverage ratios but it is also driven by liquidity ratios. We compare our approach with that of Marilena and Alina (2015) where liquidity and leverage ratios are identified as a major default driver. Their work applies † Payables turnover = supply payables × 360/cost of goods sold (East-European data) and working capital/total assets as well as logarithm of total assets (Polish data) cannot be allocated to these three groups. Table 4. Selected ratios based on the 80% percentile of the correlation to the target, allocated into three main financial categories: leverage, profitability and liquidity on the East-European (E) and on the Polish (P) data.

Category Ratio
Leverage ratios interest coverage (E), short-term liabilities/total assets (P), total liabilities/total assets (P), total costs/total sales (P) Profitability ratios return on assets (E), income from sales/total assets (E), sales/total assets (P), sales/inventory (P), sales/receivables (P), sales(n)/sales(n − 1) (P), profit on sales/sales (P), (sales − cost of products sold)/sales (P) Liquidity ratios cash ratio (E), liquid assets/total assets (E), working capital (P) multiple discriminant analysis, logistic regression and ANNs. The data used are from the Bucharest Stock Exchange principal market. Moreover, a study by Tian et al. (2015) also indicates the importance of liquidity and leverage ratios. They use North American financial data on corporate obligors and apply the LASSO method for variable selection. Finally, Tian et al. (2015) claim that their approach is superior to the one given by the popular distance to default model proposed by Merton (1974). We note that the major difference in default drivers between the East-European data and the Polish data is the higher importance of liquidity ratios for the former, the reason being that Polish companies are in general bigger and liquidity is not a major indication of default risk. Practically, larger companies have access to cheaper funding, whereas smaller companies incur higher funding costs.
Due to the low number of selected features in the German retail data set, we are not able to allocate them into different groups. However, most of the variables in the German retail data are based on the status and duration of the current account. This is aligned with the study of Barrell et al. (2010), which shows evidence that the status of the current account is a major predictor of mortgage defaults.

Discussion and conclusions
In this paper, we propose an improvement of a Bayesian approach to regularize feed-forward neural networks. The Bayesian approach is attractive because it provides automatic determination of the regularization parameters. Moreover, we demonstrate that the improved Bayesian approach performs well when compared to the classical regularization approach for neural networks. We find that using a MCMC scheme to estimate the Bayesian regularization parameters leads to higher performance than using a Gauss-Newton approximation. Furthermore, the application of early stopping on the network does not guarantee higher performance.
We analysed three data sets; two are corporate and one retail. From a policy prospective, three groups of financial ratios are identified as major drivers of default risk: profitability ratios, leverage ratios and liquidity ratios. The effect of liquidity ratios is higher on the East-European data and the effect of profitability ratios is higher on the Polish data.
The findings of this paper yield promising insights into the potential of Bayesian regularization to efficiently estimate the network weights. Practically, this leads to making better informed and less biased credit risk decisions.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
The funding of the Systemic Risk Centre by the Economic and Social Research Council (ESRC) is gratefully acknowledged [grant number ES/K002309/1].