Generalized additive model with embedded variable selection for bankruptcy prediction: Prediction versus interpretation

: This paper explores the properties of using a generalized additive model with embedded variable selection for the prediction of bankruptcy. The main purpose is to explore an innovative way to close the gap between interpretation and prediction that has prevented widespread use of methods based on machine learning. An additive model enables the incorporation of nonlinear effects for each predictor, thereby enhancing the predictive power over classical linear models, while simultaneously keeping the marginal effects for interpretation separated. In addition, we propose a penalization likelihood approach that automatically selects important financial ratios and classifies them under linear and nonlinear effects, thereby improving the interpretation of the estimations. We implemented the proposed model on data from the retail industry in Colombia. The results demonstrate a good generalization performance of the algorithm and a prediction accuracy not far below typical black box algorithms such as random forest and support vector machines.


PUBLIC INTEREST STATEMENT
Early prediction of bankruptcy may help firms, auditors and financial institutions to support lending decisions, monitor companies and take corrective actions that may help prevent impacts on profitability, financial distress and bankruptcy waves. Many predictive methodologies are based on data-driven algorithms that are difficult to interpret by humans, preventing the use in many real applications due to it is difficult to explain why a firm will fail. Our primary contribution is to illustrate the importance of interpretability of the bankruptcy's model, knowing the effect of each financial ratio, and to propose a novel methodology that has interpretable results and have good predictive performance.

Introduction
Bankruptcy is the outcome of a chronic condition that occurs when the total liabilities of a firm exceed a fair valuation of its total assets (Altman & Hotchkiss, 2010). Therefore, knowing in advance when a firm is going to go bankrupt is valuable information that can impact many business decision-making problems (Kirkos, 2012). Bankruptcy prediction has been an important issue for bankers, investors, asset managers, auditors and academics (Altman, Iwanicz-Drozdowska, Laitinen, & Suvas, 2014;Jones, 1987). In particular, for financial institutions, bankruptcy has a significant impact on their lending decisions and profitability (Shin, Lee, & Kim, 2005).
Although pioneering work on corporate bankruptcy prediction using financial ratios began in the 1930s (Patrick, 1932;Smith & Winakor, 1935), it was not until influential papers by Beaver (1966) and Altman (1968) that the study of new methodologies and applications exploded in the literature. In the vast number of articles that appeared during the following decades, most of the models were based on classical statistical classification techniques such as linear discriminant analysis (LDA), logistic regression (Dimitras, Zanakis, & Zopounidis, 1996;Ohlson, 1980;Waqas & Md-Rus, 2018), and probit model (Kim & Gu, 2006;Zmijewski, 1984). These methods gained great popularity thanks to the simplicity of the interpretations, as they estimate a linear discriminant function on which the marginal effect of each predictor can be reduced to a single parameter. The price paid for this simplicity, however, is possible misrepresentation of the real relation between the predictor variables and the response, which is likely nonlinear and too complex to be expressed analytically.
During the last two decades, there have been numerous theoretical and experimental studies of new methodologies for bankruptcy prediction based on machine learning and artificial intelligence methods. These algorithms are by design-oriented more toward prediction performance using a set of financial ratios. Among the most popular of these new classification models are artificial neural networks (NN) (Atiya, 2001;Hu, 2009;Iturriaga & Sanz, 2015;Wilson & Sharda, 1994), support vector machines (SVM) (Lin, Yeh, & Lee, 2011;Shin et al., 2005), different ensemble methods (Abellán & Castellano, 2017;Du Jardin, 2018;Kim & Kang, 2010;Wang & Ma, 2012), and deep learning (Mai, Tian, Lee, & Ma, 2019;Ribeiro & Lopes, 2011). In general, all of these have better classification performance than do classical parametric statistical models such as LDA or logistic regression (Chen, 2011;Jones, Johnstone, & Wilson, 2016). To obtain high prediction performance, it is necessary to allow for a flexible (non-parametric) relation between the predictors and the response (Friedman, Hastie, & Tibshirani, 2001). Therefore, behind the NN, SVM, deep learning and ensembles classification success, there is a multivariate nonlinear complex discriminant function that is estimated nonparametrically. This flexibility, however, entails a difficult interpretation of the model results. These black box predictive algorithms are not able to explain the mechanism by which the predictors affect the mean response. Consequently, although they work very well to predict when bankruptcy will occur, they fail to account for why this happens.
Together with bankruptcy prediction performance, interpretability is a very important issue that distinguishes parametric statistical models (e.g., LDA and logistic regression) from recent machine learning methodologies (Virág & Nyitrai, 2014). There can be different positions found in the literature about the definition of interpretability in terms of predicting bankruptcy models. Some call a model interpretable if it generates interpretable rules (e.g. sales to total assets less than 5) (Florez-Lopez & Ramon-Jeronimo, 2015;Obermann & Waack, 2015), whereas another some others define a model as interpretable if it assess the effect of the predicting variables (Kainulainen et al., 2014). From the latter approach, knowing the effect of each financial ratio on the bankruptcy risk is often important in practical applications, as it helps financial institutions and auditors to support lending decisions and monitor companies, and allows them to take steps that may help prevent impacts on profitability, financial distress and bankruptcy waves. Despite the clear dominance of methods such as NN and SVM in terms prediction accuracy, the black box nature of the estimation process and the impossibility of validation by human experts, have prevented the implementation in many real situations (Jones et al., 2016;Martens et al., 2010). This is especially true for banks and financial institutions that embrace the Basel II Capital Accord (Wu, 2005). For example, in the United States of America, the Equal Credit Opportunity Act (ECOA) call for understandable reasons when credit is denied by financial institutions (Bank of Boston, 1993).
The interpretability of financial predictive models, in particular for bankruptcy, has attracted the attention of several researchers during the last decade (e.g. Martens, Baesens, Van Gestel, and Vanthienen (2007); Virág and Nyitrai (2014); Tomczak and Zieba (2015); Obermann and Waack (2016)). Most of the efforts are done on extracting comprehensible rules from the black box type of function estimated in a machine learning algorithm. However, these methodologies in general only represent an approximation of the original model and therefore might lose accuracy (Obermann & Waack, 2015). Our approach is different in the sense that we take an interpretable model and modify it to preserve both: (i) a simple functional form on which the effects of each ratio can be understood separately, and (ii) a more flexible form in every variable, so the model can have a better specification. We think that this path has been less explored and providing some new evidence is still necessary.
In addition to the complex discriminant function, an elevated number of predictors may be another aspect that makes a bankruptcy classification model difficult to understand. According to the literature review published by Bellovary, Giacomino, and Akers (2007), the number of financial ratios considered in bankruptcy predictive models ranges from one to 57, for a total of 752 different variables. One of the reasons for the popularity of parametric classification models is the selection of variables to obtain a very understandable function (Altman et al., 2014). The Z-score model introduced by Altman (1968), probably the methodology most in use to the date (Altman, 2000), utilizes five predictors selected from a group of 22. However, if the purpose is to achieve high predictive power, there is no reason not to use a larger set of financial ratios. Studies based on new machine learning methods generally use more variables than do classical parametric models, increasing the obscurity in interpretations (Jones et al., 2016;Virág & Nyitrai, 2014).
The aim of this paper is to explore a new model that could help to close the gap between classical interpretable models and black box predictive algorithms for the problem of bankruptcy prediction. We propose a generalized additive model (GAM) for binary classification with an embedded mechanism for variable selection. The model we use has the ability to select for each predictor between linear, non-linear or zero effect through the penalization of the likelihood function. When each effect is a linear function, we obtain a single parameter that represents its marginal contribution to the odds ratio. In the non-linear case, we obtain a non-linear function that can be interpretable because of the additive property (Berg, 2006). Furthermore, the embedded variable selection will produce a sparse estimation in which non-important predictors are discarded. We applied our methodology to retail firms in Colombia using financial information obtained between 2012 and 2013. Although we used data from two years, our analysis is more oriented to the evaluation of the methodology, and our results can be extended to larger horizons. To the best of our knowledge, bankruptcy prediction studies in Colombia have been sparse and more oriented to the identification of relevant factors that explain bankruptcy using logistic regression models (cite perez 2013, caro 2013 and others..). Therefore, the present study provides further evidence on how the bankruptcy phenomenon occurs in the local economy, and gives benchmark values to the predive performance that could be achieved with the different methods that are compared.
The remainder of this paper is organized as follows: Section 2 provides a description of the generalized additive model with a binary response for the bankruptcy prediction problem. Section 3 provides a description of the embedded variable selection we propose to incorporate in the GAM model. Section 4 describes the data that we use as a case of study to apply the proposed methodology. Section 5 summarizes and analyzes the obtained results with some discussion of the marginal effects of each financial ratio. Finally, in section 6, we present the conclusions of our study.

Generalized additive model for bankruptcy prediction
We propose to use a nonparametric statistical model for the bankruptcy prediction problem that balances out the interpretability of the results and the ability to make accurate predictions. In particular, we put forward a GAM model for binary classification with embedded variable selection that separates linear and nonlinear terms. Naturally, we represent the response of the model as the indicator variable (Y) that takes the value of 1 if there is bankruptcy and 0 if there is not.
We will use the standard logit function on the probability p as the canonical parameter for the generalized model. In this framework, let the financial ratios selected as predictors be noted as X 1 ; X 2 ; . . . ; X p . Therefore, the link between the conditional mean of Y and these variables is modeled through the logarithm of the odds ratio as log . . . ; X n Þ is the conditional probability of PðY ¼ 1Þ given the values taken by the predictor variables, and f X 1 ; Á Á Á ; X p À Á is the general function of interest that have to be estimated from data. In a generalized linear model (GLM), the link function is restricted to be Therefore, the marginal contribution of each variable X j (j ¼ 1; . . . ; p) to the odds ratio can be summarized by a single parameter. Although this makes the estimations easy to understand, the assumption of linearity may be too strong in real applications with a large number of observations, and limits the flexibility needed to achieve better predictions. Most likely, the most successful and straightforward adaptation of the linear model to a more flexible one is the generalized additive model (GAM) proposed by Hastie and Tibshirani (1990) (Wood, 2017). In this model, the contribution of each variable is replaced by arbitrary univariate functions f j ðX j Þ that can be estimated by smoothing techniques. Therefore, using a GAM model, the conditional expectation of Y is modeled as log PðX 1 ; . . . ; X n Þ 1 À PðX 1 ; . . . ; with the assumption that E f j ðX j Þ Â Ã ¼ 0 for each j, in order to make the model identifiable. Since the effect of each variable is represented by a separated function (additive property), the generalized additive model retains most of the interpretability of the linear model, although all the marginal variation can no longer be summarized in a single parameter. At the same time, the additivity imposes a structural condition on f X 1 ; X 2 ; . . . ; X p À Á that avoids the deterioration of the estimations when the number of variables increases, a characteristic of nonparametric methods usually referred as the curse of dimensionality (Hughes, 1968).
The estimation of the functions f 1 ; f 2 ; . . . ; f p will be made nonparametrically, that is, each of them results from a smoothing technique (e.g. k-neighbors, kernel smoothing, etc.). We opt to use the smoothing method of penalized likelihood given that provides a better conceptual framework to evaluate the estimators properties and allows the use of functional basis. For the penalization, we assume that each f j belongs to H j ðΩ j Þ, where H j ðΩ j Þ is a Sobolev space of order 2, that is, the space of functions on Ω j with two continuous derivatives, whose second derivative is integrable. This assumption is very weak and is the standard form to define smoothing splines (Wahba, 1991). Therefore, the estimatorsf 1 ;f 2 ; . . . ;f p are defined as the solution of the optimization problem min α2R;f 1 2H 1 ;...;fp2Hp . . . ; f p À Á , y is the column vector of length n with the binary responses, and X 2 R nÂp is the matrix with all the predictors, such that the row i contains the financial ratios of the i-th firm in the sample. , n ðÁÞ is the negative log-likelihood function evaluated at ðX; yÞ, that is, The optimization problem 3 has a unique solution that is not defined on a finite number of arguments (each H j needs an infinite number of basis). However, according to the Representer theorem (Schölkopf et al., 2001), the solution of 3 implies that eachf j ðx j Þ is a natural cubic spline function with knots on the observed x ij È É , and therefore it can be expressed aŝ where c 1j , c 2j and d j1 ; . . . ; d jn are coefficients derived from the data and K j ðÁ; ÁÞ is the kernel function associated with the space H j . Details about the reproducing kernel Hilbert space approach with Sobolev spaces can be found in Wahba (1991).
Equation 5 implies that the estimation of the p functions f j can be rewritten in terms of the parameters c 1j , c 2j and d j1 ; . . . ; d jn . However, the number of equations needed to solve the nonlinear minimization problem can become very large (in the order of np Â np). An iterative method as the local scoring procedure is more suitable to solve 3 (Hastie & Tibshirani, 1990).

Embedded variable selection
The selection of adequate accounting measures as predictors for bankruptcy has an inevitable effect on the performance of the models. During the last five decades, the effectiveness of many financial ratios has been empirically proved with different datasets and in different countries (Jackson & Wood, 2013). However, there still exists a vast amount of possible variable combinations that could be included in a prediction methodology, which in conjunction with the high correlations among them, make the model selection task very difficult. In a review of bankruptcy prediction studies performed by Bellovary et al. (2007), the authors found more than 165 referenced articles that used a total of 752 different financial indicators. Among them, the return on assets, the current ratio, and the working capital to total assets have higher frequencies of use.
In several studies, the approach to selecting which predictors to include in a particular model depends on the expert opinion of the researcher and the significance that those measures had in similar datasets, frequently resulting in a group of measurements of important characteristics, such as working capital, cash flow, earnings and leverage (Altman & Hotchkiss, 2010;Beaver, McNichols, & Rhie, 2005;Jones et al., 2016). Some methodologies oriented more toward artificial intelligence and machine learning, use a broader pool of variables and implement data-driven selection methods, such as wrapper or filter techniques, to choose which of them are worth keeping.
We propose a different approach to variable selection on a generalized additive model that use in-built mechanisms to discard non-important predictors. In that sense, the estimation procedure sets the effect of some financial ratios as zero, avoiding the need to estimate several models and compare their performances afterwards. Different from sequential wrapper procedures such as forward or stepwise selection, the use of these methods usually depends on the intended specific predictive model, and assures asymptotic optimality under some criteria. Given that the generalized additive model is estimated by penalized likelihood, a convenient approach to inserting an embedded model selection method is to modify the penalties used in estimator 3. Different kinds of penalizations would force some of the non-informative predictors to be set to have a zero effect, that is,f j ðx j Þ ¼ 0. This has been vastly explored in linear models using lasso type models (Tibshirani, 1996). For generalized additive models, several methods have been proposed to extend the lasso effect, the most notorious being: COSSO (Lin & Zhang, 2006), SpAM (Ravikumar, Lafferty, Liu, & Wasserman, 2009), GAMSEL (Chouldechova & Hastie, 2015), among others (Lou, Caruana, & Gehrke, 2012;Meier, Van De Geer, & Bühlmann, 2009).
We use the generalized additive model selection (GAMSEL) method given its ability to select for each variable between linear, non-linear and zero effects, and the computational performance when the number of predictors scales. The first property helps to make the estimation more interpretable given that, apart from selecting which subset of variables to use, their effects are differentiated between linear and non-linear. This methodology uses the fact that eachf j estimated in problem 3 is a natural cubic spline with interval limits in the observations x ij . This allows for the use of a different basis to represent the solution parametrically. For convenience, the Demmler-Reinsch basis u 1 ðxjÞ; u 2 ðx j Þ; . . . ; u n ðx j Þ È É is used to represent f j ðx j Þ such that: Given that the basis u 1 ðxjÞ; u 2 ðx j Þ; . . . ; u n ðx j Þ È É has an increasing order of complexity, it is truncated to the first m j functions to reduce the computing cost. Making β j ¼ β 1j ; β 2j ; . . . ; β m j ;j where D 1 ; . . . ; D p À Á are the resulting smoothing penalty matrices, and k β j k D j ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffi β T j D j β j q . The negative log-likelihood function has the form Note that when λ ¼ 0, the estimator 7 is equivalent to the standard GAM model 3. The term that multiplies λ is the part of the penalty that permits the variable selection, separating the linear and non-linear effects. Because the non-linear part requires m j parameters, a group-lasso type of norm is used.

Data description and implementation
The dataset we use in the present study corresponds to small and medium-size retail firms in Colombia. According to Colombian regulations (law 905 in 2004), a retail firm is considered small when it has between 11 and 50 workers or its assets are between $120 thousand and $1.2 million, whereas it is considered medium when it has between 51 and 200 workers or its assets are between $1.2 and $7.3 million.
We aim to predict bankruptcy from data obtained from public financial statements (balance sheet, income statement and cash flow statement) reported to governmental institutions (Superintendencia de Sociedades de Colombia) during the years 2012 and 2013. In total, we considered 2922 firms after removing some observations that were classified as invalid given the inconsistency of some measures (e.g. extreme negative working capital to sales or negative total assets to sales). The time horizon was fixed at one year, therefore the input variables are from 2012 and the response variable (bankruptcy indicator) is for 2013. We consider the bankruptcy of all the firms that stopped reporting statements during 2013 eliminating those that were sold or went into merger. Given that one of our main objectives is to compare the performance of the methodologies we propose with other models and techniques, we randomly divided the sample into two parts, 70% for training and the 30% left as a validation set.
The depurated dataset of firms (before splitting) shows a severe situation of class imbalance given that 115 (3.93%) of them were bankruptcy cases, and 2807 (96.06%) were non-bankruptcy cases. Estimating a model with these proportions could produce a large bias towards the majority class, thus undermining the predictive power measured by the AUC (area under the ROC curve). To address the class imbalance, we use a synthetic minority over-sampling technique (SMOTE, Chawla, Bowyer, Hall, and Kegelmeyer (2002)) on the training set. This method has been shown to work better than classical under (over) sampling by synthetically generating data points from the minority class while removing some random points from the majority class. After performing some experiments with the data, we set to over-sample the minority class in 20% and undersample the majority in 80%, obtaining a proportion of 19.76% to 80.23% of the positive cases (bankruptcy) and negative cases, respectively.
Regarding the financial ratios chosen as predictors in the model, we use 30 classical measures that have been proposed in the literature (Altman, 1968;Beaver, 1966;Bellovary et al., 2007;Kirkos, 2012). As we present a methodology for embedded variable (and effect) selection, the only preprocessing and screening of variables we performed was to eliminate some that presented pairwise correlations higher than 0.9 (quick assets to current liabilities, accounts receivable to sales, current assets to sales, EBIT to total assets, inventory to sales, retained earnings to total assets and total assets to sales) with other variables. The final set of variables we use in our model is presented in Table 1.
For comparative purposes, we also fitted different methodologies to the same training data. In particular, we use random forest, support vector machine (radial and linear), GAM (generalized additive model without embedded variable selection), lasso penalty in logistic regression and linear discriminant analysis. of these, it is clear that random forest and radial support vector machine should have greater predictive power, however, they are difficult to interpret given that the information about the influence that each financial ratio has on bankruptcy occurrence is limited to its relative importance or to the projected marginal effect.

Results and analysis
To investigate the properties of the model we propose, we begin by presenting the predictive performance on the validation set using the area under the ROC curve (AUC) as the metric for evaluation. We compare our model with the performance of the other methods taken into account. In addition to exploring the differences in terms of predictive ability, we pretend to evidence the tradeoff between predictability and interpretability and how the methodology we propose may offer a good balance. Table 2 presents the AUC of the generalized additive model with variable selection evaluated on the validation data (0.8391). Additionally, we present the AUC of commonly used methods that cover the spectrum from very flexible (e.g. random forest and radial support vector machines) to parametric interpretable models (e.g. logistic regression with lasso). The AUC of the seven analyzed methodologies ranges between 0.76 and 0.86; as expected, the predictive power depends on the flexibility of each methodology. Models designed to increase the performance as random forest have higher AUC, whereas linear discriminant analysis has the lowest AUC of them all. Figure 1 shows the ROC curves for each model with their respective AUCs. On the other hand, as discussed above, models such as the linear logistic regression model are easily interpretable because the effect of each predictor is summarized on a single parameter. Table 3 presents a comparison of the variables selected by the lasso penalty in a linear logistic regression model versus the respective effects estimated by our methodology, differentiating for each one if it was excluded, linear or non-linear.
The performance of the GAMSEL model is relatively close to the random forest and SVM with radial kernel ones. The main difference is in the level of interpretability of each financial ratio. For each predictive variable of GAMSEL, there are three groups of estimated effects: linear relation,  non-linear relation or non-significant. We group the variables depending on the effect type, as shown in Figures 2 and 3.
The first cluster of variables has a linear relation (see Figure 2). Intuitively, it is very important for the survival of retail companies that they have enough liquidity, high profits and low leverage Table 3. Predictors selected effects in logistic regression with lasso penalty and GAMSEL

Predictor
Lasso LR Gamsel Figure 1. ROC for the fitted models. (Beaver, 1966). Therefore, as seen in Figure 2, when liquidity ratios (working capital to total assets and no-credit interval (NCI)) decrease, the probability of bankruptcy increases confirming that these ratios are signals of financial health in the retail business, as have been discussed in previous studies (e.g. Campbell, Hilscher, and Szilagyi (2008); Chiaramonte and Casu (2017)). Likewise, the probability of bankruptcy increases when the profitability ratios (cash flow to net worth, net  income to total assets and net income to total debt) and net worth to sales decrease, as shown in Table 3. In contrast to all negative slopes, a high leverage ratio of total liabilities to total assets indicates that a firm with high indebtedness increases its probability of bankruptcy. Consequently, a firm with reduced liquidity, low profitability or excessive leverage has a high probability of bankruptcy.
The second group of variables have non-linear effects (see Figure 3). Consistent with previous results in the retail industry, decrements of profitability (net income to sales), leverage (current liabilities to total asset) and efficiency (sales to total assets) have a monotone increasing effect on bankruptcy's probability. Interestingly, liquidity ratios (current ratio and quick assets to sales) have a concave effect, meaning that the excess of liquidity is not a good symptom in the retail industry. On the one hand, low proportions of current assets or quick assets also decrease the probability of bankruptcy. Given the operative business practices in retail, it is not surprising that both high current liabilities and sales, have a decreasing effect. In contrast, profitability ratio (cash to total assets) and liquidity ratio (quick assets to total assets) have a convex effect on bankruptcy probability. Thus, very low levels of money, quick assets or total assets increase the probability of bankruptcy.

Conclusions
In this study, we explore the benefits of using a generalized additive model for bankruptcy prediction that incorporates a novel approach to solve the model selection problem. Our experimental results demonstrate that the generalized additive models with embedded variable selection perform well in terms of predictive power and generalization, being only slightly outperformed by methodologies that do not separate the estimated effects of each predictor, and are therefore much less interpretable.
As an explanation of why a model is selecting some prone to fail firms is required by many managers and regulatory agencies, the implementation of predictive methodologies with limited interpretability such as ensembles, neural networks or support vector machines, is restricted because they do not make explicit what is fostering this situation. This opens the playing field to new models that allow an elucidation of the effect of each financial ratio on bankruptcy probability, but with a competitive prediction performance. Classical linear models such as LDA and logistic regression (with or without penalization) are considered the gold standard to obtain interpretable results; however, our results prove that those models are one step below in terms of foreseeing future failure cases. The logistic regression model with lasso type penalization had an AUC of 0.7757 on the validation set, whereas the GAMSEL model presented an AUC of 0.8391, which position the latter closer to the performance of non-interpretable algorithms than to the linear methods.
Furthermore, our investigation shows that linear models fail to represent the functional effects of important financial ratios. Allowing for nonlinear relations not only improves dramatically the generalization capacity of the model and its predictive power, but also helps to understand some interesting patterns. As long as the nonlinearity terms are structured in an additive function, the entire model retains interpretability, in the sense that the function associated with each variable represent its marginal effect. This is not true in general for non-additive functions, given what one can observe of the conditional effects given particular values of the other predictors for this type of model.
For future research, it remains to be seen how the automatic effect selection produced by the penalization scheme performs in high dimensional settings. Although we selected a set of approximately 30 financial ratios, the possibilities are much wider in terms of the number of predictors. An interesting area for future research is the embedded model selection when there are groups of financial ratios that present similar characteristics, and thus are highly correlated. In terms of data, another avenue of future research is to explore how the model performs in different countries, and with different validation frameworks (e.g. default horizons and multi-period sequential predictors).

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.