Credit Risk Assessment Using Learning Algorithms for Feature Selection

Firefly algorithm is one of the latest outstanding bio-inspired algorithms, which could be manipulated in solving continuous or discrete optimisation problems. In this context, we have utilised the firefly algorithm accompanied by five well-known models of feature selection classifiers to have an accurate estimation of risk, and further to improve the interpret-ability of credit card prediction. One of the significant challenges in the real-world datasets is how to select features. As most of the datasets are unbalanced, the selection of features turns to the maximum class of data that is not fair. To overcome this issue, we have balanced the data using the SMOTE method. Our experimental results on four datasets show that balancing data has increased accuracy. In addition, using a hybrid firefly algorithm, the optimal combination of features that predicts the target class label is achieved. The selected features by the proposed method besides been reduced can represent both majority and minority classes.


Introduction
Over the last several decades, increasing financial transaction leads to a tendency towards investigation on financial risk, particularly studies that evaluate the risk of customer credit transactions in commercial banks. Studies have shown that the majority of a bank's loss risk originates from the customer credit risk, which is one of the most rapidly expanding issues in the banking industry. Hence, the banks need to employ an evaluating system to measure customers' credit risk. Finding new ways of doing a less risky business with an entirely efficient and profitable feature is a permanent challenge in the banks. In retail, risks are often taken with no exact estimate of their degree and conceivable results.
There exist fundamental decisions in the process of loan to a consumer in financial institutions that essentially needs to rely on different algorithmic results rather than human decisions. For instance, in the process of loan approval in many banks, artificial intelligence models play the primary role in decision-making. Thereby, a low credit risk assessment model, which contains error misclassification costs, could lead to non-optimal capital allocation [1,2].
Machine learning techniques have been broadly utilised to improve precision in lots of real-world problems. Evaluation of Credit risk in the financial industry is one of these problems that has been investigated by these techniques in recent years. Most of them have been focused on the methods that incorporate the results of various classifier algorithms to obtain the optimal results [2]. Two different entities, both dataset, and algorithm, have a vital role in improving performance. The utilised databases for credit risk assessments, commonly have high dimensional data with the anomaly. The irrelevant features lead to less precision in the training process's classification data. Feature selection, the process of selecting the essential features, is required to provide outperformed predictive accuracy and increase speed and scalability. Feature selection techniques are roughly applied in the preprocessing phase of classification [1,3]. Classifying objects is the heart of many kinds of research in pattern recognition, artificial intelligence, vision analysis, and medicine. In this regards, many studies have been investigated on the most important feature selection methods such as Support vector machine (SVM) and random forest [4], K nearest neighbour (KNN) [5,6], fuzzy KNN algorithm [7,8], decision Tree [4,5]. The performance of classifiers is usually estimated according to the predictive accuracy of constructed models. However, numerous real-world data, including financial data, is unbalanced. The machine learning algorithms with unbalanced data are can not well classify, so sometimes submits minority class data to the majority class by mistake. In other words, standard classification algorithms on unbalanced datasets suffer a loss of performance.
Over-sampling and under-sampling are two well-known methods for classifying unbalanced datasets by changing the distribution of data. Over-sampling by omitting the majority class instances and under-sampling by increasing the minority class, alter data distribution. However, the danger of eliminating some potentially useful data, or in case of over-sampling, incrementing the probability of over-fitting may occur in these methods. Synthetic Minority Oversampling Technique (SMOTE), is a well-known technique that adds new samples to the minority class [9].
In other side, Selecting a suitable set of features is a challenge in credit risk assessment algorithms. To measure selected features' performance, various benchmark classifiers are used [1]. Piramuthu, in [10], bring up decision support applications for evaluating credit risk by a machine learning outlook. A hybrid algorithm composed of genetic algorithm and neural networks (HGA-NN) is proposed in [1]. It recognises a suitable subset of features and leads to improved accuracy classification and scalability in the evaluation of credit risk. The efficiency of the proposed method is measured by applying two real-known credit datasets, the Croatian bank dataset and the German credit dataset, which are chosen of the UCI databases. The results show 78.9% accuracy for German credit dataset, and average prediction precision of the HGA-NN method for the Croatian dataset is 82.88%. In 2007, Huang et al. considered a hybrid algorithm, combined SVM and GA, for feature selection of German Credit Card and tested its efficiency against the neural network, genetic programming, and C4.5. They showed that SVM is an outperformed rather than the other methods [11]. A compression between SVM ensembles with German Credit Data and neural networks was studied in [12]. Also, it was shown the performance of the SVM bagged ensemble is better than other methods. Khashman et al. [13] introduced a new evaluation system for estimating credit risk that employs supervised neural network methods. It obtained performance best with 74.46 accuracies. A study of a new classification technique based on genetic algorithm and neural network on credit risk assessment has been done in [14]. The proposed algorithm performed on the Croatian and the German retail credit datasets. Xue et al. in [3] based on the RIPPER algorithm, proposed a hybrid model. In their algorithm, a new strategy for data pretreatment is proposed, and next, for feature selection, eliminate the redundant features. Then, to overcome the obstacle of the unbalanced distribution of credit card datasets, a sampling algorithm for minority class is used to normalise the samples. Finally, taking benefits of the rules produced by the RIPPER algorithm, default credit card users are predicted by.
Our contribution: A comparative analysis of different feature selection models with the motivation to improve scalability and precision of the algorithms in evaluating retail credit risks has been investigated. The efficiency of the proposed algorithms is evaluated using a real-world German credit dataset downloaded from the UCI database. As this financial data is unbalanced, firstly, we implement the SMOTE method to balance the data. Then, the striking features (features selection) of the credit card have been investigated using the firefly binary algorithm and various classifiers KNN, Fuzzy KNN, Random Forest, Decision Tree, and SVM. Our results are differentiated from [11][12][13] in significant ways and indicate that the proposed method attains acceptable results.
The current article is organised into five sections. In Section 2, a unified description of the used tools is given. Afterwards, the proposed method in detail discusses in Section 3. The utilised parameters and criteria in our model accompanied by results are given in Section 4. We consider them to analyse the performance of proposed algorithms in providing a loan to a company. We also analyse in detail the impact of feature selection retained by each classifier on the final results. In Section 5, Experimental results and improvements achieved by each classifier are discussed, and at last, it reveals the study's conclusion.

Preliminaries
In this section, we briefly review fundamental symbols for more description of the methods KNN, Fuzzy KNN, Random Forest, Decision Tree, SVM, and Firefly algorithm that is implemented in other sections.

K-Nearest Neighbours
K-Nearest Neighbours (KNN) method is a simple, most popular and highly efficient algorithm based on a similarity measure, for instance, the distance function. In the learning machine perspective, KNN is a non-parametric and lazy classification method [15]. It is known as a non-parametric technique which means it does not make any presumption on the underlying data distribution. It is known as a lazy algorithm since it does not utilise the training data points to perform any generalisation. Mainly the purpose of KNN is separating the data points into several classes and then predict to which class a new sample point belongs. It is wholly based on feature similarity of the majority vote of neighbours. A given data point is assigned to that class, which has most common among its k nearest neighbours [16].
The KNN algorithm can be summarised as: (1) Input is a new sample along with a positive integer k, (2) For the given sample, select the k closest entry in the database, (3) Determine the most common classification of these entries, (4) The output is to which class the new sample belongs [17].
In other words, consider X = {x 1 , x 2 , . . . , x n } be a training set, contain N instances which belong to the C classes. Let Q be the set of test samples. For a test instance y ∈ Q, KNN assigns classifier to y with the weighted Euclidean distance of y and X as follows: Where k (X, y) denote the k-neighbourhoods of a point y in the set X. Hence, the KNN classifier assigns a class label to the instance y, the class in which the majority of the k neighbours of y in training set belongs to. However, the importance of increasing the distance of a neighbour and the test point is not considered by KNN. Besides, KNN provides only one single class for each training point and, assigns a unique class label for each test point. Fuzzy KNN will handle these issues.

Fuzzy K Nearest Neighbours
Fuzzy K-Nearest Neighbour algorithm (FKNN), presented in [8], preserves the fundamental decision rule of the KNN method and attempts to overcome both of KNN defects that mentioned in the preceding section. The comparison between FKNN and other classifiers such as crisp KNN, neural networks, Bayesian, and linear discriminant functions illustrate this method's superiority [18]. Consider X = {x 0 , x 1 , . . . , x N } be a training set consist of N instances which distributed into C classes and Q be the set of test patterns. Shortly, FKNN assigns a probability, called membership μ(.), to each of the training points y, in other words, μ i (y) is the membership of the data point y to class i. The classifier assigns membership to a data point y according to the following equation.
Where is the number of instances in the neighbourhood y which belonging to class c. The constant parameter k in , the initial value of k, could take a value in the interval from 3 to 9. A data point is classified using the maximum values as follow: Where K j is the j nearest neighbour and real number m greater than 1 that specified the strength of the fuzzy distance function. k-th nearest neighbour vector x j , denoted as K j and ||Q − k j || is the L-norm distance between the set Q and K j [18].

Decision Tree
One of the most widely used classification techniques is decision tree. A decision tree uses a recursive partition method in training instances. It constructs a tree-like graph with a topdown manner in every iteration to partition the instances according the best features [19].
Decision trees are usually used in operational research, particularly in decision analysis, in order to the help in th process of identifying a structure to achieving a special goal and also as a well-known tool in machine learning. A decision tree is a flowchart structure in which each inner node that tried to represent a 'test' on a feature. Each branch results from the experiment and in the class label for each node (decision after calculating all attributes). The rules of classification are derived from paths of root to leaf. In all sections of decision analysis, a decision tree and the related influence diagram are used as to appear to be a logical decision and intuitive support tool, where the all of the expected values (or expected instruments) of competing alternatives are computed [20].

Random Forest
Random Forest is a machine learning algorithm that produces reliable and outstanding results even without setting a specific parameter, customarily and objectivity manner that can be utilised for classification and regression functions. The random decision is a totality learning technique for classification, regression, and other functions manipulate by constructing for plenty of decision trees. This procedure applies to individual trees of the classes (classification) or means prediction (regression) [21]. Although known as the unrivalled algorithm in terms of accuracy, this method efficiently used in large data sets with thousands of entrance without over-fitting and any requirement for data pruning is one of the perfect properties of random Forest [22]. Tin Kam Ho [23] consumes the first idea of implementing the random method to achieve for the fulfilment of the 'accidental difference' towards propositional classification proposed by Eugene Kleinberg [24][25][26]. Other extensions of this idea have been explained by Leo Breiman [27]. The deployment Breiman's 'bagging' concept and combined with a random selection of based on features, introduced first by Ho [23] and later completely independent by Amit and Geman [28] to build a set of decision trees with controlled variance. Technically, the idea behind Random forest is to build an ensemble of learners, derived from the original dataset, after training on a bootstrap sample, denoted by Db, using the following sampling method: Let D be the original dataset with N examples, and the Db be a set created by randomly selecting k samples from set D with replacement. After eliminating duplicates, if N is large enough and k = N, the expectedly set Db involves approximately two-thirds of samples from D. The ensemble's prediction is emerged from the separate decisions according to majority voting for classification process or averaging(regression). The variance in the final model can be reduced by comparing it to the base models, which can lead to avoiding over-fitting [29].

Support Vector Machine
A Support Vector Machine (SVM) identifies as a discriminative classifier, regression analysis, and supervised learning models with associated learning algorithms in a large area of realworld problems [30]. Assume X be a training set, consisted of N instances which belong to two classes denoted by I 1 and I 2 . For i = 1, . . . , N define (x i , y i ) where x i is an object with d features and y i is its label defines as follows: if instance x i belongs to I 1 the value of y i equals 1 and y i equals −1 if x i belongs to class I 2 . Method SVM implements a linear separating hyperplane to classify the object x i correctly as follows: Where the parameter (a i ≥ 0) is Lagrange multiplier and according to the input vector x i , the constant b determines the optimal separating hyperplane. For the non-linearly disjoint case, the classifier is modified as: is the kernel function, which is a non-linear mapping between input space and one (high-dimensional) of the property space. There are well-known kernel functions that create the decision rules such as radial basis function (RBF) kernel, the linear kernel, the polynomial kernel, and the multilayer perceptron (MLP) kernel. Choosing suitable values for parameters plays a vital role in constructing the classification model with prediction stability and precision. Although there are no general rules that guarantee the best value of the parameters for the program problems [31].

Feature Selection
In machine learning, algorithms can perform particular classification based on a set of features. Moreover, the scope of features involves hundreds of variables or features used in machine learning applications or pattern recognition. Several methods are extended to address the problem of reducing redundant and irrelevant redundant variables on challenging tasks [32]. Variable Selection (Feature Selection) leads to improvements in data comprehensibility, data visualisation, and prediction performance. It also reduces the training time of learning algorithm, computation requirement, and the effect of the curse of dimensionality [17,32]. Filter, wrapper, and embedded are three main categories that feature selection is broadly classified into them. Filter methods are among the eldest methods of feature selection that work without implementing any learning algorithm. Filter methods as a preprocessing tool utilise in rating the features. In this algorithm, features will be ranked based on a particular evaluation parameter and selected based on four criteria distance, dependency, consistency, and information. The wrapper method finds the subset of features based on the classifier that the feature selection criterion is the performance of the predictor. The subset of features with the highest predictor performance is selected. Embedded methods combine the benefits of both Wrapper models and filter methods using taking different rating parameters in various steps of the search. This model includes feature selection during the training process without splitting the data into training, and testing sets [32,33]. This Embedded idea is similar to wrappers but less prone to overfitting and less computationally expensive. Moreover, the significant restriction with this method is that it takes decisions based on the classifiers [17,33].

Synthetic Minority Over-sampling Technique
SMOTE, Synthetic Minority Over-sampling Technique, was proposed by Chawla et al., performs a screening method to balance the original unbalanced training set. The simplest algorithm applies the repetition of the minority class, but the main idea of SMOTE is to introduce artificial prototypes. SMOTE mentions that points between two points close to the negative class are assigned to the negative class and tries to synthesise new negative instances among negative grade examples [9]. The new synthetic negative grade combination in SMOTE method is created as follows: Let m be the sample rate. For each negative class instance x i , It obtains k nearest neighbours. Then randomly selects m nearest neighbour y ij , where j = 1, 2, . . . , m and eventually, new point pj where j = 1, 2, . . . , m obtains as: Where r and (0, 1) is a random number in interval (0, 1) [33,34]. The technique re-sampled training set, synthesising the new synthetic negative class instances with the original dataset leads to significantly improvement in the degree of imbalance data in the original dataset [3].

Firefly Algorithm
The firefly algorithm is classified as swarming intelligent algorithm with the most effective performance that has increasingly applied in solving optimisation problems based on flashing phenomena of fireflies [34]. The firefly algorithm (FA), proposed by Yang [35], is inspired by the ability to produce flashing light by fireflies. The main tasks of them can be expressed briefly in two steps: In the first step, they attract mating partners and in the second for a predominant predator of reminded the bitter taste of firefly [36]. To summarise the FA steps, we can introduce it through three phases [37]: Initialise: A lightning signal is assigned to each firefly according to the two fireflies' distance. The coefficient of atmospheric absorption obtains as below: Where I is the strength of the light source, r is the distance between the two considered fireflies, γ is the absorption coefficient, and I 0 is the strength of the light source when r = 0. Attractiveness: it can be expressed by: Where B 0 is the attractiveness of the firefly when r = 0. The moving step: In each pair of fireflies for the whole population, the less fit Firefly is going to the cost-efficient ones, using the following model: Where α is the mutation coefficient often makes a self-adaptive parameter that can decrease through each iteration and r and n(−0.5, 0.5) is a normal randomised number range [−0.5, 0.5]. A pseudo-code of the FA is illustrated in Algorithm 1, from which it can be seen that the algorithm consists of the above steps: for i = 1 to N do 3: for j = 1 to N do 4: if fitness(x i ) > fitness(x j ) then 5: move firefly towards firefly according Equations (9) 6: Vary attractiveness with distance r using Equations (8) 7: Evaluate new solutions and update light intensity. ranks the fireflies and finds the current global best g. 12: t = t + 1; 13: end while 14

: Post Process results and visualisation
In the next section, we consider methods KNN, FKNN, Random forest, Decision tree, and SVM as the different objective functions.

The Proposed Method
In this section, our primary goal is to compare the classification results with the hybrid algorithm in the retail credit risk evaluation domain. The hybrid algorithm is considered a firefly algorithm with five classification methods: KNN, FKNN, Random forest, Decision tree, and SVM. Classification performances are measured by different efficiency measurement tools and hybrid algorithms focusing on the accuracy of classification. First, the dataset is preprocessed. Then the dataset of preprocessing is implemented the proposed algorithm. The procedures for in the proposed method for unbalanced data are as follows: Step 1: Dataset is preprocessed that includes a normalise and balanced dataset. Some values of features are in a different range, and better results depend on the appropriate data in the algorithm. Therefore, dataset is normalised in interval [0, 1]. The normalisation formula is as follows: Where x MIN and x MAX are the minimum and maximum values of each feature. Then data balanced by the Smote method. Also, the performance of the classifier algorithm is improved by the normalised dataset.  Has the credit contract been complied with (good) or not (bad)?
Step 2: Parameters algorithm is Initialised. Let N be the number of Fireflies, Max-iteration be the number of iterations for optimisation, γ be absorption coefficient parameter, be the attractiveness of the firefly, α be mutation coefficient and f (x i ) be objective function.
Step 3: Initialise a population of n fireflies' positions at random and t = 0.
Step 4: The value of the fitness function of fireflies is evaluated. This step implements five classifier algorithm. The performance of the classifier algorithm is selected for the value fitness function. The performance of a classifier is accurate.
Step 5: Update the position of fireflies by comparing their fitness and then evaluating new solutions and updating the light intensity of fireflies as follows: Let (x i , x j ) be a pair of fireflies. Then if fitness(x i ) is greater than fitness(x j ), firefly moves towards firefly according to Equation (9). Otherwise, attractiveness with distance r is calculated using Equation (8) and evaluate new solutions and update light intensity by object function.
Step 6: rank the fireflies, find the current global best and t = t + 1.
Step 7: If Satisfy the termination t < Max − iteration continue step 5 until 6 else return optimal firefly position and its value fitness return optimal firefly position and its value fitness. Is there a home phone age 4 Spouse's income income 5 Applicant's employment status share 6 Applicant's income expenditure 7 Residential status owner 8 Value of Home self employ 9 Mortgage balance outstanding dependents 10 Outgoings on mortgage or rent months 11 Outgoings on Loans major cards 12 Outgoings on Hire Purchase active 13 Outgoings on credit cards 14 Good/bad indicator

Experimental Results
In this section, we explain the dataset and assessment method used in our experiments. Then the classification results and assessments of the proposed algorithm will be illustrated.

Description of the Datasets
The experiments are applied over four real benchmark datasets such as German Credit Data, South German Credit Dataset from UCI Repository of Machine Learning [38], credit card econometrics from Kaggle [39], and Thomas [40]. Table 1 shows description of datasets. For instance, the German Credit data sets involve 700 instances of worthy credit costumers and 300 instances of unreliable credit costumers. It consists of 24 regular features for each applicant that define some vital properties like personal information, credit background, account balances, loan aims, loan amounts, employment status, etc. The features of used data sets are shown in Tables 2 and 3.

Evaluation Method
In the interest of evaluating the efficacy of the proposed model that is computed for comparisons, we use accuracy, precision, recall, and F-score. In the flowing, we define evaluation

Results and Discussion
The experiments were done on a desktop computer which has Intel Core i7-6700 @ 3.40 GHz processor, 32 GB memory, Windows10 OS configuration, and experiment software, all algorithms were coded by Matlab R2018a software. The global and optimiser-specific parameter setting is outlined in Table 4. At first, for unbalanced data sets, results were obtained from 10 independent runs for five classifier algorithms. The results are depicted in Table 5.
Tables 6-9 outline the performance of the hybrid firefly algorithm and different classification on our datasets. In our approach, the binary firefly algorithm is used to select features of the credit datasets, and the classifier is performed with the diverse classifiers of machine learning. In each epoch, the maximum number of function evaluations was used from all data sets. It was 7500 for 15 dimensional problems (15D) with 50 iterations and 15,000 for 25 dimensional problems (25D) with 100 iterations.
The proposed method was considered using various classifiers to obtain the higher performance of the proposed method (feature selection with the optimised hybrid algorithm). Table 6 shows the accuracy of the hybrid firefly algorithm and different classification on the German credit dataset. HFA-FKNN (hybrid firefly algorithm and FKNN classifier) leads to better accuracy than other classifiers. The best accuracy of all datasets are specified in  Bold in the tables. Accuracy of HFA-FKNN has resulted in 87.14 % and seven features are selected.
Comprehensive studies have been done on the German credit dataset with machine learning algorithms. Table 10 shows the comparison of the proposed approach with other studies in this area. In this study, the combined firefly algorithm has a higher accuracy machine learning algorithm, and the combined HFA−FKNN algorithm has a higher accuracy with fewer attributes. Compared to previous studies, the proposed method is superior.

Conclusion
In financial risk management, credit cards' evaluation is a vital role for banks. Banks often need to evaluate customer credit to decide against risk and competition. Identify influential factors of the credit card can help them. Moreover, feature selection using the machine learning algorithms can evaluate them by higher accuracy while traditional techniques require essential assumptions. Real-world data such as financial data are unbalanced, and machine learning algorithms are not well classified. Therefore, at first in this paper, the data is balanced by the SMOTE method, and then the firefly binary algorithm, is used for identifies the effective features of the credit cards. Beside, this study investigates various machine algorithms such as KNN, Fuzzy KNN, Random Forest, Decision Tree, and SVM for the effective classifier of the credit card ( Figure 1).

Disclosure statement
No potential conflict of interest was reported by the author(s).

Notes on contributors
Zeinab Hassani received the M.Sc. degree in computer science from the Institute for Advanced Studied in Basic Science(IASBS), Zanjan, Iran in 2011. She works as instructor at department basic science, Kosar university of Bojnord. Her current research interests include the Artificial Intelligence, Neural Network, Computational Geometry.