A feature selection method with feature ranking using genetic programming

Feature selection is a data processing method which aims to select effective feature subsets from original features. Feature selection based on evolutionary computation (EC) algorithms can often achieve better classification performance because of their global search ability. However, feature selection methods using EC cannot get rid of invalid features effectively. A small number of invalid features still exist till the termination of the algorithms. In this paper, a feature selection method using genetic programming (GP) combined with feature ranking (FRFS) is proposed. It is assumed that the more the original features appear in the GP individuals' terminal nodes, the more valuable these features are. To further decrease the number of selected features, FRFS using a multi-criteria fitness function which is named as MFRFS is investigated. Experiments on 15 datasets show that FRFS can obtain higher classification performance with smaller number of features compared with the feature selection method without feature ranking. MFRFS further reduces the number of features while maintaining the classification performance compared with FRFS. Comparisons with five benchmark techniques show that MFRFS can achieve better classification performance.


Introduction
Feature selection (Sreeja, 2019;Too & Abdullah, 2020) is to select effective feature subsets from high-dimensional original features, which is one of the key issues for machine learning. High-quality features play a key role in building an efficient model, and irrelevant or redundant features may cause difficulties (Xue et al., 2013). Therefore, feature selection methods have been extensively applied to solve practical classification tasks (Espejo et al., 2010;Liang et al., 2017;Loughran et al., 2017;Mei et al., 2017;Patel & Upadhyay, 2020). Recently, Gaining Sharing Knowledge based Algorithm (GSK) (Mohamed et al., 2020) was proposed for feature selection and achieved good performance (Agrawal, Ganesh, & Mohamed, 2021aAgrawal, Ganesh, Oliva, et al., 2022). Evolutionary computation (EC) algorithms (Al-Sahaf et al., 2019;Hancer, 2019) have been widely used for feature selection due to their global search ability (Ma & Gao, 2020b;Xue et al., 2014). EC algorithms are population based, which need to initialise individuals randomly. Sometimes, the genetic operators of genetic algorithm (GA) and genetic programming (GP), or the updating strategy of particle swarm optimisation (PSO) (Nagra et al., 2020;Too et al., 2021), are not sufficient to delete invalid features. So, a small number of invalid features still exist till the termination of the algorithms.
GP can be used for feature selection due to its global search ability and is proved to achieve good classification performance. However, with the evolution of GP, a large number of individuals with the same best fitness are generated. GP only outputs the best individual, which may lose some good individuals. Moreover, the generated features in the output individual may still contain redundant features.
Feature ranking (Ahmed et al., 2014;Friedlander et al., 2011;Neshatian, 2010) is the ranking of original features based on specific evaluation criteria, which is usually a step of feature selection. It is employed to find out which features or feature sets are more important. So, this paper proposes a GP-based feature selection method combined with feature ranking (FRFS), which considers the large number of best individuals generated during the evolution of GP instead of only one best individual. A correlation-based evaluation criterion is used as the fitness function. It is assumed that the more the original features appear in the GP individuals' terminal nodes, the more valuable these features are. Therefore, the individuals with the best fitness are stored. The occurrence times of features appearing in the individuals' terminal nodes are counted and the top β features are selected as the candidate feature subset. To decrease the number of selected features while maintaining the classification performance, another feature selection method that combines FRFS with a multi-criteria fitness function is investigated. The motivation of this algorithm is to set higher fitness values for the individuals with the same correlation value in the case of smaller feature number.
The overall objective of this paper is to propose a feature selection algorithm that combines feature ranking with a multi-criteria fitness function and verify the improvement of the algorithm by feature ranking and the dimensionality reduction effect of multi-criteria fitness function. To achieve the overall objective, the following three objectives will be investigated.
Objective 1. Propose a feature selection method combined with feature ranking using GP (FRFS) and verify whether it can achieve better classification performance with smaller number of features than using feature selection only (FS).
Objective 2. Investigate a feature selection method that combines FRFS with a multicriteria fitness function which is named as MFRFS, and verify whether the multi-criteria fitness function can reduce the number of features of FRFS while maintaining the classification performance.
The rest of this paper is arranged as follows. Section 2 outlines the background information of this paper. Section 3 describes in detail the three GP-based feature selection methods. Section 4 shows the experimental design. Section 5 presents the experimental results and discussion. Section 6 is the conclusion and future work.

Genetic programming (GP)
GP is one of the evolutionary computation algorithms (Koza, 1992;Koza et al., 1999). It is very similar to genetic algorithm (GA). The main difference between GP and GA is the representation of individuals. Due to GP's flexible representation methods, GP can be used to construct high-level features (Ma & Gao, 2020a, 2020b, construct classifiers (Muni et al., 2006;Neshatian, 2010; and solve practical industrial problems (Bi et al., 2021a(Bi et al., , 2021bPeng et al., 2020). The commonly used representation method of GP is based on tree structure. The terminal nodes (constants and variables) are randomly selected from a terminal set, and the function (mathematical and logical operators) is randomly selected from a function set to constitute internal nodes. GP randomly initialises the first generation of individuals and evaluates each individual using a fitness function. Genetic operators including selection, crossover and mutation are then performed to produce the next generation's population. This step is iterated until the termination criterion is reached. Then GP outputs the optimal individual. For feature selection, the terminal nodes contain the features selected by GP.

Correlation
The correlation-based feature evaluation criterion is proposed by Hall (1999), which takes into account both the correlation between features and classes and the correlation between features. This evaluation criterion is proved to be effective to select low redundancy and high discrimination feature subsets (Hall & Smith, 1999;Ma & Gao, 2020b), and is adopted as the evaluation criterion in this paper. The following is the formula for calculating the correlation of a feature subset.
where Correlation S is the correlation value of a set S containing k features, C fc denotes the average correlation between each feature and class, and C ff denotes the average correlation between each feature pair. They are calculated by formula (2). H(X) and H(Y) are the information entropy of X and Y respectively, H(X | Y) is the conditional entropy.

Feature selection methods based on GP
Due to GP's flexible representation methods, GP can be used for feature selection. Recent studies have focused on feature selection methods using evolutionary algorithms (Canuto & Nascimento, 2012;Papa et al., 2017;Ribeiro et al., 2012;Tan et al., 2008) due to their global search ability. According to whether the classification algorithm is involved in the fitness function, feature selection is divided into filter-based (Davis et al., 2006;Lin et al., 2008;Neshatian & Zhang, 2009bPurohit et al., 2010;Ribeiro et al., 2012) and wrapperbased (Hunt et al., 2012). The filter-based feature selection methods use information measures such as mutual information (Vergara & Estévez, 2014), information gain , consistency and correlation (Hall, 1999;Neshatian & Zhang, 2009c) as the evaluation criterion, and need less running time than wrapper-based methods. Moreover, the models produced by filter-based methods are more general. So, filter-based feature selection is investigated further in this paper. The fundamental research goal of feature selection is to find the feature subset with the best classification performance and fewer features. Lin et al. (2008) constructed a classifier using layered genetic programming, which had the characteristic of feature selection and feature extraction. Neshatian and Zhang (2012) combined information entropy and conditional entropy to evaluate the correlation between feature sets and classes for feature selection. Purohit et al. (2010) proposed a GP classifier construction method with feature selection and investigated a new crossover operator to discover the best crossover site. Davis et al. (2006) proposed a two-stage feature selection method. In the first stage, a feature subset is selected using GP and GP classifiers are evolved using the selected features. Neshatian and Zhang (2009b) proposed a filter-based multiple-objective feature selection method for binary classification task.
GP has flexible representation ability. In general, the GP-based feature selection method takes original features as the terminal set and terminal nodes as the selected features. However, some researchers have researched other representations. Hunt et al. (2012) developed GP-based hyper-heuristics for adding and removing features. Each individual generated by GP is a series of operations on the feature set. Ribeiro et al. (2012) used four feature selection criteria (information gain, chi-square, correlation and odds ration) as the terminal set of GP and three set operations (intersection, union and difference) as the function set. Viegas et al. (2018) proposed a GP-based feature selection method for skewed dataset and validated the method on four high-dimensional datasets. Papa et al. (2017) used binary strings as individual's terminal nodes, where 1 means selecting a certain feature and 0 means without selecting. AND, OR and XOR are regarded as non-terminal nodes. The output of the individual is also a binary string, which contains the selected features.

Feature ranking methods based on GP
The goal of feature ranking is to show the importance of the original features under certain evaluation criteria and is often used as a basis of feature selection. Neshatian (2010) proposed a feature ranking method based on GP, which used GP to evolve many weak classifiers and preserved the optimal weak classifiers. The score of each feature is proportional to the fitness value of the weak classifier that contains it. This method needs to run a large number of GPs, which is time-consuming. Friedlander et al. (2011) proposed a feature ranking method based on weight vectors. GP updates the weights for each feature at the end of each generation of evolution. After 10 generations of evolution, terminal nodes are no longer randomly selected during the generation of new subtrees in mutation operation, but are selected according to different probabilities based on feature weights. Ahmed et al. (2014) proposed a two-stage GP-based classification method which first ranked the original features then constructed classifiers. Neshatian and Zhang (2009a) constructed GP programs to measure the goodness of feature subsets using a virtual program structure and an evaluation function.
At present, there have been many researches on GP-based feature selection methods, but few research focus on the large amount of evolutionary information generated during the evolution process. And some researchers have begun to use GP to rank the features (Friedlander et al., 2011;Neshatian, 2010), which indicates that the frequency of features appearing in terminal nodes can show the importance of features. Based on this motivation, some issues should be investigated on whether feature ranking can help quantify the importance of features and improve the classification performance, and whether a multi-criteria fitness function can help to further explore this area select more effective features.

Method
In this paper, our proposed methods are based on standard GP, and correlation is chosen as the fitness function. The fundamental purpose of this paper is to investigate whether the feature selection method based on correlation evaluation criteria using GP can effectively get rid of invalid features, and whether feature ranking and a multi-criteria fitness function can improve the classification performance. We have different variants of algorithms to verify the research objectives. FS denotes the feature selection method using GP. FRFS denotes the feature selection method combined with feature ranking. MFRFS denotes the algorithm that combines FRFS with a multi-criteria fitness function. The details of the algorithms are described below.

Feature selection using GP (FS)
This algorithm employs GP as the search algorithm and correlation as the evaluation criterion. Because the smaller the fitness of GP, the better. The fitness function of FS is shown in Formula (3).
where Correlation is the correlation value of the selected features and classes. The output of FS is an individual that contains the optimal feature subset. The terminal nodes are the selected features. The parameter settings of GP are shown in Section 4.1.

FS combined with feature ranking (FRFS)
FS only considers the best individual generated by GP and does not make full use of the information during GP's evolutionary process. With the convergence of GP, there are a large number of individuals with the same best fitness. Our next step, feature ranking, is based on the assumption that the more frequent features occur in these individuals, the more valuable they are. All the individuals with the same best fitness during the evolution process after the jth generation are preserved. The occurrence of features in these individuals' terminal nodes are counted, and the features are ranked according to their occurrence. Our Return the first f best features of sequence R 27 end experiments found that GP starts to converge after 30 iterations, so the parameter j is set to 30. Suppose the number of features selected by FS is k. Our goal is to select the top f best features among top k features according to ranking. The Pseudo-Code of FRFS is shown in Algorithm 1. To determine how many features selected from k features are the best, first f best = k n and the f best features are tested on the classification algorithms. Then the rest k − k n features are added into f best feature subset and tested one by one according to their ranking until the classification accuracy begins to decline. Then, the top f best features are the selected features. In the paper, the parameter n is set to 3, that is, top one-third of k features are selected into f best feature subset. We started with top k n features instead of the top feature is to improve the search efficiency. We set the value of n to 3 based on our experiments.

FRFS combined with a multi-criteria fitness function (MFRFS)
In the process of GP evolution, we found that two individuals with the same fitness value are not necessarily the same individuals. To give priority to individuals with the same fitness value but less features, we use the multi-criteria fitness function as Formula 4. The method FRFS using the multi-criteria fitness function is named as MFRFS.
where f num is the number of features in an individual and α is the penalty coefficient between Correlation and f num . The parameter α is discussed in Section 5.4.

Datasets and parameter settings
To verify the effectiveness of our proposed feature selection methods, 15 datasets are collected from UCI machine learning repository (Dheeru & Karra Taniskidou, 2017). The details of the datasets are shown in Table 1. In the table, #features indicates the number of original features, #instances indicates the number of instances and #classes indicates the number of class labels. The datasets have different number of classes and features, and are representative classification problems. A random split of 7:3 is used and 70% of the dataset is used as the training set and 30% as the testing set (Ma & Gao, 2020b;Xue et al., 2013). The training set is used to select and rank the original features. The testing set is used to evaluate the classification performance of selected and ranked features by 10-fold cross-validation. To avoid stochastic characteristics of GP, 30 experiments are done independently (Ma & Teng, 2019). The average of classification accuracy and number of features of 30 different experiments are obtained.  K-nearest Neighbours (KNN, K = 5), C4.5 decision tree and Naive Bayes (NB) are used to evaluate the performance of the proposed methods. The ECJ platform (Luke, 2017) is used to run GP. The parameter settings of GP are shown in Table 2, which are the commonly used parameters of GP (Ma & Gao, 2020b;Ma & Teng, 2019). The function set is composed of four arithmetic operators, the terminal set is the original features. The population size is set to 500, initial GP experimental results indicate that this population size can evolve good solutions. The generations are set to 50 and the maximum tree depth is 17 to restrict irrelevant and redundant features can also reduce the depth of GP programs. Mutation probability and crossover probability are 10% and 90% respectively, which balance exploration and exploitation during the evolution. The parameter α in Formula 3 is set as 0.01, which is discussed in Section 5.4.

ReliefF
ReliefF (Kira & Rendell, 1992) is an extension of Relief algorithm. Relief can only be used to solve binary classification problems. Both Relief and ReliefF are feature weighting algorithms, which assign different weights to features according to the correlation between each feature and class, and features with a weight less than a certain threshold will be deleted. The weight of a feature is proportional to its classification ability. ReliefF can be applied to solve multi-class classification problems. First, one sample is randomly selected from the dataset, and then k nearest neighbour samples are selected from all classes. Then, the weight of all features is calculated, and the operation is repeated n times. Finally, the final weight of each feature is obtained and the features are ranked according to their weights. To facilitate comparisons with the proposed MFRFS, the same number of features as MFRFS is selected on all datasets by ReliefF.

LFS
LFS (Gutlein et al., 2009) starts from an empty set and adds features one by one. It restricts the number of features in each step of the forward selection. When the evaluation criteria show that its performance remains unchanged or decreases after adding a feature, LFS will be terminated. LFS is proved to be faster than standard forward selection and can obtain good results (Gutlein et al., 2009). To compare with the proposed MFRFS method fairly, Correlation is also selected as the evaluation criteria.

GSK related methods
Gaining-sharing knowledge-based optimisation algorithm (GSK) (Agrawal, Ganesh, & Mohamed, 2021b) is a human-related metaheuristic algorithm. The basic principle of GSK is to simulate the process of human gaining and sharing knowledge in the life span. Recently, GSK has been used for feature selection and proved to be effective, so three GSK-related methods including BGSK (Agrawal, Ganesh, & Mohamed, 2021b), pBGSK (Agrawal, Ganesh, & Mohamed, 2021b) and PbGSK-V 4 (Agrawal, Ganesh, Oliva, et al., 2022) were selected as benchmarks for comparison.

Experimental results and conclusion
To verify the effectiveness of our proposed methods, four experiments are done. (1) FRFS is compared with two baselines, i.e. original features (F0) and FS, to verify whether FRFS can get higher classification accuracy with fewer features. (2) MFRFS is compared with FRFS to verify whether the multi-criteria fitness function used by MFRFS can maintain the classification performance and reduce the number of features. (3) MFRFS is compared with five benchmarks, i.e.ReliefF, LFS, BGSK, pBGSK and PbGSK-V 4 , to verify whether MFRFS can achieve better classification performance in terms of accuracy and number of features. (4) The parameter α used in MFRFS is discussed to determine an appropriate parameter setting.
To justify the significance of various feature selection methods, the performance differences between paired feature selection methods are shown on the boxplots in Figures 1  and 2. The post-hoc analysis for Friedman test (Hollander & Wolfe, 1999) abbreviated as  Post-Friedman test using different colours is also marked on the boxplots. If the Post-Friedman test achieves p−value 0.05 or 0.05 < p−value 0.4, the box plots are marked green or yellow respectively, which shows that the performance differences between paired feature processing methods are significantly different or borderline significantly different. Figures 1 and 2 show whether one of the paired methods is significantly better than the other.

Comparison between FRFS and F0, FS
In this section, the experiments between FRFS and F0, FS were done and the experimental results are shown in Table 3. In the table, A-KNN, A-C4.5 and A-NB represent the average classification accuracy of 30 independent experiments using KNN, C4.5 and NB respectively. F-KNN, F-C4.5 and F-NB denote the average number of selected features of 30 independent experiments using KNN, C4.5 and NB respectively. The box plots of performance differences between FRFS and two baselines marked by Post-Friedman test are shown in Figure 1. From the Friedman significance test, we can see that there are significant differences between FRFS and two baselines. FRFS is significantly better than F0 in all learning algorithms and is borderline significantly better than FS in all learning algorithms.
As shown in Table 3, FRFS can achieve smaller number of features than F0 and FS in all learning algorithms. FRFS can achieve better classification accuracy than F0 and FS in C4.5 on 12 out of 15 datasets, can achieve better classification accuracy than F0 and FS in NB on 11 out of 15 datasets, and can achieve better classification accuracy than F0 and FS in KNN on 10 out of 15 datasets.
FS can select several features which have stronger discrimination ability than original features on datasets such as audiology, planning-relax, seismic-bumps, statlog-imagesegmentation and thoracic-surgery. However, comparing with FS, FRFS can further reduce the number of features and maintain the classification accuracy, which shows that the feature subset selected by FS is not optimal. For example, on a high-dimensional dataset Lymphoma, FS not only selects about 20 features from original 4026 features but also improves the classification accuracy. However, FRFS further selects about 10 from 20 features selected by FS and improves the classification accuracy. We can get the same results on robot-failure-lp4 dataset and the number of features selected by FRFS has been reduced to nearly 50% comparing with FS.
The experiments in this section confirm our assumption, that is, the more the original features appear in the GP individuals' terminal nodes, the more valuable these features are. Based on this assumption, feature ranking can help feature selection method further reduce the number of features while maintaining the classification accuracy.

Comparison between FRFS and MFRFS
In this section, MFRFS is compared with FRFS. The experimental results are shown in Table 4. As shown in Table 4, MFRFS can further reduce the number of features comparing with FRFS and can maintain or even improve the classification accuracy in C4.5, KNN and NB learning algorithms on 12, 9 and 8 datasets respectively.
MFRFS further reduces the number of features by more than 50% on arrhythmia dataset. MFRFS can not only select one-third of the number of features but also improve the classification accuracy comparing with FRFS on Lymphoma, especially in NB learning algorithm. As shown in the experimental results, one feature is obtained by MFRFS on climate-simulation-craches, planning-relax and thoracic-surgery datasets, and is enough to obtain equal or higher classification accuracy than FRFS.
The experimental results show that the multi-criteria fitness function can restrict redundant features as GP's terminal nodes and can achieve the purpose of decreasing the number of selected features while maintaining the classification performance. However, appropriate parameter α in Formula 4 should be set and the effect of parameter α is discussed in Section 5.4.

Comparison between MFRFS and benchmarks
In this section, MFRFS is compared with five benchmark techniques, i.e. ReliefF, LFS, BGSK, pBGSK and PbGSK-V 4 . To make a fair comparison with our proposed method, in GSK related methods, the parameter MAXNFE is set to 25,000, the number of generation is set to 500 when the dimension < = 20, and the number of generation is set to 250 when the dimension > 20. The experimental results are shown in Table 5. The box plots of performance differences between MFRFS and benchmarks are shown in Figure 2.

Comparison between MFRFS and ReliefF
Like MFRFS, ReliefF also needs to rank the features. In the experiment of ReliefF, we select the same number of features as MFRFS to verify whether MFRFS can achieve higher classification accuracy when the two methods take the same number of features. On statlog-image-segmentation dataset, both MFRFS and ReliefF select three features, but the accuracy of MFRFS is 10% higher than that of ReliefF. MFRFS and ReliefF can achieve comparable classification accuracy with only one feature on climate-simulationcraches, seismic-bumps, thoracic-surgery and planning-relax datasets. On arrhythmia, hepatitis, robot-failure-lp4, robot-failure-lp5, spambase and statlog-heart datasets, MFRFS can achieve better classification accuracy with same number of features than ReliefF. From the experimental results in Table 5, we can see that MFRFS is more robust than ReliefF. The box plots marked by Post-Friedman test in Figure 2 show that MFRFS is significantly better than ReliefF in KNN and NB learning algorithms, and is borderline significantly better than ReliefF in C4.5 learning algorithm. In general, the experimental results show that MFRFS can obtain better classification results than ReliefF.

Comparison between MFRFS and LFS
The comparison results between MFRFS and LFS are shown in Table 5. As shown in Table 5, in general, LFS can achieve comparable or better classification accuracy than MFRFS. The   Figure 2 show that MFRFS is borderline significantly better than LFS in KNN and NB learning algorithms, however, LFS selects more features than MFRFS. On arrhythmia and audiology datasets, the number of features selected by LFS is nearly four times that of MFRFS. However, on audiology dataset, LFS does not obtain a better classification accuracy with more features than MFRFS. On robot-failure-lp4 and robotfailure-lp5 datasets, features selected by LFS are nearly six times that of MFRFS. On the Lymphoma dataset, LFS even uses more than 20 times features that of MFRFS, and when NB is used as the learning algorithm, the accuracy of LFS is 10% less than that of MFRFS. On other datasets, MFRFS can select smaller number of features and achieve comparable or better classification accuracy than LFS. The experimental results show that LFS's dimensionality reduction performance is weak. The features selected by LFS still contains irrelevant and redundant features, which confirm that some invalid features are not helpful for classification. Our proposed MFRFS can select more effective features on the premise of maintaining the classification accuracy.

Comparison between MFRFS and GSK related methods (BGSK, pBGSK, PbGSK-V 4 )
Compared with BGSK, MFRFS selects a smaller feature subset and achieves better classification performance on most cases in all learning algorithms. On the higher-dimensional dataset Lymphoma, BGSK cannot reduce the dimension of features better, and select more than 1300 features. We obtain similar experimental results on pBGSK, and in most cases, a larger feature subset is selected, especially on the Lymphoma dataset. The box plots marked by Post-Friedman test in Figure 2 show that MFRFS is borderline significantly better than pBGSK in KNN learning algorithm and significantly better than pBGSK in NB learning algorithm.
As shown in Table 5, PbGSK-V 4 selects minimum number of features among the six methods. However, this method cannot achieve good classification performance, especially on audiology, robot-failure-lp5, statlog-image-segmentation and spambase datasets. Only an average of less than 2 features is selected by PbGSK-V 4 in those datasets, but the classification accuracy is very low. Compared with MFRFS, PbGSK-V 4 reduces the number of features, but in most cases the overall classification accuracy of MFRFS is better than PbGSK-V 4 . Figure 3 shows the convergence behaviours of four EC-based methods on arrhythmia and spambase datasets. As shown in Figure 3, MFRFS starts to converge in about 10 generations and can obtain good solutions. The GSK related methods need more generations to converge.

The parameter setting of α in MFRFS
The parameter α in Formula 4 is used to restrict redundant and irrelevant features as GP's terminal nodes and further reduce the number of selected features. In Section 5.2, we have shown that MFRFS can further reduce the number of features comparing with FRFS. To show the impact of parameter α on experimental results and set an appropriate parameter, different values of parameter α are experimented and the experimental results are shown in Table 6. As shown in Table 6, the number of features selected by MFRFS is inversely proportional to the value of α, which means that the smaller the value of α, the more the number of features selected.
When α = 0.001, the classification performance of MFRFS is similar to FRFS, and the parameter α has little effect on classification performance. When α = 0.005, it has a certain degree of dimensionality reduction effect. When α = 0.01, it can be seen from Table 6 that the overall performance is the best. When α = 0.025, the number of features further  decreases, but the average classification accuracy starts to decline to some extent. When α = 0.5 and 0.1, the number of features is obviously reduced a lot, but the accuracy at this time is also declined a lot. Therefore, the parameter α has a great impact on the classification accuracy. If the parameter α is too small, it cannot achieve the purpose of dimensionality reduction comparing with FRFS. If the parameter α is too large, it will remove effective features and reduce the classification accuracy. Through the experiments, α = 0.01 meets the purpose of decreasing the number of selected features while maintaining the classification performance.

Further discussion
In this paper, FRFS, a feature selection method combined with feature ranking using GP, is proposed to select smaller number of features comparing with feature selection method only (FS). Based on a multi-criteria fitness function, MFRFS is proposed to obtain smaller number of features comparing with FRFS. Here are some further discussions.
(1) Evolutionary computation algorithms will produce a large number of individuals with the best fitness. If only one of the best individuals is used for feature selection, some good individuals may be lost. In this paper, GP individuals after the 30th generation with the best fitness are preserved. The features are ranked according to their occurrence times in these individuals. Experiments show that feature ranking can help select more effective features.
To further demonstrate how feature ranking works, we take the arrhythmia dataset as an example. The results of feature ranking are shown below. A total of 12 best features were selected. The values before the "-" represent the index of the feature, and the values after the "-" represent the number of times the feature occurs in all the best individuals.
According to the method described in Section 3.2, the classification accuracy on different number of top ranked features on training set and testing set in f best feature subset are shown in Table 7. For training set, we can see that the classification accuracy first increases and then decreases with the increase of the number of top features. When the number of top features is 9, 10 and 9 for KNN, C4.5 and NB, the classification accuracy is the highest respectively.
For the testing set, we also obtain the classification accuracies of the corresponding ranking features for KNN, C4.5 and NB learning algorithms respectively in Table 7, and feature ranking can achieve highest classification accuracy on corresponding ranking features in C4.5 and NB learning algorithms.
(2) Feature ranking is based on the assumption that the more the original features appear in the GP individuals' terminal nodes, the more valuable these features are. Experiments prove that the assumption is correct. GP can be used to quantify the importance of features and select more important feature sets.
(3) MFRFS can restrict the number of irrelevant and redundant features in the GP's terminal nodes by a multi-criteria fitness function. By adjusting the parameter α, MFRFS can achieve a smaller number of features while maintaining the classification accuracy compared with FRFS.
(4) In general, MFRFS can achieve better classification performance than the five benchmarks. However, MFRFS needs to store the individuals with the best fitness, rank the features according to the occurrence times of features appearing in the GP individuals' terminal nodes and search smaller number of features with higher classification performance from the ranked features. Therefore MFRFS has no advantages in time consumption and computational complexity, especially on datasets with more features. In practical applications, we need to find a balance between classification performance and computational complexity.
(5) Feature selection is a key data preprocessing process, which can solve "the curse of dimensionality" problems. Our proposed MFRFS can greatly reduce the dimension of features and obtain a better classification performance. The most important thing is that we can identify which features are more important through the feature ranking method in this paper. In industrial applications, our methods may greatly reduce the model's training time and improve the classification performance.

Conclusion and future work
This paper proposes a feature selection method with feature ranking (FRFS) using GP, which ranks the features according to the number of their occurrence in the individuals with the best fitness. Based on FRFS, to further reduce the number of selected features, another feature selection method using a multi-criteria fitness function (MFRFS) is proposed. Experiments on 15 datasets show that FRFS can achieve better classification performance and obtain smaller number of features than FS. Compared with FRFS, MFRFS can further remove redundant and irrelevant features, and select smaller number of features while maintaining the classification performance. Compared with other five existing feature selection methods, MFRFS can achieve better classification performance.
Feature selection is important to build an efficient classification model, and feature ranking has been proved to be effective for feature selection. The fitness function of our methods in this paper are based on Correlation. It is necessary to further investigate whether our methods are effective to other evaluation criteria. Moreover, the dimension of the selected datasets is relatively low. In the future, feature ranking methods on high dimensional datasets will be investigated.

Disclosure statement
No potential conflict of interest was reported by the author(s).