An ensemble feature selection method for high-dimensional data based on sort aggregation

ABSTRACT With the rapid development of the Internet, big data has been applied in a large amount of application. However, there are often redundant or irrelevant features in high dimensional data, so feature selection is particularly important. Because the feature subset obtained by a single feature selection method may be biased, an ensemble feature selection method named SA-EFS based on sort aggregation is proposed in this paper, and this method is oriented to classification tasks. For high-dimensional data sets, the results of three feature selection methods, chi-square test, maximum information coefficient and XGBoost, are aggregated by specific strategy. The integration effects of arithmetic mean and geometric mean aggregation strategy on this model are analyzed. In order to evaluate the classification and prediction performance of feature subset, three classifiers with excellent performance, KNN, Random Forest and XGBoost, are tested respectively, and the influence of threshold on classification performance is analyzed. The experimental results show that compared with the single feature selection method, the arithmetic mean aggregation ensemble feature selection can effectively improve the classification accuracy, and the threshold interval setting of 0.1 is a better choice.


Introduction
With the rapid development of the Internet and information technology, the scale of data that can be processed by various industries has been continuously developed, and problems such as 'dimensional disasters' have been brought about. Feature selection is a critical step in data preprocessing and important research content in data mining and machine learning tasks such as classification.
Feature selection is to effectively reduce feature dimension and improve classification accuracy and efficiency by deleting irrelevant and redundant features in data sets. It also has the function of denoising and preventing machine learning model from over-fitting (Chandrashekar & Sahin, 2014).
Feature selection is usually in the search space composed of all combinations of data features, through the feature subset search algorithm, to find a subset of features that are highly correlated with pattern recognition problems (such as classification learning problems), and based on the obtained optimal features. Subsets to improve the recognition performance of learning algorithms are distinguished according to the feature subset evaluation strategy. Feature selection algorithms can be divided into the following types: Filter, Wrapper, CONTACT Jie Wang wangjie@cnu.edu.cn; Chengan Zhao zhaochengan@cnu.edu.cn Embedded, Hybrid. Moreover, the ensemble (Yang, 2017) developed in recent years. Ensemble learning is an effective method for machine learning. The core idea is to achieve better learning results by combining different learning models. Among many machine learning methods, ensemble learning methods are superior to single machine learning models. In recent years, the faster development of ensemble feature selection draws on the idea of ensemble learning. Different from several feature selection methods, only a single optimal feature subset is sought. The goal of combined feature selection is to obtain multiple optimal features. Set and aggregate learning results based on multiple optimal feature subsets. Due to the integration technology, the ensemble feature selection algorithm has better stability and robustness than other feature selection algorithms when dealing with high-dimensional data with multiple optimal feature subsets.
This paper proposes an ensemble feature selection based on sort aggregation (SA-EFS) method orient to the classification task. On the high-dimensional data set, the maximum information coefficient and chi-square are first used. The feature selection method such as the test method and XGBoost obtains a plurality of feature subsets and sorts according to the feature importance degree. After the feature ranking result is normalized, the importance weight of the feature is obtained, and the candidate set of the optimal feature subset is obtained. On this basis, the learning results of multiple optimal feature subset candidate sets are aggregated according to different rules such as arithmetic mean and geometric mean, and the optimal feature subset is obtained. Finally, KNN, RF (Random Forest) and XGboost are used. The better-performing classification model was tested, and the area AUC index (Yang, 2017;Zomaya, 2013) under the receiver operating characteristic curve was used to evaluate the classification performance of different prediction models. The method based on sorting aggregation integrates the results of multiple feature selection methods. The feature importance is normalized and aggregated, which can effectively avoid the prediction model caused by one or several feature selection methods mis-selecting feature subsets -low-performance issues. Based on the experimental results from three UCI machine learning data sets, the SA-EFS method fully considers the effectiveness of different feature selection methods. The selected feature subsets after aggregation are more accurate and are available in KNN, RF, and XGboost. A higher predicted AUC value is obtained on the classification algorithm.

Related works
Feature selection is an important step in the data mining preprocessing stage. Many domestic and foreign scholars have carried out extensive research on Filter, Wrapper, Embedded, Hybrid, and Ensemble Feature Selection.
At present, the main way to obtain multiple optimal feature subsets is from the perspective of sample and feature (Yang, 2017). From the sample point of view, the integrated feature selection firstly sampled and partitioned the training data in some form, and then obtained multiple feature subsets based on the partitioned samples. From the feature point of view, the ensemble feature selection uses a certain feature subset search algorithm to obtain multiple optimal feature subsets (Zomaya, 2013).
One commonly used ensemble feature selection method is to integrate base classifiers trained based on multiple feature subsets. For example, Kim et al. proposed EA-based Ensemble Classifier (2008), which aims to optimize the generalization performance of the classification, and uses the genetic algorithm to optimize the combination of multiple filter feature selections and multiple base classifiers and integrate them into strong classification. Device. Bolon-Canedo and other multi-optimal feature subsets selected by various filtering methods are used to construct classifiers and integrate them. Two kinds of Ensemble of Filters for gene microarray data are proposed (2014). Another type of ensemble feature selection is to select the best features from multiple feature subsets and integrate them to form an optimal feature set. Song, Ni, and Wang (2013) and other proposed feature selection algorithm based on cluster analysis (FAST FS), clustering the minimum spanning tree that is sufficient for the feature and category relationship and selecting the most relevant to each category from each tree cluster.
The features constitute the optimal feature subset. Jiang, Jiang, and Yu (2018) proposed a feature selection method based on sorting integration. The method first obtains the feature sorting sequence by using three feature selection methods such as ReliefF, and then sums the weights of each feature obtained by the three methods as the total weight of the feature and selects the final feature set. Shi et al. (2018) proposed an ensemble learning algorithm based on Bagging method based on feature selection accuracy and difference and constructed a feature selector ensemble by the heterogeneous learner to achieve the impact of theft. Effective selection of prime factors. Du et al. (2019) introduced the idea of ensemble learning into feature selection and used three kinds of different feature selection methods for ensemble learning. Each method will get a set sequence sorted by feature importance, and different results will be aggregated. Get a final set of sequences.

Chi-square test feature selection
Chi-square test is a hypothetical test method for a wide range of count data. It belongs to the category of nonparametric tests, mainly comparing two or more sample rates (composed ratios) and the correlation analysis of two categorical variables. The basic idea is to compare the degree of agreement between the theoretical frequency and the actual frequency or the goodness of fit. The formula is as follows: where A is the actual value and E is the theoretical value.

Feature screening of maximum information coefficient
The maximum information coefficient is developed on the basis of mutual information, which can quickly detect various types of associations and evaluate functional relationships. The maximum information coefficient is suitable for exploring the potential relationship between pairs of variables in big data sets, which is fair and extensive. Given a finite data set D and ordered pairs < X, Y > , the sample size is m, and the x, y plane is divided into several small meshes by x value and y value. This division is called x × y grid. G, so that all variables in D fall into G (Reshef et al., 2011), defined as: Equation (3) represents the maximum mutual information between X and Y when G is divided into i × j grids. In data set D containing two node variables X and Y, the characteristic matrix of X and Y is defined as (Reshef et al., 2011): Assuming that the mesh size of the partition is smaller than B(m) and B(m) = n0.6, the maximum information coefficient of the node variables X and Y is defined as (Reshef et al., 2011): where i × j < B(n) represents the division dimension limit of the grid G. Obviously, 0 ≤ MIC ≤ 1. Since the mutual information between random variables is symmetrical, the maximum information coefficient also has symmetry, namely:

XGBoost feature selection and classification principle
In the training process of feature selection, in order to improve the efficiency of generating new trees, XGBoost gives the importance score of each feature in each iteration, which shows the importance of each feature to model training, and provides a basis for establishing a new tree with gradient direction in the next iteration. The statistical significance of features can be directly used as the basis for feature selection.
In this paper, we first classify XGBoost based on all features, then get the importance of feature variables (FI) and sort them in descending order based on the information in the generated model process. Finally, we input the filtered features into the classifier to construct the prediction model.
As a classifier, XGBoost achieves accurate classification effect by iteration calculation of weak classifier (Torlay, Perrone-Bertolotti, Thomas, & Baciu, 2017). It supports weak classification algorithm and weak regression model, and is suitable for establishing regression model. Because of its fast calculation speed, good model performance, excellent performance and efficiency in application practice, it has been widely praised in the academic circles.
The model representation is shown in Formula 6.
where K denotes the number of trees, F denotes all possible regression trees, and f denotes a specific regression tree. The model consists of K regression tree. Object function: The goal of XGBoost is to establish K regression tree so that the predicted value of tree group can be as close as possible to the true value (accuracy) and have the greatest generalization ability. The objective function is defined as in formula 7: Function 5 describes y i . The objective function consists of two parts. The first part is loss function and the second part is general term. Training XGBoost, training objective function can be defined as formula 8: The first part is the training error, the second part is the complexity of each tree. Then the training objective function can be written as formula 9:

Forecast model evaluation indicators
When we study the two-class problem in the data set prediction, the prediction may have four different results, as shown in the Table 1. Column P indicates that the prediction is a positive class sample, N column indicates that the prediction is a negative class sample, p row indicates that the sample is actually a positive class sample, and n rows indicates that the prediction is actually a negative class sample. A good prediction should be that the value on the main diagonal of the matrix is large, not the value on the main diagonal. The true positive rate (TPR) is the number of instances that are correctly predicted to be positive, divided by the total number of instances in which the actual class is positive.
The false positive rate (FRR) is the total number of instances that are incorrectly predicted to be positive, divided by the total number of instances in which the actual category is negative.
The area under the receiver operating characteristic curve is the AUC value (Huang & Ling, 2005), which represents the relationship between the true positive rate and the false positive rate of the predictive model, as shown in Figure 1. The x-axis is the false positive rate, the y-axis.
For the true positive rate, each point on the curve corresponds to a threshold. For a classification method, there is a TPR and FPR under each threshold. The ROC curve can be obtained by adjusting the threshold of the prediction model. The better predictive model has a lower false positive rate and a higher true positive rate. Therefore, the receiver operating characteristic curve of the better predictive model is closer to the (0,1) point, that is, near the upper left corner. The main diagonal represents a predictive model of random prediction.

Overall framework
This paper proposes an Ensemble feature selection based on sort aggregation (SA-EFS) method, which firstly uses different feature selection methods to obtain candidate sets of multiple optimal feature subsets. Then, according to different rules, the learning results of multiple optimal feature subset candidate sets are aggregated to obtain the optimal feature subsets. Finally, three classification algorithms with good performance are used to verify the proposed algorithm. The overall framework for SA-EFS feature selection is shown in Figure 2: The specific description of the method is as follows: (1) Using chi-square method, maximum information coefficient method and XGBoost method to select features, and sorting features according to importance, to obtain multiple sorted optimal feature subsets FS 1 , FS 2 . . . FS t Figure 2. The overall framework of SA-EFS method. (2) For each feature j in the FS i , normalize its importance with (n-j)/n (there are n features in total), and obtain the feature weight set of the ith feature selection method W i = {w i 1 , w i 2 . . . w i n } (3) According to a certain aggregation strategy (such as arithmetic average or geometric mean), calculate the total weight of each feature in set FS 1 , FS 2 . . . FS t , sort the n features according to the total weight, and obtain the sorted feature sequence W (4) Based on the threshold α, The top α% features are selected from the sorted feature sequences to form the optimal feature subset (5) Based on the optimal feature subset, the performance of the algorithm SA-EFS is verified by using a variety of good classifier algorithms.
The detailed process of the SA-EFS algorithm is as follows (Table 2):

Experimental data sets
The experimental data selected in this paper are all from the international UCI machine learning data set. They are the three data sets of sonar, hcc-survival and musk. The details are shown in Table 3. The sample data classification number is all 2, the number of samples is 208 ∼ 776, and the feature number of the sample is 49-167 dimensions.

Experimental design
In order to verify the effectiveness of the feature selection method based on sorting integration proposed in this paper, this method is implemented by programming. The experimental environment is Windows 10, 64-bit, 8 GB RAM, Intel Core i5-7200U (2.70 GHz). Combining experiments were performed using five feature selection methods and three data sets, shown as Table 4.
Firstly, a data set is selected to select features by using chi-square, maximum information coefficient and XGBoost, and aggregate multiple feature selection results by using specific aggregation strategies (mean aggregation and geometric average aggregation). Then, a prediction model is constructed by XGBoost. The performance of the prediction model is evaluated by 5 fold crossvalidation, and the performance evaluation index A of the prediction model is obtained. Finally, repetitive experiments are conducted on three data sets to compare the performance of the proposed method with that of single maximum information coefficient, XGBoost and chisquare verification. Finally, the thresholds of alpha interval are reduced by half, and KNN and RF classifiers are added to construct the prediction model. By comparing the results before and after the thresholds are reduced, the appropriate thresholds of alpha interval are searched.

Classification accuracy results
Since the number of features in the three data sets is small, the threshold α is set between 10% and 100%, and the prediction accuracy of the model under different conditions is compared. The experimental results are shown in Table 5.
Tables 5-7 show the classification accuracy obtained by XGBoost for the feature subsets obtained by different feature selection methods under different thresholds α. The blackened font portion is the highest accuracy of the different feature selection methods at the same  threshold, and the box is the highest accuracy of the different feature selection methods at all thresholds on this data set. It can be seen from the experimental results that the feature subset obtained by the FBEST ensemble feature selection method has better prediction performance than the single method. Among the two polymerization methods used, the mean aggregation method shows the best performance in the musk dataset, the sonar dataset, and the hcc-survival dataset, while the geometric mean aggregation is not very effective on the three binary datasets. Obvious. Note that different aggregation methods should be chosen for different data sets. The threshold value of α is about 0.3, which can achieve a better classification accuracy. When the threshold value α is in the range of [0.4, 1], the classification accuracy will not increase but will decrease. The reason is that due to the certain correlation between different features, too many features will cause data redundancy and increase data noise, resulting in lower classification performance. How to reduce the redundancy of feature subsets is the direction of the next step.

Threshold alpha impact analysis
The change in threshold is reflected in the percentage of feature selection, which in turn affects the final classification result. In order to obtain more objective and accurate results, the impact of threshold α is analyzed. In the original interval [0.1,1], the interval of α is set to 0.05, and whether the value of α will have a greater impact on the change of accuracy and increase the random forest classifier and KNN classifier for comparison experiments.
The experimental results are shown in Figure 3. It shows the classification accuracy obtained by using the XGBboost, random forest, and KNN as classifiers to select feature subsets with different feature selection methods at smaller partition intervals. It can be seen that the threshold values of the highest classification accuracy of the sonar, hcc-survival and musk datasets on XGBboost are 0.3, 0.3, 0.1 (0.2, 0.3), respectively, and the highest classification accuracy threshold α is reached on the random forest. The values are 0.25, 1, 0.75, respectively. The highest classification accuracy threshold α on KNN is 0.1, 0.65, 0.7, which means that there are certain differences between different data sets. Different classifiers also have different effects on threshold α; the highest classification accuracy of random forests and KNN is lower than that of XGBoost, but it can also achieve a higher classification accuracy. Random forests can also obtain a single feature selection method in the mean integration feature selection method. Better classification performance, which means that the mean integration feature selection method has better adaptability to different classifiers. It is necessary to find a suitable threshold α for a particular data set, and setting a smaller threshold α does not cause a large change in the classification accuracy curve, which indicates that setting the threshold interval to 0.1 is a better choice.

Conclusion
This article uses the internationally accepted UCI machine learning data set. By data preprocessing, five feature selection methods and three data sets are combined to compare the performance difference between the proposed method and other methods. The results show that the feature selection method based on sorting integration has better feature selection. Models constructed based on this feature predict higher AUC values. The main work and research results of this paper are summarized as follows: (1) The three high-dimensional data sets used in this paper are real data provided by the international UCI machine learning platform, with large data volume and many characteristic variables. In order to improve the accuracy and efficiency of the model, we spend a large part of the time in data preprocessing.
(2) This paper introduces the idea of ensemble learning in feature selection and constructs an ensemble feature selection model. For the high-dimensional twoclass dataset, the three feature selection methods of maximum information coefficient, XGBoost, and chisquare test are ensemble, and the integration effect of geometric and mean aggregation methods on this model is analyzed. Feature selection method, mean integration feature selection can effectively improve classification accuracy. (3) In order to verify the effectiveness of the feature selection method based on sorting integration proposed in this paper. First, we use XGBoost to construct the prediction model. Then, the performance of the prediction model is evaluated by 5-fold crossvalidation, and the performance evaluation index AUC of the prediction model is obtained. The performance difference between the proposed method and the three methods is compared. Finally, the threshold of the interval between the values of alpha is reduced by half, and the prediction model is constructed by adding KNN and random forest classifier.
The results before and after the threshold reduction are compared to find the appropriate interval between the thresholds. (4) The model was evaluated using a comprehensive model evaluation index. The feature selection method based on sorting integration constructed in this paper achieves the best results on all three datasets with a threshold of 0.1, and the AUC is 0.873, 0.840, and 0.859, respectively, which are at a high level. It shows that the proposed feature selection method based on sorting integration has higher accuracy and higher practical value.
Due to time, the experiment was only tested on three data sets. The experiment has certain limitations. Therefore, the method needs to be applied to more Highdimensional data sets to further verify the validity of the model. Also, this study found that only a few features can bring useful information to the classification model. Too many features will result in redundancy of feature subsets and reduce the prediction accuracy of the classification model. Therefore, considering the correlation between different features, reducing the redundancy of feature subsets is also a problem that needs further study.