A Gene selection approach based on the fisher linear discriminant and the neighborhood rough set

ABSTRACT In recent years, tumor classification based on gene expression profiles has drawn great attention, and related research results have been widely applied to the clinical diagnosis of major gene diseases. These studies are of tremendous importance for accurate cancer diagnosis and subtype recognition. However, the microarray data of gene expression profiles have small samples, high dimensionality, large noise and data redundancy. To further improve the classification performance of microarray data, a gene selection approach based on the Fisher linear discriminant (FLD) and the neighborhood rough set (NRS) is proposed. First, the FLD method is employed to reduce the preliminarily genetic data to obtain features with a strong classification ability, which can form a candidate gene subset. Then, neighborhood precision and neighborhood roughness are defined in a neighborhood decision system, and the calculation approaches for neighborhood dependency and the significance of an attribute are given. A reduction model of neighborhood decision systems is presented. Thus, a gene selection algorithm based on FLD and NRS is proposed. Finally, four public gene datasets are used in the simulation experiments. Experimental results under the SVM classifier demonstrate that the proposed algorithm is effective, and it can select a smaller and more well-classified gene subset, as well as obtain better classification performance.


Introduction
With the development of gene expression profiles, the analysis and modelling of gene expression profiles has become an important topic in the field of bioinformatics research [1][2][3]. However, the high dimension of tumor gene expression data, which is often in the thousands or even tens of thousands, increases the learning cost and deteriorates learning performance. This is widely known as the ''Curse of Dimensionality'', which costs time and reduces the effectiveness of classification when using a classifier to forecast new samples [4,5]. Thus, the dimensionality reduction has been a research hotspot in different fields as an important step in pattern recognition, machine learning and data mining [6][7][8][9][10][11][12][13][14].
In general, dimensionality reduction algorithms can be categorized as feature extraction and feature selection. Feature extraction constructs a new low-dimensional space out of the original high-dimensional data through projection or transformation, while the aim of feature selection is to reduce the dimensionality of microarray data [15,16] and to enhance classification accuracy [17,18]. The existing feature selection methods can be broadly categorized into the following three classes: filter, wrapper, and hybrid [19]. A good feature selection algorithm should be reasonable and efficient; the algorithm should be able to find a typical genome containing fewer genes [20].
Many scholars have conducted research on gene selection and have generated many results. FLD is a classical technique in pattern recognition; Robert Fisher first developed FLD in 1936 for taxonomic classification [21]. FLD can be used to select the characteristics possessed by classified information, eliminate redundant attributes, and achieve the dimensionality reduction processing of gene data. Since Pawlak in the early 1980s proposed rough set theory, it has been widely used in various fields [22]. However, the classical rough set theory is only applicable to discrete-valued information systems, and it is not suitable for real-valued datasets. To overcome this weakness, Dai and Xu presented a gene selection method based on fuzzy rough sets and a fuzzy gain ratio [23]. Hu et al. proposed a neighborhood rough set model to address both discrete and continuous data sets with a d-neighborhood parameter, which can maintain the rich information for classifying the data sets [24].
To further improve the classification performance of microarray data, effectively remove the redundant gene, and reduce the computational time complexity of the gene selection algorithm, the FLD method is employed to conduct the preliminary dimensionality reduction for microarray gene data. FLD effectively removes genes that do not contribute to classification. The neighborhood rough set can process continuous data sets and avoid the loss of information caused by discretizing. Then, a new neighborhood dependency and its attribute significance are given, and an attribute reduction method of neighborhood decision systems is presented. A gene selection approach based on FLD and NRS is proposed. A number of simulation experiments were conducted on public gene data sets, and the best parameters were determined according to the experimental results. Therefore, high classification accuracy can be obtained using the selected gene subset under the support vector machine (SVM) classifier [25].
The remainder of this paper is structured as follows: Section 2 introduces related concepts of FLD. An effective and efficient feature selection method based on FLD and NRS is given in Section 3. To evaluate the performance of the proposed algorithm, five related algorithms are employed to compare four public gene expression data sets. The experimental results are described in Section 4. Finally, the conclusion is drawn in Section 5.

Fisher linear discriminant model
The Fisher linear discriminant is a classical algorithm introduced by Belhumeur in the field of pattern recognition and artificial intelligence [21]. The basic idea of the FLD model is to project the sample onto a straight line by transforming the sample so the projection of the sample can be best divided. That is, the dispersion degree between the transformed sample classes reaches the highest level, and the sample dispersion within the classes reaches the minimum, which increases the distinction among the categories. Therefore, FLD can be used to select the characteristics with the possessed information classified, eliminate redundant attributes, and achieve the processing of dimensionality reduction for gene data. The method is an effective, supervised dimensionality reduction technology. The related concepts of FLD are described as follows.
Let c be the number of classes of the sample matrix X 2 R dÂn , where n i is the number of samples belonging to the i-th class v i , and P c i¼1 n i ¼ n. The centre point of each sample is m i ¼ 1 x2v i x, and the centre point of all samples is m ¼ 1 n P n j¼1 x j , where x j is the j-th sample. The between-class scatter matrix S B and the within-class matrix S W can be expressed, respectively, as (1:1) On the basis of Formulas (1.1) and (1.2), the between-class scatter J B and the within-class scatter J W of the samples after projection are expressed, respectively, as (1:3) The objective function established by the Fisher discriminant criterion is described by If the k-th column w k of W is considered, the objective function can be transformed into maxw k w k T S B w k : A Lagrangian equation is established as Take the derivative of w k , and make it equal to 0, to obtain the following formula: To maximize the value of J B J W , the projection matrix W can be constructed by simply taking the eigenvectors corresponding to the k largest eigenvalues.

Fisher linear discriminant and neighborhood rough set based gene selection method
When using classical rough sets to solve continuous data problems, the data set must be discretized; however, processing the original properties of the data will change, and some useful information will be lost [26]. The neighborhood rough set is proposed to solve the problem where the classical rough set cannot handle the numerical attributes [27,28]. In addition, the effect of the classical neighborhood rough set model is not obvious. Then, to resolve this issue, this paper proposes a feature selection method based on FLD and NRS, which is applied to gene selection of a cancer data set.
There are N dimensions in a determined real space U. Let D = R N £R N ! R. D is called as a measure on R N , and (D, U) is called as a measure space, when D meets the following three conditions: where D(x 1 , x 2 ) is a distance function between two elements x 1 and x 2 . The distance functions used always include a Manhattan distance function, a Euclidean distance function, and a p-normal form distance function. Since the Euclidean distance function can reflect the basic situation of unknown data [29]. the Euclidean distance function is used in this paper. The formula is described as follows: x m } be a nonempty finite set on a given real space V, and its neighborhood relationship N on the real field V is expressed as a binary group NA = (U, N). For any X U, the upper approximation and the lower approximations of X in a neighborhood approximate space NA = (U, N) can be defined respectively as The approximate boundary region of X is defined as is a neighborhood decision system, A is a conditional attribute set, D is a decision attribute, and U/D = {X 1 , X 2 , X 3 , …, X n }. For any conditional attribute subset B A, the upper approximation and the lower approximation of decision attribute D with respect to B are expressed, respectively, as (2:6) It follows that the boundary region of the decision system can be expressed as (2:8) where The existence of the boundary domain causes the uncertainty of the set. Greater uncertainty occurs with larger boundary domain sets. This paper studies the boundary domain of the neighborhood decision system and investigates various uncertainty measures.
The roughness measure, a quantitative index for processing uncertain information by using the rough set theory, is the basis of resource management, system optimization, and many other decision-making problems [30].
is a neighborhood decision system, U/D = {X 1 , X 2 , X 3 ,…, X n }. Then for any conditional attribute subset B A, the neighborhood precision of U/D with respect to B is described by The neighborhood roughness of U/D with respect to B is expressed as is a neighborhood decision system and any conditional attribute subset B A. Then, a dependency of decision attribute D with respect to B is defined as Step 3: Decompose the eigenvalues of S W À1 S B obtained from Formula (1.8) and sort the eigenvalues with descending order.
Step 4: Take the eigenvector corresponding to the first k eigenvalues to form the projection matrix W.
Step 5: Calculate X 0 = W T X, X 0 2 R d 0 Ân and obtain the conditional attribute subset C after dimensionality reduction that is NDS = (U, C[D), where d' is the number of attributes of C, and n is the number of samples.
Step 7: Calculate SIG inner (a, B, D) > 0 with Formula (2.12) for any attribute a2B and a = 2 red, get the indispensable attribute a, and let red = red[{a}.
Step 8: Calculate SIG outer (a k , red, D) with Formula (2.13) for any attribute a k 2C¡red, get the most important attribute a k according to the size of the order and add it to the reduction set red = red[{a k }.
Step 10: If K(red, D) 6 ¼ K(C, D), update the conditional attribute set C = C¡{a k }, and perform Step 8.
Step 11: Output the reduction set red.
For a group of gene expression data, it is assumed that the number of samples is K, and the number of attributes is T. After the dimensionality reduction according to the FLD algorithm, the M genes can be obtained. To select a gene, it is necessary to add K

Experimental results and analysis
To verify the effectiveness of the proposed FLD-NRS algorithm, simulation experiments are performed on four public gene expression profile data sets, which include colon, leukaemia, lung, and prostate cancer data downloaded from http://bioinformatics.rutgers. ed/Static/ Supplemens/CompCancer/datasets. The specific description of the datasets is shown in Table 1 It is noted that the values of the partial gene columns in the lung and prostate data sets are all zero. Thus, the 121 columns of noise gene data from the lung cancer set and the 394 columns of noise gene data from the prostate set should be eliminated. Finally, the gene number of the lung cancer data set is 12412, and the gene number of the prostate cancer data set is 12206.
In Table 1, the four data sets have two categories, namely, belonging to two classification problems. Taking the colon cancer data set with a high dimension and a small sample as an example, there are 2000 conditional attributes and 62 samples. The number of positive samples is 40, and the number of negative ones is 20. Since the external manifestation of the gene data is a numerical matrix, the model described needs to name the gene dataset as a high-dimensional data matrix, and then the dimensionality needs to be reduced. To ensure that the gene data cannot lose its characteristics, the data of each gene is then marked. That is, the numbers between genes and markers in the data set are corresponding to each other.
In this paper, FLD is employed to preliminary dimensionality reduction, and the neighborhood rough set algorithm is used to further reduce the attributes, which can remove the redundant data of the original data set. The effect of dimensionality reduction then becomes obvious. The neighborhood radius parameter λ is set for each data set, and the lower limit of importance is 0.00001. The FLD-NRS algorithm is used to reduce the attributes for the four data sets respectively in Table 1. The experimental results of the selected gene subsets are shown in Table 2.
To verify the classification performance of the selected gene subsets, four classifiers are employed to do this experiment on each data set. The results are indicated in Figure 1.
According to Fig. 1, SVM has the best classification performance on the four data sets when compared with the other three. Then, to verify the validity of the proposed algorithm for selecting a gene subset with strong classification, the classification accuracy of the gene subset after reduction is evaluated on SVM. The FLD-NRS algorithm is compared with the other three related algorithms on four gene data sets, where the original data processing (ODP) algorithm is used to classify the original data set directly. The Lasso [31,32] algorithm is a feature selection method by coefficient compression estimation, and the NRS [24]. algorithm is a feature selection method using the neighborhood rough set theory. The experimental results are illustrated in Table 3, where m describes the gene number after gene selection, and Acc describes the optimal classification accuracy. Meanwhile, the time complexities of these algorithms are given in Table 3.  It can be seen from Table 3 that, although the classification accuracy of the leukemia data set is 94.4% with the ODP algorithm, the original data is directly classified by using ODP, and the size of the selected gene subset is very large. The NRS algorithm can effectively remove irrelevant genes to obtain a smaller gene subset. However, some genes with strong classifications have been removed. This leads to the lower classification accuracy of the selected gene subset. For example, the classification accuracy of leukemia with NRS is reduced to 64.5%. In the process of gene selection, it is known that the scale of the selected gene subset and its classification accuracy are two important aspects. The FLD-NRS algorithm presented in this paper can select a smaller gene subset, and the classification accuracy has clearly been improved. For the colon data set, our algorithm can also select fewer genes with higher classification accu-racy than the other two algorithms. For the leukemia and lung data sets, although our accuracy is lower than those of the ODP and Lasso algorithms, the selected gene subset is much smaller than those of the above two algorithms. Meanwhile, for the prostate data set, the classification accuracy with FLD-NRS is higher than those of the ODP and NRS algorithms, and the selected gene number is smaller than those of the ODP and Lasso algorithms. These results prove the effectiveness of the proposed algorithm for gene selection.
To further investigate the performance of the proposed algorithm, the FLD-NRS algorithm is compared with two random forest algorithms, where RF represents the classical random forest algorithm [33]. and SNRRF [34]. represents an improved random forest algorithm. The time complexity of the random forest algorithms can be approximated as O(kTK(logK) 2 ), where k is the number of random classifiers in a random forest. The experimental results are shown in Table 4. The time complexities of these algorithms can be found in Table 4.
According to Table 4, the classification accuracy of the colon data set with the FLD-NRS algorithm is 88%, which is higher than those of the RF and SNRRF algorithms. For the lung data set, the accuracy of the  proposed algorithm is 88.9%, which is basically equivalent to those of the two random forest algorithms, but the selected gene number is very small. These results demonstrate the validity of the proposed algorithm. However, for the leukemia and prostate data sets, the classification accuracy of this algorithm is slightly lower than those of the two random forest algorithms. These results explain that when using FLD to filter irrelevant genes, the genes with large influence on classification are mistakenly filtered out; therefore, the classification accuracy will be affected and reduced.
Through the time complexity analyses presented in Tables 3 and 4, it is obvious that the Lasso algorithm costs significantly more time, which is higher than those of the other four algorithms; although, the classification accuracy of the selected gene subset is high. The gene number of the original data set is usually much larger than that of the selected gene subset, so the time complexity of the proposed algorithm is obviously lower than those of the other five algorithms.
The above experimental results show that the FLD-NRS algorithm can solve the high-dimensional and high-redundancy problem of gene expression profile data well. The selected gene subset is smaller, and the dimensionality reduction effect is obvious. Hence, the FLD-NRS algorithm is superior to the other four algorithms mentioned in this paper under the overall situation of the three indicators, including selected gene number, classification accuracy, and computational time complexity. Therefore, the FLD-NRS algorithm can accomplish dimensionality reduction processing well, and the selected gene subset has strong classification abilities.

Conclusion
The challenge of selecting genes with important classification information from tens of thousands of gene expression profiles is an important problem in the field of bioinformatics. In this paper, a genetic selection method based on the FLD and NRS is proposed in view of poor stability, large feature subset size, and time-consuming calculations of various gene selection algorithms. The FLD approach is applied into the preliminary dimensionality reduction of gene data to obtain the candidate gene subset. A novel feature reduction algorithm in neighborhood decision systems is proposed to optimize the features after dimensionality reduction. Then, a gene subset with strong classification ability is selected. The experimental results all show that the FLD-SNR algorithm can select a gene subset with smaller scale and stronger classification ability. The proposed algorithm is of great practical significance for the future study of cancer clinical diagnosis.

Disclosure of potential conflicts of interest
No potential conflicts of interest were disclosed.