Feature Selection Empowered by Self-Inertia Weight Adaptive Particle Swarm Optimization for Text Classification

ABSTRACT Text classification (TC) is a crucial practice in case of organizing a vast number of documents. The computational complexity of the TC process is usually high because of the large dimensionality of the feature space. Feature Selection (FS) procedures are used to extract the helpful information from the feature space and results in dimensionality reduction. The development of the FS method that reduces the dimensionality of feature space without compromising the categorization accuracy is desirable. This paper proposes a Self-Inertia Weight Adaptive Particle Swarm Optimization (SIW-APSO) based FS methodology to enhance the performance of text classification systems. SIW-APSO has fast convergence phenomena due to its high search competency and ability to find feature sub-set efficiently. For text classification, the K-nearest neighbors algorithm is used. The experimental analysis shows that the proposed method outperformed the existing state-of-the-art algorithms on the Reuters-21578 data set by achieving 98.60% precision, 96.56% recall, and 97.57% F1 score.


Introduction
In the today's world of big data with large digital documents, Text Classification (TC) has gained tremendous importance, especially for companies to maximize their workflow or even profits. TC is a challenging task that can be used in several applications such as product predictions, movie recommendations, text mining, sentiment analysis, etc. It is one of the popular domains of Natural Language Processing (NLP) that allows a program to classify free-text documents using pre-defined classes. The classes can be categorized either on topic, genre, or sentiments in the text. TC can be conducive practically while considering the vast amount of online text in the form of digital libraries, web-sites and e-mails (Kowsari et al. 2019). The performance of the classifier depends mainly on the feature selection process. If the features are not selected properly, it would affect the prediction accuracy.
The feature space's high dimensionality is the main challenge for TC. The presence of many unique terms and a huge number of the actual terms used are the characteristics of feature space. Thus, reducing the size of the feature space without reducing categorization accuracy is desirable (Ikonomakis, Kotsiantis, and Tampakas 2005).
Feature Selection (FS) is being vastly used in multiple fields to filter out irrelevant and redundant information from the original data (Seal et al. 2015). It reduces the dimensionality of the data set by selecting the appropriate features without compromising the prediction accuracy. If the dimensionality of data is large, the computational complexity of classifiers becomes higher and may have adverse effect on their performance (Zahran and Kanaan 2009). FS algorithms are used in many fields, e.g., categorizing the text, data mining, pattern recognition, digital imaging, and signal processing (Zhao et al. 2010).
To find a sub-set of optimal features is a difficult task as there is no single criterion defined for it (Ge et al. 2016). Generally, the FS procedure consists of sub-set generation, termination criterion, sub-set evaluation, and result validation. The sub-set selection procedure uses a search technique that selects feature sub-set for evaluation purposes depending on a specific search method (Shang et al. 2007). This search method comprises forwarding combination method, backward combination method, forward selection method, and backward elimination method. The selection and evaluation process of the feature sub-set is continuous until the given termination condition is satisfied. After selecting the best feature sub-set, another data set must be validated (Miao and Niu 2016).
The steps involved in selecting feature sub-set can be divided into wrappers, filters, and embedded approaches. The filter's task is to separate FS from the learning algorithm and choose a sub-set that shouldn't depend on any particular learning algorithm (Lee et al. 2019). The evaluation method is used in the wrapper approach to select the feature sub-set. It is based on an exact learning algorithm to be used in the next step. During the evaluation process, the efficiency and sustainability of the sub-sets are checked to find the better one. The comparison of the sub-set with the prior best particle is also part of it. A stopping condition is checked at each iteration to find whether the FS should continue or stop. The wrapper function is considered a better solution generator because it is complex and can break into many features. If the FS and learning algorithm are interleaved, then the FS procedure falls in the domain of the embedded function (Wu et al. 2013).
Particle Swarm Optimization (PSO) is a technique that has a global search ability to select the optimal feature sub-set. There is a problem with this algorithm that it loses its diversity easily, which causes its premature convergence (Karol and Mangat 2013). To overcome the above-mentioned problem, a new algorithm called Self-Inertia Weight Adaptive Particle Swarm Optimization (SIW-APSO)is brought to light. It stables the exploration capability of the improved inertia weight in PSO. In SIW-APSO, every particle in a loop keeps improving its position and velocity and updates through a developmental process. By keeping all the functionalities of SIW-APSO, it is found better above all the previous versions of PSO (Nagra, Han, and Ling 2019).
In this work, SIW-APSO based feature selection technique is proposed that doesn't require a priori knowledge to work. SIW-APSO improves the premature convergence of the PSO. It has fast rate of convergence and ability to find the feature sub-set efficiently. For text classification, K Nearest Neighbors (KNN) algorithm is used because of its simplicity, quick response and ease of use for multi-class problems. The experimental analysis shows that the proposed method outperformed the existing state-of-the-art algorithms on the Reuters-21578 data set by achieving 98.60% precision, 96.56% recall, and 97.57% F1 score.
The rest of the paper is organized as follows: Section 2 reflects the literature review. Section 3 gives a brief overview of the SIW-APSO algorithm and describes the proposed technique for feature selection. The experimental analysis is performed in section 4. Finally, section 5 concludes the paper.

Literature Review
Several feature selection techniques have been proposed in the literature to for TC systems. Pedersen and Yang (Yang and Pedersen 1997) conducted a study to compare criteria of five FS methods for TC, including mutual information, information gain, document frequency, term strength, and x2-text (CHI). They observed that x2 and information gain are more appropriate for optimizing classification results.
Leveraging Association Rules in FS to classify text is a good feature selection method (Aghbari and Saeed 2021). A hybrid approach is proposed by Lee et al. (Lee et al. 2019) to select multi-label text features. It did improve the learning performance through competition among selective operators. An algorithm is proposed to selectively apply each operator, which rectifies the sub-set of features by its effectiveness and relative efficiency, different from the traditional approaches. Results taken by using multiple text datasets reveal that it outclassed the traditional techniques. A Parallel Global TFIDF Feature Selection Using Hadoop for Big Data Text Classification is a classic feature selection method (Amazal, Ramdani, and Kissi 2021). An application of MOGW optimization for feature selection in text classification is a better method for classification (Asgarnezhad, Monadjemi, and Soltanaghaei 2021).
(Chun-Feng, Kui, and Pei-Ping 2014) represented another algorithm that presented the enhanced version of Artificial Ant Colony (ABC) with PSO, which has poor exploitation. So, a study is proposed to work with PSO to select sub-set in text categorization effectively. For better performance, this algorithm works in multiple steps. The first step chooses a good point sub-set instead of random selection, which improves the convergence speed. Secondly, the ability of exploitation is enhanced by utilizing PSO to search for a new subset. Finally, a disordered search process takes place for the solution of recent iteration, which increases the effectiveness of searchability. The proposed algorithm is compared with different other algorithms; results show its better performance than others.
Khem et al. (Khew and Tong 2008) observed three techniques, i.e., centroid, LDA/GSVD, and orthogonal-centroid. These techniques are designed for a dimensional reduction in TC by minimizing the dimensions of clustered data. Similarly, an amazing review of FS for TC is presented by Forman (Forman 2007), and a case study is introduced for the selection of text features.
Zhang et al. introduced a PSO-based multi-objective and multi-label feature selection algorithm. The algorithm specifies the probability-based encoding, which transforms the nature of the feature selection problem into a continuous feature selection issue suitable for PSO. The specified algorithm is evaluated and contrasted with other methods, like RF-BR and MI-PPT. The results reveal that the algorithm can search for the best subset (Zhang et al. 2017).
Original PSO is expected to give an optimal feature sub-set due to its simplicity in searching in a one-dimensional search space. Multi Swarm (MSPSO) is very effective because it outclasses genetic algorithm (GA) and other rival algorithms, e.g., standard PSO, grid search, etc. Several variants of PSO are presented in the literature using filter and wrapper techniques, all aiming to make feature selection more efficient. After sufficient investigations, it is observed that PSO and its variants are relatively more effective during the selection process of a sub-set containing optimal features (Vashishtha 2016).
Radial Basis Function (RBF) network is used by Bilal et al. (Zahran and Kanaan 2009) as a text classifier. The proposed technique is compared with the efficiency of document frequency, TF-IDF, and Chi-square statistic algorithms. Results taken from the Arabic dataset determined the dominance of the proposed algorithm. The fitness function, Inertia parameters (w), and position-updating strategy are critical performance parameters of PSO. There are three major contributions of this work: Initially, evolutionary algorithms are overviewed using explicit or implicit memory, which is applied to dynamic fitness functions. Secondly, it suggests a new benchmark by observing the previous test problems to fill the gap between simple and complex real-world applications. The proposed benchmark is based on the derivation of Branke (Branke 1999), which is not limited to memory-based approaches. It is a fresh way to discover the advantages of a memory-based system while reducing its side effects.
James (Eberhart and Kennedy 1995) described the relationships between artificial intelligence and particle swarm optimization and evolving computation. The testing benchmark of both paradigms is discussed, and applications are proposed that comprise training neural networks and learning the tasks for robots. Three kinds were tested: the first one is the "GBEST" model, where each agent knows the best evaluation of the group. The rest were the two variations of the "LBEST" type, both of those with the neighborhood of six and two, respectively. The first one comes up with the actual version of GBEST that performs exceptionally well in quick convergence. The other version of LBSET (with the neighborhood of two) is the most impervious to local minima.
Exhaustive searching is the simplest way to discover the optimal subset of features through the evaluation of all sub-sets. But this is impractical to some extent, even for a moderate-sized feature set. To avoid this complexity, the FS method normally involves random search strategies. Therefore, the final feature sub-sets optimality is often minimized. Among most of the methods suggested for the FS, genetic algorithm, ant colony optimization and PSO have been emphasized more by the researchers. By collecting knowledge from previous steps, these methods try to get better solutions. GAs is a method for optimization based on natural selection. In search space, they implement methods found in natural genetics. In data mining, GAs have been used at a large scale as a tool for FS because of their advantages. PSO was first presented by Kennelly and Eberhart as a model for social learning and influence. In the Swarm-based method, particles follow a very basic mechanism, i.e., contend with the success of nearby particles. The multidimensional search space's optimal regions are found through the emerging cluster behavior.
Aghdam and Heidari presented a modified PSO-based FS technique. In this technique, a document is visualized as a set of phrases or words. The feature vector's all positions tally the given terms of the document. The evaluation is done using the bag of words model on text features. Abdelhalim et al. proposed a novel hybrid technique by using PSO and Nelder-Mead (NM) simplex search algorithm. It was intended to give answers to nonlinear unconstrained optimization problems. This method improved the positions of the particles and the velocities by using the NM scheme inside the PSO method. The evaluation of the suggested technique was done by using more than 20 optimization test functions having variable dimensions. Further, the detailed comparison with some other techniques highlights that the suggested technique is not only reliable but also a competitive one.
In nutshell, most of the existing feature extraction techniques for text classification demand a prior knowledge that requires additional computational resources. Moreover, some of the techniques also face the issue of premature convergence. To address aforementioned issues, this work presented SIW-APSO based feature selection approach.

Overview of SIW-APSO
The SIW-APSO is a modified version of the PSO technique. It improves the premature convergence of PSO. In this method, a random swarm of particles is initialized, and each particle represents a position in the search space, and a fitness function is determined from the pre-evaluated function, which is very much helpful to gain the required optimal solution. The particles move around about a dimensional solution space in the swarm, velocity and position of a particle are updated with the help of equation 1 and equation 2, respectively.
where Z d i is dth position of particle i, W tþ1 i is the inertia weight of ith at the iteration t þ 1 ð Þ, k 1 &k 2 are the constants, r 1 &r 2 the random functions, and p d b " is the dth best position particle i, and is the dth global best position founded by the whole swarm. According to equation 3, the iterative behavior of inertia weight is very much helpful to find accurate fitness.
where W tþ1 i and f Z t i À � represent the inertia weight and best fitness value at the t th and t +1 th iteration, respectively.
As in equation (1), represent the enhancement in the best fitness function that is dependent on proposed inertia weight. Equation (4) shows the f ðZ t i ) and ðZ tÀ 1 i Þ fitness function of the swarms at the t th and t-1 th repetitions, respectively. If W tþ1 i = W t i , the one thing is clear that the particle unable to find an optimal solution because f ðZ tÀ 1 i Þ = f Z t i À � , and nothing improve in inertia weight. In the case of, as we can use in the above equation (4), then the variance in inertia weight is optimistic, and the fitness will be improved. The fitness function progresses if the finest position is right close to the other particle in the swarm. W t i will grow to W tþ1 i at t-1 th loop, which is obliging for exploration and exploitation. When W tþ1 i ¼ 0:9, as described in equation3, the inertia weight is persistent and acts similar to an ordinary PSO.
In SIW-APSO, the particle begins the optimization with w ¼ 0:9, as this value is shown in equation (3). Inertia weight updated using equation (4) which shows the global best fitness. For optimization functions, there are great vibrations in the value of inertia weight in the very first iterations, which are beneficial for the particle to sustain its diversity. The oscillations decreased toward the end of iterations, which resulted in fine changes in the desired solution. The SIW-APSO can achieve in good manners to enhance the accuracy of PSO.
Through the research, we find the on some stages, inertia weight is equal to zero and does not grow in several uninterrupted iterations, which shows that swarm stuck cause the stall in a local optimum. This condition may be directed toward premature convergence because the current position of the particle is lower than a predetermined threshold value. For this shortfall, a linear function is used to map the possible range of the inertia weight values: where w i is the initial value of inertia weight, w h is the final value of w i , (t) is the current iteration of the algorithm, t h is maximum number of iteration, and the range of inertia weight is [0,1,2]. According to equation (4), the inertia weight for every swarm at each iteration is changed independently, using the enhancement in its personal best fitness. There is an alert for change of fitness; at any iteration, when particles' personal best fitness improves, then the particle changes its current direction; else its inertia weight is set to zero, and particle starts its search locally and improvement in inertia which is global exploration. The inertia weight of each particle is updated individually, so it is also possible for all particles in the swarm to have different inertia weights o; finally, some particles can search locally and some globally. So far, balancing between global and local research according to the mentioned method improves the diversity and is very useful for exploration. The algorithmic flow of SIW-APSO algorithm is given in Figure 1.

SIW-APSO for Feature Selection
This approach optimizes the problem in a continuous, multidimensional search space. SIW-APSO starts with a group of arbitrary particles. The behavior of each particle is adjusted according to its velocity. These particles have a propensity of moving toward the better search spaces. The SIW-APSO algorithm is described mathematically in the above equations (1) and (2). SIW-APSO is deliberated for searching multi-dimensional continuous search spaces. In this work, the feature selection problem has been adopted in which each feature sub-set considers as a point in feature space. The initial swarm scattered arbitrarily over the search space. Every particle takes one location, and the aim of particles is to transfer the finest position. Particles change their position by communicating with each other, and they search for e2004345-288 the local best and global best positions. Lastly, they reach their decent possible optimal position; meanwhile, they have exploration aptitude that performs the FS and optimal subset. SIW-APSO needs to extend in direction to treat with feature selection. The particle's position is considered as a binary bit vector, where the bit value 1 is considered as a selected feature, and another bit 0 is represents the nonselected feature. Each position is a feature subset; hence the SIW-APSO is effectively used to optimize the problem. The algorithmic flow of the proposed feature selection algorithm is shown in Figure 2.
The first step after we input the text documents in the system is the preprocessing step which is very important in text classifications; in this step, the stop words are removed, stemming is applied to the remaining text in the document, this helps reduce the feature space of the problem.

Initializing Particles
In this method, every particle is represented by a binary vector. The binary vector has an equal length to the actual features. The binary vector value is set to 1 when the required feature is selected and when the required feature is not selected, its value is set to 0. In the proposed method, the desired features are a subset of extracted feature set. After that, a velocity vector was produced for each particle. Thus, the velocity vector has the same as the particle vector length. Random value in the range of [0-1] is set for each cell of the velocity vector.

Updating Velocity
Each particle has a velocity which is shown such as a positive number, the velocity of the particle is bounded to show that how many features have been changed or same as an optimal point. The velocity of particles moving toward the best position. In equation 1, velocity is updated in this novel variant of PSO.

Updating Position
Update the position of the particles with the help of updating velocity, so the position of the particle will be updated by the new velocity best. If we have a new velocity is V and V, bits of particles are arbitrarily altered different from gbest. The particle moved toward the global best while exploring the search area instead same as gbest. In SIW-APSO, each particle changes its position according to its velocity, as depicted in equations 1 and 2.

Fitness Function
The fitness function has been explained through equation 6: In this equation, F i t ð Þ is the feature subset at iteration i, originating by swarm size i with the iteration length shown as F i t ð Þ ð Þ j j. Fitness is calculated in order to together measure the accuracy of a classifier, γ F i t ð Þ ð Þ, and the feature subset area. and ψ are the two parameters that have been managed the relative weight of the classifier performance and the feature subset area, δ 0; 1 ½ �andψ ¼ 1 À δ: According to this formula, both classifier performance and feature subset area have many effects on FS. The proposed method considers the classifier performance combined with sub-set length, so we set them to δ= 0.8, ψ=0.2. Figure 2 illustrates the complete steps of the proposed algorithm for feature selection in text categorization.

Experimental Results
In this work, MATLAB is used to implement the proposed algorithm. To evaluate the performance of the proposed feature selection technique, several experiments are conducted on an Intel Corei7 machine with a window 10.0 operating system, 3.6 GHz CPU, and 8 GB RAM. For experimental analysis, the top ten most frequently used classes of the Reuters −21578 documents dataset (Lewis 1997) are considered. Table 1 illustrates the classes and number of documents that belong to each class. The dataset is divided into two sets, one for training and the other for testing the classifier. The training and testing set consist of 5785 and 2299 documents, respectively. The number of documents samples taken from each class for training and testing is unbalanced.
To demonstrate the efficacy of the proposed feature selection approach, KNN classifier is used [24]. For the classifier, it is not possible to directly interpret the text documents. To address this problem, there is a need to uniformly apply indexing procedures on documents that transform their content into compact representation. In compact representation, a text T j is generally indicated by a feature vector of term weights. Each location in the feature vector is equivalent to the specified phrase or word. This indication is known as the bag of words model. In this work, the normalized TF-IDF function given in the equation is used to calculate the weights.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where f k indicates the features set that come at least once in the training documents and 0 � P kj � 1 denotes the contribution of FK to the semantics of document T j.
In equation 8, the number of occurrence f k in T j are indicated by N. T s indicates the training set, and L denotes its length. N d represents the documents count in T s in which f k comes.
To evaluate the performance of the classifier following parameters are used: where τρ (True Positive) and ρ (False Positive) denote correctly and incorrectly classified documents underclass i, respectively. The number incorrectly classified document of class i to other classes are represented by N (False Negative). These parameters or probabilities are computed with the help of a contingency table for each class Ci on the specified dataset.  1  Grain  72  32  104  2  Ship  122  42  164  3  Wheat  153  51  204  4  Interest  165  74  239  5  Corn  170  53  223  6  Crude  288  126  414  7  Trade  297  99  396  8  Money-fx  313  106  419  9  Acquisition  1484  664  2148  10 Earn 2721 1052 3773 e2004345-292 In the case of multiple classes, micro average and macro average can be computed for the above-mentioned parameters. In the case of macro average, all classes are treated equally regardless of the number of documents that belong to that class. On the other hand, in the case of micro average, all the documents are weighted equally, therefore favoring the performance of common classes. Equations 12-15 indicate macro and micro averaging on recall and precision. The global contingency table is shown in Table 2, which is determined through the addition of class-specific contingency tables.
where and μ represent the macro and micro average, respectively. The comparative analysis of the proposed technique is made with four previous studies for feature selection which are based on PSO, CHI, Information Gain (IG), and Genetic Algorithm (GA). To perform the experimental analysis, the population size is set to 50; the maximum numbers of iterations are adjusted to 100, the value of C1 and C2 is set to 1, and the w is taken in the range of [0.4, 1.4]. Table 3 shows that the proposed technique outperforms the existing techniques in terms of precision, recall, and F1-score. To graphically demonstrate the progress of the proposed feature selection techniques to determine the optimal solution, a graph is drawn between the percent feature and F1 measure, which shows the improvement process of the best particle with respect to the increase in the number of features. Figure 3 and Figure 4 illustrate the macro and micro averaged F1 score against a number of selected features for all feature selection techniques. These graphs show that proposed techniques supersede the existing techniques as the percentage of selected features exceeds 12%. Table 4 gives the micro and macro F1 score-based comparison of the proposed technique with existing techniques. It illustrates the best performance is achieved by the SIW-APSO.     In comparison with the existing techniques, SIW-APSO quickly determines the optimal solution. Generally, it determines the optimal solution within tens of iterations. The exhaustive searching is not possible to apply on Reuters-21578 data set to determine the optimal feature subset because of the existence of billions of candidate subsets. On the other hand, in the case of SIW-APSO, the optimum solution is found at the 100th iteration.

Conclusion
This paper proposed the FS technique based on SIW-APSO. First of all, the solution space is searched by it. After that, the evolutionary process is used to iteratively update the speed and position of each particle. The selected feature subset's length and classifier performance are used a heuristic data. To show the efficacy of the proposed technique, the comparative analysis is performed with four existing techniques based on PSO, CHI, Information Gain (IG), and Genetic Algorithm (GA). The experimental results show that the proposed technique outperforms all of its competitors on the Reuters-21578 data set by achieving 98.60% precision, 96.56% recall, and 97.57% F1 score. In future, this work can be applied to other classification problems.

Disclosure statement
No potential conflict of interest was reported by the author(s).