Optimal feature selection and invasive weed tunicate swarm algorithm-based hierarchical attention network for text classification

Through social media platforms and the internet, the world is becoming more and more connected, and producing enormous amounts of data. Also, the texts are collected from social media, newspapers, user reviews of products, company press releases, etc. The correctness of the classification is mainly dependent on the kind of words utilised in the corpus and the features utilised for classification. Hence, due to the increasing growth of text data on the Internet, the accurate organisation and management of text data has become a great challenge. Hence, in this research, an effective Invasive Weed Tunicate Swarm Optimization-based Hierarchical Attention Network (IWTSO-based HAN) is implemented for achieving categorisation of text. Here, the features are mined from the text and thereby the optimal features are acquired to perform the classification strategy. The incorporation of parametric features of each optimisation ensures the proposed method to increase the convergence of global solutions by improving the categorisation effectiveness. The proposed method obtained better performance for text classification with measures, like accuracy, True Positive Rate (TRP), True Negative Rate (TNR), precision, and False Negative Rate (FNR) with values of 92.4%, 92.4%, 94.1%, 95.4%, and 0.0758.


Introduction
The text data is collected from various sources, like emails, social media, chats, insurance claims, tickets, web data, questions, and answers from customer services and user reviews.The text is a specifically rich source of data but acquiring insights from the text results in a time-consuming and complex task due to the nature of unstructured data (Minaee et al., 2020) (Fan et al., 2023).In general, the text documents consist of the huge dimensional noisy, and irrelevant terms.The presence of noise and the irrelevant factors direct to complexity of computation and generate noise in the learning procedure and hence it results in poor performance for the classifiers.Hence, it is required to process the text documents before performing text classification for minimising the dimension and eliminate noisy and irrelevant terms (Wang & Hong, 2019).The electronic text processing model is everywhere in recent decades with several files and formed more substantial issues to solve the user requirements information.Also, it is one of the methods is to classify the textual data automatically, so the users can retrieve the data easily manipulate and extract the information for recognising the patterns and thereby creating information.Electronic files are organised into various categories that become a growing interest in different organisations and people (Koller & Sahami, 1997;Stein et al., 2019).Text classification is the area that produces a solution for the issues and utilises a combination of knowledge fields, like data mining, natural language processing (NLP), information retrieval, machine learning etc.It is commonly measured a supervised machine-learning issue in such a way that the approach is trained from various examples for classifying the hidden piece of text (Vidyadhari et al., 2019;Stein et al., 2019;Kou et al., 2020).Communication is developed by the internet so that data sharing is possible through the internet.The text files present on the Internet hold huge commercial value.Mine the useful data hidden in the text gains more benefit from the text classification as it intends for assigning the text files by unidentified classes into a fixed number of classes with certain classifiers (Wang & Hong, 2019).Hence, the text classification mechanism is utilised to organise text data suitably.Text classification is a supervised learning model that allocates text documents to specified categories (Thirumoorthy & Muneeswaran, 2020).Text classification is otherwise termed text categorisation such that it is the tags assigning procedure or labels to textual units, like documents, paragraphs, queries, and sentences.Text classification is achieved either by automatic labeling or by manual annotation.Deep learning (Roy et al., 2023) and machine learning methods are commonly used for effective text classification.One of the machine learning approaches, named Structured Logistic Regression (SLR) is used in the context of text classification.Also, it is used in bioinformatics large-scale classification issues (Pedersen et al., 2014).With the increasing growth of text data in industrial appliances, text classification in an automatic way becomes more important (Minaee et al., 2020).Accordingly, text classification is present in different appliances, like complaint feedback of specific services in the social network, audio music genres, and TV programmes classification in the entertainment data selection, identification is specific topics, detection of accidents for disease outbreaks, accidents in the industry oil, analysis of essay answers, and the materials classification of various reading levels for the students concerning the education system, filtering of e-mail (Drucker et al., 1999;Guzella & Caminhas, 2009;Mohammadzadeh & Gharehchopogh, 2021), classification of pediatric asthma in the biomedicine (Wang & Hong, 2019;Pedersen et al., 2014), e-mail classification (Günal et al., 2006;Yu & Zhu, 2009), topic and author gender identification, sentiment classification (ElAmine Chennafi et al., 2022), speech detection (Aldjanabi et al., 2021) and web page classification (Chen & Hsieh, 2006;Anagnostopoulos et al., 2004;Thirumoorthy & Muneeswaran, 2020;Han et al., 2022).
The indivisible entity of text quantity is termed as the word or feature or term.The unstructured content credentials are specified as a feature vector in the text classification domain (Feng et al., 2023).The noise feature does not contain any facts regarding the category.It is very complex to conclude from the noise feature.For example, when the feature or term is present in each text document, then the feature or term will not be useful for the classification (Thirumoorthy & Muneeswaran, 2020).Accordingly, feature acquisition is a significant module utilised in the text classification process such that it aims to find the subset of a feature that contains a smaller number of relevant features and is more sufficient for maintaining or increasing the classification effectiveness (Belazzoug et al., 2020).In general, the feature chosen mechanism is employed to minimise the data dimension and error rate of the classifier and to decrease computational time (Thirumoorthy & Muneeswaran, 2020;Parlak & Uysal, 2023;Şahin & Kılıç, 2019;Saeed & Al Aghbari, 2022).It selects the subset of features from the entire features by various analyses (Liu et al., 2020;Zhou et al., 2022;Jin et al., 2023;Coban, 2022).The feature selection approaches are specifically found in different domains, like document clustering, text categorisation, computer vision, bioinformatics, image processing, and industrial appliances (Belazzoug et al., 2020).

Problem statement
Text classification is the method of categorising the set of text files into various classes from the predefined set.Text classification is an essential method to process and manage a huge amount of files in digital forms.Commonly, text classification is useful for extracting and summarising information and retrieval of text.There are many text classification methods implemented in previous studies.The problems identified in the previous studies are discussed below: • The computation time was increased owing to the huge dimension of the information.
• The feature space with huge dimensionality has noise, and redundant and unrelated features, which affect the correctness of the model.• Some feature selection methods did not effectively select the important features, so the error rate of the classifier was improved.
Hence, in this research, IWTSO-based HAN is developed for the efficient classification of text documents.
In this research, a text classification framework is designed with HAN to classify the text documents of sentences.Here, the input text is passed to the pre-processing phase, and then the text data is subjected to the feature extraction module.Accordingly, the acquired features are passed to the feature selection phase in such a way that the best features are selected by employing the IWTSO.Then, in the classification module, the HAN is used to classify the text data in such a way that the HAN is trained by the IWTSO algorithm.Moreover, the IWTSO is formed by the integration of improved Invasive Weed Optimization (IWO) (Misaghi & Yaghoobi, 2019) and Tunicate Swarm Algorithm (TSA) (Kaur et al., 2020), respectively.
Major contributions of the research: • IWTSO-based HAN: An effective text classification framework is designed with HAN to classify the text documents of sentences.The IWTSO-based HAN effectively achieved better results using fitness determination.• The IWTSO algorithm is formed by integrating the IWO and TSA, which is used for the effective training of the HAN classifier.
The remaining parts of the manuscript are Section 2 presents the discussion of various text classification aproaches, and Section 3 explains the IWTSO-based HAN.Section 4 discussed the results of the IWTSO-based HAN, and the research is concluded in Section 5.

Literature survey
For the effective selection of features, Belazzoug et al. (2020) established an improved sine cosine algorithm (ISCA).The novel search space regions were identified with the help of this algorithm and generated the best solution.The solutions considered were the better solution location and search space location.Here, the efficiency was improved and removed the premature convergence.The feature selection issues were solved, but it did not apply for some complicated issues.Lim and Kim (2020) developed a quadratic programming-based model to compute the optimal balance between two dependencies for selecting the features.It considered the similarity measure for eliminating redundant terms.Here, the similarity measure was computed using the mutual information and ranking method.Goudjil et al. (2018) developed a support vector machine (SVM) model to solve the high dimensional classification issues.It used the active learning approach for text classification.It minimised labeling effort by selecting the samples that were labeled.It used posterior probability for selecting the informative samples.It increased the classification accuracy, but it needed more labeled samples for training the classifier.Ranjan and Prasad (2018) introduced a lion fuzzy neural network (LFNN) for accomplishing text categorisation.It considered the dynamic database to achieve the classification process.Here, the feature was mined from the words for reducing the dimension of the search space, but the error rate was not reduced.Liu et al. (2020) developed a relative document term frequency difference (RDTFD) for selecting the features.This method was used to partition the features of total text files into small feature sets based on the capability of features for discriminating the negative and the positive samples.It reduced the redundancy of features and improved the performance of categorisation.It did not increase the running speed.Borhani (2020) developed an artificial neural network for text classification.It used a fast text classifier using an updating formula to tune the neural network.However, discriminative features were used for text mining.It discovered the knowledge and achieved better performance in mining the information.It increased the performance but resulted in high computational complexity.Gasmi (2022) implemented a Bidirectional Encoder Representation from the Transformers (BERT) model based on the optimal deep learning scheme.Here, the parameter selection of deep Learning was done by Particle Swarm Optimization (PSO).The prediction of matching response was done by the k-Nearest Neighbors algorithm (KNN).The accuracy of this model was high, but it had complexity issues.Maragheh et al. (2022) implemented the Spotted Hyena Optimizer (SHO)-Long Short-Term Memory (SHO-LSTM) approach for text classification.Here, for word embedding the Skip-gram approach was used.The weights of LSTM were optimised by the SHO for improving the correctness of the approach.It had better convergence capability, but the parameter optimisation was not that effective.

Review on meta-heuristics algorithms
The recent meta-heuristics optimisation algorithms are reviewed based on the advantages and disadvantages and the details are tabularised in Table 1.

Research gaps
• The quadratic programming model designed by (Lim & Kim, 2020) minimises the number of redundant terms chosen and increases the classification accuracy.However, this approach considers a greater number of dependencies between the terms, and it increases the processing time such that it results in a major limitation.• In the process of classification mechanism, eliminating the noise from the large volume of text documents, like redundant and irrelevant features results in a complex task.Hence, it is required to use dimension reduction schemes for solving such issues, as feature selection and feature extraction (Liu et al., 2020).• The machine learning and deep learning method obtained higher accuracy in the sentiment analysis and topic classification, but the higher performance often depends on the quality and the dimension of training samples that is often difficult to collect (Wei & Zou, 2019).• Much traditional text classification research concentrates only on a certain type of sentence or phrase.However, these techniques rely on target words or target sentences for solving text classification issues without the consideration of the relationship between each word (Liu & Guo, 2019).• Unlike documents of paragraphs, the short texts are unclear as it does not contain sufficient context information that results in a major issue for classification.

Proposed IWTSO-based HAN
The IWTSO-based HAN is devised in this research for classifying the text.Here, the best features are acquired by employing IWTSO.Finally, the classification of text process is carried out with HAN in such a way that it is trained by the IWTSO that is formed by merging improved IWO (Misaghi & Yaghoobi, 2019) with TSA (Kaur et al., 2020).Figure 1 portrays the schematic view of the IWTSO-based HAN.

Acquisition of text data
Assume the dataset as D with n count of text documents and is represented as, where, D denotes dataset, D i indicates text data available at i th dataset index, and n signifies the overall text files.For the Reuter dataset the value of n = 19043, whereas for • Easy implementation.
• Apply on Engineering design problems.
• Not applicable in some engineering design issues.(Tanhaeean et al., 2022) Boxing Match Algorithm (BMA) • Solved various difficult numerical issues.
• Accuracy of the optimal solution was reduced in some mathematical functions.(Abdollahzadeh et al., 2022) Mountain Gazelle Optimizer (MGO) • Avoid premature convergence.
• Analysis was not done for real-world engineering issues.(Misaghi & Yaghoobi, 2019) Improved IWO • Fast convergence and high accuracy.• Improve the population diversity.
• The performance was reduced some complex functions.
(Kaur et al., 2020) T S A • Applicable for complex unconstrained design issues.
• Not applicable for multi-objective optimisation issues.the 20-Newsgroup dataset n = 19997, and with real-time data, the total number of documents is specified as n = 5000, respectively.The input text data D i is used to execute the classification process.

Pre-processing
D i is subjected to the pre-processing, which is done by employing stop word removal and stemming processes.It is an important step in information retrieval and text mining due to unstructured data.This step aims to remove the noise from text data.Its purpose is to clean before processing the classification task.At this phase, the dimension of data can be reduced either through missing value, duplication, or minimising the overall features.Here, the data that is in the unstructured format is changed to the structured text representation.It is the procedure of cleaning and preparing the data text for classification.
Stop word removal: Stop word is part of natural language that does not contain any meaning in the text processing system.The main general words in the text documents are articles, pronouns, and prepositions.Words that do not provide meaning to documents are treated as stop words.These words are not assumed as keywords in text applications hence it is required to eliminate such words from documents.It removes the frequent and common words that do not have any important influence on the sentence.The stop words considered are, but, also, to, have, and can.
Stemming: It is the process to acquire the root or base of a word by eliminating prefixes and suffixes.It reduces the word variants to the root kind.The library used for stemming is Porter Stemmer.The outcome of this phase is signified as A with [U × V] size.

Extraction of features from text data
The extracted features are explained below: Wordnet-based feature: It is a commonly used lexical resource for NLP tasks.This is the network of concepts, which is in the word nodes form.These word nodes are organised by the semantic associations among the words depending on their sense.The semantic relation is nothing but the pointers among the synsets.This is specifically utilised to find synsets (Liu et al., 2015).Accordingly, the synsets identified from the text data are specified as f 1 , respectively.
Co-occurrence-based feature: The use of term sets or item sets in the documents is termed as co-occurrences of terms.It means the frequent occurrence of from the text corpus.It is specified as, Here, R it denotes the words' i and t co-occurrence frequency, and Z t indicates the word t frequency.The co-occurrence-based features acquired from text data are specified as f 2 .
TF-IDF: It contains two parts, namely TF and IDF.Accordingly, TF measures the individual words in the text, and IDF signifies the frequency of the word that is present in the text (Wu et al., 2020).
where, n(a) indicates the number of occurrences of entries a in the class, and n denotes the entire entries.
The inverse text frequency in IDF is specified as, where, N denotes the corpus overall texts, and N(a) indicates the total number of texts that contains a word in the corpus.The TF-IDF is represented as, The extracted feature is specified as f , which is formed by f = {f 1 , f 2 , f 3 }.

Feature selection by IWTSO algorithm
After the extraction of features from the text data, the unique and important features to be selected using the IWTSO.The selection of important features from the text data utilised to increase the categorisation correctness as the redundant and duplicate data can be removed.The IWO is the population-based algorithm.In general, the plant is represented as a weed that unintentionally grows in the environment.It is very effective in converging to the optimum solutions with the essential features, like competition, growth, and seeding in the weed colony.The basic characteristic features, like reproduction, spectral spread, and competitive deprivation are employed for the simulation of the habitat behaviour of the weeds.TSA is a bio-inspired algorithm.The swarm behaviour and the jet propulsion of Solution encoding: This is the solution vector representation, in which the selected optimal feature is indicated as X, whereas X < f , respectively.Figure 2 portrays the solution encoding.
Fitness measure: It is a process of computing optimal features among a set of features by te consideration of the accuracy measure.The equation for this is shown in Eq. ( 37).

Algorithmic procedure of IWTSO
(i) Initialisation: Let us define H as the weeds' population in the solution space, and H best is the finest position of weeds.(ii) Fitness computation: It is employed to discover the best solution in the selection of feature selection process in such a way that it is determined using the actual and the target values difference.(iii) Update solution: The location update solution is specified as, The standard equation of TSA is, where, Here, B denotes the vector, C implies gravity force, I signifies social force between the search agents, K indicates water flow advection, x(s) represents chaotic mapping, r and b 1 , b 2 , and b 3 are a random number with the interval of [0, 1].Here, Q min is set to 1, and Q max is set to 4, respectively.

(i) (iv) Feasibility evaluation:
The factor of fitness is computed for every result in such a way that the result with the best fitness value is declared as the best result.

(ii) (v) Termination:
The aforementioned steps are repeated until the best result is attained.
Algorithm 1 represents the pseudo-code of the IWTSO.

Text classification using proposed IWTSO-based HAN
After selecting the features, the text categorisation is achieved by employing the HAN.Text classification is the fundamental task in NLP.The key goal of text categorisation is to allocate the labels to text.The advantage of using HAN for text classification is to capture two basic insights regarding the document structure.First, as the document has a hierarchical structure, the document representation is constructed by modeling the sentence representation and then grouping them to text representation.Second, it is noted that different sentences and texts in the documents are separately informative.The importance of sentences in the text is generally context-dependent, that is same sentence or text is separately important in various contexts.To enhance the performance of categorisation, the HAN includes two different levels of attention, like word level and sentence level.Word encoder: Here, the input feature X is embedded in vectors utilising the embedding matrix E. The bidirectional gated recurrent unit (GRU) consists of a forward GRU, in which the reading has been done from first to last, and in the backward GRU the reading has been done from last to first.
Here, annotation is obtained for the given feature X by integrating the forward and backward hidden state ← g such that g = g, ← g .

Word attention:
All the words do not equally contribute to the demonstration of sentence meaning.The significant words in the sentence are extracted by using this.The demonstration of informative features is merged for forming the sentence vector.y X = tanh(E.g+ q) (28) At first, word footnote g is passed to the one-layer MLP to get y X as the hidden demonstration of g, then the significance of the word as the resemblance of y X is measured with the context vector of word level y v and obtain the normalised weight β using the Softmax function.Later the sentence vector V is determined by utilising the word annotation weighted sum by the consideration of the weights.Figure 3 portrays the structure of HAN.
Sentence encoder: V is the sentence vector, which is used to derive the document vector.This is useful for encoding sentences in bidirectional GRU.
The annotation of the sentence is obtained by concatenating − → g u and ← − g u .The g u [g u = − → g u , ← − g u ] summarize neighbour sentences around the sentences.

Sentence attention:
The significance of sentences is determined using the sentence-level context vector W. y u = tanh(E z g u + q z ) (33) Here, the document vector is denoted as Y , which has the details of text.The output has the dimension of [1 × 22].

(ii) Training process of HAN:
The HAN is trained with the IWTSO.

Solution encoding:
It is useful in identifying the accurate optimal solution.Here, H = [1 × κ] is the solution, where κ indicates the weight.For every iteration the weight factor is updated.

Fitness function:
The error variation between the target and the actual output is computed to determine the fitness measure, which is defined as, where, the actual value is indicated as O, and the classified outcome is represented as Y.
Other steps of the IWTSO algorithm are discussed in section 3.4.

Results and discussion
The results and analysis of the IWTSO-based HAN regarding the performance measures are explained in this section.

Experimental setup
The accomplishment of the IWTSO-based HAN is done in the PYTHON tool with Tensor-Flow and Keras library, windows 10 OS, intel processor, and 2GB RAM.Table 2 shows the experimentation parameters.

Evaluation metrics
The evaluation is done using accuracy, TPR, precision, and TNR metrics.

Performance analysis
This section explains the performance analysis made by the IWTSO-based HAN.depicts the performance analysis by considering the TNR measure.At 80% of the training data, the IWTSO-based HAN achieved the TNR with feature size 100 is 0.799, 200 is 0.815, 300 is 0.894, 400 is 0.914, and 500 is 0.924.Figure 4(d) portrays the FNR analysis.When considering 80% of training data, the developed method achieved the FNR by considering the feature size 100 is 0.193, 200 is 0.165, 300 is 0.096, 400 is 0.075, and 500 is 0.059.The precision analysis is illustrated in Figure 4(e).
At 80% of the training value, the precision computed by the IWTSO-based HAN with the feature size of 100 is 0.865, 200 feature size is 0.887, 300 feature size is 0.904, 400 feature size is 0.924, and 500 feature size is 0.948.
(b) Analysis with 20-Newsgroup dataset Figure 5 depicts the performance analysis made by the IWTSO-based HAN by considering the 20-Newsgroup dataset.Figure 5(a) depicts the accuracy analysis.At 60% of the training value, the accuracy measured by the developed method by considering feature size 100 is 0.760, 200 is 0.784, 300 is 0.824, 400 is 0.874, and 500 is 0.924.Figure 5(b) portrays the analysis measured with the TPR measure.At training data = 60%, the TPR of the IWTSO-based HAN with feature size 100 is 0.741, 200 is 0.774, 300 is 0.814, 400 is 0.864, and 500 is 0.913.The TNR analysis is given in Figure 5(c).When training value = 60%, the TNR measured by the IWTSO-based HAN with 100 feature size is 0.724, 200 feature size is 0.754, 300 feature size is 0.804, 400 feature size is 0.877, and 500 feature size is 0.905.
Figure 5(d) portrays the FNR analysis.At 70% of training value, the FNR of IWTSO-based HAN with 100 feature size is 0.226, 200 feature size is 0.215, 300 feature size is 0.175, 400 feature size is 0.126, and 500 feature size is 0.076.The precision analysis is portrayed in Figure 5(e).At 80% of the training value, the precision computed by the IWTSO-based HAN with the feature size of 100 is 0.874, 200 feature size is 0.897, 300 feature size is 0.914, 400 feature size is 0.924, and 500 feature size is 0.945.

Comparative analysis
This section explains the comparative analysis made by the IWTSO-based HAN with three kinds of datasets.

Comparative discussion
In the IWTSO-based HAN, the pre-processing is done by stop word removal and stemming processes, in which the unstructured format is changed to the structured text representation.Also, the behaviour and the jet propulsion of TSA are integrated with the weed behaviour of improved IWO and so the convergence rate of the optimisation is increased thus enabling to generate global best solution by eliminating the local optima.Moreover, to enhance the performance of classification, two different attention levels of HAN are used, such as, word level and sentence level.Thus, the performance of the IWTSO-based HAN is improved compared to other existing methods.

Conclusion
An effective classifier named HAN is developed for performing the text classification process.The IWTSO-based HAN involves different phases to classify the text document.At first, the input text data is pre-processed, then, the feature extraction acquires features associated with the text data.The feature selection process is employed to select the optimal features of data to enhance the classification performance.The HAN is employed to classify the text document such that HAN is trained with the IWTSO algorithm.The IWTSO-based HAN obtained higher performance in terms of accuracy, TPR, TNR, precision, and lower FNR of 0.924, 0.924, 0.941, 0.954, and 0.0758, respectively.Text classification is used in various fields, like data mining, artificial intelligence, information retrieval, NLP, etc.The major applications of the text classifications are spam detection in emails, language detection, sentiment analysis, speech recognition, topic labeling, and intent detection.However, the efficiency of the feature selection method is not evaluated, which may affect the accuracy of the model.The future dimension of research would be the consideration of publicly available larger datasets.Also, the performance of the implemented feature selection method will be evaluated with other filter-based feature selection approaches.

Disclosure statement
No potential conflict of interest was reported by the author(s).

( i )
Structure of HAN:The structure of HAN is composed of different parts(Yang et al., 2016).The input taken by the classifier has the dimension of [40 × 50], whereas the result generated by the Bidirectional layer has the size of[40 × 100].However, the attention layer processes the data with dimension [40 × 100] and forms the result data with [1 × 100] size.

Figure 4
Figure4portrays the analysis of the IWTSO-based HAN with the Reuter dataset.Figure4(a)shows the accuracy analysis.Considering the training data of 80%, the accuracy computed by the IWTSO-based HAN with feature size 100 is 0.813, 200 is 0.854, 300 is 0.897, 400 is 0.935, and 500 is 0.954.The performance analysis made by the TPR is portrayed in Figure4(b).For 90% of training data, the TPR measured by the IWTSO-based HAN by considering feature size100, 200, 300, 400, and 500 is 0.846, 0.874, 0.924, 0.931, and 0.954.Figure 4(c)

Table 1 .
Review on recent meta-heuristics algorithms.
As J is the best TSA search agent, which is substituted in H best of improved IWO.
Table 4 portrays the comparative discussion of the IWTSO-based HAN.By considering the Reuter dataset, the accuracy measured by the IWTSO-based HAN is 0.913, whereas the TPR