An evolutionary approach for depression detection from Twitter big data using a novel deep learning model with attention based feature learning mechanism

One of the main factors causing suicide is depression. However, many cases of depression go undiagnosed because they are not correctly diagnosed. An increasing number of people with mental illnesses express their emotions online using tools like social media (SM) and specialized websites. Recently, efforts have been made to use Machine Learning (ML) and deep learning (DL) models to predict depression from SM platforms. However, it is problematic that most ML algorithms now provide no explanation. As a result, this study proposes a novel Deep Learning (DL) model called residual network 50, which includes optimal long short-term memory (RNT-OLSTM) for Depression Detection (DD) on Twitter data. In addition, to address the issue of data imbalance in the Twitter data, a cluster-based oversampling approach is used, which considerably reduces the possibility of bias towards the dominant class (non-depressed).. Finally, the embedding layers are inputted to RNT-OLSTM for DD, in which the hyperparameters of the network are tuned using the Sine Chaotic map and constriction factor-based Coyote Optimization Algorithm (SCCOA) to minimize the prediction loss. The out-comes prove that the proposed system performs better than the existing schemes for the DD of imbalanced Twitter data with higher detection rates.


Introduction
Users have altered their methods of communication and information gathering on the web due to the rapid development of computer science and Internet technologies.Through blogs and SM, consumers now actively participate in making digital content in the Web 2.0 Era.Utilizing the features offered by these new platforms, such as Facebook, Twitter, Reddit or related sites [1][2][3], users post their thoughts, ideas and sentiments about a subject, topic or issue online [4].SM content from different websites and platform providers, such as posts, comments, posts, tweets and reviews, has dramatically influenced the development of big data.Data analytics and artificial intelligence are currently dealing with a new era of excitement due to the explosion of big data from SM. Finding effective methods to gather a high volume of social data and draw insights from the enormous amount of data obtained is the main problem faced by big social data.
By capturing public opinions, DD in big social data might offer commercial insights.Depression is a severe mental disorder that has a detrimental effect on one's feelings and thoughts [5,6].In 2021, the World Health Organization predicted that 280 million individuals worldwide would experience depression.Among them went untreated in nations with low and moderate incomes.During the coronavirus epidemic, the number of instances of depression in the US has tripled.In Thailand, 60% of suicides are motivated by depression, and the number of cases of depression is still rising [7].
Due to a lack of awareness of depression, there are still numerous occurrences of misdiagnosed depression, even though many patients are receiving treatment.As a result, patients will experience negative thoughts, including self-isolation, erratic behaviour and suicidal thoughts and will become dependent on drugs like antidepressants [8,9].Predicting is crucial before becoming impacted by a psychological disorder [10].Data cleansing techniques are more appropriate for detecting depression.The majority of labelcleaning techniques, however, rely on or make assumptions about the dataset's noise distribution.As a result, numerous researchers have recently focused ML algorithms on diagnosing depression [11,12].ML (ML) is one of the most important aspects of artificial intelligence [13].Finding the most helpful term features to make a forecast has demonstrated exemplary performance in predicting whether SM people have depression [14].
In order to do so, a variety of ML approaches, such as random forest (RF), logistic regression (LR), naive-Bayes (NB) and support vector machines (SVM) [15], can be employed to identify depression.The effectiveness of these algorithms can be assessed using a confusion matrix.A successful algorithm will have a high precision score and help accurately forecast sentiment, which can be either positive or negative [16].However, it relies on the grammatical accuracy of SM data and requires many attributes to diagnose depression.It also only recognizes explicit aspects while failing to extract implicit ones.Several writers have recently focused on DL, an area of ML, for the early DD in SM data [17,18].Even without feature extraction, DL models can handle high-dimensional data with little information loss [19].
To address the classification problem, DL models do not require the described classifiers of handmodelled models [20].It was created to accurately forecast depression from SM data by training a nonlinear, complex connection among input factors utilizing their prior features.This strategy needed a more sophisticated model to study the effects of various parameters and their modifications.The issue of class imbalance is also a significant task in detecting depression.Still, the focused class imbalance problem is only partially resolved because the imbalanced data hinders the classifier's performance, which also lowers the prediction rate.This research subsequently suggests a novel DL method for DD of Twitter data.The proposed work's primary contributions are listed below.The following outlines the remaining parts of the paper: Section 2 surveys the recently developed ML and DL models to perform DD of SM data.A brief explanation of the suggested research model is given in section 3. The outcomes of the proposed and existing research frameworks regarding some performance metrics are given in section 4. At last, the conclusion and future research directions are described in section 5.

Related works
This section discusses the recent methodologies of DD with the use of various ML and DL models.The limitations of the surveyed models are presented at the end of the literature, with a solution offered by the proposed system to overcome those limitations.
Lei Tong et al. [21] presented a DD approach called cost-sensitive boosting pruning trees (CBPT) for Twitter SM data.Initially, the system performed pre-processing operations such as reducing influencing noisy samples, stemming, and removing irregular words.Afterwards, the features such as the user profile and linguistic and social interaction were extracted from the pre-processed data.Finally, the extracted features were fed into the CBPT for DD.The system utilized two datasets, as CLPsych 2015 (CLPsych2015) Twitter Dataset and Tsinghua Twitter Depression Dataset (TTDD) from the UCI repository.The outcomes illustrated that the method attained a higher accuracy and f-score than the existing related schemes.
Amna Amanat et al. suggested a DL strategy for DD from textual data [22].The data was initially gathered from the freely available Tweets-Scraped dataset on Kaggle, which included more than 4000 tweets.After that, mentions, stop words, retweets, and URLs were deleted from the dataset as a portion of pre-processing.Then, features from the dataset were extracted using the one-hot encoding.Then feature visualization was done using principal component analysis (PCA).Finally, a recurrent neural network (RNN) and long short-term memory (LSTM) were used for DD.The outcomes showed that implementing the system with a lower false positive rate attained an accuracy of 99%.
Based on the multimodal analysis, RaminSafa et al. [23] demonstrated an automatic detection of depression symptoms on Twitter.Data was initially gathered from tweets using a regular expression on a selfreported diagnosis.All retweets and URLs were eliminated, and the sentiment polarity score was then determined using a lexicon-based methodology.The correlation characteristic was then measured using harmonic analysis, and the dimension was reduced by the singular value decomposition (SVD) method.Finally, a multimodal ML technique was used to perform the DD.The method attained an accuracy of 91% and 83% when worked independently on bio-text and tweets, which were satisfactory compared to the results of the existing approaches.
To detect depression, Raymond Chiong et al. [24] investigated sentiment lexicons and contentbased characteristics for DD.The input features were extracted initially from the collected data.All input features were subsequently normalized using min-max normalization on a scale of 0-1.In the end least, four standard single classifiers such as LR, decision tree (DT), multilayer perceptron (MLP) and SVM, as well as four ensemble models say adaptive boosting (AB), bagging predictors (BP), gradient boosting (GB) and RF were used to find depression in tweets.Experiments were conducted on two openly available Twitter datasets, and findings showed that the proposed system attained an accuracy of 96%, which was superior to the existing detection techniques.
Hamad Zogan et al. [25] presented an explainable multi-aspect DD with a hierarchical attention network (MDHAN) for the DD of SM data.The system first gathered three complementary datasets, including depression, non-depression, and candidate datasets for depression.After that, the dataset's social data and interactions, Emoji sentiment, topic distribution, and domain-specific features were retrieved.The user tweets encoder was then carried out using RNN.The RNN's classification layer assessed whether the user was depressed or not.The experimental results demonstrated that the MDHAN outperformed previous approaches with 0.895 accuracies, 0.902 precision, 0.892 recall, and 0.893 f-measure, ensuring substantial evidence to support the prediction.
Multimodal DD on Instagram was suggested by Chun Yueh Chiu et al. [26], considering the period between the posts.The system gathered a dataset of depressive and non-depressive Instagram users to make DD on SM.Then, data pre-processing operations such as word segmentation, redundancy removal, and emoticon decoding were carried out.The preprocessed dataset was then used to extract features via the convolutional neural network (CNN).Finally, bidirectional LSTM was employed to identify the person with depression.The system may detect depressive users with an F1 score of up to 0.835, according to experimental results.
Raymond Chiong et al. [27] suggested an ML model incorporating a textual featuring learning strategy for DD.The technique was first used to remove punctuation, numerals, stop words, and word correction.The machine then utilized the BOW approach to extract features from the dataset.Finally, DD was carried out using logistic regression, SVM, MLP, decision tree, RF, adaptive boosting, bagging, and gradient boosting approaches.Initially, two openly available Twitter datasets were utilized for training and testing the ML classifiers.After that, the system used three nontwitter SM data sources such as Reddit, Facebook and an electronic diary to validate the presented classification approaches.According to test results, the system could accurately and successfully detect depression in SM texts or data in texts with an accuracy of 99.28%.
A summarization-boosted deep framework for DD on SM was introduced by Hamad Zogan et al. [28].The dataset's most pertinent embedded summary sentence features were first extracted using CNN.The Bidirectional Gated Recurrent Units (BiGRU), an RNN, was then utilized to extract the long-term dependencies and sequential information from the features extracted using the CNN.After that, the dimension was decreased using the Latent Dirichlet Allocation (LDA).Finally, the DD was done using the fully connected layer of the BiGRU.A dataset of depression, non-depression, and depression candidates was used in the investigation.The system did comprehensive experiments, and the results showed that it performed substantially better than existing methods (+3% accuracy and +6.5% F-Score).
A DD system based on ML and natural language processing over social interactions was provided by Tanya Nijhawan et al. [29].The system first performed tokenization, the elimination of undesirable patterns, the elimination of stop words, and stemming.Then, using the Bag-of-words method, the system extracted feature vectors.Finally, using ML models and Bidirectional Encoder Representations from Transformers (BERT), the depressed users were detected.Some tweets made up the dataset used to train the binary sentiment analysis system.The experimental findings demonstrated the system's accuracy of 97.78%, which was higher than the state-of-the-art techniques.
Using the naive multinomial theorem, Rinki Chatterjee et al. [30] suggested DD from SM posts.The system first retrieved data based on the Facebook API and the Twitter API.The system counted how many positive, negative, and neutral feelings there were overall.Following the classification of the remarks, the level of depression and non-depression was discovered.Finally, the system detected the depression using the naive multinomial theorem.The experimental findings demonstrated that the system outperformed traditional approaches regarding precision, f-measure accuracy, and recall achieving values of 0.77, 0.86, 0.31, and 0.46.
The ML and DL-based technique for DD has been the focus of recent publications.Numerous authors have suggested various ML techniques for diagnosing depression, including decision trees, NB, SVM, Knearest neighbours, and LR.These ML-based methods produce higher outcomes, but they struggle to identify depressed individuals at the user level, making them vulnerable to inaccurate prediction [21][22][23][24][25][26][27][28].Handling the enormous amount of data provided by Twitter is a difficult challenge.Most authors use DL-based techniques for depression identification to get around these problems [29][30][31][32][33].The distribution might be increased to obtain great accuracy.As a result, the DL approaches are more reliable and offer great precision for severe depression utilizing human history.However, none of those mentioned above studies focused on tweaking the random initialization hyperparameter of the DL model, which affects the system's performance.As a result, the approaches mentioned above that are advised produce predictions with less accuracy and need more processing time.The suggested method uses a novel DL technique, called OLSTM, to address these flaws, and SCCOA is used to modify the LSTM's hyperparameter.Additionally, the low prevalence of depression in the general population and the resulting uneven class composition present challenging issues for the DL techniques.Previous research has yet to look into the issue of class imbalance.An unbalanced dataset can reduce the model's accuracy.With the help of the SDFCM-SMOTE technique, our suggested study resolves this problem.

Proposed methodology
This paper proposes a novel deep-learning model for the DD of SM data.Initially, the Twitter data is collected from the publicly available Kaggle dataset.Afterwards, the class imbalance problem is solved by SDFCM-SMOTE.Then, the pre-processing operations, such as removing emojis, stopping words and punctuation, lowercasing, lemmatization, and tokenization, were performed on the balanced dataset.Then word embedding is performed on the pre-processed data utilizing TF-IDF and W2 V, which gives the feature representation of pre-processed data.Finally, the RNT-OLSTM is employed for DD, where, ResNet50 is utilized to extract spatial features, and SCCOA tunes the hyperparameter of LSTM.The structure diagram of the proposed system is shown in Figure 1.

Dataset balancing
Initially, Twitter data is gathered from publicly available data sources.The class imbalance problem of the dataset is rectified using the proposed clustering model.A class-imbalanced dataset is one in which the majority class surpasses the minority class.This severe issue can lead to overfitting problems and a high false positive rate (FPR) in the classification task.This article suggests a SDFCM-SMOTE to solve the class imbalance issues.The suggested algorithm includes three steps: clustering, filtering, and oversampling.In the first step, clustering, SDFCM is used to build clusters using the acquired Twitter data as input.One of the most well-known algorithms, FCM, performs clustering effectively by minimizing the objective function and iterating membership and centroid.It builds a k-numbers of clusters before assigning each data point to a cluster.The choice of the first cluster centre and the beginning membership value affects how well the FCM algorithm performs.The FCM algorithm will converge very quickly, and processing time can be greatly decreased if an initially chosen cluster centre is near the original final cluster centre.
The clustering results are highly unpredictable because the initial cluster centres are typically chosen at random by the conventional FCM algorithm.So, the optimal k-value is first determined using the Silhouette Distance before FCM clustering is applied.These silhouette distance-based FCMs for selecting optimal cluster centroids are named SDFCMs.The proposed algorithm's next phase involves filtering procedures that keep clusters with a high percentage of minority samples and synthesize more minority samples into sparse clusters.Finally, the cluster with a low density of minority samples is oversampled using the SMOTE method.These steps are thoroughly discussed in the section below.
Step 1: Clustering Let us take the input Twitter dataset as IT s = {x 1 , x 2 , x 3 , . . . . . ..xA } and then the FCM algorithm divides the samples into creates a k− number of random clusters.However, this randomly creates initial clusters that likely get stuck in a local optimum and never converge to the global optimum.Hence the proposed system used silhouette distance to create an optimal number of cluster centres.The SD's value ranges from 1 to +1, where +1 denotes the best fit for the cluster, and −1 denotes the worst fit and which is expressed as: Where, ... u refers to the average distance between the instances within-cluster and ... v refers to the average distance between the instance and the nearest cluster or clusters.The FCM algorithm's objective function is therefore defined as follows: Where, w m refers to the fuzzy weighted index or smoothing factor.It controls the degree of fuzziness between fuzzy classes, ... D 2 mn indicates the squared difference between the element (IT s ) m in the dataset IT s and the cluster centre y n .Afterwards, the degree of membership (x mn ) can be obtained by minimizing the objective function.Like the pixel probability in a mixed modelling assumption, the degree of membership in fuzzy clustering is crucial.The iterative functions of membership and centroid are given in the following equations.
x w m mn (5) Each sample point's final membership matrix to all class centres is obtained by optimizing the objective function to minimize it.Thus, the best classification for all sample data was found.
Step 2: Filtering Here, the clusters having a higher number of minority samples are filtered.Then Euclidean distance matrix is computed for each filtered cluster Cα by ignoring the majority data samples.Finally, each cluster's mean distance d ( Cα ) is computed by summing the distance matrix" of all diagonal elements and then dividing it by the total count of non-diagonal elements.The density of each filtered cluster is computed using equation (6).
Where, MS( Cα ) indicates the count of minority samples in the cluster and l− refers to the count of features.Then calculate the sparsity of each filtered cluster using the following equation.
At last, using equation ( 8), calculate the weight of each filtered cluster (WF c ).
Step 3: Oversampling Finally, the SMOTE oversampling is performed for each filtered cluster.SMOTE is an oversampling method in which false samples are created for the minority class.This strategy seeks to solve the overfitting problem induced by random oversampling.It concentrates on the feature space to develop new instances by using interpolation between positive instances that are close together.It is mathematically written as follows: Where, WF c indicates a newly generated sample, pm denotes a randomly selected minority sample in the cluster; qm refers to the nearest neighbour minority sample of pm .

Data pre-processing
Tweets, in particular, frequently include noise and unhelpful information like typos, slang words, usergenerated abbreviations and white spaces, which are unwanted and make the learning process harder for DD.To solve these issues, in this paper, the removal of stop words, emojis, punctuation, lowercasing, lemmatization, and tokenization is performed on the balanced dataset, which makes the learning process easier and improves prediction accuracy.These are explained deeply in the following steps: Step 1: Removal of Emojis, stop word and Punctuation One of the most often utilized pre-processing stages is stop word removal.Stop words such as "is," "was," "at," "if," and similar words, which do not contribute any meaning to the sentences, so they are removed from the tweets.It does not only reduce the vector space but also improves the performance by increasing the execution speed, calculations, and also the accuracy.Also, the collected Twitter data may contain emojis.A brief digital image or icon called an emoji can convey a thought or emotion.It must also be removed to improve classification accuracy.The punctuations in the Twitter data have been removed as an additional pre-processing step that treats every text in the tweets equally.Additionally, one more typical pre-processing method helps to treat each sentence similarly by removing punctuation from the Twitter data.After the punctuation has been removed, the words data and data!For instance, they are regarded similarly.

Step 2: Lowercasing
Lowercasing is another pre-processing approach used in natural language processing.It changes the input data into lowercase so that it changes "DATA," "Data," "DaTa," and "DATa" into "data."Each word in our dataset is first converted to its original general form before being turned into lowercase to maintain consistency and prevent confusion when conducting a DD study.

Step 3: Lemmatization
After that, a lemmatization process is carried out, which converts the word to its meaningful base form to make the system accurate.
Step 4: Tokenization Finally, tokenization is performed on the dataset to increase the system's efficiency.Tokenization separates each sentence into parts, such as words, phrases, and other elements.

Word embeddings
The process of converting words into vectors is known as word embedding.It is a helpful illustration of words and often leads to improving performance in the DD task.They maintain valuable syntactic and semantic features of all the words.In this work, the proposed system uses TF-IDF and W2 V to extract vector representations of words.

TF-IDF
The TF-IDF technique efficiently recovers keywords and determines each token's importance and relevance in a particular context.It determines the relative importance of each word used in the text.Inverse Text Frequency and Word Frequency are the two mathematical techniques TF-IDF uses.This paper uses TF-IDF because of its superior performance, particularly in increasing accuracy and reducing error values.Set DP n points to a set of documents, and dp denotes a single document (dp ∈ DP n ).Every document is represented as a collection of sentences and words WoR.Its first component, TF, analyzes the occurrence of specific words to determine their similarity using equation (10).
Where, N WoR (dp) indicates the count of recurrent words WoR in the document dp and |dp| refers to the size of the document dp and it is computed by using the following equation: After that, its second component, IDF, which determines how many texts in a corpus includes a particular word, is expressed as follows: The calculation of TF-IDF, which comes last, involves multiplying the TF values by the IDF calculation results for each word.This is mathematically shown as follows: Where, ← → WM t,dp indicates the weight of the term (t) in a document dp.This weighting equation is applied to each word in a document.

Word2vec
Another technique for obtaining word embeddings is W2 V, which constantly represents words as vectors.This combines efficient use of the continuous bag-of-words (CBOW) architecture and the skip-gram design for computing word vector representations.The CBOW technique uses a context as input and seeks to forecast a specific word, whereas the skip-gram model uses a word as input and seeks to predict a target context.The similarity between Skip-gram and CBOW is based on the presumption that words with comparable vector values will appear more frequently.Because the window contains word location information, W2 V learns word embedding using nearby words.

Depression detection
Once word embeddings are obtained, DD is performed in this phase.In this paper, the proposed system uses an RNT-OLSTM for DD, which classifies the wordembedded data into depressed and non-depressed tweets.The proposed approach involves "5" layers: the convolution layer, embedding layer, LSTM layer, pooling layer, and SoftMax layer.Here, the Resnet50 extracts feature from the model primarily via convolution and pooling.A CNN with 50 layers is called ResNet-50.By introducing a shortcut to achieve residual learning, the residual block in ResNet50 integrates the features of the input layer with those of the neighbouring output layer.The OLSTM layer then uses the spatial information obtained from Resnet50 to identify the depression.
Because LSTM has feedback connections, it can process all of the data in a sequence.However, the choice of hyperparameter of the LSTM model has a significant impact on its performance.Random hyperparameter initialization increases the processing speed and computational complexity.Thus, the proposed system uses the SCCOA approach to optimally select the LSTM model's hyperparameter.This improvisation in conventional LSTM makes the system better.This combination of ResNet50 and OLSTM, termed RNT-OLSTM, achieved higher classification performance to predict depressed users on Twitter. Figure 2(a) depicted that process of big-data analytic on twitter network for realtime depression and detection.
(a) Embedding Layer An embedding layer is an example of a hidden layer in a neural network.Said this layer converts input data from a high-dimensional to a lower-dimensional space, enabling the network to better understand the relationships between inputs and process the data.This layer of the model transforms each list associated with a particular word in the information array into an element vector.The embedding layer's primary responsibility is to generate the following input embedding matrix for each word chosen from the training set: Where, EL (WoR) indicates the embedding matrix of the word WoR, denotes the real number system, VS v and VD D refers to the vocabulary size and the dimension of the word embedding vector.

(a) Convolutional Layer
After getting the word embedding matrix, it is fed into the ResNet50 model's convolutional layer.An entire neural network's building blocks are created using this layer.The convolution layer's primary function is to remove spatial characteristics from the word embedding matrix.It performs the initial convolution with 7 × 7 kernel sizes.

(a) Max Pooling Layer
The max pooling layer is the next.A "maximum pooling" operation finds the highest value in each patch of each feature map.This layer helps the model change the data only with the needed information.Additionally, it aids in reducing overfitting issues that arise in neural networks.It performs the pooling operation with a kernel size of 3 × 3.
(a) LSTM layer LSTM is a kind of recurrent neural network (RNN) which can extract features from a previous layer and use those features to detect depression.The LSTM layer's four gating layers maintain the cell state, consisting of two input gates, a forget gate, and output gates.The input gate determines which input should be added to the cell state.The forget gate chooses which previous cell state to forget based on the current cell state.It is much less susceptible to the vanishing gradient problem and is very efficient at modeling complex sequential data.Each layer comprises the weights and biases between the input and the hidden layer and between the hidden layer and the output layer.These weights and biases are randomly assigned during the training process.However, these random weights and bias values made the system poor and required more computational power.So, the proposed system uses SCCOA to optimally select the weight and bias of the conventional LSTM model.This optimal selection of the hyperparameter in LSTM is termed OLSTM.These are explained in the following steps.The structure of OLSTM is shown in Figure 2(b).
Step 1: Hyperparameter tuning using SCCOA The COA is a bio-inspired optimization algorithm that considers the social structure and environmental adaption of coyotes to tackle continuous optimization problems.It is based on the intelligent behaviour of coyotes.However, the standard COA has a slow convergence rate, low solution accuracy, and a tendency to settle on local optimal solutions.So here, a sine chaotic map is utilized to initialize the population of the individuals in the optimal search space that boosts the population diversity of the algorithm.Additionally, the position of the individuals is controlled using the constriction factor to prevent the algorithm from forming an optimal local problem.These two improvements in conventional COA are termed SCCOA.The COA algorithm initializes the population of coyotes using sine chaotic map.It is another type of chaotic map, but the map has problems such as small parameter range and uneven sequence distribution, and it is expressed by, Where, ... R e, ξ +1 a indicates the position of the a −th coyote in e −th pack at the ξ −th iteration, φ and β are considered as 10 and 0.5.Then, compute the fitness of the individuals in the population using equation (16).
Where, f _f indicates the fitness evaluation function.
Then, the probability of coyotes leaving or being evicted from the population can be formulated as follows: Where, K a indicates the number of coyotes, where, K a ≤ √ 200 and ρpr are greater than 1.The best coyote at each iteration is considered the alpha coyote ( ... α) and that is computed using the equation ( 18).... α e,ξ = ... R e, ξ a for min FiTN e,ξ a (18) After that, the cultural tendency (CT e,ξ b ) of the coyotes is estimated using its median social conditions, which is mathematically expressed using equation (19).
Where, ... M e,ξ refers to the coyote's ranked social conditions.Next, a new coyote's birth is estimated as Where, SP s and AP p refers to the scatter and association probability, δ 1 andδ 2 represents the arbitrary coyotes of the e −th pack, and ... D b and RaN denotes the random variables ranges from [0, 1].Each a −th coyote in the e −th pack updates its social condition at each iteration using the following equation (22).
Where, τ1 and τ2 refers to the alpha and pack influence, respectively, FiTN e,ξ a represents the fitness of the coyote.Also, in equation ( 22), the term f cf refers to the constriction factor that avoids the premature convergence problem and increases the global search ability of the algorithm and it is expressed as Where, T IR indicates the total number of iterations.Finally, the best coyote is selected based on its social condition obtained at the end of the iteration.
Step 2: Once optimal hyperparameters are selected, the following mathematical computations are performed in LSTM to get the network's output.
... G ... (a) SoftMax Layer Finally, the SoftMax layer receives the output from the LSTM layer, and this layer classifies the input data into the depressed or non-depressed people.

Results and discussion
Here, the performance of the proposed DD of SM big data using an RNT-OLSTM is analyzed by comparing their results with the conventional schemes.First, the datasets used to analyze the proposed work's efficiency are described, and then the outcomes of the proposed and existing schemes are compared and discussed regarding some performance metrics.This system was implemented in the working platform of Python 3.7 with a machine configuration of Intel(R) Core-i7 CPU with a speed of 3.0 GHz, memory of 64 GB, and Window 10 OS.

Dataset description
Twitter is a well-known social network that offers open and straightforward data access.The experimentation is done by using two datasets such as clean_d_tweets.csv,dataset 1 (DS1), and CLPsych2015, dataset 2 (DS2).The DS 1 contains the depressed and non-depressed class labels of Twitter data from the users, which can be obtained via https://www.kaggle.com/ datasets/hyunkic/twitter-depression-dataset, and it contains more than 4000 tweets.In this dataset, d_tweets represent depressed tweets, and non_d_tweets represent non-depressed tweets.Next, the DS2 is obtained via (https://github.com/clpsych/shared_task)which contains 2000 tweets of users to detect their depression level.

Performance evaluation of the techniques for clustering
In this section, the proposed SDFCM is compared against the existing FCM, Mean Shift (MS) clustering, Low-energy adaptive clustering hierarchy (LEACH), and Density Based Clustering (DBC) with respect to processing time metric, which is shown in Table 1.The above table demonstrates the outstanding efficacy of the proposed work in terms of processing time.Processing times are important because they can help to estimate how long it will take to complete a clustering process.If the system takes minimal time to cluster the data, then it regarded as a good system.Herein also, the existing FCM, MS, LEACH, and DBC takes processing time of 1.05, 1.91, 2.38, and 3.14 min, which is higher when compared to the proposed one, because, the proposed system consumes very less time for clustering, i.e. it takes just 0.82 min.As in basic FCM algorithm, initial centroids are selected randomly from the input data, so clusters vary from one another, because of which the number of iterations and total processing time also increases in each run of the same data.In proposed SDFCM algorithm initial centroids are calculated by SD, so the number of iterations remains constant and processing time is also decreased.This is the reason that proposed SDFCM clustering algorithm is efficient from basic FCM and MS, LEACH, and DBC algorithms.

Performance evaluation of the techniques for word embedding
This section compares the results of the proposed hybrid TF-IDF and Word2vec with the different existing word-embedding schemes such as TF-IDF, Word2vec, Glove, and Fast text regarding accuracy.Figure 3(a) shows the model's results achieved for DS1 and DS2.We can easily observe that the proposed word embedding approach attains a better accuracy rate than the existing methods.For example, the existing Fast text has an accuracy of 88.32% and 88.04% for DS1 and DS2, and the Glove embedding approach has an accuracy of 90.96% and 90.45%.Also, the existing TF-IDF and Word2Vec approach used to embed the word alone has obtained the lowest accuracy of 95.14% and 93.47% for DS1 and 94.86% and 93.02% for DS2, but the proposed hybrid approach using TF-IDF and Word2Vec yields highest accuracy of 98.78% and 98.12% for DS1 and DS2.TF-IDF has a good representation of words that determine words' weight, but doesn't represent well the semantics while Word2Vec has opposing characteristics that is it can handle with problems related with semantics.By observing the advantages and disadvantages of their models, it is possible to presume that they complement each other.Combining these two models automatically increase the classification accuracy.This is the most important thing to achieve better outcomes for the proposed system.So, it is stated that using these hybrid embedding schemes represents the terms' vector representation better than the existing schemes and leads to higher accuracy.

Performance evaluation of the techniques for depression detection
This section investigates the outcomes of the proposed RNT-OLSTM with conventional LSTM, RF, Deep CNN (DCNN), and SVM for two datasets, namely DS1 and DS2.The evaluation includes metrics such as accuracy, f-measure, recall, precision, specificity, the area under the curve (AUC), and error rate.Table 1 demonstrates the efficiency of the proposed and existing methods for DS1. Figure 3 In this table, the existing SVM has low performance than all the methods.For example, the existing SVM attains 91.48% accuracy, 90.81% precision, 91.59% recall, 91.18% precision, 85.57% specificity, 85.42% AUC score, and a 21.98 error rate, which are very low outcomes when compared to all the methods.But the proposed one achieves superior outcomes than all the existing methods.The reason is that, the existing LSTM, DCNN, RF, and SVM are the well-known algorithm and produce the satisfactory classification results.However, these all the existing methods manually extracts feature from the embedded words for classification, but herein, the proposed system uses RNT method to efficiently extract the features and this efficient method leads to attain the higher performance than the existing methods.Also, the existing classification methods randomly choose the hyperparameter such as weight and bias for the corresponding inputs.This random selection of hyperparameter directly affects the classification performance and increases the error rate.But the proposed one addresses these issues and proposes a SCCOA approach to optimally select the hyperparameter.This optimal selection makes the proposed system faster and achieving highlevel outcomes than the existing methods.For example, the proposed RNT-OLSTM's accuracy was improved by 1.98%, 4.29%, 5.97%, and 8.07% than the existing LSTM, DCNN, RF, and SVM methods.Also, the proposed one attains 99.66% precision, 99.89% recall, 99.68% f-measure, 98.24% specificity, 96.25% AUC, and a 1.22 error rate, which is also better than the existing methods.From Table 2, it is concluded that the proposed one achieves better outcomes than the conventional methods.Next, the outcomes of the detection models for DS2 are discussed, and their numerical outcomes are tabulated in Table 3. Table 3 illustrates the efficiency of the proposed approach with existing methods using DS2.The proposed RNT-OLSTM attains an accuracy of 99.23%, higher than the existing methods.The existing SVM attains the lowest results for all metrics compared to other methods.It attains 91.32% precision, 91.52% recall, 91.46% f-measure, 84.18% specificity, 83.78% AUC, and 18.41 error rate, respectively.However, the proposed method achieves 99.38% precision, 99.58% recall, 99.45% f-measure, 97.87% specificity, 95.78% AUC, and a 1.68 error rate, which is better than LSTM, DCNN, RF, and SVM.The results prove the effectiveness of the proposed model over existing methodologies.The reason is that the proposed one uses the SDFCM-SMOTE approach to address the data imbalance problem, which leads to higher predictive accuracy for DD.When the data is not balanced with both majority and minority classes, the classifiers use the biased learning model that could lead to poor predictive accuracy for the minority classes contrasted to the majority classes.However, the existing methods are focused on something other than the data imbalance problem, so they attain minimal outcomes than the proposed one.In addition, tuning the hyperparameters in the DL model is essential because the random selection increases the classifier's training time and error rate.So, the proposed work uses SCCOA for the optimal selection of hyperparameters in the LSTM model, which makes the proposed one achieve better detection results with a minimal error rate.The entire experimental findings, therefore, demonstrated that the proposed strategy works better than current approaches.

Figure 1 .
Figure 1.Structural design of the proposed system.

Figure 2 .
Figure 2. (a): Process of big-data analytic on twitter network for real-time depression.(b): Structure of OLSTM.

Figure 3 .
Figure 3. (a): Accuracy analysis of the proposed system.(b): F-measure and Run time comparison with proposed work.

Table 2 .
Performance evaluation of the models for DS1.

Table 3 .
Performance evaluation of the models for DS2.