Robust Rumor Detection based on Multi-Defense Model Ensemble

ABSTRACT The development of adversarial technology, represented by adversarial text, has brought new challenges to rumor detection based on deep learning. In order to improve the robustness of rumor detection models under adversarial conditions, we propose a robust detection method based on the ensemble of multi-defense model on the basis of several mainstream defense methods such as data enhancement, random smoothing, and adversarial training. First, multiple robust detection models are trained based on different defense principles; then, two different ensemble strategies are used to integrate the above models, and the detection effect under different ensemble strategies is studied. The test results on the open-source dataset Twitter15 show that the proposed method is able to compensate for the shortcomings of a single model by ensembling different decision boundaries to effectively defend against mainstream adversarial text attacks and improve the robustness of rumor detection models compared to existing defense methods.


Introduction
Rumor detection technology is a sub-task of text classification in the natural language processing field, judging the authenticity of a message by identifying input text and other characteristics (Shu et al. 2017).With the development of deep learning technology and its popularization and application in the field of natural language processing, the rumor detection model based on deep learning has greatly improved the accuracy of rumor detection and become a mainstream method.This type of method mainly regards rumor detection as a text classification task and applies deep neural network models to make a high-level representation of the input text and classify it (Gao, Liang, and Jiang et al. 2020).However, with the wide application of rumor detection technology in the real world, some criminals use the fragility of deep neural networks to try to deceive rumor detection models, achieve the purpose of circumventing supervision, and bring new adjustments to rumor detection technology.
Adversarial text is the current mainstream adversarial method that deceives target models by adding perturbations to characters, words, or sentences.Related studies (Cheng et al. 2020a;Goodfellow, Shlens, and Szegedy 2015;Ling, Ji, and Zou et al. 2019) show that natural language processing models based on deep neural networks exhibit great vulnerability to maliciously generated adversarial text.As (Zhou, Guan, and Bhat et al. 2019) noted, tampering with words or characters in news text content may mislead the detector into detecting rumors as real news.To avoid attacks being perceived by humans, attackers typically use synonyms to generate adversarial text (Jin et al. 2020;Li, Ji, and Du et al. 2019;Ren, Deng, and He et al. 2019), circumventing the defense methods used in the detection, such as automatic spelling and grammar checking, while preserving the original semantic information well.
Given the above synonym substitution-type attack problem, existing researches have proposed defense methods based on data enhancement (Si, Zhang, and Qi et al. 2020;Wang and Bansal 2018) and adversarial training (Madry, Makelov, and Schmidt et al. 2018;Miyato, Dai, and Goodfellow 2017;Zhu, Cheng, and Gan et al. 2019) to enhance the robustness of English text classification models.These methods can also be migrated to the rumor detection model.The key idea of the former approach is to add manual rule-making adversarial text to the training set to assist classifier training, but it is only for specific types of attacks and is difficult to cover multiple types of attacks.However, real-world attacks on text input tend to be ever-changing and the search space against text is growing exponentially.The adversarial training-based approach introduces the defense idea of minimum-maximum optimization in the image field, improves the regularization ability of the model by adding norm-bounded interference to the word embedding, expands the decision boundary and enhances the robustness of the model.There have also been studies (Gupta et al. 2022) using machine translation, which translates input text from the source language to the target language and translates it back into the source language again before feeding it to the classifier, but this method has a large semantic loss.
To improve the robustness of rumor detection models under adversarial conditions, we study the defense effectiveness of current mainstream adversarial text defense methods and propose a defense method based on model ensemble, which further enhances the success rate of rumor detection models in dealing with adversarial texts by setting a reasonable ensemble strategy to compensate for the decision failure of a single robust model in the face of adversarial texts.Specifically, there are four main points of innovation in our approach: (1) Based on the idea of data enhancement, with reference to the current mainstream adversarial text generation method, synonyms are selected from the two large synonyms knowledge bases of Hownet and WordNet to replace the important words in the original text.The training set is expanded by artificially generating adversarial text through this method, to train a robust rumor detection model 1; (2) Extending the idea of random smoothing in the field of image processing to the discrete and structured space of the text and training a robust rumor detection model 2 through random scrambling; (3) Using the standard adversarial training method PGD to train a robust rumor detection model 3; (4) In the detection stage, the model ensemble idea is adopted to integrate the results of rumor detection models 1, 2 and 3 to further improve the confrontation and defense effect of the model.

Rumor Detection Classifier
Rumor detection is essentially a text classification problem, with inputs being a sequence of text words and output being a single label.Current rumor detection models based on deep learning have achieved good results in the open-source rumor detection dataset, indicating that rumor detection based on deep neural network extraction of text features is effective (Gao, Liang, and Jiang et al. 2020).
In general, convolutional neural network (CNN) can be used to extract text semantic features (Yuan, Ma, and Zhou et al. 2019), recurrent neural network (RNN) and its variants GRU, LSTM, etc. are used to model the dependencies between text and forwarding sequences (Ma et al. 2016;Ruchansky, Seo, and Liu 2017;Shu, Cui, and Wang et al. 2019), and studies (Shu, Cui, and Wang et al. 2019;Yuan, Ma, and Zhou et al. 2019) have also introduced attention networks to further extract the deep features of message text.Most of the existing rumor detection classifiers integrate neural networks with different structures to model and classify rumors themselves and their propagation processes end-to-end.

Adversarial Text Generation
Traditional methods of adversarial text generation in the NLP domain include character-level (Eger, Şahin, and Rücklé et al. 2019;He, Lyu, and Xu et al. 2021) substitution of similar letters, addition of symbols between characters, and word-level (Jin et al. 2020;Li, Ji, and Du et al. 2019;Ren, Deng, and He et al. 2019) synonym substitution.However, the former can be easily corrected by spell checking (Hládek, Staš, and Pleva 2020), while most of the adversarial text faced in rumor detection tasks is handmade by netizens, who prefer to adopt a word variation strategy, that is, modify certain important words without affecting semantics.(Ren, Deng, and He et al. 2019) use WordNet as 1 a thesaurus to replace the generated adversarial text, and (Zang, Qi, and Yang et al. 2019) further introduces the Hownet 2 knowledge base to expand the search scope of synonyms, not only increases the sample size but also makes the generated adversarial text closer to the real semantic information.
Defense Method Against Synonym Substitution Type Attacks (Wang and Yang 2015) first attempted to add interfered text to the training to improve the robustness of the model, but due to the low efficiency of adversarial text generation, the robustness performance of the model was limited.Methods based on random smoothing(Ye, Gong, and Liu 2020) input text training by constructing random sets and using the statistical properties of sets to prove robustness.(Madry, Makelov, and Schmidt et al. 2018;Zhu, Cheng, and Gan et al. 2019) propose adversarial training methods such as PGD and FreeLB based on the minimum-maximum optimization formula.The study (Wang, Tang, and Lou et al. 2021) proposes a privacy framework wordDP based on an exponential mechanism, which applies differential privacy methods to robustly verify synonym substitution attacks in text classification to ensure that small changes in input do not lead to sharp changes in output.(Li, Song, and Zeng et al. 2022) proposes a rebuild-ensemble framework that reconstructs text using the mask-fill capability of pre-trained models and uses these texts with less adversarial effects for predictions for better robustness.

Method Based on Multi-Defense Model Ensemble
As shown in Figure 1, the adversarial text successfully deceives the original detection model f Original without adopting a defense strategy, misleading the detection model to give "non-rumor" error results.In order to improve the robustness, we first use three defense methods to improve the detection model f Original , and obtain a rumor detection model based on data enhancement f Data , a rumor detection model based on random smoothing f RS , and a rumor detection model based on adversarial training f PGD .
In the detection process, the above three robust rumor detection models are used to detect the input text, and then the detection results of the three models are integrated to further improve the effectiveness of adversarial defense.
The detection model and ensemble strategy under the three single defense strategies are introduced below.

Data Augmentation Based Robust Detection Model
In real-world adversarial text scenarios, attackers often locate important words by changes in confidence information because they cannot obtain specific gradient information about the classifier.This paper extends the training set by mimicking the attacker's method of generating adversarial text, and improves the robustness of the model through data augmentation.Given an input sentence sequence x ¼ fx 1 ; x 2 ; . . .; x n g, where x i represents the i À th word, we use the scoring function to determine the importance of the j À th word in x: In formula (1), f y is the classifier model, and C x j calculates the difference in confidence between the two-text classification results before and after the word x j is deleted from the original text.Using the above method, the confidence difference of each word is calculated in turn and sorted according to the size of the difference.The larger the confidence difference, the higher the importance score.
At the same time, the words are replaced by the synonyms in WordNet and Hownet until the predicted value of the classifier changes, so far two adversarial texts are generated.WordNet is a large, hand-organized semantic dictionary in which synonyms are grouped into synonyms.Hownet is a knowledge base of sememes, and in general, words with the same sememes are represented by the same meaning and can be substituted for each other.To ensure that the semantics do not change after substitution, we have made a provision that the average revision rate of sentences does not exceed 20%.
Through the above method, the adversarial text data of twice the size of the original training set can be generated, and the adversarial text generated by the external knowledge base can be directly added to the original training set.If the rumor detection model can be trained at the same time, then a robust model f Data based on data enhancement can be obtained.

Random Smoothing Based Robust Detection Model
The definition of a smoothing model in SAFER (Ye, Gong, and Liu 2020) is as follows: As shown in Equation ( 2), what the robust detection model f RS after random smoothing needs to be satisfied is that when the sentencexin the original input text adds random perturbation to z, the model predicts that zstill belongs to the original category c.
Unlike the data augmentation-based defense method in Section 3.1, which directly transforms the input text, the random smoothing-based defense method used in this section transforms the word embedding representation of the input text at the embedding layer of the model to allow the model to learn more adversarial forms.Therefore, unlike the substitution of the original text in section 3.1 based on the dictionary or knowledge base, this section constructs a perturbation set P x in the context-aware word vector space, that is, the embedding of the original word in the text is replaced by the K nearest neighbor embedding.Drawing on existing research (Ye, Gong, and Liu 2020), we used the Glove model for word embedding and set K to 10.
For a sentence x ¼ fx 1 ; x 2 ; . . .; x n g in the input text, its perturbation distribution Q x is defined by randomly perturbing each word x i as a word in the perturbation set P x with the same probability, the formula expressed as: where z ¼ fz 1 ; z 2 ; . . .; z n g is the sentence after the perturbation, P x i j j is the size of the word x i perturbation set, and II is the indicative function.
The smooth representation of the original text embedding obtained based on the above method is sent into the rumor detection model to train, which can expand the data distribution exposed by the model, so as to obtain a more robust detection model f RS .

Adversarial Training Based Robust Detection Model
Sections 3.1 and 3.2 augment the data that can be used to train the model by means of raw text transformation and text embedding layer transformation, respectively, to improve the robustness of the model, and this section adopts PGD (Madry, Makelov, and Schmidt et al. 2018), a defense method based on adversarial training.
The principle of PGD is to minimize model parameters within a range to resist worst-case perturbations, as shown in the formula: where D is the distribution of inputs,Xis the embedded representation of the input sentenceX,yis the classification label, and L is the loss function of the classifier, whose parameter distribution is expressed as θ.To solve the internal maximization problem, PGD employs a gradient projection descent algorithm: where gðδ t Þ ¼ Ñ δ Lðf θ ðX þ δÞ; yÞ is the gradient of the loss function L relative to the perturbation δ, represents the projection on the εnorm, and t finds the ascending step of the "worst-case" perturbation δ with a step α.
In the process of rumor detection model training, after iterating on K times δ by PGD algorithm, the model parameters are updated by taking the gradient of the last perturbation, so as to obtain a robust detection model f PGD after adversarial training.

Multi-Defense Model Ensemble Strategy
For the three robust models generated by the above different methods, this section mainly examines the ensemble effects under two ensemble strategies: logits-summed and majority-vote (Cheng et al. 2020b).Logits refer to the input of the softmax function.The former is a direct average of the logits generated by each classifier, and the latter is to count the detection results of each classifier and use the voting results as the output of the rumor detection.
We define the input as ðx; yÞ, where x is the sequence of input words and y 2 ½C� is the classification label.The logits of the i À th classifier are defined by f i ðxÞ,i 2 ½1; 2; 3�, and the label predicted by each classifier is: The output of the integrated classifier with the logits-summed strategy is , and the prediction label is: The integrated classifier prediction labels that use the majority-vote strategy are: Through the above strategies, the detection effect of multiple models can be integrated to enhance the defense ability of rumor detection models in the face of unknown attacks.

Datasets and Rumor Detection Models
The dataset adopts Twitter15, a classic dataset collected from Twitter, the most popular social media site in the United States, with tweets averaging about 15 words in length and containing four labels, "False Rumor" (FR), "True Tumor" (TR), "Unverified" (UR), and "Non-Rumor" (NR).
The rumor detection model adopts CSI (Ruchansky, Seo, and Liu 2017), Defend (Shu, Cui, and Wang et al. 2019) and GLAN (Yuan, Ma, and Zhou et al. 2019) and Bert (Devlin, Chang, and Lee et al. 2019), the above four models, respectively, use the current mainstream CNN, RNN and attention mechanism in the field of natural language processing to extract deep semantic information for rumor detection, and the detection method and detection effect are representative.
CSI uses LSTM to extract the eigenvectors of the time-series text input, combine the user history information, and apply the full connection layer to classify the output eigenvectors.dEFEND first uses GRU to encode the words and stitches them together to get a word representation that combines the context, because each word contributes differently to the sentence, then the weight of each word is learned through the attention mechanism, and the sentence representation is weighted.The sentence representation is then entered again into the bidirectional GRU extraction context for the timing feature representation for rumor detection.Global-local attention network (GLAN) achieves accurate detection of rumors by combining the text content of rumors with the local semantic information and global structural information in the process of dissemination.First, the input sentence is converted into a vector form, and the CNN is applied on the word vector matrix to extract features; then, using the same method it can get a feature representation of the forwarded text; finally, a multi-head attention mechanism is applied to integrate the features of the original text and forward them into a more advanced semantic representation.The Bert-based rumor detection model represents a detection method using a large-scale pre-trained model.

Attack Methods
In order to evaluate the performance of the model in the face of different attacks in real-world scenarios, we use two black box attack methods: TextFooler and PWWS.In order to avoid being easily detected by the human eye, the average modified word is limited to no more than 20%.
• TextFooler (Jin et al. 2020): First measure the impact of words on the classification results and sort them, then construct candidate substitution words based on counter-fitting word vectors, and select the words that change the target label to replace them.• PWWS (Ren, Deng, and He et al. 2019): Sort all words based on probability-weighted word significance scores, and then greedily traverse the candidate substitution words until the labels of the model change.We use two synonymous thesauruses, Hownet and WordNet, respectively, to generate candidate replacement words.
Table 1 shows examples of adversarial text generated by the two attack methods.It can be seen that in the case of the same attack algorithm, the main reason for the difference between the adversarial texts lies in different sets of external synonyms.

Evaluation Indicators
We use the prevailing metric accuracy rate (Acc) in rumor detection, while introducing both the attack success rate (Suc) and the average number of times the attacker query model (Que).
• Acc: The number of texts correctly detected by the model divided by the number of all texts; • Suc: The number of adversarial texts that successfully interfered with the model divided by the number of all adversarial texts; • Que: A classic metric for evaluating robustness, the higher the average number of times an attacker queries a model, the harder the model is to attack.
According to the definition of the above indicators, the criteria of the robust rumor detection model are high accuracy, low attack success rate, and high queries.

Results and Analysis
In order to construct robust defensive detection methods, this section tests the effects of the four common rumor detection models (f Original ) described in Section 4.1 and their improved models under section 3.1 (f Data ), 3.2 (f RS ) and 3.3 (f PGD ) defense methods.The test data adopt the Twitter15 test set in Section 4.1 and use the methods in Section 4.2 to attack it, and the results are shown in Table 2.

Rumor Detection Model Vulnerability Analysis
From the experimental results of the attack based on the original model in the first row of Table 2, the adversarial rumor text causes a significant decrease in the detection accuracy of all four types of rumor detection models, with the CSI model showing the largest decrease and the highest success rate of the attack.This is because the CSI model has the simplest structure and uses only RNN to extract the propagation timing features, which has the weakest robustness.The dEFEND model, on the other hand, has stronger robustness North korea may be develop for a rocket launch.
PWWS(Wordnet) North korea may be preparing for a bullet launch.
North korea may be preparing for a projectile launch.PWWS(Hownet) North korea may be preparing for a bullet launch.
North korea may be preparing for a cartridge launch.
compared to the CSI model because it uses a multi-layer attention mechanism to deeply integrate text semantic and comment information.The GLAN model obtains a more robust node representation by applying the graph attention mechanism, while effectively fusing multidimensional features for detection, reducing the sensitivity of the model to adversarial text and providing better robustness.BERT is the most robust rumor detection model due to its natural use of bidirectional Transformer structure, which can capture deeper semantic information for classification.

Effectiveness of Defense Methods
To facilitate the analysis of the results, the results of Table 2 are classified as a histogram, as shown in Figure 2. The abscissa represents three types of defense methods, the four colors represent the four types of rumor detection models introduced in Section 4.1, and the ordinate coordinate represents the improvement value of the robustness evaluation index, which is analyzed as follows: The results of the three attack methods described in Section 4.2 are shown in Figure 2(a-c), respectively.Figure 2 shows the improvement effect of the robust performance of the four models under the three types of attack methods, that is, in the case of adversarial text attack, compared with the original model, the increase in the recognition accuracy rate, the decrease in the attack success rate, and the increase in the number of queries of the robust model using the defense method.
It can be seen that all rumor detection models have improved their robustness after applying the three types of defense methods proposed in sections 3.1-3.3,which shows that the three defense methods we have adopted are effective on all rumor detection models.As shown in the first column in the figure, among the four types of models, the CSI model ranked the worst in terms of detection accuracy improvement after applying the defense method, among which the method based on adversarial training improved the accuracy rate by less than 10%, and the robustness improvement effect was limited.Our guess is that, because the CSI-X neural network structure is relatively simple, it is still vulnerable to attack even if it is defended.While dEFEND and GLAN apply attention mechanisms, Bert uses a bidirectional transformer structure that can extract deep semantic information for rumor detection, thereby reducing the sensitivity of the model.These three types of models have significantly improved their robust performance after applying defense methods.

Complementarity of Defense Methods
From Figure 2, the performance of robust models based on three types of defense methods in the face of different types of attacks can be further analyzed.When faced with TextFooler attacks using counter-fitting word vector exchanges, the robust model based on random smoothing structure has the best defense performance, and the detection accuracy on all four models is improved by more than 20%.At the same time, the experimental results show that the data enhancement method based on WordNet and Hownet knowledge base has achieved the best defense effect in dealing with both attacks of PWWS, because the training process of the robust model shares the same thesaurus with the attack process of PWWS, which is equivalent to an "open-book exam." In the face of TextFooler attacks with different substitution strategies, the data enhancement method is average, but it is still superior to the traditional adversarial training method PGD.In other words, PGD was the worst defense against three types of attacks, which also coincided with previous experiments on adversarial training in other text classifications (Madry, Makelov, and Schmidt et al. 2018).The principle of traditional adversarial training is to add small perturbations to the embedding layer to expand the decision boundary, but the embedding vector after the perturbation may not necessarily match the original embedding vector table, so that the perturbation of the embedding layer cannot correspond to the real text input, which is inconsistent with the actual attack scenario.Therefore, the PGD method has limited ability to improve the robustness of the model, and is more as a regularization method to improve the regularization ability of the rumor detection model.
In summary, when faced with a thesaurus attack based on the unknown, we tend to choose a random and smooth approach to defense.

Ensemble Strategy Effectiveness Analysis
Through the analysis of sections 5.2 and 5.3, it can be seen that the robust performance of the model in the face of adversarial text can be improved to varying degrees using a variety of defense methods.From the perspective of multimodel ensembles, this section studies the application effects of the two ensemble strategies in Section 3.4, and the experimental results are shown in Tables 3-6.
At the same time, the loggits-summed ensemble strategy only averages the output results of the model, which weakens the overall diversity of the integration to a certain extent, so the performance improvement of robustness is limited, and sometimes the accuracy of detection is not as high as that of a single model.The voting-based strategy performs best on both clean text and adversarial text.Note that the voting-based ensemble strategy can effectively resist three types of attacks based on different thesauruses and improve the robustness of the model.This is because individual models trained on different loss functions have different decision boundaries, and when a set is formed, it leads to more diversity, thus compensating for the deficiencies between each other.Therefore, the detection results of multiple robust models can be made through voting strategies for the integration of decision-making, which can improve the robustness of the model.

Conclusion
In order to enhance the robustness of the rumor detection model against maliciously produced adversarial text in reality, this paper proposes a rumor detection adversarial defense method based on the ensemble of multiple defense models.This method applies mainstream defense strategies such as data augmentation, random smoothing, and adversarial training to compensate for the shortcomings of a single model by ensembling different model decision boundaries, thus effectively defending against mainstream adversarial text attacks and achieving more robust rumor detection.Through experimental evaluation of the open-source rumor dataset, we prove that the proposed method can effectively improve the effectiveness of rumor detection under adversarial conditions.

Figure 2 .
Figure 2. Robust indicators improvement value of each rumor detection model under adversarial conditions.
Rumor detection framework based on multi-defense model ensemble.

Table 2 .
Results of different models on the Twitter15 dataset and its adversarial text.