Towards data fusion-based big data analytics for intrusion detection

ABSTRACT
 Intrusion detection is seen as the most promising way for computer security. It is used to protect computer networks against different types of attacks. The major problem in the literature is the classification of data into two main classes: normal and intrusion. To solve this problem, several approaches have been proposed but the problem of false alarms is still present. To provide a solution to this problem, we have proposed a new intrusion detection approach based on data fusion. The main objective of this work is to suggest an approach of data fusion-based Big Data analytics to detect intrusions; It is to build one dataset which combines various datasets and contains all the attack types. This research consists in merging the heterogeneous datasets and removing redundancy information using Big Data analytics tools: Hadoop/MapReduce and Neo4j. In the next step, machine learning algorithms are implemented for learning. The first algorithm, called SSDM (Semantically Similar Data Miner), uses fuzzy logic to generate association rules between the different item sets. The second algorithm, called K2, is a score-based greedy search algorithm for learning Bayesian networks from data. Experimentation results prove that – in both cases – data fusion contributes to having very good results.


Introduction
Intrusion detection has been the subject of numerous studies in industry and academia, but cybersecurity analysts still want greater accuracy and comprehensive threat analysis to secure their systems in cyberspace.Improvements to intrusion detection could be achieved by adopting a more comprehensive approach to monitoring security events from many heterogeneous sources.
In previous work (Abid et al., 2022) we proposed a distributed intrusion detection system in industrial control systems.As a perspective to this work, we are proposing to merge multiple datasets and turn data into visuals in order to produce more consistent information and increase the reliability of intrusion detection.
Data fusion is a crucial technique in the field of data analysis, particularly in the context of intrusion detection.It involves combining multiple sources of information to create a more complete and accurate representation of the data being analyzed.This is particularly important in intrusion detection, where a wide variety of data sources can be used to identify and respond to potential security threats.By fusing data from multiple sources, such as network logs, system events, and user behaviour, a more comprehensive picture of the system can be created, allowing for a more effective response to potential threats.This can lead to a reduction in false positives and false negatives, allowing security systems to more accurately identify and respond to actual threats.Data fusion can also help to identify patterns and relationships between different data sources that may not be immediately apparent, leading to a deeper understanding of the system being monitored and a more effective response to potential security threats.Overall, the use of data fusion is an essential component of effective intrusion detection, helping to ensure that the data being analyzed is accurate, complete, and actionable.
Merging security events from heterogeneous sources can offer a more holistic view and a better knowledge of the cyber threat situation.A problem with this approach is that at present even a single event source (for example, network traffic) can encounter big data challenges when it is considered alone.Attempts to use more heterogeneous data sources pose an even greater challenge for big data.Several Big Data Technologies for Intrusion Detection can help solve these heterogeneous data Problems.These technologies can help to effectively manage and analyze the large amounts of data generated by security systems, making it easier to identify potential threats and respond quickly and effectively.Some of these technologies include Hadoop which is an open-source framework for distributed storage and processing of large datasets.Hadoop can be used to store and process large amounts of log data generated by security systems, making it easier to identify patterns and anomalies that may indicate potential threats.
Besides, to improve intrusion detection, the visualization of these security events in the form of graphs and diagrams improves alert accuracy and gives a more complete overview of cyber threats from a global perspective.Data visualization allows security analysts to quickly and effectively understand complex security data.It provides an intuitive way to represent large amounts of data, making it easier to identify patterns, relationships, and anomalies.Data visualization can help security analysts to quickly identify and respond to potential threats, reducing the time it takes to detect and respond to an attack.For example, security data can be visualized in the form of graphs and charts, allowing security analysts to quickly identify trends and anomalies in network activity.This can help to identify potential security threats and respond to them before they cause significant harm.
In this study, Big Data analytics tools including Hadoop/MapReduce and Neo4j graph databases are used to merge and clean the massive data to improve intrusion detection performance.Machine learning algorithms (SSDM and K2) are used then to learn from data and detect upcoming attacks.The use of machine learning algorithms in intrusion detection systems allows these systems to be more effective and efficient in detecting and responding to potential security threats, due to their ability to analyze data, identify patterns and anomalies, and make predictions based on the insights gained from that analysis.
The research problem in this paper is to improve the accuracy and comprehensiveness of intrusion detection in cybersecurity by adopting a more comprehensive approach to monitoring security events from many heterogeneous sources.The paper proposes the use of data fusion, which involves combining multiple sources of information to create a more complete and accurate representation of the data being analyzed.
The contributions of this paper can be summarized as follows: 1.The paper proposes the use of Big Data analytics tools such as Hadoop/MapReduce and Neo4j graph databases to merge and clean the massive data to improve intrusion detection performance.2. The paper also proposes the use of machine learning algorithms, specifically SSDM and K2, to learn from data and detect upcoming attacks.3. The paper highlights the importance of data visualization to improve alert accuracy and provide a more complete overview of cyber threats.4. The paper discusses how the use of data fusion can lead to a reduction in false positives and false negatives, allowing security systems to more accurately identify and respond to actual threats.
The paper aims to propose a more comprehensive approach to intrusion detection that incorporates data fusion, machine learning, and data visualization to improve the accuracy and comprehensiveness of intrusion detection in cybersecurity.
This paper is structured into 9 sections covering all stages of work.The next section is an overview of related work, the third section introduces Intrusion Detection Systems, the fourth section describes the most popular intrusion detection approaches in the literature, the section after is about KDD 99 and DARPA99 datasets, the sixth section introduces Big Data technologies and the benefits of using Hadoop ecosystem, MapReduce and Neo4j database.The seventh section describes our contribution which includes the pre-processing of data and the combination of the KDD99 and DAPRA99 datasets.The experimental results and the use of machine learning algorithms (SSDM and K2) are presented in section 8. Section 9 is for the conclusion.

Related work
Intrusion detection based on data fusion consists of building alert databases by merging many different intrusion detection datasets.Several works have been proposed in this area: Jeyepalan and Kirubakaran (2019) presented a new approach for the detection of heterogeneous intrusions based on data fusion.Essid and Jemili (2016) proposed combining two heterogeneous data sources using MapReduce and Hadoop. Ben Fekih and Jemili (2018) proposed an approach that combines intrusion detection datasets: the NSL-KDD, Mawilab, and DARPA'99 datasets, and implemented the Naïve Bayes algorithm to build and analyze the detection model.Radhakrishna V et al. (2019) proposed an approach for discovering temporal patterns by introducing the concept of data fusion w.r.t the temporal pattern tree; The tree is generated for each timeslot and then the trees obtained for individual timeslots are merged or fused to get the overall tree for the entire dataset.The concept of tree-based data fusion helps to prune elements efficiently and well ahead during the pattern mining process.Om Prakash Singh et al. (2022) employed principal component analysis (PCA) to compute the appropriate coefficients for data fusion.Xi Qiu et al. (2021) exploited feature fusion and long-term context dependencies for simultaneous ECG heartbeat segmentation and classification.Fuyong Sun et al. (2022) proposed an end-to-end Deep Fusion framework for Travel Time Estimation, which exploits multisource heterogeneous traffic information within an encoder-decoder architecture.He, H. et al. (2019) proposed a multi-view data fusion for intrusion detection.His work focuses on combining information from multiple sources, such as network traffic, system logs, and user behaviour, to improve the accuracy of intrusion detection.Khraisat, A. et al. (2020) proposed a hybrid intrusion detection system based on data fusion that combines information from multiple sources, such as network traffic and system logs, to improve the accuracy and effectiveness of intrusion detection.Anjum, N. et al. (2021) proposed an approach of fusion of heterogeneous data sources for intrusion detection that focuses on combining information from heterogeneous data sources, such as network traffic, system logs, and user behaviour, to improve the accuracy of intrusion detection.Aleroud and Karabatis (2017) proposed a context-aware data fusion approach that considers the contextual information, such as the location and time of an intrusion, to improve the accuracy of intrusion detection.Tong Meng et al. (2020) proposed a deep learning-based data fusion approach that combines information from multiple sources to improve the accuracy and effectiveness of intrusion detection.
These works focused on developing frameworks for integrating multiple sources of data, including network traffic data, system logs, and user behaviour data to demonstrate the effectiveness of data fusion techniques in improving the accuracy and effectiveness of intrusion detection systems.However, the majority of these works, except works (Essid and Jemili, 2016) and (Ben Fekih and Jemili, 2018), didn't mention any specific machine learning algorithm used in their approach.
As shown in table 1, more research is needed to develop data fusion techniques that can handle large amounts of data, provide a way to visualize graph structures and operate in real-time to support the needs of distributed networks.The previous works focused on various approaches to data fusion and its application in different domains, such as intrusion detection, traffic estimation, physical access control analysis, and employee activity monitoring.Other works focused on using data visualization: Geepalla and Asharif (2020) developed a Graph-based method for the analysis of Physical Access Control (PAC) log data to detect normal and abnormal behaviour.They have developed an Eclipse application (AC2Neo4j) that transforms PAC log data into Neo4j automatically, thus allowing for powerful analysis to take place using cypher queries.Velampalli S et al. (2019) used a graph-based approach that analyzes the data for suspicious employee activities.They focused on graph-based knowledge discovery in structural data to mine for interesting patterns and anomalies.They first reported the normative patterns in the data and then discovered any anomalous patterns associated with the previously discovered patterns.For visualizing the suspicious patterns, they used the enterprise graph database Neo4j.
All the works listed above address various aspects of data fusion, data visualization and learning algorithms in the context of intrusion detection and abnormal behaviour detection.While the approaches proposed in these works are promising, they also have some limitations and challenges: In terms of data fusion, most of the works in (Jeyepalan and Kirubakaran, 2019;Radhakrishna et al., 2019;Singh et al. 2022;Qiu et al., 2021;Sun et al. 2022;He et al., 2019;Khraisat et al., 2020;Anjum et al., 2021;Aleroud and Karabatis, 2017) did not include big data analytics or data visualization.Instead, they focused on combining information from multiple sources, such as network traffic, system logs, and user behaviour, to improve the accuracy of intrusion detection.However, the effectiveness of these approaches largely depends on the quality and completeness of the input data.Moreover, the protection against the information redundancy in distributed data fusion is an ongoing challenge.Big data analytics can help overcome the limitations caused by the quality and completeness of the input data by processing large datasets, identifying patterns and anomalies, and deriving insights that improve the accuracy of intrusion detection.Big data analytics can also help to improve the efficiency and accuracy of distributed data fusion by identifying and removing redundant information.Additionally, data visualization tools can help interpret the data by representing it visually, aiding in identifying potential vulnerabilities and threats.
In terms of learning algorithms, some works, such as (Ben Fekih and Jemili, 2018) and (Meng et al., 2020), propose the use of machine learning algorithms, such as Naive Bayes and SVM, for intrusion detection and abnormal behaviour detection, respectively.While these algorithms have shown promising results, Naive Bayes is limited by its handling of unseen categories and its performance on small datasets.SVMs are limited by the sensitivity of their classification performance to the choice of kernel function.Learning algorithms that can deal with incomplete data and are not sensitive to metrics can represent a good alternative.
Besides, some of these works lack a comprehensive evaluation of their proposed methods.For example, the work in (Essid and Jemili, 2016) used Hadoop/MapReduce for centralized data fusion and the K2 algorithm for learning, but it did not compare the performance of the proposed method with other state-of-the-art approaches.However, a disadvantage is that the centralized approach can be limited in its ability to capture the complexity and distribution of real-world data.Similarly, the work in (Ben Fekih and Jemili, 2018) used the Naïve Bayes algorithm to build and analyze the detection model, but it did not evaluate the performance of other machine learning algorithms or compare the results with other intrusion detection systems.
Regarding data visualization, the works in (Geepalla and Asharif, 2020) and (Velampalli et al., 2019) proposed graph-based approaches for the analysis of physical access control log data and suspicious employee activities, respectively.These approaches provide a powerful way to represent complex data and identify patterns and anomalies.However, the visual representation of graphs can also be complex and may require additional tools or skills to interpret the results.These approaches did not include data fusion or big data analytics.Data fusion and big data analytics can help to simplify the visual representation of complex graphs by integrating multiple data sources and using advanced analytics techniques to extract insights from large datasets.This can make it easier to identify patterns and relationships, leading to a more comprehensive and understandable visual representation of the data.
In conclusion, while the works presented above propose promising approaches for intrusion detection and abnormal behaviour detection, they also have some limitations and challenges related to data fusion, learning algorithms and data visualization.In terms of data fusion, the mentioned works did not incorporate big data analytics or data visualization, which may limit their effectiveness.The quality and completeness of input data is also a critical factor that affects the performance of these approaches.Furthermore, protecting against information redundancy remains an ongoing challenge.Regarding learning algorithms, some of the proposed methods lack a comprehensive evaluation of their performance, and the algorithms used may have limitations such as difficulties in handling unseen categories or sensitivity to kernel functions.Additionally, some works did not compare their results with other state-of-the-art approaches or evaluate other machine learning algorithms.The works that utilized graph-based approaches for data visualization did not include big data analytics or data fusion.Although graphbased approaches can effectively represent complex data, the visual representation of graphs can be challenging to interpret, requiring additional tools or expertise.
To address these limitations, our research work proposes the development of a new IDS that utilizes data fusion and visualization.By combining data fusion and visualization, we will create a system capable of handling large amounts of data and removing redundancies in distributed environments, using the power of Big Data analytics tools such as Hadoop/MapReduce and the Neo4J graph database.We will also leverage two learning algorithms, SSDM and K2, to enable our system to make accurate detection even in situations where data is incomplete.By using these algorithms and tools, we aim to develop a more effective IDS that addresses the limitations of current systems.We will compare our approach with other approaches to demonstrate the effectiveness of our proposed system.The integration of these algorithms and tools will allow for more efficient data processing and accurate intrusion detection, providing a significant improvement in the overall security of distributed systems.

Intrusion detection system (IDS)
An Intrusion Detection System (IDS) (Aleroud and Karabatis, 2017) is a defense system, which detects malicious activity in a network.An important characteristic of these intrusion detection systems is their ability to provide a view of unusual activity and to issue alerts notifying administrators and/or blocking a suspected connection.In addition, IDS tools can distinguish between internal attacks and external attacks.An IDS consists of five main components as shown in figure 1: .Load Balancer: Allows the collection of data sources and then dispatch them to sensors.
. The Sensor: Collects information on the evolution of the state of the system and provides a sequence of events that reflect this evolution. .The Analyzer: This is the core of the intrusion detection system and consists of determining whether a subset of the events provided by the sensor contains intrusions or not. .The Monitor: Collects alerts from the sensors, formats them, and presents them to the Manager. .The Manager: Responsible for the appropriate response to stop the attack.
The IDS allows to: . Detect attacks and all types of security breaches that are not recognized by preventive security tools. .Act as quality control for security design.
. Provide accurate information about intrusions, and diagnose, recover, and correct the root causes of these intrusions. .Respond to attacks and block intrusions before they damage the information system and programmes. .Prevent staff from violating security policy.If employees are aware that their actions are being monitored by an intrusion detection system, it will make them less likely to commit violations due to the risk of detection.

Intrusion detection approaches
Various approaches to intrusion detection are currently in use, each one has its advantages and drawbacks.

Intrusion detection based on support vector machines (SVM)
SVM is a supervised learning method.It performs the classification by constructing an Ndimensional hyperplane that optimally separates the data into different categories.In the basic classification, SVM classifies data into two categories.Given a set of training instances, marked pairs {(x, y)}, where y is the instance label x, SVM works by maximizing the margin to get the best performance of the classification.This method has some drawbacks.His main problem is that the complexity of the training is very dependent on the size of the data set; it is known to be at least quadratic in the number of data points.Using a hierarchical clustering algorithm BIRCH overcomes the problems of SVM (Ben Sujitha et al., 2020).It is applied in the construction of an intrusion detection system to reduce the number of data points.The Cross performance of SVM and BIRCH in a detection system in terms of accuracy was 95.72% with a false positive rate of 0.73%.

Intrusion detection based on Naive Bayes (NB)
This method uses Bayes rules.One hypothesis is that each class of examples is independent.The learning phase calculates the likelihood of each attribute.This is done by looking at examples of training.Frequency of occurrence of the class and the attribute value for each class.The execution phase looks at the largest conditional probability to determine whether such an unknown body belongs to such a known class.One of the highlights is the need for a small amount of information for the learning phase (Kharche and Patil, 2020).

Intrusion detection based on fuzzy genetic algorithm (FGA)
The fuzzy genetic algorithm system is a fuzzy classifier; It is a hybrid approach that combines the strengths of fuzzy logic and genetic algorithms (Meng et al., 2020).The following are the steps for using FGA for intrusion detection (Abid et al., 2022): 1. Preprocessing: Collect network data and preprocess it for analysis.This may include data cleaning, normalization, and feature extraction.2. Fuzzy rule generation: Use fuzzy logic to define rules that relate input data to output data.These rules are typically in the form of 'if-then' statements that define the conditions for an intrusion to occur.3. Fitness function: Define a fitness function that measures the effectiveness of the IDS.
This function is used to evaluate the quality of each candidate solution in the genetic algorithm.4. Genetic Algorithm (GA): Use the genetic algorithm to optimize the fuzzy rules.
The genetic algorithm will evolve a population of candidate solutions by selecting the fittest individuals and combining them through crossover and mutation operations.
5. Evaluation: Evaluate the performance of the IDS using the fitness function and test data.If the IDS performance is not satisfactory, refine the fuzzy rules and repeat the optimization process.
The fuzzy genetic algorithm steps are in figure 2 as follows: 4.4.Intrusion detection based on fuzzy neural network (FNN) Many argued that Fuzzy Neural Networks (FNN) can improve the performance of intrusion detection systems compared to traditional methods.However, the accuracy of the intrusion detection system by a fuzzy neural network, for low frequent attacks, are still needed to be improved.FNN IDS based on fuzzy clustering is a technique that is used to divide the data into small groups (subsets).The objective is to reduce the detection time required while increasing the detection rate and identifying attacks (Arachchilage and Wijekoon, 2021).
In our contribution, we opted for Fuzzy Decision Tree (FDT) and Bayesian Network (BN)based intrusion detection approaches.Experimentations has shown the effectiveness of these approaches compared to the other ones.

Datasets
In this part, we focus on the description of the KDD99 and DARPA99 datasets used for experimentation.

KDD99
The data of the KDD-Cup99 database (KDD Cup, 1999) were prepared and controlled by the MIT Lincoln laboratory for the evaluation of the intrusion detection programme DARPA 1998.Lincoln Labs set up an environment to acquire nine weeks of raws taken by TCP dump of data in a local-area network (LAN).
The raw training data was about four gigabytes of compressed binary TCP dump data from seven weeks of network traffic.This was processed into about five million connection records.Similarly, the two weeks of test data yielded around two million connection records.The training data is made up of 22 different attacks out of the 39 in the test data.In this work, the use of the KDD training dataset consists of 10% of the original dataset (Essid and Jemili, 2016).Each connection is labeled as a normal connection or attack, with a specific kind of attack.The found attacks are categorized into four main categories (Tavallaee et al., 2019) (see table 2): 1. Denial of Service Attack (DoS): is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine.2. User to Root Attack (U2R): is a class of exploit in which the attacker starts with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and can exploit some vulnerability to gain root access to the system.3. Remote to Local Attack (R2L): occurs when an attacker who can send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine.4. Probing Attack: this is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.

Darpa99
In the DAPRA99 database, the data set consists of weeks 1,2, and 3 for the training data and weeks 4 and 5 for the test data (DARPA, 1999).In the training data, weeks 1 and 3 are for normal traffic and week 2 is marked for attacks.The attacks marked during this traffic are classified into four categories: Probe, DoS, R2L, and U2R.Table 3 shows all attacks from the Darpa99 dataset: In our work, we focus on all the data from weeks 1-3.We aim to combine them with the sets of data-base KDD99.

Big data analytics tools
Big data is data that exceeds the processing capacity of conventional database systems.
The data is too big, moves too fast, or doesn't fit the structures of your database architectures.To gain value from this data, you must choose an alternative way to process it.In the intrusion detection field, we often involve Big Data analysis.The Cloud Security Alliance reported that, in 2013, a company like HP can generate one trillion daily events or about 12 million events per second (Group BDW, 2020).All these events are stored in large datasets.To correlate these events, it will be very difficult to manage this large amount of data with old techniques.So to analyze this huge volume of data, techniques of Big Data are used.In this section, we will present the different technologies used in the context of Big Data.

Hadoop
The Hadoop ecosystem supports a distributed master-slave architecture that consists of the Distributed File System (HDFS) for data storage and the MapReduce technique for computing capabilities.
Figure 3 shows the distributed architecture of Hadoop: Hadoop is an open-source framework for writing and running distributed applications that process large amounts of data.Distributed computing is a wide and varied field, but the key distinctions of Hadoop are that it is (Hadoop Architecture, 2020): . Accessible: Hadoop runs on large clusters of commodity machines or cloud computing services such as Amazon's Elastic Compute Cloud (EC2). .Robust: Because it is intended to run on commodity hardware, Hadoop is architected with the assumption of frequent hardware malfunctions.It can gracefully handle most such failures. .Scalable: Hadoop scales linearly to handle larger data by adding more nodes to the cluster. .Simple: Hadoop allows users to quickly write efficient parallel code.
. Hadoop uses a programming model called MapReduce for parallelization, scalability, and fault tolerance.

Mapreduce
MapReduce is also a data processing model.Its greatest advantage is the easy scaling of data processing over multiple computing nodes.Under the MapReduce model, the data processing primitives are called mappers and reducers (Hadoop, 2020).
Map and Reduce functions receive and transmit pairs (key, value) of data: . The Map function receives a pair input and can produce any number of output pairs: none, one, or more. .The Reduce function receives an input pairs list (pairs produced by the Map function) all having the same key and producing a pair that contains the expected result.This pair can have the same key as the input.
Steps of a MapReduce job: . Input File: Contains input files which mostly are stored in HDFS.
. Input Format: This phase consists of selecting the files to be processed.It defines the Input Split, which will separate the application into tasks depending on the file.Each task will therefore be a Map.Finally, it gives a generic drive for the file (s). .Input Split: this phase splits the file into several pieces to perform.She knows how the file is divided. .Record Reader: this phase transforms an Input Split into a key-value pair.It is reused until the Input Split is complete. .Mapping: this is the basic task of MapReduce.It transforms the key-value pairs into a new list of key-value pairs as needed. .Combine: this phase consists of grouping the results of the mapping carried out on a node.This makes it possible to group the data to be sent over the network. .Sort: The Sort step is to reorganize the input data of a Reducer, then group the values by keys. .Reducer: this is the second important task of MapReduce.It is used to give an output value for each key it processes. .Output Format: This is the last phase of a MapReduce job that saves the results on the HDFS system.
In this work, we use this technique to eliminate redundancies in the two datasets.

Neo4j database
Neo4j is a graph-oriented, open-access database used to manage and analyze big data.It allows storage, interrogation, and browse thousands of nodes in a few milliseconds.
A Neo4j cluster is made up of a single master instance and zero or more slave instances.All instances in the cluster have full copies of the data in their local database files.Figure 4 shows a basic architecture of a Neo4j cluster: Each instance in the Neo4j database has two parts: . HA Database: This is the element that allows data to be stored and retrieved.
. Cluster Manager: Helps ensure the high availability and fault tolerance of a cluster.
HA Database communicates directly with other instances for data replication using the Cluster Manager.Each instance contains the logic needed to coordinate with the other instances for data replication and election management.Each slave instance communicates with the master to keep the databases up to date.
Neo4j is a highly scalable, fully transactional ACID (atomicity, consistency, isolation, and durability) graph database that stores data structured as graphs.It allows developers to achieve excellent performance in queries over large, complex graph datasets, and at the same time, it is very simple and intuitive to use (Ankur, 2021b).
Neo4j can store billions of nodes, each node has a set of properties and querying with its query language called Cypher.
Cypher is the most powerful tool in the hands of any graph database developer when it comes to Neo4j.Cypher is a language for querying graphs and came into existence because the Java API for Neo4j was considered too verbose and that of Gremlin too prescriptive.Cypher was designed to strike the right balance by being declarative and easy for those who come from a SQL background (Ankur, 2021b).
The advantage of using a graph database is that it provides a partition-free solution.The usage of data on Hadoop will lead to the partitioning of the data between systems.This will lead to inaccurate solutions when performing operations such as Clustering or Classification which require the availability of the entire data rather than a subset of the data (Jeyepalan and Kirubakaran, 2019).
Moreover, the advantage of using a graphical database is that it provides processing of millions of data nodes in the order of milliseconds as opposed to any other mechanism that will take a few seconds or even minutes, depending on the size of the data set.In our work, we used the Neo4j graph database to merge similar data in both datasets and combine them into a single one.

Contribution
In this contribution, we have attempted to address the shortcomings of similar approaches in intrusion detection context mentioned in table 1.We dealt with a large amount of data and used Big Data techniques to process it in real-time and in a distributed environment.We also used a data visualization tool to simplify the data fusion process and reduce the data size by keeping only relevant data, without redundancies and irrelevant data.After this step, we applied two algorithms that handle uncertainty in the data to ensure effective intrusion detection.Comparative studies conducted in the experimental part demonstrated the relevance of our choices.
In this section we explain the three main stages (1, 2, and 3) of our work as shown in Figure 5:

Remove redundancies
The data sets of KDD99 and DARPA99 contain a very large number of redundant data which must be eliminated to avoid problems in terms of learning algorithms and also to reduce the size of data.For this phase, we used the MapReduce technique.A MapReduce job includes several phases: 1. Pretreatment: preparation of input data, ex: decompressing files.2. Split: block data separation treatable separately and formed of (key, value), ex: in rows or tuples.3. Map: applying the map function on all pairs (key, value) formed from the input data, it produces further pairs (key, value) in output.4. Shuffle & Sort: redistribution of data so that the pairs produced by Map with the same keys are on the same machines.5. Reduce: aggregation of pairs having the same key for the final result.
Figure 6 clearly explains with a simple example the procedure of Map Reduce: To run our algorithm on the two datasets KDD99 and Darpa99, we use the following command: Hadoop ja/mnt/hgfs/D/remove/remove.jar /mnt/hgfs/D/kdd99.txt/mnt/hgfs/D/Darpa99.txt.
After the execution of our algorithm, we obtain two files that correspond to the two datasets KDD99 and Darpa99 without redundancies, we notice that the number of data lines is greatly reduced following the elimination of redundant data.
Table 4 shows the number of records in the two datasets before and after the redundancy elimination phase: After the phase of redundancy elimination, we move to the next step which consists of merging similar data after an import of not redundant sets of kdd99 and darpa99 given in the Neo4j database.

Merging similar data
Our goal after the merging phase is to increase and improve the intrusion detection rate.In this section, we used simple queries destined for the cypher language to merge similar nodes (each connection has turned into a node) based on a set of properties with higher weight in the connection.
Wei W., Gombault S., and Guyet T. (Ben Fekih and Jemili, 2018) proposed to use the most 10 common attributes that different methods simultaneously selected to form the key set of attributes shown in Table 5 for the detection of different categories of attacks.
For example, two nodes with the same type of attack Neptune, and that have the same properties of the category DoS are merged into a single node.After choosing the attributes for each dataset, we launch the Neo4j instance to run our queries. .LOAD CSV WITH HEADERS FROM: this command imports a CSV-type file from its location. .MERGE is the command that allows you to browse our database line by line and merge connections with the same attributes and then create it as a node. .ON CREATE SET: used to create the other attributes for each node.
After executing these queries, we obtain all the records (connections) in the KDD_Darpa99 database which are created in the form of nodes.Each node contains all the data and the specific type of attack as shown in figure 7 below: When creating data in the Neo4j database, the objective of storing this data in the same database (KDD_Darpa99) makes it easier to work on the next phase which is the vertical combination of all the data in a single uniform database.After executing the query, the data is retrieved in a well-organized table as shown in the following figure (see figure 8): Subsequently, we exported the final database in CSV format.

Data fusion results
After the elimination of redundancies and merging of similar data, the number of records in the two datasets is reduced.Table 6 shows records numbers before and after fusion.The number of records after fusion has considerably decreased.This is explained by the presence of a large number of redundant and insignificant data that were deleted during the fusion process.

Fuzzy decision tree: sSDM algorithm
Different intrusion detection approaches are currently in use.Each one has its strengths and weaknesses.The fuzzy decision tree showed its performance in intrusion detection.Fuzzy decision trees have classification structures that are built on base granules of information in fuzzy clusters (Li and Li, 2021).While the data set is grouped in clusters so that similar data are put together.These groups are completely characterized by their prototype (centroid).When the tree is growing, nodes (clusters) are divided into granules of less diversity (of greater homogeneity) (Ahmadian and Yazdi, 2021).The performance, robustness, and usefulness of the classification algorithms are improved when relatively few features are involved in the classification.Thus, the selection of relevant characteristics for classifier construction requires a lot of attention.More function selection approaches were explored (Aleroud and Karabatis, 2017).
The goal is to identify the best subset of functionality in the construction of the fuzzy decision tree.The usefulness of the fuzzy decision tree for IDS development with a data partition is based on horizontal fragmentation.The latter is to repartition the subset tuples, where each subset may contain data that have common properties.It can be defined by expressing each fragment as a selection operation on the overall relationship.The horizontal fragmentation is defined using the selection operation on the overall relationship (Pradhan and Das, 2021).10) is an algorithm from the Apriori algorithm family to which notions of fuzzy sets are added to strengthen the search for associations on more robust bases.As the name of the algorithm suggests, SSDM relies on the semantic similarity of elements to perform its searches.(Figure 9) The algorithm uses similarity matrices.The algorithm requires setting a minimum value of minim similarity which determines whether the algorithm should confuse two elementsof the same domain into a single entity.
The algorithm looks for cycles of similarity between elements and these cycles are used when generating identical A priori candidates.We then end up with a set of pure candidates possibly containing fuzzy candidates.Where applicable, the support of the fuzzy  candidate is evaluated according to the weight of its similarities by the following formula: where weight (c) is the weight candidate C containing elements a and b and where weight (a) is the number of occurrences of an in the transactional database.
The general formula for n elements is given by the following equation: where f is the smallest degree of similarity contained in the considered associations.The generation of the rules is also identical to that of the A priori.
The basic principle of this algorithm is to use fuzzy logic to highlight the syntactic relationships between different components.This algorithm is based on the generation of association rules between item sets extracted from the base data.X → Y is an association Y where X and Y are two item sets.We define the following two variables: -Support: percentage of transactions that contain X and Y; -Minsup: Support minimum value; -Minconf: minimum confidence value; -Minsim: minimum similarity value before establishing a semantic link between two different words.
All these parameters are defined between 0 and 1.
Steps of the algorithm: 1. Data scanning: identify the elements and areas; 2. Calculation of the degree of similarity between each element for each area; 3. Identify the elements that are considered like; 4. Generate candidates; 5. Calculate the weight of each candidate; 6. Assess each candidate; 7. Generate the rules.
To evaluate the performance of the proposed system, two experimentations have been made: The first one aims to evaluate the performance of the SSDM algorithm compared to other algorithms.The results of the comparison of SSDM with others algorithms are described in Tables 7 and 8: The second experimentation aims to evaluate the contribution of data fusion for intrusion detection, we performed comparison with other approaches based on single dataset: KDD dataset.The results are shown tables 9 and 10.

Experiment 1
In this experiment, we are going to show the effectiveness of SSDM algorithm compared to other algorithms, all of them learn from the fused dataset.
On one hand, we notice that U2R attacks have the lowest detection rate.U2R (User to Root) attacks are a type of intrusion where an attacker with limited privileges on a system tries to escalate their privileges to gain root-level access.U2R are less proportionate in both Kdd99 and Darpa99 datasets.One possible reason why U2R attacks are less proportionate in alert datasets is that they may be less common compared to other types of attacks.For example, other types of attacks such as DoS (Denial of Service) or probing attacks may be more frequent and cause more immediate damage to the system, making them more likely to trigger an alert.Additionally, U2R attacks may take longer to execute and may be more difficult to detect than other types of attacks.
Machine learning algorithms learn patterns from the data they are trained on, and these patterns are used to make predictions on new, unseen data.When there is less proportionate data available for a particular class, it can make it difficult for machine learning algorithms to accurately learn and generalize from that data.
On the other hand, SSDM shows better performance in detecting DOS, R2L, and Normal attacks in comparison with the other algorithms shown in table 8.This can be explained by the high rate of missing data in the records of these connexions.The SSDM, due to its fuzzy reasoning, succeeded to detect these connexions' types.
FGA is an optimization technique that combines the principles of fuzzy logic and genetic algorithms to solve optimization problems.FGA uses genetic algorithms to search for the optimal solution to a problem, while incorporating fuzzy logic to handle uncertainty and imprecision in the data.This makes it a powerful tool for solving complex optimization problems, especially in situations where the relationships between variables are not well-defined.In general, SSDM is considered to be a better approach than FGA for data mining tasks that require a deeper understanding of the relationships between data instances.This is because SSDM is able to process data in more semantically-aware manner, whereas FGA is more focused on finding the optimal solution to a problem through a combination of genetic algorithms and fuzzy logic.
SSDM and Fuzzy Neural Networks (FNN) are both techniques used for pattern recognition and data mining.However, they differ in the way they process information and make decisions.FNN are based on fuzzy logic and use fuzzy sets to represent uncertainty in data.They are able to process large amounts of information and make decisions based on that information.However, they can sometimes struggle with complex, non-linear relationships in data.
On the other hand, SSDM is a data mining technique that is based on semantic similarity.It uses natural language processing and ontologies to understand the meaning of data, and then measures the semantic similarity between instances in a dataset.This approach is more effective for processing complex, multi-dimensional data, as it is able to capture the underlying relationships between data instances.
In general, SSDM is considered to be a better approach than FNN for data mining tasks that require a deeper understanding of the relationships between data instances.This is because SSDM is able to process data in a more semantically-aware manner, whereas FNN are more focused on processing raw data.

Experiment 2
In this experiment, we are going to show the contribution of data fusion compared to other approaches based on single dataset.For this purpose, we considered approaches based on GA, C5, C4.5 and ID3 algorithms.All these approaches learn from a single dataset which is KDD dataset.
It's clear that data fusion has improved the accuracy and efficiency of the intrusion detection system.By integrating data from multiple sources, the intrusion detection system can have a better understanding of the network traffic.This integration of data has allowed the intrusion detection system to detect anomalies and potential threats that might be missed by a single data source.Data fusion has improved the accuracy of intrusion detection systems by reducing false positives and false negatives.
Other machine learning models, considering data uncertainty, have been used in literature and have shown good results in classification, among which we cite the Bayesian networks.In the next section, we're going to present a probabilistic algorithm, K2, for learning, and compare results from both classifiers.We classically define a Bayesian network as a directed acyclic graph.It is made up of a set of variables and a set of arcs between the variables.variable corresponds to a node in the network.With each variable X i having for parents pa (X i ), we associate a conditional probability p(X i /pa(X i )) (Ben Sujitha et al., 2020).The idea of the K2 algorithm is to maximize the probability of the structure knowing the data.See figure 11: The analysis results of our database KDD_Darpa99 are presented in figure 12: In this section, we compared the K2 and the SSDM classifiers being used in this contribution.The comparative results are described in Tables 11 and 12 in terms of detection rates and false alarms.Finally, we present the performance of the K2 algorithm-based model in terms of false alarms in Table 12.
The use of the k2 algorithm has shown the highest performance in the detection of some types of attacks (Probing and R2L), while SSDM has shown better performance in the detection of other types of attacks (DOS, U2R) and normal connexions.SSDM takes into consideration imprecision due to the missing data, while k2 takes into consideration the uncertainty of the data due to its stochastic and random origin.Both could be used in ensemble learning in future research to make profit from the advantages of both of them.(Table 13)

Experiment 4
In this experiment, we are going to compare our results of intrusion detection based on data fusion with recent intrusion detection approaches which learn from a single dataset (KDD99 dataset), in order to emphasize on the importance of data fusion based contribution.
M. S. S. Islam and A. R. Islam in (2021) evaluated the performance of deep convolutional neural networks (CNNs) for intrusion detection using the KDD99 dataset, achieving a detection rate of 99.1%.D. D. P. C. Arachchilage and K. M. K. Wijekoon in (2021) presented an ensemble approach using random forest and gradient boosting algorithms, achieving a detection rate of 98.3% on the KDD99 dataset.D. Li and Y. Li in (2021) proposed a feature selection method combined with random forest algorithm, achieving a detection rate of 99.4% on the KDD99 dataset. A. Ahmadian and H. S. Yazdi in (2021) proposed a feature selection method based on correlation-based feature selection and principal component analysis, achieving a detection rate of 99.3% on the KDD99 dataset.X. Yang et al. in (2021) proposed a feature selection method based on recursive feature elimination and a support vector machine algorithm, achieving a detection rate of 99.3% on the KDD99  2021) proposed a hybrid approach using a rulebased method combined with a support vector machine algorithm, achieving a detection rate of 99.1% the KDD99 dataset.And H. Farhan et al. in (2021) proposed a hybrid approach using a k-nearest neighbours algorithm combined with a rule-based method, achieving a detection rate of 99.2% on the KDD99 dataset.
The idea of combining a set of different data and minimizing the number of records has increased the results in terms of detection rate.The partitioning of tasks between the ecosystem Hadoop and database Neo4j has enabled easier interpretation of data and better data exploration.Besides, the Fuzzy Decision Tree (SSDM) and the Bayesian  Network (K2)considered among the most efficient models in the knowledge representation fieldshave shown their efficiency in detecting intrusions.
In future work, we plan emphasize on security and transparency in managing and analyzing data.Blockchain technology can play a crucial role in big data analytics by providing secure, transparent, and immutable data management solutions.
The study by Balani et al. (2022) highlights the design of a high-speed blockchainbased sidechaining peer-to-peer communication protocol over 5G networks.The proposed protocol is designed to provide a high-speed and secure communication mechanism between peers over 5G networks, enabling faster and efficient processing of large volumes of data.Blockchain can be useful for big data analytics applications that require real-time data processing and analysis.
The study by Mhatre et al. (2022) proposes a blockchain-based counterfeit product identification system (BCPIS) to tackle the problem of counterfeit products in the supply chain industry.BCPIS utilizes blockchain technology to provide an immutable and transparent record of the entire supply chain, ensuring the authenticity of products.The system provides a reliable and secure way of tracking the movement of products in the supply chain, reducing the risk of counterfeit products entering the market.Blockchain can be useful for big data analytics applications that require accurate and reliable data.
The use of blockchain in big data analytics can help in real-time data processing, analysis, and accurate data tracking.

Conclusion
In this article, we presented a new approach based on Big Data analytics tools for combining different sets of data, and then we implemented the decision tree algorithm  SSDM, and the Bayesian network algorithm K2, to analyze the fused data.In this work, we showed the usefulness the data fusion based on the Hadoop ecosystem (MapReduce) and Neo4j database to manage and process a set of big data.Despite its advantages, SSDM suffers from some shortcomings.We aim to improve the performance of this algorithm by modifying the data structures and improving the expression of syntax rules to reduce false classification and to better detect new intrusions.
In future work, we will continue to develop our approach to combine a greater number of data sets using Big Data analytics tools, to ameliorate the Intrusion Detection System performance.Besides we intend to improve the SSDM and the K2 algorithms and use them for ensemble learning in order to take advantage of each of them and to ameliorate the learning effectiveness.Finally, we intend use blockchain in big data analytics to provide a reliable, secure, and transparent way of managing and analyzing data.
To combine both datasets into a single one, we used a Cypher query that selects the 24 attributes from the KDD dataset and the 22 attributes from the Darpa dataset.In the Cypher query, we added two other attributes with null values to the 22 attributes of Darpa data to equilibrate attributes.The MATCH clause in Cypher is used to specify the pattern of nodes and relationships that should be matched in the graph.It is used in combination with the WITH clause to vertically combine the results of multiple MATCH clauses.The WITH clause allows to take the results of one MATCH clause and pass them as input to another MATCH clause.

Figure 7 .
Figure 7. Data created in the Neo4j database after the merge phase.
support = (Transactions including X and Y)/(Total number of transactions) (3) -Confidence: percentage of transactions containing X which also contain Y. confidence = (Transactions including X and Y)/(Number of transactions including X) (4) The algorithm requires the user to define the default data:

Figure 12 .
Figure 12. Analysis results of the KDD_Darpa99 database.

Table 4 .
Number of records with and without redundancies.

Table 5 .
Selected attributes for individual attack category.

Table 6 .
Number of records of the final dataset.
dataset. A. B. Pradhan and A. K. Das in (