Cybersecurity Deep: Approaches, Attacks Dataset, and Comparative Study

ABSTRACT Cyber attacks are increasing rapidly due to advanced digital technologies used by hackers. In addition, cybercriminals are conducting cyber attacks, making cyber security a rapidly growing field. Although machine learning techniques worked well in solving large-scale cybersecurity problems, an emerging concept of deep learning (DL) that caught on during this period caused information security specialists to improvise the result. The deep learning techniques analyzed in this study are convolution neural networks, recurrent neural networks, and deep neural networks in the context of cybersecurity.A framework is proposed, and a real-time laboratory setup is performed to capture network packets and examine this captured data using various DL techniques. A comparable interpretation is presented under the DL techniques with essential parameters, particularly accuracy, false alarm rate, precision, and detection rate. The DL techniques experimental output projects improvise the performance of various real-time cybersecurity applications on a real-time dataset. CNN model provides the highest accuracy of 98.64% with a precision of 98% with binary class. The RNN model offers the second-highest accuracy of 97.75%. CNN model provides the highest accuracy of 98.42 with multiclass class. The study shows that DL techniques can be effectively used in cybersecurity applications. Future research areas are being elaborated, including the potential research topics to improve several DL methodologies for cybersecurity applications.


Introduction
Internet usage increased significantly during this pandemic, with sizable interconnected networks facing multiple threats. As a result, there are multiple security threats in cyberspace. Various security organizations worldwide continue to develop innovative techniques to protect peripherals and sensitive data from cyberattacks. Broad security practices include network-based security systems (Zhengibing, Zhitang, and Junqi 2008) and host-based (Hu 2010) that protect cornered peripherals from illegal intrusion.
These systems consist of multiple devices combined, primarily firewalls, intrusion detection systems (IDS), threat protection, simple control over system practices, and a flag boost based on configured detection priority. Intrusion detection plays an essential role in the configured detection priority. Intrusion detection plays an essential role in information security and helps detect illegal access, changes, and destruction of information systems (Mukkamala, Sung, and Abraham). IDS are generally divided into signaturebased, statistical anomaly-based, and combined approaches. Signature-based detection uses predefined signatures of abuse activity to classify intrusion attempts. Statistical anomaly-based detection incorporates natural sequences and identifies suspicious activity based on deviations from routine lines.
On the other hand, the combined approach of detection techniques practiced abuse detection and anomaly detection techniques (Alazab et al. 2010). Many vendors like Microsoft, Checkpoint, Symantec, McAfee, Kaspersky, Symantec, Microsoft McAfee have advanced anti-malware, virus, and threat protection products to protect networks and user data from attacks. Additionally, these vendors typically use signature-based approaches to identify malware. Ransomware attacks (Mcintosh et al. 2019), zero-day attacks (Alazab et al. 2011), unauthorized access (Shenfiled. Dey, and Ayesh 2018), denial of service (DoS) (Larson 2016), data breaches (Low 2017), phishing (Binks 2019), social engineering (Krombholz et al. 2015), etc. common nowadays. These security incidents or cybercrimes intensely impact businesses and people, causing disruption and overwhelming business and financial losses. These security incidents or cybercrime intensely impact businesses and people, causing disruption and overwhelming business and financial losses. DL algorithms have found an essential role in solving complex problems. DL can be classified as various multi-layered ML techniques that capture general notions of complex, vast amounts of data. DL offers several cybersecurity companies modernization of security systems at an optimal cost. The popularity trend is shown in Figure 1, determined from Google Trends from 2019 to 10 th January 2022.
Researchers seek complete and accurate data to advance their approaches. However, obtaining the correct, valuable data is a significant difficulty. The contributions of the work are summarized as follows: (1) A framework for cybersecurity based on deep learning has been proposed.
(2) Real-time data has been collected using a lab setup, and then the proposed methodology for the evaluation process has been insulated.
(3) A detailed evaluation with live data is carried out using three different deep learning techniques. It shows how deep learning techniques can be used effectively in cybersecurity applications. (4) The key areas of deep learning in the cybersecurity domain and open gaps are explored; new studies can be targeted.
The remaining paper is structured as follows. Section 2 outlines the relevant works, different deep learning approaches, and common types of cybersecurity attacks.. Section 3 introduces the proposed deep learning-based cybersecurity framework. Section 4 discusses the real-time lab setup, the log acquisition process, the methodology employed in the experimentation process, and performance analysis using DL approaches. Section 5 presents the future scope along with open research challenges. Finally, Section 6 confers the conclusion.

Related Works
Several relevant studies are consistent with DL methods for detecting cybersecurity attacks . We, therefore, classify the studies listed in Table 1 assessed using the systems used:

DL Techniques
The case study defines DL techniques for cybersecurity attack detection systems. Figure 1. Popularity score of "cyber security" and "deep learning" worldwide from 2019 to 10 th January 2022, the x-axis represents time strap, and the y-axis represents popularity score.

Interpretation of DL Techniques
It shows whether the application evaluates DL techniques for cybersecurity attack detection systems. Data related to IDS shows the work on converging the data associated with cybersecurity attack detection systems. (Milenkoski et al. 2015) presented the typical applications of cybersecurity IDS by examining existing systems correlated with model assessment criteria. (Loukas et al. 2019) presented a study on cyber-physical IDS strategies for vehicles. (Mahdavifar and Ghorbani 2019) presented a survey on deep learning cybersecurity applications; The highlighted studies focus on Androidbased malware detection and analysis. (Ferrage et al. 2020) presented a comprehensive overview of cybersecurity attack detection using DL approaches. In addition to public data sets, various deep learning models are discussed.
Furthermore, two data sets are evaluated with DL methods. (Ring et al. 2019) showed a reflection of intrusion detection data. The study maps different data sets and recognizes special features. Our study focuses on DL techniques intended for cybersecurity IDS on real-time datasets.
However, this work (Mahdavifar and Ghorbani 2019), (Wu et al. 2020), (Yang et al. 2018a) does not provide a comprehensive analysis with DL methods applied to data. Therefore, our study evaluates deep learning approaches using a real-time data set generated by the proposed lab facility instead of a public dataset.

Deep Learning Approaches
Deep Learning (DL) describes a family of artificial intelligence (AI) derived from artificial neural networks (ANN) (Sarker et al. 2020). However, DL's main interest in traditional machine learning (ML) techniques continues with increasing mass production approaching numerous cases, especially discovering enormous amounts of data (Yang et al. 2018a). This section explains the DL techniques related to cybersecurity IDS. Three deep learning methods for cybersecurity IDS are used in this study, namely (a) CNN, (b) RNN, (c) DNN.

Convolution Neural Network (CNN)
A convolutional neural network selections elements at a higher resolution and thus translates them into superior compound elements, shown in Figure 2. (Yang et al. 2018b) proposed a feature training system to categorize malicious traffic. CNN is used to study categorizations, and inexperienced traffic flow is directed in pictures through CNN. CNN rates suspicious traffic for image analysis based on the USTC-TRC2016 dataset. (Chowdhury et al. 2017) suggested using techniques for IDS. The CNN is prepared, successive outputs of CNN are collected and provided as input within the SVM (Support Vector Machine) and KNN (k-nearest neighbor) intrusion detection techn iques using KDD 99 and NSL-KDD records. (Guo, Wang, and Wei 2018) proposed an analysis of malware detection using CNN. The CNN technology mentioned above includes two folding layers, pooling layers, and inner product layers. They downloaded malicious applications used in this work from the Virus Share website. (Khan et al. 2019) proposed a CNN model that automatically extracts the penetration example. The author used KDD99 datasets to test the accuracy of intrusion detection.

Recurrent Neural Network (RNN)
It's a neural network; the combined diagram contains at least one cycle, shown in Figure 3. (Kim et al. 2016b) proposed a framework using the KDD Cup 1999 dataset for a recurrent neural model for intruder detection. They showed 98.8% among the total attack patterns. (Yin et al. 2017) presented a recurrent neural network integration into an IDS system. They used the NSL-KDD data set and assessed performance based on accuracy, false-positive rate, and true positive rate. The study also highlighted the benefits of using RNN for IDS. (Kim et al. 2016a) presented the LSTM recurrent neural network method for intrusion detection data. They achieved an accuracy of 93.85% and a FAR of 1.62%. (Brown et al. 2018) presented log anomaly detection using RNN. Using the Los Alamos National Laboratory (LANL) cybersecurity dataset, they evaluated model performance.

Deep Neural Networks (DNN)
It features multi-layered perceptions (MLP) with multiple layers and a class of feed-forward artificial neural networks, shown in Figure 4. (Tang et al. 2016), presented an intrusion detection system using CNN and other DL methods in software-defined networks. They used the NSL-KDD data set, and the experimental results found that the learning rate of 0.001 is achieved more effectively than others. ) presented a deep adversarial and statistical learning method to detect network intrusions. They used two main components, discriminator and generator, in the proposed systems. (Wu and Guo 2019) presented a deep learning model for detecting intruders in an extensive . The models are evaluated using NSL-KDD and UNSW-NB15. Finally, (Rezvy et al. 2018) proposed an intruder detection and classification model using a deep autoencoder algorithm for dense neural networks. They used the NSL-KDD record. They reported an overall recognition accuracy of 99.3%.

Cybersecurity Attacks
Sophisticated attacks penetrate cybersecurity operations. This section examined several publications on detecting cybersecurity crimes using deep learning thought, examined some types of attacks, and discussed the variety of intruders, including targets. A summary of common cyber attacks is presented in Figure 5.

Attacker Types
Attacks can be divided into three types. First, the attacker has no prior knowledge of systems or deep learning models and no knowledge of the black box attack. Second, the gray box test attackers understand some of this system's information and design elements and have credible information. Third, the attacker has extensive details about the white-box model, which only occurs in the most critical case. Finally, an attack that transfers a focused system to a neural network begins with a misclassification called an integrity violation. The motive varies with the following analysis. When an attack pushes a focused system onto the neural network, a misclassification known as an integrity violation begins; if the attack is targeted, the operation will not appear and run for some time, treated as an availability violation. On the other hand, when an attacker attempts to negotiate private information, treated as a privacy invasion, this criminal carries out this attack in two main forms: targeted and random attacks. First, the opponent tries to get an inaccurate return on the given attack with a specific part of the training session. Second, the intruder focuses on the practice pattern in an accidental attack.

Denial of Service Attack (Dos)
It is processed while transmitting a significant volume of traffic in a special technique to the selected recipient; Target users do not have access to the operation of the network. The ultimate goal is to permanently or temporarily suspend or terminate the service (Diro and Chilamkurti 2018). It is handled by transferring a significant volume of traffic to the designated beneficiary; they are no longer entitled to network operation with the ultimate goal of permanently or temporarily suspending or terminating the service (Diro and Chilamkurti 2018).

Probing
The attackers examine the networks and effectively obtain the information and data (Radford, Metz, and Soumithchintala 2015).

User to Root Attack
The attacker's track system and regular user account are efficient . Significantly, the identifications are recorded, and confidential information may suffer (Xiong and Yu 2018).

Remote to Local Attack
The attackers exploit the operation by exploiting the abuse of system communication and execution through the vulnerabilities previously introduced into the process. Remote abuse seems easier to stop, while local attacks are difficult to identify (Mnih et al. 2016).

Adversarial Attacks
It demands that topics related to DL in privacy statements are appropriate. For example, (Mahloujifar, Diochnos, and Mahmoody 2019) examined the bias attack technique to overcome the obstacle of improving comfort in a realistic environment.

Poisoning and Evasion Attacks
Poison attacks are performed during each DL preparation phase. Then the attacker interpolates the infection within the preparation samples, reducing the prediction efficiency of the DL technique. An evasive attack targets DL's prediction process. (Jiang et al. 2020) used the Particle Swarm Optimization (PSO) method to combat the virus and focused on this preparatory phase. On the other hand, the poison abuse and the intervention phase point to mysterious attacks.

Integrity Attacks
When altered or misrepresented, they converge that the information is functional. The attacker largely accompanies this attack by encrypting companyowned elements and accusing the decryption of massive financial fraud.

Causative Attacks
It is performed while focusing on the decision-making technique to develop a misleading classification of neural networks. (Sihag and Tajer 2020) recommended a method to evaluate the protected framework to detect abuse and isolate the neural network design to overcome this obstacle. The publicly available cybersecurity datasets are summarized in Table 2.
This section examined related work, different DL approaches, and common cybersecurity attacks.

Proposed Deep Learning-based Cybersecurity Framework
This section illustrates a standard DL framework structure for leveraging cybersecurity. The design is deemed to be comprehensive to address various cybersecurity challenges. Furthermore, the facility will be as in-depth as probable to handle multiple cybersecurity challenges. The visual representation of the designed system is demonstrated in Figure 6. The functional area emphasizes the selection of data sources concerning the proposed framework. The general structure consists of four main elements.

Analysis
The workflow for this structure starts with examining different types of file formats in static and dynamic mode. The structured workflow starts with a static or dynamic examination of network traffic, flow logs, and other log files such as web and cloud. The input file is decompressed in the static analysis phase to extract identical features based on the predefined ruleset. Packet streams are recorded using the defined filter patterns in the dynamic analysis phase. It also practices an on-demand control protocol that allows the filtering rules to be corrected as network requirements change.

Feature Extraction
In this module, the static functions, comprising web URL functions, raw log data, and access control-related details, are removed from these accessible resources. In addition, the dynamic functions are extracted from the collected log files, including system sequences, system resources, traffic flow, registry keys, file details, and domain-based. The network feature extractor is a flowbased extractor that can derive network traffic characteristics from various files like Libcap, Wincap, PCAPng, Npcap.

Pre-processing
Input features can be reduced to sub-dimensional subspaces using the random projection system or principal components analysis, which inputs the DN model based on the dimension of this feature matrix. Based on these generated features, we need to normalize the nominal values to the numeric values within the target system, similar to the [1,0] range. There is no need to reduce the dimensionality as the DN is roughened to achieve this naturally. A DL structure framework model should acquire the high-level features of the lowlevel layer by layer. Therefore, feature conversion before training the DN architecture would not be prudent and would eliminate DL techniques.3.4. DL Classifier. DL exercises the final feature matrix as input and is trained to use the greedy layer-by-layer learning technique (Tavallaee et al. 2009). Thus, the DL can be implemented by a CNN, DNN, DBN, etc., depending on the input feature matrix. While the DL is shown in the frame, it can be performed using ANN, RNN, CNN, DBN, DNN based on the nature of the input feature matrix. For example, CNN has been very effective at classifying images, and RNN is adapted to processing input sequences.

Data Collection and Implementation
Create an unbiased, real-time record of intruders that combines a variety of real-world attacks. This section defines the contemporaries of real-world network attack records, including infiltration, brute force, DDoS, and SQL injection records. The experiment is performed in Anaconda Environment with Python 3.8 version using Tensor Flow, Keras. The proposed infrastructure includes four elements: (4.1) proposed setup, (4.2) methodology, (4.3) feature selection, (4.4) processing, (4.5) performance evaluation, (4.6) outcome and discussion, described in Figure 8.

Proposed Lab Setup
The proposed setup consists of a router, switch, two laptops, and a server. Two virtual machines are installed on each laptop as virtual servers running the Ubuntu operating system, as shown in Figure 7. Attacks are generated both inside and outside the network. Snort 3.0 version, released in January 2021, is used as IDS software, Wireshark tool to capture packets, Scythe, Netssi2 tools to generate attack scenarios, KIWI Syslog server, and MySQL database to store the logs used. In this setup, the Snort analyzer summarizes and recognizes the packets. The analytics engine is an integral part of Snort. Attack simulations are generated from both internal and external networks.

Proposed Methodology
The methodology adopted in this study is illustrated in Figure 8. After capturing these logs, they are stored in the MySQL database. The dataset has been split into training and testing. 70% has been used as training, and 30% has been used as testing. In the training phase, logs have been classified into begin or attack. We processed the records and followed them by normalization. Different deep learning approaches are applied to the training dataset. Similarly, processing and normalization have been done on testing data. A detection model is developed concerning the received inputs of the preparation and test dataset; an attack detection model is developed.

Feature Selection
The information gain feature mechanism is used to classify the data set. Logs are collected based on 17 characteristics. Traffic is classified into seven categories, shown in Table 3.

Processing
The dataset contains 10,88,365 rows on four files, each row having 17 features. In addition, we parsed and removed column headers repeated in some data files. As a result, about 9768 samples dropped during the data clean-up process. Table 4 represents the summary of the dataset used for experiments.  Each of the cleaned datasets contains 17 characteristics, two of which target ports and protocols are treated as categorical using a 1-to-n encoding, and the rest are all numeric. Therefore, Table 5 presents the total traffic statistics samples for a particular type among all datasets; The total number of attack samples is 1078597 and is shown graphically in Figure 9.   Figure 9. Attack samples.

Performance Evaluation
Performance evaluations are performed on this dataset to determine the capacity of deep learning approaches to detect cyberattacks and respond within the performance limitations -most critical analysis pointers, including detection, false alarm, precision, and accuracy represented in Table 6. Where TP denotes True Positive, TN denotes True Negative, FP denotes False Positive. FN denotes False Negative.

Results and Discussion
Deep-learning approaches are utilized on the individual dataset and depicted the performance outcomes such as accuracy, detection rate, and false alarm. Normalizing the numeric features is explored, but the performance variation was statistically tiny to deserve normalizing numeric values for all the experiments. The learning rate used is from 0.01 to 0.8; no. of hidden nodes selected are 15 to 100, batch size 2000, Sigmoid is employed as an activation function. The correlation map of the dataset is presented in Figure 10.    Table 7.
The ROC Curve (Receiver Operating Characteristic Curve) is presented in Figure 11 with the highest detection rate of the three techniques. Table 8 shows the accuracy and training time of DL models with various parameters in the dataset. Related to both DBN and RNN networks, CNN obtains a greater accuracy of 98.35%.
Tables 9 and 10 illustrate the accuracy results over 100epochs for Binaryclass and Multiclass experiments.
The practice of the models of using Binary-class shows the CNN model provides the most excellent accuracy, about 98.36% with a 98% precision rate. The RNN model offers the second-highest accuracy of 97.75%. The practice of the models using Multiclass shows that the CNN model gives the most excellent accuracy of 98.42%, including a precision rate of 98. The RNN model offers the second-highest accuracy of 97.75%. The comparison of the model is presented in Figure 12.

Future Scope and Open Research Challenges
Researchers have introduced various methods using DL algorithms to identify, classify and predict the diverse field of cybersecurity. Figure 13 describes the main areas where DL can be used for cybersecurity. First, unnecessary security warnings and comments can indicate how to deal with waste and inaccurate conclusions, a major challenge in deep learning. Then deep learning techniques tend to be uselessly improved when the confidence cases are terrible, namely bad, irrelevant components or insufficient training capacity. Most research results were proposed using the public database. The research should highlight building a real-time setup to validate deep learning approaches so Accessing real-time datasets is a challenge. The experiment primarily produces accessible data. The researcher can refine the study to examine various open-source data in the future. Compared to cybersecurity, the DL procedures are associated with higher costs for the corresponding error correction. In addition, the DL methods are linked to black boxes; The advanced principles   of the error are complex to fix. Therefore, in the presentation, the researcher should focus on the dominant elements of the intrusions to develop an effective cybersecurity knowledge technique. Strong production, CPU, another extensive repository area, and adequate knowledge remain the primary source elements. The DL methods for solving cybersecurity challenges should focus on one specific topic. Instead, the  researcher can consolidate the DL design with various machine learning methods to discover essential data. In addition, the researcher can similarly focus on multiple built-in deep learning models to improve the appearance in the future.

Conclusion
The rapid technological change makes it a challenging task to secure the systems. Therefore, it is advisable to have a more innovative way to deal with the current situations affecting the taste of deep learning technologies. We show a broad summary of cyber security applications from deep learning approaches. In this study, three DL techniques are examined and discussed. First, the common cyber-attacks are discussed using publicly accessible datasets. Then a proposed framework for cybersecurity is illustrated using DL techniques for general applications. Then, a lab setup is performed to capture live network packets, analyze real-time cyber security attacks, and assess various essential characteristics, namely false alarm rate, detection rate, accuracy, precision, recall, etc. Finally, the researchers' challenges, including technological and operational aspects, are examined, highlighting the future direction of researching DL in cybersecurity.

Disclosure statement
No potential conflict of interest was reported by the author(s).