Distributed deep learning approach for intrusion detection system in industrial control systems based on big data technique and transfer learning

ABSTRACT Industry 4.0 refers to a new generation of connected and intelligent factories that is driven by the emergence of new technologies such as artificial intelligence, Cloud computing, Big Data and industrial control systems (ICS) in order to automate all phases of industrial operations. The presence of connected systems in industrial environments poses a considerable security challenge, moreover with the huge amount of data generated daily, there are complex attacks that occur in seconds and target production lines and their integrity. But, until now, factories do not have all the necessary tools to protect themselves, they mainly use traditional protection. In order to improve industrial control systems in terms of efficiency and response time, the present paper propose a new distributed intrusion detection approach using artificial intelligence methods, Big Data techniques and deployed in a cloud environment. A variety of Machine Learning and Deep Learning algorithms, basically convolutional neural networks (CNN), have been tested to compare performance and choose the most suitable model for the classification. We test the performance of our model by using the industrial dataset SWat.


Introduction
Artificial intelligence (AI) in cybersecurity has become essential in increasing the efficiency and precision of threat detection in manufacturing systems.As manufacturing industries expand globally, we generate vast amounts of data to gain insights and improve our product offerings.The global AI in cybersecurity market (Channe, 2019) is predicted to grow at a significant rate, driven by factors such as increasing data privacy regulations, the rise in cyber-attacks, and the adoption of digital and cloud-based solutions.To protect smart manufacturing systems from cyber risks, there is a growing need for systems and programs that can detect, predict, and analyze threats.
AI offers a promising solution to address these challenges, and many industry players are focusing on utilizing AI for cybersecurity.The convergence of AI and security is a natural fit, and machine learning techniques, such as deep learning and convolutional neural networks (CNNs), are filling the gaps left by previous rule-based data security systems.The importance of AI in cybersecurity has been recognized at a global level, with the need for strategic investment and systematic development highlighted in reports such as the White House report on AI (Executive office of the president of the united states, 2018).
The emergence of Industry 4.0 (Deloitte, 2015), characterized by connected, robotic, and intelligent factories, further emphasizes the need for AI in our industrial systems.However, the presence of connected systems in industrial environments poses significant security challenges, as many companies lack accurate standards, management skills, and techniques to effectively implement cybersecurity measures (Lezzi et al., 2018).This increases the vulnerability of our industrial IT systems and the risk of intrusions.
In recent years, industrial control systems (ICS) have been victims of cyber attacks, resulting in financial losses, physical damage, and potential human casualties.Traditional protection tools, such as firewalls and antivirus software, are insufficient to defend against complex attacks.While some factories deploy intrusion detection systems (IDS) to maintain system security, existing IDS solutions like Snort and Bro have limitations in detecting attacks that are not part of their databases.
To address these challenges, we propose a new approach for distributed intrusion detection in ICS using artificial intelligence methods, Big Data techniques, and a cloud environment.Building upon our previous work that achieved a 99% accuracy rate using Gradient-Boosted Trees (GBTs) classifier, we extend our approach (Abid et al., 2022) by introducing a Distributed Deep Learning method based on convolutional neural networks (CNNs).We test the proposed approach on the Secure Water Treatment (SWat) dataset to compare its performance with our previous approach and select the most suitable model for classification.The paper is organized as follows: Section 2 provides an overview of related works in the field of intrusion detection in ICS.Section 3 describes the tools and data used in our study.Section 4 presents our previous proposed approach, while Section 5 introduces the new approach based on Deep Learning and CNNs.Section 6 evaluates the experimental results, and finally, Section 7 presents our conclusions and outlines future works.

Related work
Before delving into the implementation of our approach, it is crucial to conduct a thorough examination of existing solutions that focus on detecting and classifying intrusions in both general network environments and particularly in industrial control systems.By assessing the effectiveness of these established methods, we can determine their relevance and applicability to our own approach.This analysis will help us select the most appropriate techniques to integrate into our system.
In Abedin and Waheed (2022), the authors conducted experiments to train and evaluate various machine learning algorithms, including DT, AdaBoost, GBT, MLP, LSTM, and GRU for binary intrusion classification.They utilized two datasets: UNSW-NB 15 and Network TON_IoT.To address the challenge of imbalanced class distribution, they introduced a feature selection process based on Gini Impurity and weighted RF.The experimental findings indicated that DT, when combined with the feature selection technique, exhibited superior performance.Gu and Lu (2021) proposed an effective approach for intrusion detection systems (IDS) by utilizing Support Vector Machines (SVM) in conjunction with the Naïve Bayes algorithm.Their approach involved employing Naïve Bayes to transform the original features into new data.This transformed data was then utilized to train an SVM model for the classification of intrusions.To assess the performance of their approach, the authors conducted experiments on multiple datasets: UNSW-NB 15, CICIDS2017, NSL-KDD, and Kyoto 2006+.Notably, they achieved promising results across these datasets.For the CICIDS2017 dataset, they achieved an accuracy of 98.92%.The NSL-KDD dataset yielded an accuracy of 99.35%.The UNSW-NB 15 dataset resulted in an accuracy of 93.75%, while the Kyoto 2006+ dataset showed an accuracy of 98.58%.
In Jemili (2023), The authors proposed a novel approach that combines heterogeneous datasets using Big Data analytics tools.They merge diverse datasets, eliminate redundant information, and leverage popular tools like Hadoop/MapReduce and Neo4j.Machine learning algorithms are applied to extract meaningful insights, utilizing the SSDM algorithm for generating association rules and the K2 algorithm for learning Bayesian networks.Through extensive experimentation, the authors demonstrate the effectiveness of their data fusion approach, achieving highly accurate and valuable outcomes by leveraging the power of data fusion with Big Data analytics tools.In addition, In Abid et al. (2023) a new approach to real-time intrusion detection in ICS is introduced, leveraging Cloud Computing and Big Data techniques for data fusion.The contributions of this work are twofold.Firstly, it proposes a real-time IDS that overcomes the limitations of traditional systems through the efficient processing capabilities of Cloud Computing and Big Data techniques.Secondly, it employs data fusion to integrate diverse data sources, resulting in improved intrusion detection accuracy and efficiency.The proposed IDS achieves higher accuracy rates and demonstrates superior efficiency in detecting intrusions compared to existing solutions.
In their study (Sarkar et al., 2022), the authors proposed an ensemble technique for Intrusion Detection using ML models.To address class imbalance, data augmentation techniques are employed to rebalance the KDD Cup99 and NSL-KDD datasets.The proposed methodology adopts a three-step approach that utilizes a cascaded structure with MLP models to enhance intrusion detection.Furthermore, a cascaded meta-specialized classifier architecture is developed to classify each class separately, improving the detection quality significantly.The approach achieves a classification accuracy of 89.32% and a low False Positive Rate (FPR) of 1.95%.To further enhance detection capability, the authors integrate the predictions of the best algorithms by increasing their weights.This integration leads to improved detection performance, as demonstrated by the high accuracy of 87.63% and a low FPR of 1.68% achieved on the NSL-KDD dataset.
The authors of Moustafa (2021) introduced a recently developed dataset known as the Network TON_IoT dataset.The authors employed a Wrapper Feature Selection technique based on Random Forest (RF) to identify the most relevant features from this dataset.They further evaluated the performance of Gradient Boosting Machine (GBM), Random Forest (RF), Naïve Bayes (NB), and Deep Neural Network (DNN) algorithms using the selected features.The GBM achieved an accuracy score of 93.83%.The RF model obtained an impressive accuracy score of 99.98%, while the DNN model achieved a slightly higher accuracy score of 99.92%.For the Naïve Bayes algorithm, the authors reported an Area Under the Curve (AUC) score of 91.28%.
The authors of the paper (Abid et al., 2022) developed a distributed intrusion detection system utilizing various machine learning algorithms.Their approach leveraged Big Data techniques and employed Apache Spark in a cloud environment, specifically Databricks.The proposed system was evaluated using the Secure Water Treatment (SWat) dataset, an industrial dataset.By employing the Gradient-Boosted Trees (GBTs) classifier, the distributed intrusion detection system achieved an impressive accuracy rate of 99%.Additionally, the system demonstrated reasonable response times, indicating its efficiency in real-time intrusion detection.
In Khan and Serpen (2019), the authors proposed a misuse intrusion detection system to identify attacks on a gas pipeline SCADA dataset.They employed three Machine Learning algorithms, Naive Bayes, PART, and Random Forest (RF) using WEKA and performed a feature extraction process to ameliorate their results.Results showed acceptable performance for both classification categories (binary and multi-class)for Random Forest classifier compared to other classifiers.
Perales Gomez et al. (2020) presented a new Methodology for Anomaly Detection in Industrial Control Systems (MADICS) to detect attacks in ICS.It's based on a semi-supervised anomaly detection paradigm and makes use of deep learning algorithms to model ICS behaviours.It consists of five main steps, focused on pre-processing the dataset to be used with the machine learning and deep learning algorithms; performing feature filtering to remove those features that don't meet the requirements; feature extraction processes to obtain higher order features; selecting, fine-tuning, and training the most appropriate model; and validating the model performance.To evaluate MADICS, the authors used the Secure Water Treatment (SWaT) dataset.Experiments showed that, MADICS achieved good results in terms of precision 98.4%, recall 75% and F1-score 85.1%.
The authors of Alhaidari and AL-Dahasi (2019) presented an IDS based on three machine learning algorithms namely J48, Naive Bayes and Random Forest to detect Distributed Denials of Service (DDoS) in SCADA systems against the KDDCup'99 dataset.Their results show that the Random Forest classifier yields the best accuracy with 99.99% while Naïve Bayes holds the worst accuracy with 97.74%.Kravchik and Shabtai (2018) introduced a study on detecting cyberattacks on industrial control systems (ICS).This work consists of applying using unsupervised deep neural networks especially convolutional neural networks on Secure Water Treatment testbed (SWaT) dataset.They tested the proposed approach by employing various deep neural networks architectures including several variants of convolutional and recurrent networks.This approach achieved a low rate of false positives and showed that 1D-CNN outperformed more complex recurrent networks although being much faster to form.Wang et al. (2021) created a model that is based on a deep residual Convolution Neural Network (CNN) to prevent gradient explosion and assure accuracy.They employed visualizations to convert flow data into visual representations.First, they used the KDDCUP99 dataset to pretrain the Resnet8 model and save it with the parameters after testing its performance.Secondly, they load the pretrained model with its parameters and use the ICS dataset, the gas pipeline dataset, they constructed an eight-layer residual neural network, and employed fine-tuning technology to transfer learning to detect abnormal datasets of ICSs.This approach showed great results in terms of accuracy (0.9909), recall (0.9906) and Precison (0.9955) with a low FPR (0.0085) and training Time with 51 min 58 s.Lai et al. (2019) developed an anomaly detection method for industrial control system especially SCADA systems based on the CNN deep learning representation model to detect and classify attacks.They proposed a feature mapping method based on the Mahalanobis distance to convert one-dimensional flow data into a two-dimensional matrix to be used as a CNN input.Experiments proved that the proposed system achieved good performance on both Binary classification and multi-class classification respectively, with an accuracy of 99.46% and 99.32% with a low time occupancy in both feature mapping and detection respectively with 0.253s and 0.192s.
The authors of Elnour et al. (2020) introduced a new semi-supervised dual isolation forests-based (DIF) attack detection approach for industrial control systems datasets such as the Secure Water Treatment (SWaT) testbed and the Water Distribution (WADI) testbed.They trained two Isolation Forests (IF) independently The first IF was trained by using the normalized data and the second was trained by using a preprocessed version of data after applying the Principal Component Analysis (PCA) method.The developed system proved its improvements in terms of computational complexity, attack detection capability and appropriateness to complex and high dimensional data.
In Kim et al. (2020), the authors created an anomaly detection system for ICS using sequence-to-sequence neural networks.The sequence-to-sequence neural networks method was applied to form and predict ICS operational data and interpret their timeseries characteristics.To evaluate Their method, the authors used the Secure Water Treatment (SWaT) dataset.Experiments showed that, this approach detected 29 out of 36 attacks and detected 25 out of 53 attack points.
In conclusion, the existing works in the literature have proposed solutions for detecting cyber attacks in industrial control systems (ICS) using various Machine Learning and Deep Learning models.However, a common limitation among these works is the focus on local environments, with a lack of consideration for distributed environments.
Traditionally, most papers in the field have relied on feature extraction methods combined with Machine Learning techniques for intrusion classification.However, recent studies indicate that deep learning approaches are more effective and can reduce response time.
Given this ambiguity, our approach aims to experiment with both traditional featurebased methods and deep learning approaches.We will explore different algorithms and combinations to determine the optimal approach in the subsequent section.Our proposed approach will utilize a distributed architecture deployed on the Cloud, specifically leveraging Databricks Community and the Apache Spark tool for data manipulation and analysis using the SWaT dataset, which represents industrial data.
To compare performance and identify the most relevant model for classification, we will employ various Machine Learning and Deep Learning algorithms, with a particular emphasis on convolutional neural networks (CNN).

Industrial data collection
In the literature (Choi et al., 2019), there are several datasets for ICS intrusion detection such as: the datasets of Morris and Gao (2014) who published five different power generation, gas and water treatment datasets for their intrusion detection research, the dataset of Lemay et al. who provided the network traffic dataset related to command and control of secret channels in the Supervisory Control and Data Acquisition (SCADA) field, and the datasets published by SWaT that represent a scaled-down version of an industrial water treatment plant and collected sensors, actuators, input/output signals, and network traffic for seven days of normal operation and four days of attack scenario.However, these datasets contain unintended patterns that can be used to easily identify attacks and non-attacks using machine learning algorithms.Although the gas dataset was updated in 2015 to provide more randomness, it was obtained from a small-scale testbed that may not reflect the true complexity of ICS.Therefore, there is no realistic dataset of sufficient complexity from a modern ICS that contains both network traffic and physical properties of the ICS except the SWaT dataset which is a large-scale labeled dataset collected from a realistic testbed of sufficient complexity (Goh et al., 2017).

Secure water treatment dataset (SWaT)
The SWaT dataset (Goh et al., 2017) is a scaled-down version of a real industrial water treatment plant that utilizes ultrafiltration and reverse osmosis membrane units to produce 5 gallons per minute of filtered water.It was collected over a period of 11 continuous days, where the first 7 days captured normal operation without any attacks, and the remaining 4 days involved 41 different attack scenarios.Both network traffic and physical data from sensors and actuators were collected during this time.
Comprising a total of 53 features, the dataset includes a timestamp and a label indicating whether the data point corresponds to a normal operation or an attack.The remaining 51 features represent numerical values from 51 specific sensors and actuators.These sensors and actuators were sampled every second, providing a high-resolution temporal view of the water treatment plant's functioning.
The SWaT dataset serves as a valuable resource for researchers and professionals interested in studying the security of industrial control systems.It enables the development and evaluation of algorithms and systems for anomaly detection, intrusion detection, and other cybersecurity measures tailored to critical infrastructure such as water treatment plants.
Table 1 provides a more detailed representation of the features included in the dataset.Table 2 offers a comprehensive overview and detailed description of the distinct attack scenarios observed in the SWaT dataset.

Databricks
Databricks(AWS) (Databricks Architecture Overview, 2021) is a data analytics platform founded by the creators of Apache Spark whose goal is to accelerate innovation in data science, analytics and engineering by applying advanced analytics for machine learning and large-scale graph processing.Moreover, by using deep learning to harness the power of unstructured data such as AI, image interpretation, machine translation, natural language processing, etc.In addition, it performs real-time analysis of highspeed sensor and time-series IoT data.A free version is provided by AWS called Databricks Community Edition.This platform is a cloud-based Big Data solution designed for developers and data engineers.It provides users with access to a micro-cluster and cluster manager as well as a notebook environment.It's a higher level platform that allows users to become proficient with Apache Spark.This platform offers a variety of features, It allows cluster sharing where multiple users can connect to the same cluster.In addition, it offers a security feature and provides a data management.

DBFS
DBFS is a distributed file system available on Spark Cluster in Databricks.It allows mounting storage objects to access data without the need for credentials.In addition, it allows you to interact with the storage objects using directory and file semantics.In addition, this system keeps files in the object storage so that data is not lost after closing clusters.

Big data technique: apache spark
Apache Spark is a powerful, flexible and distributed data processing framework.range of libraries, including the scalable machine learning library MLlib which contains many machine learning algorithms, such as classification, clustering and regression algorithms.

Machine learning
Machine learning is a branch of artificial intelligence that allows computers to learn without explicit programming.However, to learn and develop, computers need data for analysis and training.Specifically, it is a modern science that discovers models from data and makes predictions based on statistics, data mining, pattern recognition and predictive analysis.The first algorithm was created in the late 1950s, the most famous of which is the perceptron.Machine learning is very effective in situations where information needs to be discovered from a large amount of different and ever-changing data (i.e.big data).
There are different types of machine learning algorithms.Generally, they can be divided into two categories: supervised and unsupervised.In the case of supervised learning, the data used for training have been 'labelled'.Therefore, the machine learning model already knows what to look for in this data (reasons, elements, etc.).At the end of the training, the model thus formed can find the same elements on the unlabeled data.Among the supervised algorithms are classification algorithms and regression algorithms.
Unsupervised learning, as opposed to supervised learning, is to train a model on unlabeled data.The machine searches for data without any clue and tries to find patterns or recurring trends.The non-supervised models include clustering (to find groups of similar objects), association (to find links between objects) and dimensional reduction (to choose or extract characteristics).
A final approach is reinforcement learning.In this case, the algorithm learns by trying again and again to achieve a well-determined goal.They can try all kinds of techniques to do that.The model gets rewarded if it gets close, or penalized if it fails.

Deep learning
Deep learning is defined as a form of artificial intelligence, derived from machine learning.It is a machine learning technique based on a neural network model: tens or even hundreds of layers of neurons are stacked, which brings greater complexity to rule-making.In the human brain, each neuron receives about 100,000 electrical signals from other neurons.Each active neuron can excite or inhibit the neurons connected to it.In artificial networks, the principles are similar.The signal circulates between the neurons.However, neural networks are not electrical signals, but assign certain weights to different neurons.Neurons that receive more charge will have more influence on neighbouring neurons.The last layer of neurons responds to these signals.
Actually, Deep Learning has been efficiently applied in several fields including computer vision, speech recognition, natural language processing,image classification and the security field (Nassif et al., 2019;Pang et al., 2021;Torfi et al., 2020;Voulodimos et al., 2018).
In the field of security, Deep learning method is applied to detect anomalies in order to reduce dimensionality and classify anomalies because with the fast rise in transmitted traffic, manual feature engineering fails to manage multidimensional and large-scale data, while deep learning models automatically learn complex data (Wang et al., 2021).
In traditional machine learning, features are often identified by an expert and then coded into a data type, which is a long and difficult task when it comes to large-scale data processing.The main difference between superficial machine learning and deep learning lies in the capability of deep architectures to learn features with different levels of abstraction at different layers of processing without human intervention and directly from the original data (Schmidhuber, 2015).Therefore, Deep models automatically find complicated correlations and mappings between raw input and output (Deng & Yu, 2014).
Moreover, compared with traditional machine learning, Deep Learning uses perceptron and neuron or back propagation techniques to change and adjust parameters continuously to obtain excellent performance (Wang et al., 2021).
Several Deep learning algorithms have been developed, in general, all these algorithms are networks of neurons consisting of interconnected neurons organized in layers, which share certain common basic properties.What differentiates them is the architecture of the network, that is, the way neurons are organized in the network and sometimes the way they are formed.

A machine learning approach for intrusion detection in industrial control systems
Our first approach aims to provide an efficient and distributed intrusion detection system based on Machine Learning by adding additional processing power.Our approach is divided into four steps: . The collection of industrial data for water treatment via the ITrust. 1 .Data storage using a deployment architecture: the Databricks Cloud Platform which allows us to have a distributed and scalable system and which solves several problems, mainly the problem of storing alert databases. .Data structuring through data preprocessing techniques to obtain clean and usable data through data transformation by converting data from the original format to another format, the study of correlation between data and data cleaning by eliminating irrelevant features and redundant rows. .Data analysis by using AI learning mechanisms such as machine learning to properly detect and classify intrusions.
To realize our approach, we choose Databricks as our data analysis, processing and storage platform, which is based on AWS and already has Spark installed and configured.An overview of the proposed architecture of our system is illustrated in Figure 1.

Data storage
Our first step is to load our SWaT intrusion detection dataset into the Databricks DBFS file system.

Data transformation
Since Apache Spark offers the ability to convert data from the original format to another format, we first read the CSV files from DBFS and convert them to Apache Parquet format for compressing and partitioning it and therefore we can minimize storage costs and get better performance (Abid & Jemili, 2020).

Deal with categorical label and features
This step consists in carrying out the encoding of categorical features, so that we can form our model, with the help of the StringIndexer() method in which the indices are attributed according to the frequency of the attribute, thus the most frequent attribute obtains the index 0.0.In our dataset, mainly categorical features from motorized valves and pumps.

Correlation matrix
A correlation matrix is used to evaluate the dependency between several variables at the same time.The result is a table containing the correlation coefficients between each variable and the others.The correlation matrix for our features is shown below in Figure 2.

Feature selection
It is the automatic selection of attributes in our data (such as columns in tabular data) that are most relevant to the classification problem we are working on.Feature Selection is the process of selecting a subset of relevant features to use in model building.Feature selection is different from dimension reduction.Both methods seek to reduce the number of attributes in the dataset, but a dimension reduction method does so by creating new combinations of features, whereas feature selection methods include and exclude attributes present in the data without changing them.
4.2.4.1 Elimination of features that do not vary over time.After studying the variation of the features, we found that the features P202, P401, P404, P502, P601 and P603 which correspond to those of the pumps have a null variation.Figure 3 shows a visualization of the variation of these features.

Choice of features most correlated with the target.
To determine the most relevant features, we defined a threshold of 0.25, which corresponds to the average of the thresholds found, for a judicious choice that allows us to develop a powerful and efficient classification model.The selected features are those which have a correlation with our target higher than 0.25.At the end, we can define a dataset composed of 24 features as follows: LIT101, P101, AIT201, P203, DPIT301, FIT301, MV302, MV304, P302, AIT402, FIT401, LIT401,P402, UV401, AIT501, AIT502, FIT501, FIT502, FIT503, FIT504, P501, PIT501, PIT502, PIT503.

Intrusion detection and classification using machine learning
After applying the necessary preprocessing and once our database is ready, we arrive at the last step of this approach which is classification.A judicious choice of Machine Learning algorithms represents a fundamental step for the development of a classification system capable of efficiently distinguishing the different classes with a high accuracy rate and a low error rate.In our work, We tested the performance of our intrusion detection system against several ML classification algorithms Multi-layer Perceptron (MLP), Decision Tree, Random Forest, Logistic Regression,Gradient-Boosted Trees (GBTs) and Naïve Bayes, by using Apache Spark and its MLlib library (Classification & Pegression, 2022).
. Decision Trees: Decision trees are supervised learning models used for classification and regression.Decision trees are universally used because they are suitable and easy to understand and manipulate.They manage categorical and numerical features, support binary and multiclass classification.In decision trees, data are continuously split according to a certain parameter.The decision tree contains two entities: decision nodes and leaves.The leaves are considered as decisions or final outcomes.Decision nodes are where the data is split.Decision trees use different attributes to split the data into subsets, this process is repeated until subsets share the same decision. .Gradient-Boosted Trees (GBTs): Gradient Boosting Trees (Boosting) are a supervised machine learning method that is generally used for both classification and regression domains.GBTs train iteratively decision trees in order to obtain an optimal solution and reduce a loss function. .Random Forest: Random Forest is a supervised Machine learning algorithm that is one of the most popular algorithms thanks to its flexibility and ease of use for both classification and regression.Random Forest adapts to a number of decision tree classifiers on various subsamples of the dataset and uses the mean to improve predictive accuracy and control of over-fit.The subsample size is controlled with the if parameter (default), otherwise the entire dataset is used to generate each tree structure. .Multi-layer Perceptron (MLP): It is a supervised learning algorithm that learns a function f by training on a data set, where the number of dimensions for the input is equal to that of the output.Given a set of features X = x 1 , x 2 , . . ., x m and a y target, it can learn a non-linear function approximater for classification or regression.It is different from logistic regression, in that between the input layer and the output layer, there may be one or more non-linear layers, called hidden layers.

Model optimization
Most machine learning models need to be adjusted to provide the best results.For example, for a random forest, each time you divide a node, you have to choose the number of trees to create and the number of variables to use.If you set the parameters manually, it quickly becomes very time consuming.This is where the ParamGridBuilder and the CrossValidator come in.This is an optimization method (hyperparameter optimization) that allows you to test a series of parameters and compare the performances to deduce the best parameterization.There are several ways to test model parameters, and the ParamGridBuilder and CrossValidator is one of the simplest methods.For each parameter, we determine a set of values to test.

A deep learning approach for intrusion detection in industrial control systems
Our second approach consists of several steps that begin with data collection and preparation through to attack detection and classification.
The contribution and improvements of our approach can be outlined as follows: (1) Industrial Data Collection: We utilize the ITrust 1 platform to collect industrial data specifically related to water treatment.This ensures that the data used in our approach is relevant and representative of real-world scenarios, enhancing the accuracy and applicability of our solution.
(2) Data Storage and Deployment Architecture: We leverage the Databricks Cloud Platform for efficient data storage and management.This platform provides a robust infrastructure for handling large-scale datasets, enabling seamless access and retrieval of information during the data processing stages.(3) Data Structuring, Preparation, and Cleaning: We employ advanced techniques for structuring, preparing, and cleaning the collected data.This involves organizing the data in a meaningful way, handling missing or erroneous values, and removing noise or outliers.By performing these preprocessing steps, we enhance the quality and reliability of the data used in subsequent stages.(4) Conversion to Image Data: To leverage the power of convolutional neural networks (CNNs), we convert the original one-dimensional vector data into two-dimensional image data.This transformation enables the CNNs to capture spatial dependencies and patterns in the data, which are crucial for accurate classification.This approach expands the applicability of CNNs to non-traditional image-based domains, such as water treatment.( 5) Binary and Multi-class Classification with Transfer Learning: We utilize a convolutional neural network for binary classification tasks, leveraging transfer learning techniques.
Transfer learning allows us to leverage pre-trained models on large-scale image datasets, such as ImageNet, to improve the performance of our water treatment classification model.By transferring the knowledge learned from these large datasets, we can achieve higher accuracy even with limited labeled data.
By combining these five steps, our approach offers several contributions and improvements to the field of water treatment: . Relevant and Representative Data: By collecting industrial data through ITrust, 1 we ensure that our approach is tailored to real-world water treatment scenarios.This increases the practicality and applicability of our solution. .Efficient Data Storage and Management: Leveraging the Databricks Cloud Platform enables us to handle large-scale datasets efficiently, ensuring smooth data access and retrieval during the different stages of processing. .Enhanced Data Quality: Through data structuring, preparation, and cleaning, we improve the quality and reliability of the data used in subsequent steps.This increases the accuracy and effectiveness of our solution. .Utilization of CNNs for Non-Image Domains: By converting the data into image format, we extend the application of convolutional neural networks to domains beyond traditional image classification.This opens up new possibilities for leveraging the power of CNNs in various industries, including water treatment. .Improved Classification Performance: By incorporating transfer learning, we enhance the classification performance of our model.Transfer learning allows us to benefit from pre-trained models and their learned knowledge, enabling higher accuracy even with limited labeled data.
Our approach offers a comprehensive and innovative solution for water treatment classification, addressing data collection, storage, preprocessing, transformation, and classification stages.These contributions contribute to advancements in the field and have the potential to improve the efficiency and accuracy of water treatment processes.Figure 4 shows the global architecture of our approach.
After addressing the first three steps outlined in the previous approach, the remaining section focuses on elucidating the subsequent steps involved.

Feature mapping
By reason of the interaction and the correlation between the features in Industrial control systems, we propose a feature mapping method based on the Mahalanobis Distance (MD) (Xiang et al., 2008) that converts one-dimensional data into a two-dimensional matrix that can be used as CNN input and improve both cost and performance.In our case, We convert each labeled data into a 16*16 size image.

Convolutional neural network(CNN)
The deep neural network model we selected was the convolutional neural network (CNN) that has been widely used for various domains especially in the field of anomalies detection (Lai et al., 2019;Pang et al., 2021;Wang et al., 2021).We used this method to classify ICS abnormal traffic because it is the most successful model among deep learning architectures.One of the main advantages is the automatic extraction of characteristics by directly processing data at the level of convolution layers which plays a major role in the extraction of characteristics and the resizing of data after several steps.Therefore, CNNs can not only select entities but also classify traffic data.In addition, compared to other DL algorithms, the biggest advantage of CNN is that it shares the same convolutional kernels, which would greatly reduce the number of parameters.filter F and the stride S. The output O of this operation is called activation map or also feature map. .Pooling (POOL): is a sub-sampling operation usually applied after a convolution layer.In particular, the best known types of pooling are the max and the average pooling, from which maximum and average values are derived respectively.Its objective is to subsample the features maps formed at the exit of the previous layer to save processing time, reducing its size, while preserving the most essential information.This improves the efficiency of the network and avoids over-learning. .Fully Connected (FC): applies to a previously flattened input, each connected to all neurons.Fully connected layers are typically present at the end of CNN architectures and can be used to optimize goals such as class scores.In general, the fully connected final layer contains a Softmax activation function for multi-class classification and a Sigmoid activation function for binary classification.
Figure 5 provides an example of a CNN architecture.

Activation functions.
. ReLU: The rectified linear unit layer is an activation function that is used on all parts of the volume.It is used to introduce non-linear complexities to the network.The ReLU function defined by F(x) = max (0, x) where, all values of x>0 return x, and all values of x< = 0 return 0. .Sigmoid: The most popular function.It is used extensively for models where we have to predict probability as output.Since the predicted probability exists only between the range of 0 and 1, sigmoid is the right choice.

Train a neural network.
. Epoch: In the context of a model drive, epoch refers to an iteration seeing the entire drive set to update its coefficients. .Mini-batch Gradient Descent: During the training phase, the updating of the coefficients is often not based on either the entire training ensemble at once because of costly calculation times, or on a single point because of potential noise.On the other hand, the update step is done on mini-batches, where the number of points in the batch is an adjustable parameter. .Loss function: In order to quantify the performance of a given model, the loss function is used to evaluate the extent to which true outputs are correctly predicted by model predictions.

Regularization to CNN.
. Dropout: Dropout is a technique that is meant to prevent over-adjustment on training data by abandoning units in a neural network with a probability p>0.This forces the model to avoid relying too much on a well-defined category of traits.

optimization technique: Adaptive Moment Estimation (Adam)
. ADAM represents the current trend and the most popular optimization algorithm used in deep learning.The most advantages of ADAM algorithm are memory efficiency and reduced computing power (Zhang, 2018).Moreover, in Kingma and Ba (2014) the authors highlighted that the ADAM algorithm includes the advantages of multiple optimization algorithms, which allows it to converge faster than the others.

Transfer learning
Current investigations have proved universal use of CNNs, which provide innovative support for many classification problems.In general, deep CNN models require a large amount of data to achieve good performance.The usual challenge related with using such models is the lack of training data.Indeed, collecting large volumes of data is a daunting task, and no efficient solution is yet available.The problem of under-sizing datasets is therefore presently being solved using the Transfer Learning method, which is very effective in dealing the lack of training data (Alzubaidi et al., 2021).
Transfer learning refers to the transfer of knowledge acquired by solving a given problem to a set of methods to address another problem.With the rise of deep learning, transfer learning has been a great success.Indeed, the models used in this field often require a lot of computational time and a lot of resources.However, by using pretrained models as a starting point, transfer learning can quickly develop high-performance models and efficiently solve complex problems.
There are Numerous CNNs models such as Xception, Inceptionv3, GoogleNet, VGG, ResNet and AlexNet.5.2.2.1 ResNet50.ResNet (Residual Network) was developed by He et al. (2016).Various types of ResNet have been developed based on the number of layers from 34 layers up to 1202 layers.The best type was ResNet50, which comprised 49 convolutional layers plus a single FC layer (conventional fully connected).The ResNet50 won the Scale Visual Recognition Challenge 2015 (ILSVRC2015) where it reached the top-5 with 96.43% test accuracy in ImageNet.The most notable addition to this network is the introduction of skip connections that connect non-consecutive layers.Each nth is connected to the (n+2)th layer.This connection allows the network to better use the features of the initial layers and helps with 'vanishing gradients'.
Current investigations have proved universal use of CNNs, which provide innovative support for many classification problems.That's why, in the present work, we used the ResNet50 CNN model to classify ICS abnormal traffic.We used the common Adam optimizer and the sigmoid activation to implement binary classification and dropout to mitigate overfitting of the model with binary-crossentropy for our loss.

Evaluation and validation of results
This last section summarizes the classification results obtained across several models.We will present the results of each model to evaluate the effectiveness and efficiency of the proposed method.Then we will discuss and validate these results.

Evaluation metrics
Model evaluation is an integral part of the model development process.It is useful to find the best model that represents our data and how well the chosen model will perform in the future.For classification algorithms, two evaluation measures are commonly used (Apache Spark, 2021):

Confusion matrix
A confusion matrix is used to have a complete picture of the performance of a model.It is defined as follows in Figure 6:

. Main indicators
The Table 3 contains the following indicators that are commonly used to evaluate the performance of classification models.During all the experiments, the dataset was divided into three bases: 70% for the train, 10% for the test and 20% for the validation.

Results
In our study, we conducted a binary classification for our previous approach (Abid et al., 2022) based Machine Learning and two types of classification: Binary and multi-class classification for our new approach based Deep Learning.

A machine learning approach for intrusion detection in industrial control systems
After having parameterized the models used, we can confirm that the ParamGridBuilder and the CrossValidator have allowed the optimization of the results of different models for a better performance.In the rest of this section, we illustrate the results found for each classifier.
The confusion matrix of each classifier is displayed in the Figure 7: The classification report which illustrates the performance across the different metrics for each model is displayed in the Figure 8: The table below compares the different models used for the detection and classification of intrusions in our dataset (Table 4): The results showed that the Gradient-Boosted Trees (GBTs) classifier gave the best performance in terms of Accuracy

Accuracy
The fraction of correctly classified objects and the total number of objects.

Precision
The ratio of data instances predicted as positive that are actually positive.

Recall
The proportion of positive examples that were classified correctly.

F-Measure
The harmonic mean of precision and recall.False Positive Rate (FPR) The ratio of the number of normal instances detected as attack to the total number of normal instances.(0.99), Precision (0.99), Recall (0.99) and F1-score (0.99) with a low number of FNs(24) and FPs(2427).This was followed by the Random Forest classifier then the Decision Tree classifier, while the Naïve Bayes and Logistic regression classifiers gave the lowest detection accuracy (0.97).In terms of speed, as well as implementation and data processing, thanks to Apache Spark, we received a response from the system within seconds.
Table 5 shows the time spent by each model in the learning and prediction phases.

A deep learning approach for intrusion detection in industrial control systems Binary Classification
In the binary classification task, we assessed the performance of our deep learning algorithm, specifically the CNN based ResNet50 model, in terms of precision, recall, and Accuracy.We selected this algorithm based on its extensive adoption and proven effectiveness in various domains.The performance of this model, as well as the variation of loss, are depicted in the Figures 9 and 10 provided below: Upon conducting tests on both approaches, the traditional machine learning-based approach (Abid et al., 2022) and the deep learning-based approach, we have found that CNN, particularly the ResNet50 model, outperforms in the task of classifying intrusions.The ResNet50 model has demonstrated excellent results in terms of accuracy, precision, and recall, while maintaining low training and prediction times of 0.19s and 0.16s, respectively.

MultiClass Classification
The Table 6 offers a comprehensive representation of the evaluation results, showcasing the effectiveness of our multiclass classification approach in accurately detecting and categorizing various security attacks.The table presents performance metrics related to diverse security attacks, providing a comprehensive overview of the system's performance.

Comparison and discussion of results
In this section, we performed a comparative study between our proposed system and other existing approaches.Our objective was to assess the performance and effectiveness of our intrusion detection system.To conduct the evaluation, we compared our system with other works that utilized the SWaT dataset.To gauge the system's performance, we considered multiple metrics such as Precision, Recall, F-Measure, and accuracy (Table 7).
Our previous approach (Abid et al., 2022) attained a commendable precision, recall and accuracy of 0.99% through the utilization of the Gradient-Boosted Trees (GBTs) classifier.
Similarly, Perales Gomez et al. ( 2020) achieved a precision of 0.984, with recall of 0.750 and an F-Measure of 0.851 by employing LSTM.Another approach (Kravchik & Shabtai, 2018), which utilized 1D CNN, achieved a precision of 0.968, recall of 0.791, and an F-Measure of 0.871.
In Inoue et al. (2017), the authors explored different algorithms based on machine learning and deep learning.Their OCSVM approach obtained a precision of 0.925, while the DNN approach achieved a precision of 0.982.In Elnour et al. (2020), the authors achieved good results with the DIF approach, obtaining precision, recall, and F-Measure values of 0.935, 0.835, and 0.882, respectively.Another study (Wang et al., 2021) reported a precision of 0.9955 and recall of 0.9906 by utilizing a CNN-based ResNet8 model.Furthermore, Lai et al. (2019) achieved an accuracy of 0.9946 using the CNN-based LeNet-5 model.
Our findings indicate that our distributed system, employing the ResNet50 CNN model, outperformed our previous approach (Abid et al., 2022) as well as other approaches in terms of various performance metrics.Specifically, our system achieved a precision of 0.998, recall of 0.9974, and an F-Measure of 0.9976.
This superiority can be attributed to the advantages offered by the Transfer Learning technique.Additionally, our system demonstrated faster response times, which can be attributed to the benefits of using Databricks and Apache Spark.By leveraging the ResNet50 CNN model, our effectively leveraged the pretrained and knowledge gained from a large-scale dataset.This transfer of knowledge allowed our model to achieve superior performance compared to other approaches.The pre-trained model served as a strong foundation, enabling our system to learn complex patterns and features from the industrial dataset, thereby improving its accuracy in identifying intrusions.
Furthermore, the use of Databricks and Apache Spark provided additional advantages to our system.Databricks, a cloud-based platform, facilitated efficient data storage, processing, and management, enabling seamless integration of our distributed intrusion detection system.Apache Spark, a distributed computing framework, accelerated the processing and analysis of large-scale industrial data, contributing to reduced response times.These technologies played a crucial role in enhancing the overall performance and efficiency of our system.
Our study highlights the effectiveness and efficiency of our distributed intrusion detection system, which combines the power of the ResNet50 CNN model, Transfer Learning, Databricks, and Apache Spark.By leveraging these advancements, we achieved superior performance in terms of accuracy and response time compared to existing intrusion detection works utilizing industrial datasets.
In addition to intrusion detection systems, access control and usage control play crucial roles in ensuring security across various applications, especially in the context of industry.As the future of work in the industry evolves, these mechanisms will become increasingly important.Several notable works have contributed to the field of access control and usage control security, addressing different aspects such as safety decidability, administrative role-based access control, and implementation verification.
One significant contribution is found in the works by Rajkumar and Sandhu (2020) and Rajkumar and Sandhu (2016b), which focus on the safety decidability aspect of pre-authorization usage control.These studies enhance the dependability and security of computing systems, providing valuable insights for industry professionals to safeguard sensitive data and prevent unauthorized access.
Another relevant work, presented in the poster by Rajkumar and Sandhu (2016a), focuses on developing access control models to enhance security in administrative role-based scenarios.With the growing interconnectivity of organizations, establishing robust access control measures for administrative roles becomes essential.This research offers valuable guidance for industry professionals in designing effective access control systems that prevent unauthorized individuals from gaining access to critical resources and sensitive information.
Furthermore, the study conducted by Rajkumar et al. (2010) investigates the correct implementation and functioning of usage control mechanisms in network security applications.As the reliance on networked systems and the Internet of Things (IoT) continues to grow, ensuring proper usage control mechanisms becomes vital.Industry professionals can gain insights from this research to implement and maintain usage control mechanisms effectively, thereby enhancing the security of networked systems in various industrial settings.
In summary, these works collectively contribute to the field of access control and usage control security by addressing different aspects of safety decidability, administrative rolebased access control, and implementation verification.By incorporating the findings and recommendations from these studies, future industry professionals can proactively address emerging cybersecurity challenges, ensuring a safer and secure work environment in future.

Conclusion and future work
In this paper, we aimed to extend our previous work (Abid et al., 2022) published in ICCCI 2022 by introducing a new distributed Deep Learning for intrusion detection system in industrial control systems based on Big Data tool and Transfer Learning technique.We used SWaT, the industrial dataset to evaluate our proposed system.
After testing the previous approach and the new one, we can confirm that CNN perform better in the task of classifying intrusions especially with the ResNet50 model which has given good results in terms of accuracy, precision and recall with a reduced training and prediction time due to the Transfer Learning technique and the performance of Databricks Community which solved several problems, including dataset storage and high availability of our cluster, despite its limited version.
The proposed research presents several perspectives: a first axis aims to merge two or more datasets in streaming in order to produce more consistent information, increase the reliability of intrusion detection and improve the evaluation schemes.A second axis is to improve the data processing speed by opting for continuous streaming (data processing in milliseconds and not only in seconds).A third axis focuses on automating the decision making process to deal with intrusions through the development of an expert system to provide appropriate recommendations for each intrusion to stop the attack.
Ouajdi Korbaa is a full-time professor at the University Sousse (Tunisia).

Figure 1 .
Figure 1.Overview of the proposed approach.
of the most correlated features.Several features are correlated with each other with a correlation of 100%.This kind of correlation is habitual in industrial control systems because several sensors (or actuators) depend and based on each other PeralesGomez et al. (2020).So, we can't delete any feature in this step.

Figure 4 .
Figure 4. Overview of the proposed approach.
Figure 5. Example of CNN architecture.

Figure 9 .
Figure 9. Performance metrics of the proposed approach.
UF feed Pump; Pumps water from UF feed water tank to RO feed water tank via UF filtration P-302 UF feed Pump; Pumps water from UF feed water tank to RO feed water tank via UF filtration AIT-401 RO hardness meter of water.AIT-402 ORP meter; Controls the NaHSO_3 dosing(P203), NaOCl dosing (P205).FIT-401 Flow Transmitter ; Controls the UV dechlorinator.LIT-401 Level Transmitter; RO feed water tank level.P-401 Pump; Pumps water from RO feed tank to UV dechlorinator.P-402 Pump; Pumps water from RO feed tank to UV dechlorinator.

Table 2 .
Different attacks of SWaT.

Table 6 .
Performance metrics relative to different attacks.

Table 7 .
comparison of our approach with other works.
He received Engineering Diploma from the Ecole Centrale de Lille (France) in 1995 and his Masters degree in Production Engineering and Computer Science from the University of Lille (France) in the same year.He obtained his Ph.D. in Production Management, Automatic Control, and Computer Science from the University of Science and Technologies of Lille (France) in 1998 and his ''Habilitation to Supervise Researches'' degree in Computer Science from the same University in 2003.Pr.Korbaa has published around 150 research papers on Optimistation, Simultation and Modeling, Applied and Computational Mathematics, Manufacturing Engineering and Computer Engineering.