TCN enhanced novel malicious traffic detection for IoT devices

With the development of IoT technology, more and more IoT devices are connected to the network. Due to the hardware constraints of IoT devices themselves, it is difficult for developers to embed security software into them. Therefore, it is better to protect IoT devices at the traffic level. The effect of malicious traffic detection based on neural networks is promising. Still, the slow computation brings some difficulties to deploying AI-based detection systems on edge servers. Time Convolutional Network (TCN) is a high-speed neural network suitable for massively parallel computation. In this paper, we propose Multi-class S-TCN, an improved network supporting multiple classifications based on TCN for the practical needs of IoT scenarios. Besides, we implement a complete IoT traffic security detection procedure based on deep packet inspection and protocol analysis. The proposed Multi-class S-TCN significantly improves the detection speed without degrading the detection effect. Experiments show that this work has better detection performance and faster detection speed compared to existing approaches, proving the effectiveness of the proposed detection flow and Multi-class S-TCN in IoT scenarios.


Introduction
With the birth and application of 5G technology (Retal & Idrissi, 2020), its high bandwidth and low delay features provide a more robust infrastructure for many network equipment to access.More and more emerging mobile terminals and IoT applications are booming.
As a new computing paradigm, Edge computing moves computing from the cloud to the network edge, such as base station and access point to provide services for adjacent users.Santos et al. (2020) In this way, the network delay is reduced to meet the needs of some delay-sensitive network services.In this application example, many user data generated by IoT applications are calculated and exchanged directly at the edge.These delayed sensitive data often involve user privacy and even public safety, which is very important.According to Symantec's 2019 network security threat report (O'Gorman et al., 2019), in 2018, the number of network attacks increased by 56%.Symantec can intercept at least 1.3 million network attacks on endpoint hosts every day.At the same time, one-tenth of daily network links are malicious, compared with only one-sixth in 2017.There is an urgent need for an effective detection method to ensure the safety and effectiveness of IoT services deployed at the edge.However, performing compute-intensive and delay-sensitive security tasks is usually not desirable in IOT scenarios (Zhang et al., 2014).Therefore, an effective detection method is urgently needed to ensure the security and effectiveness of IoT services deployed on edge.
Researchers actively explore and continuously improve network traffic detection technologies to improve network security and network service quality (Cai et al., 2022;Liu et al., 2019;Ning et al., 2021).The early network traffic detection mainly depended on the port allocation of the transport layer.The feature extraction was to obtain the port number of the transport layer.Because the detection accuracy decreased with reducing the proportion of traffic with fixed ports (Madhukar & Williamson, 2006), this method is not applicable in today's network environment.Deep Packet Inspection (DPI) is a common-used method in traffic detection.The DPI method based on fingerprint matching is still widely used in nowadays network security (Xu et al., 2016).In particular, the conventional matching method is used to represent the traffic features like regular expressions and then match them through the regular matching algorithm, which significantly improves the detection efficiency.
Due to the complexity of IoT scenarios, the traditional traffic detection methods based on signature matching are challenging to keep up with the evolution of network attacks.In contrast, machine learning is used more frequently in network security because of its precision and ability to detect unknown attacks.However, machine learning also faces challenges (Ibitoye et al., 2019), such as the demand for many computing resources and poor accuracy.
Traffic detection based on machine learning usually needs to extract feature information from the vast network data flow and then use the extracted features to classify the traffic to detect malicious traffic.This paper introduces a novel malicious traffic detection framework for IoT scenarios based on DPI technologies.Compared with existing methods, it can detect malicious traffic and identify the type of attacks.In summary, this paper makes the following contributions: • We introduce a novel multi-class classification network named Multi-class S-TCN to detect malicious traffic and identify their types.• We design a feature extraction algorithm based on DPI technologies for Multi-class S-TCN models.This algorithm extracts feature information by deeply unpacking and analysing the network traffic data in the data link layer.• We implement a prototype and operate a series of experiments to evaluate the method we proposed in this paper.The results show that this method has excellent detection and training costs performance.
Section 1 is the introduction, which mainly introduces the starting point of this paper.Section 2 introduces the background of this paper and predecessors' works in related fields.Section 3 introduces the detail of the method we proposed in this paper.In Section 4, several experiments are designed to evaluate our design.Section 5 summarises the work of this paper.

Deep packet inspection (DPI)
DPI technologies appeared in the early days of the Internet and have been widely used to analyse network traffic.DPI deeply parses the traffic data and can extract the feature information of network traffic data to the greatest extent for identification and detection.It has a specific identification ability for cache overflow, Denial of Service (DOS), penetration, and even worms encapsulated in data packets.Porter (2005) Port identification (Nguyen & Armitage, 2008) obtains the transmission unit, data segment, or datagram of the transmission layer by analysing the payload of the network layer packet.It locates sending and receiving data according to the port and distinguishes the traffic.Port identification is the cheapest cost and fastest DPI technology, which ignores encrypted information.It is widely used in network access control and network firewall.However, port identification can only identify the allocated ports.
Protocol analysis is a method of parsing the network layer data to obtain header information to collect data transmission behaviour features and then classify the traffic based on the similarity of behaviour features.
Fingerprint matching is a method to identify the type of network data by comparing and matching the payload with collected fingerprints.The recent methods are mainly based on Finite State Machines (FSM), which transforms the fingerprint database into regular expressions, dramatically improving matching efficiency.Modern network intrusion detection systems, such as SNORT, BRO, and L7 filter, are implemented in this way.Xu et al. (2016) However, these methods have little significance for the protocols without fixed fingerprints information and network traffic data transmitted by encrypted protocols.
Statistical analysis classify traffic based on mathematical and statistical analysis.Statistical significant features are calculated to vectors for classification.Rocha et al. (2011) and Boukhtouta et al. (2016) Traffic can be classified by comparing the similarity of feature vectors.

Malicious traffic detection
Machine Learning (ML) technologies help researchers to build better security systems.Ning et al. (2021) AI-driven systems can be divided to two categories: predication (Chang et al., 2021) and classification (Telikani et al., 2021).There have been many researches on traffic detection.Skaruz and Seredynski (2007) converted the detection of SQL attacks into a time series prediction problem, and used an improved RNN network to identify malicious attacks.Owl-Eye (Liu et al., 2018) proposed a detection system based on the Hidden Markov Model (HMM), which was designed a parameter-level detection for HTTP traffic.HAST-IDS (Wei et al., 2018) first uses CNN to learn low-level spatial features of network traffic, and then learns to use Long Short-Term Memory (LSTM). n.d. (n.d.) proposed a unified model that combines multi-scale CNNs and LSTM (MSCNN-LSTM) for traffic monitoring.MECGuard (Liu et al., 2021) proposed a Gated Recurrent Unit (GRU) based attack detection for mobile edge computing.Mondal et al. proposed using ML for traffic classification in SDN based on docker (Mondal et al., 2021).Malialis et al.Proposed an intrusion detection method based on reinforcement learning (Malialis et al., 2015).Gajewski Mondal et al. proposed anomaly detection systems for Home Area Network (HAN), which is the most common IoT scenario.Gajewski et al. (2020) and Gajewski et al. (2019).

Overview
In this paper, we use the HTTP/TCP/IP protocol family as an example to explain the workflow of our design.The malicious traffic detection framework proposed in this paper is mainly divided into six modules, as shown in Figure 1.It includes traffic capture, data stream capture, protocol identification, feature extraction, traffic classification, monitoring, and control.
Traffic caputure module is based on the Pcap Application Programming Interfaces (Pcap APIs) to read traffic data from the network card.Data stream capture module unpacks the data obtained by the traffic capture module layer by layer and finally extracts the relevant TCP packet payload.In this process, the module saves some necessary data of features.Protocol identification module analyses the TCP payload output by the previous module.It classifies the application layer protocol used by the traffic in the TCP packet through a simple decision tree.Feature extraction module then extracts the feature information of the data load according to the protocol used by the traffic.The module also integrates the feature information extracted by the previous module and is ultimately responsible for submitting the integrated features.Traffic classification module is the Multi-class S-TCN to detect the malicious traffic and identify their types.Monitoring and control module is finally responsible for feeding back the stream monitoring results.

Traffic data preprocess
The network traffic passing through the node can be copied and forwarded by a bypass in a network node.The data capture module copies the data from the network card to the memory and temporarily stores them for subsequent processing.At present, more general traffic monitoring and capture software is implemented based on the Pcap APIs, such as Wireshark, Nmap, ntop, TCPDUMP, and WinDUMP, and intrusion detection system snort.Pcap APIs use Libpcap (Garcia, 2008;McCanne, 2011) under Linux and WinPcap (Risso & Degioanni, 2001) under Windows.The traffic captured in this method will be copied to memory with Pcap encapsulation format, including capture time, data size, and other information.

Packet header parsing
Pcap is a data storage structure.Although it is not a data packet structure for communication, it can be approximately handled as a data packet.The data in the first two fields can be parsed into microsecond-level timestamps, and the last two fields are the captured length and original length of the data.We save timestamp and original length as features.Pcap is the basic of our analysis.We will collect these headers: • Ethernet Data Frame Header: At the data link layer, the data carrier is an Ethernet data frame.Ethernet has different versions of the protocol.We use Ethernet II here, which is defined in IEEE 802.3x (IEEE, n.d.).The address information can be used as the identity information of the data stream to be temporarily stored.• IPv4 Packet Header: As the definition in RFC791 (Postel, 1981b), all fields are stored as feature information in this paper.• TCP Header: We use the TCP protocol as the target protocol, which is defined in the RFC793 document (Postel, 1981a)."Flag," "Window Size", "Urgent Pointer," and "Options" fields are temporarily stored as features, and "Source Port" and "Destination Port" are temporarily stored for identification.

Application layer protocol identification
The core idea of the application-layer-protocol identification method in this paper is to use a decision tree as a classifier and take the beginning part of the payload of the entire TCP session for identification.There is only one formal application layer protocol in a complete TCP data stream at most.The beginning of the payload contains some fingerprints information of the application layer protocol.Theoretically, through the first part of the first TCP data segment containing valid data in the TCP data stream, the type of most application layer protocols can be determined.The essence of identifying the protocol type is a classification task, and the decision tree method is used to classify in the method proposed in this paper.The calculation cost of the decision tree is relatively low.It's fast and suitable for processing such data with obvious fingerprint information.

Packet feature extraction
After the analysis work in the previous, there are ten features as shown in Table 1.Data with variable length needs to be transformed into fixed-length data through a specific algorithm.Among them, the content of the TCP Option is a combination of some flag fields for validating a specific function.The actual operation only uses the XOR operation to simplify the calculation to make it a fixed-length feature.The IPv4 Option may generally record a series of routeing information, such as the IPv4 address that the data packet is going to.Because the IPv4 addresses of different types of networks have certain actual features, after the IPv4 Option is summed every 4 bytes, it can also reflect certain feature information.

Http header features
Application layer protocols are more complex and flexible than IPv4 and TCP protocols.The content of the header of HTTP protocol is an ASCII-encoded string, and the length of each field is not fixed, with the space character (Space, SP) as the separator.The header of HTTP protocol can be divided into a start line and a message header, with Carriage Return and Linefeed (CRLF) as the boundary (Fielding et al., 2006).
The HTTP message header is similar to the "Options" field of the TCP protocol.Each rule consists of two parts, the name, and the parameters.The name indicates the type of function to be set, and the parameter indicates the value to be placed.Although these HTTP message can be divided into three types (request, response, and generally), they will be processed unified.The name can be digitised through a dictionary.Because the value of most parameters is not a numerical range but a character flag with a limited length, a simple XOR method is used to obtain a numeric summary of the parameter characters to complete the digitisation.
In the end, we selected the eight features listed in Table 2 for the HTTP header.

Http payload features
For users, in the entire data transmission, the effective data payload is the message body of the HTTP message.The message body directly involves user privacy, so it should not be parsed and analysed.We can obtain some statistical information through mathematical  statistics, or other dimensional feature information can be obtained through mathematical conversion, thereby reflecting the similarity of the payload.A mathematical-statistical method is adopted for the HTTP payload, and eight features are planned to be extracted to reduce the computational cost.Taking bytes as the statistical object, the value range of a byte of data is 0 to 255, and it is equally divided into eight intervals.The values of all bytes are counted among the partitions, and the corresponding interval number is 32.Therefore, each feature of the HTTP payload is not independent, and its feature type is neither a flag nor a scalar.

Build TCN-based classifier
We introduce the Multi-class S-TCN designed in this paper here.It mainly addresses the needs of malicious traffic identification in IoT scenarios with extensive network data streams and strengthens the training performance of the model.It is worth mentioning that the data obtained by the procedure described above is used as the model's input.

Traditional TCN
From a macro perspective, as shown in Figure 2, TCN is composed of a series of identical residual blocks in sequence.In each residual block, the operation shown in Figure 3 is performed repeatedly.The data enters the module to be filled first, and then a onedimensional expansion convolution is performed.After the convolution, because the data length is greater than the input data length, After cutting, the size is consistent with the input, and the above steps are performed again to complete a residual convolution.The input and output length of the TCN is controlled according to the padding, expansion convolution, and cutting operations.Each input sample consists of N vectors, and the length of each vector is L.After the padding of length L + 2p, the convolution operation is performed, and the length is L + 2p−S + 1.Because the convolution kernel has undergone expansion with a coefficient of d, the size of the new convolution kernel S = d(S − 1) + 1, and the length of the vector after convolution is Finally, the length of the output after cutting the p elements at the end of the vector is L = L + p − d(S − 1), in addition, because the number of elements to be padded and cut is p = d(S − 1).Therefore, the length of the final output vector is L = L, which is the same as the length of the input.
Here, taking a nine-element squence [x 0 , x 1 , . . ., x 8 ] as an example, after three residual blocks with initial convolution kernel size 3, the expansion coefficients d are respectively {1, 2, 4}, the corresponding padding and shearing coefficients p are {2, 4, 8} respectively.From the perspective of microscopic calculation, as shown in Figure 3, the residual blocks' calculation process and data flow layer by layer from bottom to top are presented.The grey squares in the figure are valid data, which come from the input data or the output of each layer of residual blocks.The white squares are filled data.After the residual module of each layer is padded and cut, the first L data of the convolution calculation is retained as the operation of the previous layer.
The line in Figure 3 represents the convolution operation.The black line represents the relationship between the last element of the output vector and all elements participating in the process in the TCN.As the calculation moves up in the graph, the practical information will gradually move to the right.The n-th element of the output sequence is calculated from the first n elements of the input element.The n-th aspect of the output sequence is only affected by the first n elements of the input sequence, forming a causal relationship.The further the output sequence goes to the left, the fewer elements are affected, that is, the longer the memory goes to the right, and the deeper the memory goes to the left.This is similar to the RNN structure.However, the number of calculations in the RNN structure is proportional to the length of the input sequence.At the same time, the TCN is positive to the square root of the size of the sequence, which significantly improves the calculation speed.In addition, the initial input elements of the RNN have a weaker impact on the output every time, which is particularly detrimental to the long-term memory ability.In TCN, the number of weakening times is also proportional to the square root of the length of the sequence.In other words, compared to RNN, TCN has a significant advantage in long-term memory.

Simplified TCN (S-TCN)
In the residual module of the ordinary TCN, the elements are padded in a balanced manner at both ends.After the expansion convolution, the cutting operation is performed.We have made the following improvement: If only p elements are padded on the left side before convolution, then the length of the output sequence is L = L + p − d(S − 1), let p = d(S − 1).The size of the output remains L so that the production does not need to be cut, and its microstructure becomes as shown in Figure 2. The same convolution operation is performed according to the original structure, and the data flow remains the same as the original one.The n-th element of the output element is affected and only affected by the first n elements; that is to say, the output result is consistent with the original and is not affected by the change.The remaining module structure is shown in Figure 4.
Considering the amount of calculation generated by convolution, when the number of residual modules is Q, the size of the convolution kernel is S. The input length is L. It is easy to find from Figures 1 and 2 that for the output sequence length L i of I residual modules, the required multiplication operation is L i × S. The addition operation is L i × (S − 1).For the TCN implemented on the original structure, the multiplication calculation C m and the addition calculation C a required are shown in Equation (1) and Equation ( 2), respectively: (1) After improving the structure, the required multiplication calculation C m and addition calculation C a are shown in Formula 3 and Formula 4, respectively: It can be seen from the Formula (5) and Formula (6) that the calculation amount of the original method is C m times more multiplication operations and C a times addition procedures than the improved calculation amount.When there are more layers, the amount of calculation increases exponentially.Here we only consider that the CNN has only one input sequence and one convolution kernel.If the number of input sequences and convolution kernels is not one, the neural connection between the two sequences will also be significantly reduced.Therefore, the optimisation proposed in this section is necessary to improve computing performance, which significantly saves computing resources and improves computing speed.

Multi-class S-TCN
This paper uses S-TCN as the core to construct the traffic recognition neural network to enable the traffic recognition neural network to capture the causal relationship in the data stream.A vector can represent the eigenvalue of each data packet, and all the feature vector are form a vector sequence.This sequence Forms the feature matrix X.Each data packet in the TCP data stream has the same size feature L. In a TCP session, the features at the same position in the feature vector of each Pcap packet can form a feature sequence.That is, the features in the sequence are arranged in the order of packet capture, and there may be a causal relationship in time so that you can Capture through the TCN network.As metioned before, in the feature matrix X formed by the data of the TCP data stream, the data of each row is a vector composed of the features of a data packet, which is called the feature vector in this paper.This paper proposes a multi-class neural network Multi-class S-TCN based on S-TCN.Its structure is shown in Figure 5.The data has undergone a padding operation.A layer of Fully Connected Feedforward Neural Network (FCFNN), which is marked as the specification layer, is input to the TCN after a transposition operation.The output matrix is flattened to turn the matrix into a vector.Enter a layer of FCFNN for prediction, and select the most likely label output through the Softmax function.The role and implementation of each sub-module are as follows: Padding: After the feature extraction of a TCP session sample's feature data is completed, the result is not a complete matrix.It lacks the feature vector of some data packets and the internal features of some data packets.Because the data input is required to be a fixedlength vector sequence, the data needs to be padded.Before the features data of the sample enters the formal neural network, it is padded with zero elements at the end of the vector and matrix.The vector makes the features data become the input matrix X that meets the requirements.
FCFNN: Although the feature normalisation operation has been carried out, the problem of the balance between the feature value range and the probability distribution within each feature vector still exists.The fully connected neural network can solve this problem, so the initial specification layer.It is a layer of a fully connected neural network.Here, the number of columns of the output matrix should be set according to the actual effect and the length of feature vector.
Transpose: Each row of the data matrix output by FCFNN layer still represents the features information of a data packet.Each column is the sequence of the features at the same position.Therefore, before entering the TCN, a transpose operation is performed to make the input meet the sequence processing feature of the TCN.
Flatten: From the previous knowledge, the data output by the TCN layer is still a matrix.The matrix needs to be changed to a vector to make the data match the input of the prediction layer.The flattening operation rearranges all the data in the matrix into a vector.
Prediction: The final task of Multi-class S-TCN is classification, that is, input sample feature output sample type label.The number of sample types is known, the number of output data C in the prediction layer should be the same as the number of sample types, and finally, the output vector of length C is y = [y 0 , y 1 , . . ., y C−1 ].
Softmax: If we only select the most likely label, the Softmax function is not needed.To calculate to calculate the distance between each label and the actual label, the Softmax function is added at the end, see Formula (7).Substituting the data into the prediction layer can get the probability that the label is i is y i .If only this classification model participates in the judgment, only the output of the digital label, i corresponding to the largest y i value.If there are multiple models, the vector y i can be used as a score, combined with the score of another network for judgment.

Evaluation
The

Dataset I
The first dataset uses five standard application layer protocols and takes the first 50 bytes of the TCP session payload as the identification input.As shown in Table 3, the application layer protocol types of the five experimental samples are HTTP, TLSv1.2,SMB, OCSP, and TLSv1.

Dataset II
The  devices to simulate attacks on the network to form network data containing malicious traffic.All traffic data in the dataset can be regarded as traffic flowing through a network node.
The traffic generation truly simulates Brute Force, DoS, Botnet, Web Attack, etc. Seven different network attack scenarios are close to actual data and meet the dataset requirements of the experiments in this paper.In addition to the primary binary traffic data, the CSE-CIC-IDS2018 dataset also provides the system log of each device and the feature data obtained by analysing the data stream.Due to the solution's application scenarios and main detection methods, this paper only uses the encapsulated Binary network traffic data in Pcap format.
Because the feature extraction part of this paper uses HTTP as the object of in-depth analysis of the application layer protocol, try to experiment and test with HTTP malicious traffic types.This paper uses Botnet, DoS, and Web Attacks as the malicious traffic types.Web Attack is composed of Brute Force, XSS, and SQLi, and DoS generally refers to traditional DoS and DDoS.Benign means normal traffic.In this dataset, all traffic captured during the test time is packaged in units of devices.Malicious data and normal data can be distinguished based on the address of the network layer.Therefore, malicious and normal IPv4 packets can be labelled by the designated IPv4 address.
Through the data preprocessing described above, four types of samples of Botnet, DoS, web Attack, and Benign can be screened out from the original data.Among them, Botnet, DoS, and Web Attacks are malicious samples.Benign indicates normal sample.Each sample is based on the TCP data stream and contains a specific number of TCP data segments.The data of each TCP data segment contains up to 40 specific feature, which are generally restored before the experiment-all values of the entire feature sequence.For the consideration of time cost and devices, only part of the data with sufficient sample size is selected for testing.The feature sample number of the TCP data stream participating in the evaluation is shown in Table 4.

Binary classification evaluation metric
When there are only two types of samples, the sample concerned by the researcher is called the positive sample, such as the malicious sample mentioned in this paper.The other is called negative sample, such as the benign sample mentioned in this paper.TN and TP represent samples that were correctly identified as negative and positive, FN and FP represent samples that were incorrectly classified as negative and positive.We will evaluate our work using Accuracy = TP+TN TP+FP+TN+FN , Precision = TP TP+FP , Recall = TP TP+FN , and F-measure (Fawcett, 2004).

Multi-classification evaluation metric
In the multi-class classification task, the matrix C is formed by counting the predicted labels and the real labels.The elements of the i row and the j column in the c represent that the number of predict result whose predict label is i and the true label is j.When i = j, c i , i represents the number of samples correctly classified for the sample type with the label i. Promote Accuracy, Precision, Recall, and F1-measure in multiple classifications as follow.Multi-Accuracy = means the correct classification ratio of all sample types, so there can only be one evaluation index for multiple classifications.Also, according to the meaning of Precision and Recall, for each type of sample, Multi- can be obtained.Multi-F1 is the harmonic average of Multi-Precision and Multi-Recall it is shown in Formula 9.

Application layer protocol identification evaluation
We verify it through experiments to confirm whether the application layer protocol identification method proposed in this paper is feasible.The experiment uses data set described in Section 4.1.1,and the ratio of the number of training and testing samples is 7:3.For the first 50 bytes of the payload of each TCP session, each byte is used as an 8-bit integer.If it is less than 50 bytes, it is padded with 0 and then used as the input of the decision tree.The implementation of the decision tree used here was completed by Scikit-learn (Pedregosa et al., 2011), and many experiments were performed.The best experimental results are shown in Table 5.Each data in the table indicates the accuracy rate of the sample of the row type for the prediction type.For example, the first column of data shows that 99.93% of the test samples that are HTTP type are recognised as HTTP, and 0.07% are recognised as OCSP.The malicious traffic detection method uses the HTTP protocol as an example, so it is particularly emphasised here.

DPI-based feature extraction evaluation
This experiment uses the dataset described in Section 4.1.2.The extracted features are mainly divided into data packet, application-layer header, and application layer payload features.In order to evaluate whether these features play a role in the recognition and detection process, we shielded some features and conducted experiments to determine the impact of each feature on the results.The number of features of the three types and their arrangement positions in the vector is shown in Table 6.
In order to correspond to the specific malicious sample type, the output neuron of Multiclass S-TCN is adjusted to two, which means that the adjusted Multi-class S-TCN becomes binary classifier.When only one part of features are masked, after successive experiments, the Loss change during the training process and the recognition accuracy are shown in the first column of Figure 6.When only a part of the three parts is valid, only the features of this part are retained.The result is shown in the second column of Figure 6.The curve in the Figure 6 is smoothed.
From the results of the experiment, data packet features and HTTP payload features alone have a good classification effect for botnet-type malicious traffic samples.The results show that in addition to the prominent behavioural features of botnet-type malicious traffic, the data payload also has similar statistical features.The classification of the two types of information has a higher performance than either one alone; for the DoS samples, although the HTTP data payload can also reflect similarities, it can be identified 100% only by data packet features, which shows that DoS-type malicious traffic has a pretty obvious similarity in data packet features.As for the features of the HTTP header alone, the Loss curve hardly converges.It can be inferred that the features of the HTTP header do not help identify DoStype malicious traffic.Finally, for the Web Attack samples, shielding the packet features and HTTP header features will reduce the classification accuracy.These two types of features are helpful to the recognition of Web Attacks.Payload features do not help detect Web Attack.
This experiment proves that the DPI-based features prepared for Multi-class S-TCN are very suitable for malicious traffic detection without any redundant feature extraction work.

Multi-class S-TCN evaluation
Multi-class S-TCN is the detection network for our design.We use two experiments to evaluate the Multi-class S-TCN of this work.These experiments will evaluate detection performance, DPI-based feature sensitivity, and cost of Multi-class S-TCN.In addition, we selected SVM, KNN, DNN, and RNN for comparisons.All models are trained on the same training set and test set until convergence.The configurations of these approaches are as follows: SVM and KNN: SVM (Cortes & Vapnik, 1995) and KNN (Guo et al., 2003) only accept feature input in vector form, so the original data is expanded into a TCP data stream feature vector of length 2000.As a comparative experiment of traditional machine learning methods, SVM and KNN also use the SVC classifier and KNeighbors classifier provided by Scikit-learn.Scikitlearn has done many optimizations on these two methods to achieve higher performance.SVC chooses RBF as the kernel function.KNeighborsClassifier uses the Euclidean metric as the distance calculation method, the reciprocal of the distance is used as the weight in decision-making, and the K value is set to 50.
DNN: A typical deep, fully connected neural network (Glorot & Bengio, 2010;Schmidhuber, 2015) is designed in the experiment .The model has five hidden layers, the number of corresponding neurons is [2000,1000,100,100,100], and the prediction of four neurons in one layer.Finally, the probability result of the classification is output through the Softmax function, and the one with the most significant probability is the final classification result.
RNN (Lipton et al., 2015): The TCP data stream feature matrix extracted in this paper is a vector sequence, which can be directly input to the RNN in a vector sequence.The nature of the network detection task in this paper is a multi-classification task.Therefore, similar to Multi-class S-TCN, in the experiment, after the effectual output of the RNN is flattened, it is input to a layer of fully connected neural network for prediction.And finally, through the Softmax function, take the label with the highest probability.The number of neurons in the hidden layer of the RNN is 50.That is, the length of each output vector is 50.

Detection performance
The first experiment also uses the dataset described in Section 4.1.2.A 4-classification experiment was performed on three different malicious samples and benign samples.In the  experiment, the ratio of training and testing samples was 7:3.It can be seen that the number of Web Attack samples is tiny, and the training of many classification approaches in machine learning is easily affected by the imbalance of the number of samples.Therefore, in the training process, the samples of this type are replicated 100 times.
The results are shown in Figure 7.All indicators in evaluation metric of Multi-class S-TCN are higher than 99%.This figure also shows the detection results of several common-used machine-learning models, and we can see that the Multi-class S-TCN is the best one.
To better evaluate this work, we also need to compare the performance of the Multi-class S-TCN with other approaches on the same dataset.Since CSE-CIC-IDS2018 is a widely used dataset, there is already a large body of work using this dataset for validation.We chose three recently published papers for comparison: Kim et al. (2020), Lin et al. (2019), andZhao et al. (2020).The results are shown in Table 7.
We can see from the results that our solution is very promising, compared to the existing work.Only the accuracy of Kim et al. is slightly higher than our work, but our solution has an overwhelming advantage in terms of recall and precision.

DPI-based feature sensitivity
The second experiment further compares and analyses the significance of three types of features (packet features, application-layer head features, and application-layer payload  This experiment shows that no matter what kind of shielding method is for any detection task of malicious samples, the classifier based on Multi-class S-TCN can obtain the best f-measure score.Only when the packet characteristics are not shielded.The detection performance of Web Attack samples is slightly worse than that of SVM.The performance gap between KNN in these areas and other methods is significant.The detection performance of the DNN drops below 0.7 when only H features are available.In other shielding methods, the detection performance of Web Attack samples is also fragile.The performance of the SVM and the RNN are relatively balanced, but the F-Masure value is slightly lower than our model on average.In general, Multi-class S-TCN performs well in any situation, which means it is stable and suitable for malicious detection.

Detection cost
The third experiment is to evaluate the cost of Multi-class S-TCN.The system included in the final solution needs to be deployed on edge servers or switches, and is sensitive to delay.Therefore, the model's performance has strict requirements, which is why S-TCN is used as the classifier in this paper.Therefore, we also compare the calculation time required for each classification of different methods in this experiment, as shown in Table 8.
From the results, we can see that Multi-class S-TCN has the fastest detection speed.As we mentioned earlier, high detection speed is important for deployment in production environments.In addition, our work is an improvement for traditional TCN, which means that Multi-class S-TCN is well suited for parallelism, just like traditional TCN.Parallelism can further improve the detection speed and thus achieve better performance in production environments.

Conclusion
This paper proposes an improved TCN named Multi-class S-TCN and designs a DPI-based malicious traffic detection solution for the IoT environment based on it.This paper performs the traffic feature extraction based on protocol identification and then completes the AI-driven malicious traffic detection using the Multi-class S-TCN model.We prove that the features selected in this paper are effective and that the Multi-class S-TCN model has good detection performance through a series of experiments.Compared with existing approaches, this work has the features of high accuracy, fast detection speed, and support for parallel detection, making it more suitable for the needs of the IoT environments.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This

Figure 1 .
Figure 1.Flow chart of flow detection system.
proposed S-TCN-based IoT novel malicious traffic detection method consists of several steps: 1. traffic capture; 2. application layer protocol identification; 3. DPI-based feature extraction; 4. malicious traffic detection.The first of these steps is based on the mature traffic capture tool provided by Pcap.The other steps are the main work of this paper, especially steps 3 and 4. DPI-based traffic feature extraction and S-TCN-based traffic classifier guarantee that this work can accurately and efficiently detect malicious traffic in the IoT environment.A series of experiments are set up to verify their effectiveness.

Figure 6 .
Figure 6.Results of DPI-based features evaluation.P stands for Packet features.H stands for Application layer head features, B stands for Application layer payload features and N stands for all zero padding.

Figure 7 .
Figure 7.Comparison between S-TCN and other networks.TCN refers to Multi-class S-TCN.

Figure 8 .
Figure 8. Results of Multi-class S-TCN evaluation with shielded features.TCN refers to Multi-class S-TCN, P stands for Packet features.H stands for Application layer head features, B stands for Application layer payload features and N stands for all zero padding.
features, hereafter referred to as P, H, and B, respectively) in DPI-based feature extraction for different types of attack traffic.As shown in Figure8, from left to right in the first row in the figure are the results of shielded data packet features, shielded protocol header features, and shielded protocol payload features.The second row of the figure from left to right represents the performance in the three cases of using only data packet features, protocol header features, and protocol payload features.

Table 5 .
Application layer protocol recognition effect.

Table 7 .
Comparison between Multi-class S-TCN and existing approaches

Table 8 .
The minimum time cost per sample.TCN refers to Multi-class S-TCN work was partially supported by National Key R&D Program of China [grant number 2020YFC0832500], Ministry of Education -China Mobile Research Foundation [grant number MCM20170206], The Fundamental Research Funds for the Central Universities [grant number lzujbky-2019-kb51] and [grant number lzujbky-2018-k12], National Natural Science Foundation of China [grant number 61402210], Major National Project of High Resolution Earth Observation System [grant number 30-Y20A34-9010-15/17], State Grid Corporation of China Science and Technology Project [grant number SGGSKY00WYJS2000062], Program for New Century Excellent Talents in University [grant number NCET-12-0250], Strategic Priority Research Program of the Chinese Academy of Sciences with [grant number XDA03030100], Google Research Awards and Google Faculty Award.