A new approach to software vulnerability detection based on CPG analysis

Abstract Detecting source code vulnerabilities is an essential issue today. In this paper, to improve the efficiency of detecting vulnerabilities in software written in C/C++, we propose to use a combination of Deep Graph Convolutional Neural Network (DGCNN) and code property graph (CPG). Specifically, 3 main proposed phases in the research method include: phase 1: building feature profiles of source code. At this step, we suggest using analysis techniques such as Word2vec, one hot encoding to standardize and analyze the source code; phase 2: extracting features of source code based on feature profiles. Accordingly, at this phase, we propose to use Deep Graph Convolutional Neural Network (DGCNN) model to analyze and extract features of the source code; phase 3: classifying source code based on the features extracted in phase 2 to find normal source code and source code containing security vulnerabilities. Some scenarios for comparing and evaluating the proposed method in this study compared with other approaches we have taken show the superior effectiveness of our approach. Besides, this result proves that our method in this paper is not only correct and reasonable, but it also opens up a new approach to the task of detecting source code vulnerabilities.


Problem
Vulnerability is a weakness that exists in a system and allows attackers to exploit, causing damage to the safety and security attributes of that system including confidentiality, integrity, availability, according to Shen and Chen (2020).In the research, source code vulnerabilities are defined and classified into two categories: classification according to software defects; classification according to the software development process.Common Vulnerabilities and Exposures (CVE) (http://cve.mitre.org)reported the danger level of current source code security vulnerabilities.Therefore, the problem of early detecting source code vulnerabilities is now an urgent issue.
To detect source code security vulnerabilities, Z. Li et al. (2018) pointed out some main approaches including static analysis and dynamic analysis.In which, static analysis method with a combination of techniques such as Pattern Matching; Lexical analysis method; Data flow analysis method; some analysis methods based on the abstract syntax tree (AST) (Dam et al., 2018;Grieco et al., 2016;Wei To overcome the above situation, recent studies have tried to analyze the source code into code representation graphs such as AST, CFG (Control Flow Graph), Program Dependence Graph (PDG), and then used classification algorithms (machine learning and deep learning) such as Long short term memory (LSTM), Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), etc.However, these approaches have some problems as follows: (1) The method of representing and synthesizing source code features: Accordingly, it can be seen that with certain data sets or certain vulnerability types, using one of these methods (AST, CFG, and PDG) alone for representing source code security vulnerabilities could bring good results, but it is less effective when used for other unknown vulnerabilities.The reason is that these source code representation methods only focus on representing a certain feature type of the source code, so it will lose other important features.This makes it difficult for these methods to build an information profile that represents the fullest features of the source code.
(2) The method of extracting source code features: With traditional approaches, after the source code is synthesized and normalized, it is put to popular machine learning and deep learning models to search, extract, and train features.However, such extraction and training way loses important features of the source code because some features are lost or hidden in the process of normalization and parameter synchronization.
To solve the above problems, in this paper, we propose a new approach with the combination of deep learning graph networks with the method of building feature profiles of source code.Accordingly, our proposed method will solve the above 2 disadvantages as follows: For problem (i) instead of using single representation methods such as AST, CFG, and PDG, the article proposes to use CPG to represent source code and then use some natural language processing methods to representation of source code information.For problem (ii) instead of using traditional deep learning networks, we will use DGCNN for source code feature extraction: (1) Task 1: Building feature profiles of the source code.In this study, we propose a method of synthesizing source code features based on the CPG processing method and some natural language processing (NLP) models such as the Word2vec and one hot encoding.The main steps in task 1 include: • Step 1: Representing the source code by using the CPG graph.This step has the task of representing the source code as a graph with features as vertices and edges.The process of source code processing and analysis using CPG graphs is presented in detail in Section "The method of building feature profiles of source code" of the paper. • Step 2: Normalizing vertex feature using the Word2vec model.At this step, after the source code is analyzed and normalized by the CPG graph, it is further processed through the Word2vec model. • Step 3: One hot encoding.This step is responsible for embedding the attributes and features on edges.
• Step 4: Synthesizing source code features.At this step, the results of the computation and processing on the vertices and edges are synthesized into an information profile showing features of the source code.
(2) Task 2: Extract source code features using DGCNN: The purpose of this step is to use DGCNN to extract source code features based on the graph feature profile built in task 1.Details of this process is presented in section 3.3 of the paper.(3) Task 3: Classifying the source code.This step is responsible for detecting vulnerabilities of source code based on the source code features built from task 2.

Contribution
Based on the analysis and evaluation of the characteristics and processes in the proposed approach in the paper, it can be seen that the practical significance and scientific content in the research include: • Proposing a method of building feature profiles of source code.In particular, in this paper, in order to build a source code feature profile, we have proposed to combine many different data mining techniques including CPG source code representation technique, data processing technique on the edge, data processing technique on the vertex, etc.With such an approach, code representation graphs such as AST, CFG, PDG all are combined and constituted a common data structure, thereby the source code to be represented in the most complete and clear way on both syntax and semantics (Syntax-based, Semantics-based).In addition, some NLP methods such as Word2vec, one hot encoding are also used to normalize and process data on vertices and edges to ensure that the data is most fully collected.This proposal has an important meaning in the task of detecting source code vulnerabilities because it has contributed to solving difficulties in the process of seeking and representing relationships between components in the source code, thereby improving the efficiency of the source code vulnerability detection process.
• Proposing to use the DGCNN deep learning graph network for the task of extracting source code features.This is a new proposal for the process of analyzing and extracting features of source code based on graphs.The use of the DGCNN network helps the process of calculation and feature synthesis go smoothly, avoiding loss and redundancy of data of the graph.The experimental results in the paper have shown that the DGCNN model is more effective than other deep learning models in the problem of selecting and extracting graph features.
• Proposing a method of detecting source code security vulnerabilities based on the technique of building source code feature profiles and the DGCNN deep learning graph network.This is a new approach and hasn't been published by any studies and proposals.With this proposal, the basic features of the source code are exploited and trained in the most explicit way, helping to improve the efficiency of the process of analyzing and evaluating abnormal features of source code.The experimental results performed in section "Result evaluations" of the paper have proven the correctness and science of the proposal.
The rest of the paper is organized as follows: In Section "Related Work", we study and examine some previous studies for the task of source code vulnerability detection.The contents related to the proposed method are analyzed and presented in Section "The proposed model".The experimental results and evaluations of the effectiveness of the proposed method are presented in Section "Experiments and evaluations".Finally, conclusions and future development directions are presented in Section "Conclusion and development directions" of the paper.Chen et al. (2017) pointed out that the current vulnerability detection methods are often based on algorithms that traverse the AST graph and match AST nodes with vulnerability rules.This affects the detection speed because it takes time to match a large number of unnecessary rules.This study proposed an optimized rule-checking algorithm to improve the speed of matching rules on the AST tree, resulting in a 28.7% improvement compared to the original PMD.Tian et al. (2009) applied an improved Apriori algorithm for web application vulnerability detection to improve the ability to detect unknown vulnerabilities.This method used association rules to analyze the logical relationship between the components of the application.It could detect most of the vulnerabilities that exist on all pages of the target site.Hu et al. (2020) described some vulnerabilities such as memory leak, double-free, and use-after-free using the CFG and Pointerrelated Control Flow Graph (PCFG) frameworks combined with two algorithms, Vulnerability Feature (VJVF) and Feature Judging, for task detection.The results showed that this MRVDAVF method gave more effective results than some tools such as cppcheck, flawfinder, and splint with memory leaks, double-free, and use-after vulnerabilities, but the detection ability wasn't diverse.Lee et al. (2006)

Vulnerability detection using machine learning, deep learning
Rebecca L. Russell et al. (2018) argued that current popular source code vulnerability detection tools only allow detection of a limited set of vulnerabilities based on a predefined ruleset.Meanwhile, the application of machine learning and deep learning in vulnerability detection gives the ability to directly learn features from the source code.This is the basis for detecting more types of vulnerabilities.Harer et al. (2018) presented a data-driven approach to vulnerability detection using machine learning.It was applied to C and C++ programs.As a result, the deep learning model gave better results than the traditional machine learning models with a ROC of 0.87.Tang et al. (2021) proposed a combination of the Extreme Learning Machine (ELM) model and Doc2Vec to represent the program.Besides, the authors demonstrated that Doc2Vec provided faster training and better generalization performance than Word2Vec.With the same idea, Do Xuan et al. ( 2022) have presented a method to detect security vulnerabilities using machine learning algorithms and NLP methods.Wu et al. (2020) emphasized that the source code feature extraction is an extremely important step that greatly affects the detection results.Because previous methods could not fully exploit this aspect, the author extracted features by CPG graph for the vulnerability detection task.M. Li et al. (2021) showed that previous deep learning-based methods ignored the connection between semantic graphs and didn't efficiently process information in graph form.The application of Graph Neural Network (GNN) models gave insight into the problem of security vulnerability detection.The experimental results of the study also proved that the graph-based model was better than the sequence-based model, especially for semantic graph, up to 5.01% in accuracy.Wang et al. (2020) pointed out that approaches using regression networks such as Recurrent Neural Network (RNN), LSTM, Bi-LSTM gave poor results with the task of detecting vulnerabilities.Conventional regression models designed for sequential data are not suitable for the current code representation methods by complex graphs such as AST, CFG, CPG, etc.Current GNN networks allow direct manipulation of graph data.This gives the ability to strongly exploit complex relationships on this data type to improve detection efficiency.Research by Tal Ben-Nun et al. ( 2018) presented a natural language processing method combined with the inst2vec algorithm for detecting C and C++ source code vulnerabilities.In the experiment part, the authors compared their method with other approaches while applying basic machine learning and deep learning models such as RNN, Tree-Based CNN, etc.The results are higher than most approaches on the same experiment dataset.Besides, Yi M. Li et al. ( 2021) built an IVDetect model, which is a combination of two main methods: Consider the vulnerable statements and their surrounding contexts via data and Control dependencies and artificial intelligence.

The model architecture
The functions and tasks of the sub-blocks in the proposed model shown in Figure 1 are: (1) Source code: In this paper, the source code set used is FFmpeg and Qemu (https://ffmpeg.org/download.html).These source codes are written in C and C++.
(2) Standardizing source code: In this block, the source code is pre-processed, normalized to remove empty data, spaces, etc.
(3) Building feature profiles of source code: To do this, we propose to use some processing subblocks in this block as follows: • CPG block.This block has the function of representing source code in the CPG graph by using the Joern tool (https://joern.io/).The result of the CPG block is that the source code is parsed syntaxes and semantics by the Joern tool to represent the source code as a CPG graph.
• Vertex feature normalization block.Accordingly, when using the Joern tool, the source code is processed into vertices and edges.Next, the vertex information is processed and normalized into feature vectors by using the Word2vec model.
• Edge feature normalization block.This block has the function of processing edge data to normalize this data into feature vectors.In this paper, we use the one-hot encoding method to normalize the edge feature.
• Source code feature synthesis block.This block combines edge and vertex data into an information profile containing information about structural and semantic features of the source code.
(5) Extracting source code features.In this paper, we propose to use the DGCNN model to extract features of source code represented in CPG form.
(6) Source code classification.This block has the function of classifying source code based on the features extracted from the previous block.The output of this block is the label of source code (is normal or contains security vulnerabilities).

Representing source code using the CPG graph
CPG was introduced by Yamaguchi et al. (2014).Accordingly, CPG combines 3 code representation graphs (namely AST (Yamaguchi et al., 2012), CFG (Gascon et al., (2013), PDG (Ferrante et al., 1989) into a common data structure.A CPG graph consists of the following main components: • Nodes and node types.Nodes represent the program structure.It includes low-level language constructs such as methods, variables, control structures, but it also has higher-level constructs such as HTTP endpoints or findings.Each node has a type that specifies the program structure type represented by the node.For example, a node with type METHOD represents a method while a node with type LOCAL represents defining a local variable.
• Labeled edges.Relationships between program structures are represented through edges between respective nodes.For example, to represent a method that contains a local variable, we can create an edge with the label CONTAINS from a node METHOD to the local node.By using labeled edges, we can represent many relationship types in the same graph.Multiple edges can exist between two identical nodes.
• Key-value pairs.Nodes contain key-value feature pairs where the valid keys depend on the node type.For example, a method has at least one name and one signature, while a local declaration has at least one name and type of the declared variable.
In this paper, to represent the source code into a CPG graph, we use the Joern tool.Joern is a platform that supports analyzing source code, bytecode, and binary code.Joern is developed with the goal of providing a support tool for vulnerability detection and research based on static analysis of source code.Accordingly, with the input data of C/C++ program source codes, the Joern tool analyzes the source code into a CPG graph consisting of edges and vertices.

The method of normalizing vertex features
As described above, the Joern tool analyzes the source code into two main components: edge and vertex of the CPG graph.In which, the vertex of the graph contains the main information: index of vertex, type of vertex, and the code on that vertex.Next, the paper proposes a method to normalize this information.Accordingly, to perform vertex feature normalization, we propose 3 tasks as follows: (1) Task 1: Normalizing vertex types.
Table 1 below describes the vertex types of the CPG graph.
From Table 1, it can be seen that after the source code is analyzed by the Joern tool, it has 39 different vertex types.Each vertex type represents a different property and characteristic of the source code.Next, we normalize these vertices to obtain their feature vectors.To do this, we use the one-hot encoding vector.One-hot encoding is the process of transforming each value into a binary vector containing only 1 or 0. Tomas et al. (2013) and Le and Mikolov (2014) presented the operating principle and implementation steps of the one-hot encoding process in detail.Thus, after being encoded, the vertex types become vectors of equal length and each vector is unique.
(2) Task 2: Normalizing the code on the vertex.
We propose to use the Word2vec model to standardize these codes.Word2vec is a method to represent a word in the form of a relational distribution with the remaining words (Le & Mikolov, 2014;Tomas et al., 2013).Word2vec uses a 2-layer neural network with only one hidden layer.Its input is a large corpus and output is a vector space in which each unique word in the corpus is associated with a corresponding vector in space.In Word2vec, a distributed representation of a word is used to create a multidimensional vector.Each word is represented by the set of weights of each element in it.Thus, instead of just having a one-to-one connection between an element in the vector and a word,

METHOD_REF
This node represents a reference to the method/function/procedure as it appears when a method is passed as an argument in a call. (Continued) the word representation is spread over all elements in the vector, and each element in the vector contributes to the definition of many different words.Word2vec has 2 models, Skip-gram and CBOW (Le & Mikolov, 2014;Tomas et al., 2013).This paper only uses the Skip-gram model to analyze and normalize the functions.When using the Skip-Gram model, the input is a word in the sentence, and the algorithm looks at the words around it.The number of surrounding words to consider is called "window size".In order to be trainable, the word is vectorized to be fed into the network to build a dictionary from the text dataset and then use a one-hot vector to describe each word in the dictionary.Basically, the Skip-gram model includes 3 main components: Input; Hidden layer and Output layer.
At this task, we combine 2 vectors obtained at task 1 and task 2 into a uniform vector representing the node of the graph.Thus, based on discrete information, which is the result after collecting and analyzing by the Joern tool, the study used some NLP methods including Word2vec and One-hot encoding to normalize this information.The normalized information contains important values and features representing the vertex of the graph.These features represent different information about source code containing vulnerabilities and normal source code.

Edge feature normalization
Table 2 describes edge types in the graph when source code is analyzed by the Joern tool.
From Table 2, it can be seen that there are 14 edge types analyzed and extracted through the Joern tool.To normalize edge information, the paper uses one-hot encoding.The way to conduct one-hot encoding is presented in task 1 of section "The method of normalizing vertex features" in the paper

Synthesizing source code features
Thus, based on the process of calculating and normalizing the information presented in sections "The method of normalizing vertex features" and "Edge feature normalization", we have obtained information about the edges and vertices of the CPG graph.Next, this information is synthesized into an information profile.This information profile describes in detail the characteristics and features of the source code.The profile includes many nodes and many edges.We think that the proposal of building a source code information profile based on the vertex and edge analysis of the CPG graph brings many important values to show the characteristics of the source code.

Source code feature extraction
As described above, the data representation is complex (the source code is represented on a directed CPG graph with a multi-edge multi-node structure), so pushing it into an ordinary classification model doesn't bring a high effect.Therefore, in this paper, we propose to use a deep learning graph network to further extract and normalize source code features in CPG graph form.Specifically, the deep learning graph network used in this paper is the DGCNN model (Goy and Ferrara., 2018).Currently, deep learning graph networks have been studied and applied in many different fields, however, the application of this network for the task of source code vulnerability classification has still been limited (Makarov et al., 2021;Z. Li et al., 2019;Zhou et al., 2020).
DGCNN is proposed in the study (Tomas et al., 2013) to solve two main problems: i) How to extract useful features to characterize diverse information encoded in a graph for classification purposes; ii) How to sequentially read a graph in a meaningful and consistent order.To achieve that goal, the proposed DGCNN architecture includes 3 main layers as follows: Where: A: adjacency matrix X: feature matrix I: unit matrix with the same size as A f: activation function D: degree matrix w: weight matrix Z i : output of layer i Z 0 = X Convolution aggregates information about neighboring nodes to extract local substructure information.To extract features of substructure features at multiple scales, the convolutional layers are stacked as follows: Where: Z 0 ¼ X, Z t 2 R n�ct is the output of the t-th convolutional layer; c t is the number of output channels of layer t.
• The second layer.DGCNN introduces a new Sort Pooling layer to generate a new representation (embedding) for each given graph using input as learned representations for each vertex through a stack of GCN layers.Thus, it can be seen that the Sort Pooling layer sorts the features, takes the high-valued features, and discards the low-valued features.Accordingly, after the initial inputs go through the GCN layers, we have a Z 1:h matrix where each row is a descriptor of vertex features and each column is a channel to transmit feature.This matrix is generated by concatenating the outputs of h GCN layers consisting of n rows (n is the number of vertices), K columns (K is the total number of columns of the output matrices of h GCN layers).With the input of the matrix Z 1:h , the processing is as follows: firstly, the matrix Z 1:h is rearranged according to the rows of the Z h matrix according to the principle: Sorting rows by the representative value of the last column of the matrix Z h .If these two rows have the same value in the last column, we compare the value in the last column of the second row Z hÀ 1 .Next, cut or pad allzero rows to the Z 1:h matrix so that the size of this matrix goes from n rows to k rows.The output is a matrix of dimensions k × K where k is a predefined integer.
• The third layer.The remaining layers are CNN layers and traditional Dense layers Albawi et al., (2017).The purpose of this layer is to read a sorted graph representation and make predictions.Accordingly, the output of the Sort Pooling layer is used as the input of the 1-D Convolution, Max Pooling, and Dense layers to learn the appropriate graph features to predict the graph labels (Duan et al., 2003).
Thus, it can be seen that the DGCNN model works according to the following principle: with the input of any graph structure, it is first put to the GCN layers where the vertices information is propagated between nodes.Then the vertex features are sorted and synthesized at the Sort Pooling layer, and transmitted to traditional CNN structures to learn a predictive model.

Source code classification
To classify source code vulnerabilities and normal source code, we use 2 layers, Fully Connected Layer and Softmax Layer, as shown in Figure 1.

Experimental data
The main datasets used for experimentation and evaluation are FFmpeg and Qemu (https:// ffmpeg.org/download.html).Qemu is a collection of software programs that enable the creation, management, and administration of virtual machines and the operation of virtualized environments on physical servers.The main vulnerability types in this dataset are Dos, buffer overflow error, etc. FFmpeg includes software programs and libraries for processing video, audio, and other multimedia streams.Both datasets are program source codes written in C/C++ language.Table 3 below lists the main components of these two datasets.

Scenarios for the experimental dataset
With the experimental dataset presented in Table 3, this study divides the dataset into different components, then conducts experiments and evaluates the accuracy of the proposed models based on the experimental datasets.The whole process of dividing the experimental dataset into scenarios is randomly selected, where 80% of the dataset is used in the training process, the remaining 20% is used in the testing process.

Evaluation scenarios
In this paper, we propose two main scenarios as follows: • Scenario 1. Evaluating the effectiveness of the proposed model with some deep learning graph networks in some other studies.In particular, in this scenario, we compare and evaluate the DGCNN model with the GCN model (Haridas et al., 2020) and GCN+IndRNN (Cai et al., 2021).
• Scenario 2. Evaluating the effectiveness of the source code vulnerability detection model when not applying the approach using deep learning graph networks.Specifically, in this scenario, we want to clarify the role and importance of the deep learning graph network in the task of synthesizing, extracting features of source code.We will use some classification algorithms such as CNN, Multilayer Perceptron (MLP), LSTM, BiLSTM for this task.

Evaluation metrics
We use the following 4 metrics to evaluate the effectiveness of the proposed model.The general formulas of these 4 metrics are expressed through formulas (7,8,9,10).

Evaluations of experimental scenario 1
Experimental results of detecting source code vulnerabilities using the DGCNN model.As mentioned above, the purpose of this experimental scenario is to compare and evaluate the ability to synthesize and classify source code vulnerabilities of several deep learning graph networks.Specifically, we compare and evaluate 3 main deep learning graph networks: DGCNN (our proposal), GCN (Haridas et al., 2020), GCN +IndRNN (Cai et al., 2021).Tables 4, 5, 6 below show the experimental results of this scenario.Accordingly, Table 4 shows the results of source code vulnerability classification based on the DGCNN model.
Table 5 shows the results of evaluating the GCN network (Haridas et al., 2020).
The experimental results in Table 5 show that the source code classification results of the GCN model are relatively stable.Specifically, this model gives an average overall accuracy of 79.55%.This result is not high.Regarding the stability of the GCN model, it is clear that when using the 2-layer GCN model and changing randomly the number of units, the accuracy of the model changes slowly and is insignificant.With parameters [64-32-16] and [256-128-32], the GCN model gives the best results in most metrics.Comparing the experimental results in Tables 4  and 5, it is clear that the DGCNN model has much better performance than the GCN model.The reason is that with the support of GCN and CNN layers, the DGCNN model works better in the task of synthesizing and extracting features of source code.In addition, DGCNN uses a mechanism for   synthesizing results based on the combination of each layer together.This mechanism is different from just linearizing features of each node and then taking the final result as output like GCN.In addition, because the GCN model only uses GCN and MLP layers to represent information, it loses important features of the graph, thereby losing the meanings and differences between normal source code and source code containing vulnerabilities.This leads to low classification results.Next, Table 6 below presents the experimental results of the GCN+IndRNN model (Cai et al., 2021).
Obviously, the experimental results in Table 6 show that with the support of GCN, MLP, IndRNN layers, the GCN+IndRNN model promotes higher efficiency than the original GCN model.This difference ranges from 4% to 7% on all metrics.The reason is that the GCN+IndRNN model integrates an additional layer of IndRNN to synthesize and represent information of source code, so it could extract some more important features.Comparing the results of Tables 4 and 6, it can be seen that the DGCNN model completely outperforms the GCN+IndRNN model.

Comments and evaluations for scenario 1
Based on the experimental results presented in Tables 4, 5, 6, it can be seen the complete superiority of the proposed method in the paper with other approaches.Next, the study conducts experiments to evaluate the detection and prediction ability of these models through values of confusion matrices.
From the experimental results in Figure 2, it can be seen that the DGCNN model brings the best efficiency for the task of accurately detecting vulnerabilities of source code.Accordingly, the DGCNN model correctly predicts 1,908 source code vulnerabilities out of a total of 2,013 vulnerabilities.This result is higher than GCN (Haridas et al., 2020) and GCN+IndRNN (Cai et al., 2021) models with 98 and 51 vulnerabilities, respectively.As for the ability to accurately predict normal source codes, the DGCNN model also correctly predicts 2,385 normal source codes out of a total of 2,458 source codes.This result is much more efficient than GCN (Haridas et al., 2020) and GCN +IndRNN (Cai et al., 2021).Comparing confusion matrices of GCN and GCN+IndRNN, it can be seen that the GCN+IndRNN model is more effective than GCN.The reason is that the GCN+IndRNN model uses the IndRNN model for the classification task instead of using the pure GCN layers.

Experimental results of detecting source code vulnerabilities based on some other approaches
For this scenario, we evaluate the experimental results of some other approaches without using the CPG graph and deep learning graph networks.The results shown in Table 7 below are experimental results when applying a model combining NLP and some deep learning methods including LSTM (Lin et al., 2021), CNN (Rebecca L. Russell et al., 2018), BiLSTM (Zheng et al., 2020).
Comparing the experimental results in Table 7, it can be seen that the results have a large difference between the models.Specifically, the model combining Word2Vec and CNN gives the best precision of normal source code classification (94%).This result is higher than that of LSTM and BiLSTM models by 4% and 8%, respectively.For the correct classification of source code vulnerabilities, the BiLSTM model has the best efficiency of 65%, which is 6% and 7% higher than that of LSTM and CNN models, respectively.This demonstrates with the support of the ability to remember over long sequences, the BiLSTM model has synthesized many important and outstanding features of source code vulnerabilities, thereby improving the ability to accurately detect.Comparing the results of Tables 7 with table 4, it is clear that our proposed method brings much better result than other approaches.Similarly, when comparing the results of Tables 7 with table 5 and 6, it can be seen that the approach using CPG graph and deep learning graph network brings better classification results than other approaches.This result shows that our proposal in this paper is correct and reasonable.

Conclusion and development directions
Detecting security vulnerabilities of source code is now an urgent issue because the techniques for exploiting vulnerabilities are growing strongly.In this paper, based on some different data mining techniques, we have succeeded in building a new approach to enhance the ability to accurately classify source code vulnerabilities.The experimental results show that the proposed method in the paper has brought remarkable effectiveness for both the task of correctly classifying vulnerabilities and normal source codes.There are 3 reasons leading to the outstanding effectiveness of the proposed method, including: i) CPG graph has succeeded in representing the syntactic and semantic relationships of the source code.Only when these relationships are expressed in detail, the characteristics of the source code are fully represented.ii) Proposing the method of building source code feature profile based on edges and vertices using NLP techniques.This is a new technique that helps us normalize and fully extract the source code features based on the CPG graph.This task is very important because if and only if fully building and synthesizing the source code features, it will bring the best classification effect.iii) Proposing to use the DGCNN for the task of extracting source code features based on the CPG graph.This is a breakthrough proposal because the DGCNN model is very suitable for non-linear and structured graphs such as source code graphs.In the future, in order to improve the performance of the source code vulnerability detection model, it is necessary to improve 2 main tasks: i) Task 1: Improving the method of building source code information profiles.Accordingly, instead of only using NLP methods, other advanced deep learning methods can be applied to search for the relationship between edges and vertices of the graph.ii) Task 2: Improving the method of extracting source code features.Some other non-linear deep learning graph networks can be used to synthesize and extract more information about the depth of edges and vertices.
Figure 1.The architecture of the proposed model.
Sahu et al., 2021isadvantage of current rule-based vulnerability detection tools: they didn't adapt to software in general, often limited in terms of techniques, technology.The authors proposed a model with three components: Analysis, Rule Processing, and Detection.The Analysis and Rule Processing components improved the problem, but the model still had limitations in terms of complexity and time.Other approaches, such as those bySahu et al., 2021 Sahu et al., 2020have also been proposed.

Table 1 . Vertex types of source code when analyzing into CPG graph Vertex type Meaning META_DATA
This node contains the meta data of the CPG graph.FILEFile nodes represent source files or a shared objects from which the CPG was generated.File nodes act as indexes which means they allow all elements of the code to be looked up by file.For each file, the graph must contain a FILE node.NAMESPACESimilar to the FILE node, the NAMESPACE node plays as an index that allows obtaining all definitions inside a namespace by tracking outgoing edges from a NAMESPACE node.NAMESPACE_BLOCKReference to a namespace.This node is inspired by a "namespace block" of C++.A namespace block is a block of code placed in the same namespace by the programmer.This block can be introduced via the "package" statement in java or the "namespace{}" statement in C++.
MEMBERThis node type represents a type of a class, struct, or union, e.g., type declaration `class Foo{int i;}` represents the declaration of the variable `i`.TYPE Node TYPE represents a particular initialization declaration type.TYPE_ARGUMENT This node type represents an argument as used to initialize a parameterized type, in the same way that an actual argument provides specific values for a parameter at method calls.TYPE_DECL This node represents a declared type e.g. declared by a class-, struct-, etc.In contrast to the TYPE node, this node doesn't represent a concrete representation of a type.For example for the parameterized type 'List[T]', it would represent 'List[T]', but not represent 'List[Integer]' where 'IntegerLITERAL This node represents a fixed value such as an integer or a string.LOCAL This node represents a local variable.

Table 1 . (Continued) Vertex type Meaning
<operator>This node represents the operators.