CN119670101A

CN119670101A - Source code logic vulnerability detection method, system and device based on multi-view fusion

Info

Publication number: CN119670101A
Application number: CN202411987591.6A
Authority: CN
Inventors: 章程; 余亚龙
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2024-12-31
Filing date: 2024-12-31
Publication date: 2025-03-21
Anticipated expiration: 2044-12-31
Also published as: CN119670101B

Abstract

The present invention belongs to the field of software security testing, and specifically relates to a method, system and device for detecting source code logic vulnerabilities by fusion of multiple views. The method comprises: encoding the software source code into a fusion feature vector containing feature information of a tree view and a stream view; S2: obtaining a large number of fusion feature vectors, and manually adding label information representing whether there is a vulnerability to form a sample data set; S3: selecting cross entropy loss as a loss function, dividing the sample data set into a training set and a test set, training and testing the CNN model, and saving the model parameters of the network model that meets the requirements; S4: using the saved network model as a vulnerability detection model, and using it to perform vulnerability detection on the fusion feature vector encoded by the software source code to be identified. The present invention solves the problem that the existing vulnerability detection scheme based on graph features cannot fully extract and utilize the feature information in the software code attribute graph, which leads to insufficient vulnerability detection accuracy and efficiency.

Description

Multi-view fusion source code logic vulnerability detection method, system and device

Technical Field

The invention belongs to the field of software testing, and particularly relates to a method, a system and a device for detecting source code logic loopholes by multi-view fusion.

Background

Source code vulnerability detection is an important part of software supply chain security. The traditional software source code bug detection method has higher false positive and false negative rates, some existing static analysis methods generally lead to high false positive rates of detection non-bug programs, and other dynamic detection methods still have higher false negative rates. So far, these tools have remained unreliable.

The software source code vulnerability detection method based on deep learning is widely applied, wherein the detection method based on graph analysis shows good accuracy. However, most of these methods directly analyze the graph and extract features of the graph, often resulting in a significant amount of overhead. For example, the graphical representation of the software code obtained by the graph analysis method contains a large amount of information, most of which is irrelevant to the vulnerability characteristics, and this requires the model to identify the information related to the vulnerability characteristics from the large amount of information, which increases the difficulty of identifying the model. Meanwhile, if the function scale is large, the obtained graph representation becomes relatively complex, which makes the scale of the deep learning neural network large, not only increases the cost of the scheme, but also limits the practical value of the scheme on low-power equipment. In addition, because the GPU memory and the computing resources are limited, the input length of the existing neural network model is usually limited, which may also cause that the network model is difficult to directly extract enough characteristic information from the graph, and the phenomenon that certain characteristics cannot be used and cut off by the model occurs, so that the vulnerability recognition accuracy of the network model is greatly reduced.

Disclosure of Invention

The invention provides a multi-view fusion source code logic vulnerability detection method, system and device, which are used for solving the problem that the existing vulnerability detection scheme based on graph features cannot fully extract and utilize feature information in a software code attribute graph, so that the vulnerability detection precision and efficiency are insufficient.

The technical scheme provided by the invention is as follows:

a multi-view fusion source code logic vulnerability detection method comprises the following steps:

s1, encoding software source codes into fusion feature vectors containing feature information of tree views and stream views, wherein the process comprises the following steps:

S11, generating a code attribute graph CPG according to the software source code.

And S12, extracting corresponding edges and nodes from the CPG to respectively construct an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG.

And S13, adding additional edges representing return, circulation and jump operations among nodes in the abstract syntax tree AST according to preset rules, so as to obtain an enhanced syntax tree EAST.

S14, taking the enhancement grammar tree EAST as a tree view, and generating corresponding tree view features through a pre-trained code semantic coding model.

And S15, taking the CFG and PDG as flow views, and generating corresponding flow view features through a pre-trained code semantic coding model.

And S16, carrying out average pooling on the feature vectors of the tree view and the stream view, and then splicing to obtain the required fusion feature vector.

S2, acquiring a large number of fusion feature vectors converted by the software codes, manually adding tag information representing whether the loopholes exist or not, and then using the tag information as sample data to further form a required sample data set.

S3, selecting cross entropy loss as a loss function, dividing a sample data set into a training set and a testing set, and training and testing a CNN model, and storing model parameters of a network model meeting performance index requirements.

And S4, taking the stored network model as a vulnerability detection model, encoding the software source code to be identified into a corresponding fusion feature vector, and inputting the fusion feature vector into the vulnerability detection model to further realize vulnerability detection.

As a further development of the invention, in step S11, the software source code is converted into a corresponding code profile CPG using a source code based static analysis tool joern.

As a further improvement of the present invention, in step S13, the added additional edges in the enhancement syntax tree eas include, with respect to the abstract syntax tree AST:

(1) The RETURN node in AST is connected TO the function name node and is denoted RETURN TO.

(2) And adding an edge which is sequentially connected in series between nodes corresponding to all variables of the same row of code statement, and marking the edge as an interlayer cascade edge NEXT_TOKEN.

(3) And adding a backward edge REVERSE EDGE between all the original edges and the nodes corresponding to the interlayer cascade edges.

As a further improvement of the invention, in step S14, feature vectors of the tree view are generated by pre-training GraphCodeBERT, and in step S15, feature vectors of the stream view are generated by pre-training CodeBERT.

As a further improvement of the present invention, in step S15, features of the CFG and the PDG are extracted through CodeBERT, and then features of the CFG and the PDG are spliced, so as to form feature vectors of the flow view.

In step S16, the feature vector of the averaged and pooled tree view is 1×768 dimensions, the feature vector of the flow view is 2×768 dimensions, and the fused feature vector is 3×768 dimensions.

As a further improvement of the present invention, in step S3, the CNN model includes 10 convolution filter layers and 128 hidden layers, and a ReLU is used as an activation function.

As a further improvement of the invention, in step S4, the fusion feature vector is divided into three channels and is simultaneously input into a vulnerability detection model, and the vulnerability detection model is used for outputting a prediction result of whether the software source code corresponding to the fusion feature vector contains a vulnerability.

The invention further discloses a multi-view fused source code logical vulnerability detection system which is designed based on the multi-view fused source code logical vulnerability detection method, wherein the multi-view fused source code logical vulnerability detection system comprises a code acquisition unit, a CPG generation unit, a tree view generation unit, a stream view generation unit, a fusion feature generation unit and a vulnerability detection model.

The CPG generation unit is used for generating a corresponding code attribute graph according to the input source code by adopting a preset static analysis tool. The tree view generating unit is used for extracting corresponding nodes and edges from the code attribute graph to construct an abstract syntax tree AST, and then adding three types of accessory edges into the AST to obtain an enhanced syntax tree EAST as a required tree view. The flow view generating unit is used for extracting corresponding nodes and edges from the code attribute graph, and the constructed control flow graph CFG and the program dependency graph PDG are used together as required flow views.

The fusion feature generation unit is used for generating feature vectors corresponding to the tree view and the stream view by adopting the pre-trained code semantic coding model, carrying out average pooling on the feature vectors and the feature vectors, and then carrying out feature splicing to obtain the required fusion feature vectors. The vulnerability detection model is trained by the CNN model through the multi-view fusion source code logic vulnerability detection method, and is used for generating a prediction result of whether the corresponding software code has the vulnerability or not according to the input fusion feature vector and outputting the prediction result.

The invention also provides a multi-view fused source code logical vulnerability detection device, which comprises a memory, a processor and a computer program stored in the memory and running in the processor. When the processor executes the computer program, the source code logic vulnerability detection system is created, so that whether the input software source code has the vulnerability or not is detected.

The technical scheme provided by the invention has the following beneficial effects:

The vulnerability detection scheme provided by the invention firstly uses the input source code to construct the CPG of the code. Then, on one hand, the abstract syntax tree AST is decomposed from the CPG, the enhanced syntax tree EAST is generated by using the AST as a tree view, and the extracted features thereof are used. On the other hand, the control flow graph and the program dependency graph are extracted from the CPG, and feature vectors of the corresponding flow view are generated. And then, feature stitching is carried out on the extracted feature vectors of the flow view and the tree view, and the obtained fusion feature vectors are output to a trained CNN model, so that the detection of the loopholes released in the software code is realized.

According to the scheme, the tree view and the stream view in the CPG are extracted in a classified mode, embedded into the vectors and then input into the model, so that the semantic and syntactic characteristics of codes are classified, the model is enabled to extract relevant characteristics more easily, and the efficiency and the accuracy of vulnerability detection are improved.

Drawings

Fig. 1 is a flowchart of steps of a multi-view fusion source code logical vulnerability detection method provided in embodiment 1 of the present invention.

Fig. 2 is a typical diagram of an enhanced syntax tree generated from an abstract syntax tree according to embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of a source code logical vulnerability detection method of multi-view fusion in embodiment 1 of the present invention.

Fig. 4 is a system architecture diagram of a multi-view fused source code logical vulnerability detection system provided in embodiment 2 of the present invention.

Fig. 5 is a graph showing the comparison of performance between the CPG feature extraction method of the invention and the conventional scheme in the ablation experiment.

Fig. 6 is a graph comparing the performance of the inventive protocol in an ablation experiment using CNN and MLP.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment provides a multi-view fusion source code logic vulnerability detection method, which still belongs to a detection method based on graph analysis. However, unlike the conventional method, the present embodiment does not directly perform feature extraction and network model recognition when obtaining a code attribute map corresponding to a software code, but decomposes a tree view and a stream view containing different feature information from the code attribute map. And extracting the characteristics of the characteristic information contained in the two views, finally, after the characteristic representation of the tree view and the stream view is obtained, merging the characteristics of the two views by using a vector splicing method, and taking the obtained merged characteristic vector as the input of a deep learning network model to perform characteristic learning and vulnerability detection classification.

After the scheme provided by the embodiment is adopted, the network model can be focused on the core information related to the software logic function in the code attribute diagram, so that the interference of a large amount of invalid information contained in the core information is eliminated, and the problems of low accuracy of leak detection and excessive resource consumption in the traditional scheme are further solved.

Specifically, as shown in fig. 1, the method for detecting a source code logical vulnerability of multi-view fusion in this embodiment includes the following steps:

S1, encoding the software source code into a fusion feature vector containing feature information of the tree view and the stream view. The fusion feature vector is a new data type for training a network model and realizing software vulnerability detection in the scheme of the embodiment, and the construction process of the fusion feature vector comprises the following steps:

The code attribute map is a graph-based code representation that provides a combined and compact representation of the code, consisting of elements from control and data streams and all feature information in abstract syntax. The code attribute map may provide additional context for grammar and semantic results related to the source code. Like all graph structure data, the format of the code attribute graph may be represented as g= (V, E), where V represents the set of nodes and E is the set of edges between all nodes. In CPG, node set V is made up of segments of types and codes such as CALLSTATEMENT, RETURNSTATEMENT and ArithmeticExpression.

In practical applications, joern tools (an advanced source code-based static analysis tool Joern) may be used to generate code attribute graphs (CPGs), although in other implementations, other tools similar in function to joern may be used to transcode software into corresponding code attribute graphs.

As described above, the CPG is original graph structure data directly converted from the source code, and this type of data includes a large amount of information related to the inherent logic of the source code, including syntax, semantic information, data logic relationships, control procedures, and so on. Is a large and complex data set. In order to realize better logic vulnerability detection, three types of core information, namely an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG, are extracted from the CPG in the embodiment. Wherein the separated abstract syntax tree enters branches of the tree view, and the control flow graph and the program dependency graph enter branches of the flow view.

The original abstract syntax tree only contains the relation among variables, and the embodiment carries out enhancement processing on the abstract syntax tree, adds some additional edges with different purposes and adds more information for the syntax labels corresponding to the variables. Specifically, as shown in fig. 2, the added additional edges in the enhancement syntax tree EAST include, with respect to the abstract syntax tree AST:

(1) The Return TO edge is used TO connect the "Return TO" function declaration, i.e., connect the Return node in the AST with the function name node, denoted as the Return edge.

(2) The NEXT TOKEN edge is used to connect all the variables of the same code statement, i.e. an edge which is sequentially connected in series is added between the nodes corresponding to all the variables of the same code statement, and the edge is marked as an interlayer cascade edge.

(3) For all types of edges, we connect their respective backward edges (transposed adjacency matrix) to double the number of edges and edge types, i.e., add one backward edge REVERSE EDGE between all original edges and the corresponding nodes of the inter-layer cascade edge. The backward side allows information to propagate faster in the model and allows better model performance.

In the branches corresponding to the tree view, the present embodiment utilizes GraphCodeBERT to embed the resulting enhanced AST-based graph, graphCodeBERT is a pre-training model for programming languages that takes into account the inherent structure of the code and the semantics of the code. GraphCodeBERT first preprocesses our enhanced AST-based graph to obtain source code tokens. In a similar manner to the embedding of Bert, graphCodeBERT employs a combination of token embedding and position embedding for encoding words, which uses 12 transformer encoder layers to form a core network structure, employs 12 attention head multi-head attention mechanisms, including Feed Forward layers and Layer Normalization layers, and so on. The embedding of GraphCodeBERT pre-training models can generate corresponding feature vectors, wherein the feature vectors comprise the inherent structure and grammar information of the program of the source code.

In the branching of the flow view, feature extraction may be performed on the obtained control flow graph and the program dependency graph, respectively, and then the extracted feature vectors are fused, and because the program dependency graph and the control flow graph have the same vertex but different sides, the fused graph considers both their structure information and their vertex information. In practice CodeBERT may be used to extract features of the control flow graph and the data flow graph, and their vectors are then merged together using vector stitching.

Particularly, considering that the vertices contained in the program dependency graph and the control flow graph are the same and the edges are different, when the flow view feature vector is generated, the features of the CFG and the PDG can be extracted through CodeBERT respectively, then the features of the CFG and the PDG are spliced, and the feature vector of the flow view is formed.

The dimensions of the extracted feature vectors may differ from view to view, for example, the pre-trained GraphCodeBERT and CodeBERT models generate n x1 x 768 dimensions of feature vectors for n rows of codes for each function, so that n x1 x 768 dimensions of vectors can be ultimately obtained in multiple views. Considering that the CNN model needs vector representations with the same size as input, the embodiment adopts an average pooling method to uniformly pool the feature vectors from different views of the same software source code into vectors with 1 x 768 dimensions, and the feature vectors of the tree view and the stream view are spliced through a vector fusion stage to generate feature vectors with 3 x 768 dimensions for each function.

In the practical application process, the same method can be adopted to process the acquired test data of all source codes, including the construction of tree views and stream views, feature extraction and vector embedding, so as to obtain corresponding fusion feature vectors. Meanwhile, the tag information is added by technicians according to the real vulnerability inspection result of the source code.

In this embodiment, the CNN model used includes 10 convolution filter layers and 128 hidden layers, and ReLU is used as the activation function. During the training phase of CNN, the penalty function used by this embodiment to penalize within the correct classification is cross entropy penalty (Cross Entropy Loss, CEloss). CEloss aims to minimize the distance between the anchor sample and the positive sample (i.e., closer in the embedding space) and maximize the separation between the anchor sample and the negative sample (farther in the embedding space).

In the actual application process, the fusion feature vector generated by the source code is divided into three channels and is simultaneously input into a vulnerability detection model, and the vulnerability detection model is used for outputting a prediction result of whether the software source code corresponding to the fusion feature vector contains a vulnerability.

In summary, since the CPG contains a large amount of functional information, the existing solution needs to extract information related to the vulnerability characteristics from a large amount of irrelevant information during practical application, and the model faces a great challenge in distinguishing relevant information and affects the accuracy and hardware dependence of the solution. In order to alleviate the problems of low accuracy and high resource consumption of vulnerability detection, the embodiment uses a CPG-based multi-view fusion method, and combines an enhanced abstract syntax tree, a program dependency graph and a control flow graph to perform vulnerability detection. As shown in fig. 3, in practical application, the scheme first constructs the CPG of the code using the input source code. Next, an abstract syntax tree AST (tree view) is decomposed from the CPG on the one hand, and then the AST builds a graph based on the enhanced AST and extracts its features using GraphCodeBERT. On the other hand, the control flow graph and the program dependency graph are extracted from the CPG, and feature vectors of the corresponding flow view are generated. And then, performing feature stitching on all the extracted feature vectors to obtain a fused feature vector containing all the feature information. And finally, extracting all the graphic information contained in the fusion feature vector by using the trained CNN module, and realizing classification detection on the loopholes contained in the fusion feature vector.

Example 2

Based on the method of embodiment 1, this embodiment further provides a multi-view fused source code logical vulnerability detection system, which is designed by adopting the multi-view fused source code logical vulnerability detection method of embodiment 1. The multi-view fused source code logic vulnerability detection system provided by the embodiment belongs to a computer program, and the program can receive a software source code to be detected when running, then transcode the software source code into a fused feature diagram by adopting the method in the embodiment 1, and further realize vulnerability detection through a trained CNN model.

Specifically, as shown in fig. 4, the multi-view fusion source code logical vulnerability detection system provided by the embodiment comprises a code acquisition unit, a CPG generation unit, a tree view generation unit, a stream view generation unit, a fusion feature generation unit and a vulnerability detection model. The CPG generation unit is used for generating a corresponding code attribute graph according to the input source code by adopting a preset static analysis tool. The tree view generating unit is used for extracting corresponding nodes and edges from the code attribute graph to construct an abstract syntax tree AST, and then adding three additional edges into the AST to obtain an enhanced syntax tree EAST as a required tree view. The flow view generating unit is used for extracting corresponding nodes and edges from the code attribute graph, and the constructed control flow graph CFG and the program dependency graph PDG are used together as required flow views.

Example 3

On the basis of the foregoing embodiments, the present embodiment further provides a multi-view fused source code logical vulnerability detection apparatus, which includes a memory, a processor, and a computer program stored in the memory and running in the processor. When the processor executes the computer program, the source code logic vulnerability detection system is created, so that whether the input software source code has the vulnerability or not is detected.

The source code logical vulnerability detection device of multi-view fusion provided in this embodiment is essentially a computer device for implementing the scheme in embodiment 1. The computer device may be an intelligent terminal, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster formed by a plurality of servers) capable of executing a program, or the like.

The computer devices indicated in this embodiment include at least, but are not limited to, a memory and a processor that are communicatively coupled to each other via a system bus. Among them, the memory (i.e., readable storage medium) includes flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk provided on the computer device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is typically used to store an operating system and various application software installed on the computer device. In addition, the memory can be used to temporarily store various types of data that have been output or are to be output.

The processor may be a central processing unit (Central Processing Unit, CPU), an image processor GPU (Graphics Processing Unit), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process the data.

Simulation test

In order to verify the performance and advantages of the multi-view fusion source code logical vulnerability detection method provided by the invention, technicians make experimental plans and simulate and test related schemes. The test experiments were as follows:

1. Experimental setup

This experiment uses Pytorch.0.1 to implement the method of the present invention. The dataset used in the experimental procedure was from REVEAL, which contained 20494 non-loopholes and 2240 loopholes. These programs come from two open source items, linux Debian Kernel and Chromium (the open source item for Chrome). Linux Debian kernel and Chromiums (open source project for Chrome) are two popular and well-maintained public projects that represent various security issues in two important programming domains (operating system browsers). Both of these projects have a large number of published vulnerability reports.

Furthermore, the ratio of non-vulnerable and vulnerable functions in the data set selected in this experiment was about 9:1, which is similar to the ratio of vulnerable and non-vulnerable programs in the real world. In addition, the present experiment also used QEMU dataset and SARD dataset (this is an item maintained by the national institute of technical standards and standards (NIST), including 12303 vulnerability functions and 21057 non-vulnerability functions).

In the training phase of the network model, the loss function adopts cross entropy loss, the training process uses an Adam optimizer to train the CNN, and the learning rate is set to be 0.001.

2. Performance contrast test

In order to evaluate the true performance of the present invention, the experiment included selecting several most advanced vulnerability detection methods as control groups, which were compared with the present invention, including Reveal, vulCNN, design, syseVR, vulDeepecker, russell et al. The experiment tests the performance of the present invention and the control group scheme on the REVEAL data set and Sard data set respectively, and adopts four widely used indexes to evaluate the performance of vulnerability detection of each scheme. These four evaluation indexes include Accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1).

Where accuracy refers to the proportion of all test cases that are properly classified. Precision refers to the ratio of correctly predicted vulnerability samples to samples predicted to be vulnerable. Recall refers to the ratio of exactly predicted vulnerability samples to all vulnerability samples. The F1 score evaluates the overall effect by taking into account accuracy and recall.

In performance control experiments, the performance of the present invention on the REVEAL dataset versus other control protocols is shown in table 1:

TABLE 1 Performance test results of different protocols on REVEAL data set

The analysis of the data in the table shows that the method of the present invention achieves an accuracy and precision of 90.48% and 77.22%, respectively. Compared with the real scheme, the method can improve the accuracy, precision and F1 score by 8.71%, 45.67% and 21.71% respectively. However, the method of the present invention is 7.47% lower than real in terms of recall, due to the unbalanced ratio of loopholes and non-loopholes functions in real dataset (loopholes functions only account for 9% of all functions), and uneven number of test categories and too much data noise, resulting in poor recall performance of the method. Compared with the design, the accuracy is improved by 2.99%, the accuracy is improved by 45.67%, the recall rate is improved by 17.02%, and the F1 score is improved by 29.42%. Our method also has significant advantages over the other two methods. In summary, our approach provides significant improvements over the other most advanced approaches over the four widely used indicators.

Further, the present experiment was performed to compare the present invention with the control scheme on Sard dataset, and the experimental data obtained are shown in table 2:

TABLE 2 Performance test results of different schemes on Sard datasets

Analysis of the data of table 2 shows that the scheme of the present invention also has significant advantages over another data set. In summary, the method of the present invention provides a significant improvement over other most advanced methods over four widely used criteria. The method of the present invention is almost superior to token-based methods (Russell et al) and slice+token-based methods in all four metrics. This is because graph-based models learn semantic dependencies between each node along the way through various graphs, while graphs can highly preserve the semantic and syntactic information of the code, which enables graph-based models to make accurate predictions. However, token and slice based methods do not fully preserve the semantic information of the program, which may result in the model failing to make the correct predictions. Thus, the graph-based approach is significantly better than the slice and token-based approach.

3. Ablation experiments

The technical scheme provided by the invention comprises two core elements, namely, feature extraction and feature fusion are carried out on CPG, and the extracted fusion feature vector is learned and trained by using CNN model. To investigate the contribution of both to the outstanding performance achieved by the final protocol, the skilled person has specified the following ablation experiments.

3.1 Different feature extraction modes

The experiment uses two methods to process CPG and compares them to explore the effect of CPG processing mode of tree view and +flow view on the performance of the scheme employed by the present invention.

First, in the experiments of the present invention, the source code was standardized according to the design method and converted into CPG graph using Joern tools. CPG is then split to obtain the required abstract syntax tree, control flow graph and code dependency graph. And the abstract syntax tree is enhanced into a graph based on the enhanced abstract syntax tree by using corresponding rules, and as a tree view of the invention, the invention uses the pre-trained GraphCodeBERT for feature extraction and vector embedding. Control flow graphs and program dependency graphs as part of the flow view, the present invention uses CodeBERT for embedding. Finally, the tree view and the stream view are vector-fused by using a vector fusion method, and the tree view and the stream view are sent to the CNN for training and prediction.

In the second experiment, a method was used that did not split CPG. After the code attribute map of the source code is obtained using Joern tools, codeBERT is used directly to extract features and embed vectors into CPG without extracting tree and stream views. The obtained source code feature vector is then sent to CNNs with the same parameter configuration for training and prediction.

In this experiment, the final performance test results of both schemes are shown in fig. 5. As can be seen from an analysis of the data in fig. 5, the method of the present invention is superior to the conventional CPG processing method in terms of efficiency and accuracy in the first 100 rounds of training and prediction of Sard dataset. This proves that the method for extracting the characteristic information in CPG is superior to the traditional method in efficiency and precision.

This advantage is achieved in accordance with the present invention because the CPG contains a large amount of code semantic and grammar information. If CPG is simply vector embedded and sent to CNN for feature learning and training, the model has difficulty extracting features related to vulnerability functions from a large amount of irrelevant and complex information, resulting in low model efficiency and accuracy. The scheme of the invention extracts the tree view and the stream view in the CPG in a classified manner, and inputs the tree view and the stream view into the model after embedding the tree view and the stream view into the vector, which is equivalent to simply classifying the semantic and syntactic characteristics of the code, and then inputs the semantic and syntactic characteristics into the model, so that the model can extract relevant characteristics more easily, and the efficiency and the accuracy of vulnerability detection are improved.

3.2 Different neural network models

The experiment researches different influences on the leak detection performance of a final scheme when different models are adopted for training under the same method. Specifically, this experiment investigated two widely used classifier models, convolutional Neural Network (CNN) and multilayer perceptron (MLP). For the CNN model, the experiment used a model with 10 convolutional filter layers and 128 hidden layers. For the MLP model, the experiment used a network model with 3 fully connected layers (i.e. linear layers) and 2 ReLU activation layers, and finally the results were output as binary classifications using the sigmoid activation layer.

The present experiment uses the same procedure to process source codes on Sard and Qemu datasets, and after feature vectors of the source codes are obtained using the same feature extraction method, these feature vectors are fed into the CNN classifier and the MLP classifier, respectively, for feature learning and training. The performance of the resulting solution is shown in figure 6.

As can be seen from an analysis of the data in fig. 6, the results of training and prediction using the CNN model, which appeared to be more balanced in the overall value of the evaluation index, were slightly better than the MLP model, which showed bipolar differentiation (i.e., higher accuracy, but lower recall) in the evaluation index of the SARD dataset.

The reason why the above phenomenon occurs by analysis is that the structure of the MLP model is relatively simple, so that the MLP model is easy to sink into a local minimum value, and a global optimal solution cannot be found. And the fully connected layer of the MLP ensures that each neuron is connected to all neurons of the previous layer, thereby failing to take advantage of the local features of the data. Whereas CNNs can extract local features through local connections and weight sharing, which is not possible with MLP. In addition, the difficulty of the MLP model to process high-dimensional data can also be one of the reasons. From the experimental results, it is known that CNN should be selected as a network model required for feature training and classification of the present invention in practical applications.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. A method for detecting source code logic vulnerabilities by fusion of multiple views, characterized in that it comprises the following steps:

S1: Encode the software source code into a fused feature vector containing feature information of the tree view and the stream view. The process includes:

S11: Generate a code property graph CPG based on the software source code;

S12: extract the corresponding edges and nodes from CPG and construct them into abstract syntax tree AST, control flow graph CFG and program dependency graph PDG respectively;

S13: according to a preset rule, additional edges representing return, loop and jump operations are added between nodes in the abstract syntax tree AST, thereby obtaining an enhanced syntax tree EAST;

S14: Take the enhanced syntax tree EAST as the tree view, and generate the corresponding tree view features through the pre-trained code semantic encoding model;

S15: The control flow graph CFG and the program dependency graph PDG are used as flow views, and the corresponding flow view features are generated through the pre-trained code semantic encoding model;

S16: average pooling the feature vectors of the tree view and the stream view and concatenating them to obtain a desired fused feature vector;

S2: Obtain a large number of fused feature vectors converted from software codes, manually add label information representing whether there is a vulnerability as sample data, and then form the required sample data set;

S3: Select cross entropy loss as the loss function, divide the sample data set into a training set and a test set, and use it to train and test a CNN model, and save the model parameters of the network model that meets the performance index requirements;

S4: The saved network model is used as a vulnerability detection model, and the software source code to be identified is encoded into a corresponding fusion feature vector and then input into the vulnerability detection model, thereby realizing vulnerability detection.

2. The multi-view fusion source code logic vulnerability detection method as described in claim 1 is characterized in that: in step S11, the software source code is converted into a corresponding code property graph CPG using a source code-based static analysis tool joern.

3. The method for detecting source code logic vulnerabilities by multi-view fusion according to claim 2, characterized in that: in step S13, relative to the abstract syntax tree AST, the additional edges added to the enhanced syntax tree EAST include:

(1) Connect the return node in the AST and the function name node, denoted as the return edge RETURN TO;

(2) Add a serial edge between the nodes corresponding to all variables in the same line of code statement, which is recorded as the inter-layer cascade edge NEXT_TOKEN;

(3) Add a reverse edge between the nodes corresponding to all original edges and inter-layer cascade edges.

4. The multi-view fusion source code logic vulnerability detection method as described in claim 3 is characterized in that: in step S14, the feature vector of the tree view is generated by the pre-trained GraphCodeBERT; in step S15, the feature vector of the flow view is generated by the pre-trained CodeBERT.

5. The multi-view fusion source code logic vulnerability detection method as described in claim 4 is characterized in that: in step S15, the features of CFG and PDG are first extracted respectively by CodeBERT, and then the features of the two are spliced to form a feature vector of the stream view.

6. The multi-view fusion source code logic vulnerability detection method as described in claim 5 is characterized in that: the feature vector of the tree view after average pooling in step S16 is 1*768 dimensions; the feature vector of the stream view is 2*768 dimensions; and the fused feature vector is 3*768 dimensions.

7. The multi-view fusion source code logic vulnerability detection method as described in claim 6 is characterized in that: in step S3, the CNN model includes 10 convolutional filter layers and 128 hidden layers, and ReLU is used as the activation function.

8. The multi-view fusion source code logic vulnerability detection method as described in claim 7 is characterized in that: in step S4, the fused feature vector is divided into three channels and input into the vulnerability detection model at the same time, and the vulnerability detection model is used to output a prediction result of whether the software source code corresponding to the fused feature vector contains a vulnerability.

9. A multi-view fusion source code logic vulnerability detection system, characterized in that: it is designed based on the multi-view fusion source code logic vulnerability detection method according to any one of claims 1 to 8; the multi-view fusion source code logic vulnerability detection system comprises:

A code acquisition unit, which is used to acquire the source code of the software to be tested;

A CPG generation unit, which uses a preset static analysis tool to generate a corresponding code property graph according to the input source code;

A tree view generation unit, which is used to extract corresponding nodes and edges from the code attribute graph, construct an abstract syntax tree AST, and then add three types of attachment edges to the AST to obtain an enhanced syntax tree EAST as the required tree view;

A flow view generation unit, which is used to extract corresponding nodes and edges from the code attribute graph, and construct a control flow graph CFG and a program dependency graph PDG as the required flow view;

A fusion feature generation unit, which is used to generate feature vectors corresponding to the tree view and the stream view using a pre-trained code semantic encoding model, and first average pool the two and then perform feature concatenation to obtain the required fusion feature vector;

A vulnerability detection model, which is obtained by training a CNN model through the multi-view fusion source code logic vulnerability detection method as described in any one of claims 1 to 8; the vulnerability detection model is used to generate a prediction result of whether the corresponding software code has a vulnerability based on the input fusion feature vector, and output it.

10. A multi-view fusion source code logic vulnerability detection device, which includes a memory, a processor, and a computer program stored in the memory and running in the processor, characterized in that: when the processor executes the computer program, it creates a source code logic vulnerability detection system as described in claim 9, thereby realizing the detection of whether there are vulnerabilities in the input software source code.