[go: up one dir, main page]

CN119670101A - Source code logic vulnerability detection method, system and device based on multi-view fusion - Google Patents

Source code logic vulnerability detection method, system and device based on multi-view fusion Download PDF

Info

Publication number
CN119670101A
CN119670101A CN202411987591.6A CN202411987591A CN119670101A CN 119670101 A CN119670101 A CN 119670101A CN 202411987591 A CN202411987591 A CN 202411987591A CN 119670101 A CN119670101 A CN 119670101A
Authority
CN
China
Prior art keywords
view
source code
vulnerability detection
fusion
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202411987591.6A
Other languages
Chinese (zh)
Other versions
CN119670101B (en
Inventor
章程
余亚龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202411987591.6A priority Critical patent/CN119670101B/en
Publication of CN119670101A publication Critical patent/CN119670101A/en
Application granted granted Critical
Publication of CN119670101B publication Critical patent/CN119670101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

本发明属于软件安全测试领域,具体涉及一种多视图融合的源代码逻辑漏洞检测方法、系统和装置。该方法包括:将软件源代码编码为包含树视图和流视图的特征信息的融合特征向量;S2:获取大量融合特征向量,人工添加表征是否存在漏洞的标签信息后构成样本数据集;S3:选择交叉熵损失作为损失函数,将样本数据集分为训练集和测试集,对CNN模型进行训练和测试,保存满足要求的网络模型的模型参数;S4:将保存的网络模型作为漏洞检测模型,并用于对待识别的软件源代码编码后的融合特征向量进行漏洞检测。本发明解决了现有基于图特征的漏洞检测方案无法充分提取并利用软件代码属性图中的特征信息,进而导致漏洞检测精度和效率不足的问题。

The present invention belongs to the field of software security testing, and specifically relates to a method, system and device for detecting source code logic vulnerabilities by fusion of multiple views. The method comprises: encoding the software source code into a fusion feature vector containing feature information of a tree view and a stream view; S2: obtaining a large number of fusion feature vectors, and manually adding label information representing whether there is a vulnerability to form a sample data set; S3: selecting cross entropy loss as a loss function, dividing the sample data set into a training set and a test set, training and testing the CNN model, and saving the model parameters of the network model that meets the requirements; S4: using the saved network model as a vulnerability detection model, and using it to perform vulnerability detection on the fusion feature vector encoded by the software source code to be identified. The present invention solves the problem that the existing vulnerability detection scheme based on graph features cannot fully extract and utilize the feature information in the software code attribute graph, which leads to insufficient vulnerability detection accuracy and efficiency.

Description

Multi-view fusion source code logic vulnerability detection method, system and device
Technical Field
The invention belongs to the field of software testing, and particularly relates to a method, a system and a device for detecting source code logic loopholes by multi-view fusion.
Background
Source code vulnerability detection is an important part of software supply chain security. The traditional software source code bug detection method has higher false positive and false negative rates, some existing static analysis methods generally lead to high false positive rates of detection non-bug programs, and other dynamic detection methods still have higher false negative rates. So far, these tools have remained unreliable.
The software source code vulnerability detection method based on deep learning is widely applied, wherein the detection method based on graph analysis shows good accuracy. However, most of these methods directly analyze the graph and extract features of the graph, often resulting in a significant amount of overhead. For example, the graphical representation of the software code obtained by the graph analysis method contains a large amount of information, most of which is irrelevant to the vulnerability characteristics, and this requires the model to identify the information related to the vulnerability characteristics from the large amount of information, which increases the difficulty of identifying the model. Meanwhile, if the function scale is large, the obtained graph representation becomes relatively complex, which makes the scale of the deep learning neural network large, not only increases the cost of the scheme, but also limits the practical value of the scheme on low-power equipment. In addition, because the GPU memory and the computing resources are limited, the input length of the existing neural network model is usually limited, which may also cause that the network model is difficult to directly extract enough characteristic information from the graph, and the phenomenon that certain characteristics cannot be used and cut off by the model occurs, so that the vulnerability recognition accuracy of the network model is greatly reduced.
Disclosure of Invention
The invention provides a multi-view fusion source code logic vulnerability detection method, system and device, which are used for solving the problem that the existing vulnerability detection scheme based on graph features cannot fully extract and utilize feature information in a software code attribute graph, so that the vulnerability detection precision and efficiency are insufficient.
The technical scheme provided by the invention is as follows:
a multi-view fusion source code logic vulnerability detection method comprises the following steps:
s1, encoding software source codes into fusion feature vectors containing feature information of tree views and stream views, wherein the process comprises the following steps:
S11, generating a code attribute graph CPG according to the software source code.
And S12, extracting corresponding edges and nodes from the CPG to respectively construct an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG.
And S13, adding additional edges representing return, circulation and jump operations among nodes in the abstract syntax tree AST according to preset rules, so as to obtain an enhanced syntax tree EAST.
S14, taking the enhancement grammar tree EAST as a tree view, and generating corresponding tree view features through a pre-trained code semantic coding model.
And S15, taking the CFG and PDG as flow views, and generating corresponding flow view features through a pre-trained code semantic coding model.
And S16, carrying out average pooling on the feature vectors of the tree view and the stream view, and then splicing to obtain the required fusion feature vector.
S2, acquiring a large number of fusion feature vectors converted by the software codes, manually adding tag information representing whether the loopholes exist or not, and then using the tag information as sample data to further form a required sample data set.
S3, selecting cross entropy loss as a loss function, dividing a sample data set into a training set and a testing set, and training and testing a CNN model, and storing model parameters of a network model meeting performance index requirements.
And S4, taking the stored network model as a vulnerability detection model, encoding the software source code to be identified into a corresponding fusion feature vector, and inputting the fusion feature vector into the vulnerability detection model to further realize vulnerability detection.
As a further development of the invention, in step S11, the software source code is converted into a corresponding code profile CPG using a source code based static analysis tool joern.
As a further improvement of the present invention, in step S13, the added additional edges in the enhancement syntax tree eas include, with respect to the abstract syntax tree AST:
(1) The RETURN node in AST is connected TO the function name node and is denoted RETURN TO.
(2) And adding an edge which is sequentially connected in series between nodes corresponding to all variables of the same row of code statement, and marking the edge as an interlayer cascade edge NEXT_TOKEN.
(3) And adding a backward edge REVERSE EDGE between all the original edges and the nodes corresponding to the interlayer cascade edges.
As a further improvement of the invention, in step S14, feature vectors of the tree view are generated by pre-training GraphCodeBERT, and in step S15, feature vectors of the stream view are generated by pre-training CodeBERT.
As a further improvement of the present invention, in step S15, features of the CFG and the PDG are extracted through CodeBERT, and then features of the CFG and the PDG are spliced, so as to form feature vectors of the flow view.
In step S16, the feature vector of the averaged and pooled tree view is 1×768 dimensions, the feature vector of the flow view is 2×768 dimensions, and the fused feature vector is 3×768 dimensions.
As a further improvement of the present invention, in step S3, the CNN model includes 10 convolution filter layers and 128 hidden layers, and a ReLU is used as an activation function.
As a further improvement of the invention, in step S4, the fusion feature vector is divided into three channels and is simultaneously input into a vulnerability detection model, and the vulnerability detection model is used for outputting a prediction result of whether the software source code corresponding to the fusion feature vector contains a vulnerability.
The invention further discloses a multi-view fused source code logical vulnerability detection system which is designed based on the multi-view fused source code logical vulnerability detection method, wherein the multi-view fused source code logical vulnerability detection system comprises a code acquisition unit, a CPG generation unit, a tree view generation unit, a stream view generation unit, a fusion feature generation unit and a vulnerability detection model.
The CPG generation unit is used for generating a corresponding code attribute graph according to the input source code by adopting a preset static analysis tool. The tree view generating unit is used for extracting corresponding nodes and edges from the code attribute graph to construct an abstract syntax tree AST, and then adding three types of accessory edges into the AST to obtain an enhanced syntax tree EAST as a required tree view. The flow view generating unit is used for extracting corresponding nodes and edges from the code attribute graph, and the constructed control flow graph CFG and the program dependency graph PDG are used together as required flow views.
The fusion feature generation unit is used for generating feature vectors corresponding to the tree view and the stream view by adopting the pre-trained code semantic coding model, carrying out average pooling on the feature vectors and the feature vectors, and then carrying out feature splicing to obtain the required fusion feature vectors. The vulnerability detection model is trained by the CNN model through the multi-view fusion source code logic vulnerability detection method, and is used for generating a prediction result of whether the corresponding software code has the vulnerability or not according to the input fusion feature vector and outputting the prediction result.
The invention also provides a multi-view fused source code logical vulnerability detection device, which comprises a memory, a processor and a computer program stored in the memory and running in the processor. When the processor executes the computer program, the source code logic vulnerability detection system is created, so that whether the input software source code has the vulnerability or not is detected.
The technical scheme provided by the invention has the following beneficial effects:
The vulnerability detection scheme provided by the invention firstly uses the input source code to construct the CPG of the code. Then, on one hand, the abstract syntax tree AST is decomposed from the CPG, the enhanced syntax tree EAST is generated by using the AST as a tree view, and the extracted features thereof are used. On the other hand, the control flow graph and the program dependency graph are extracted from the CPG, and feature vectors of the corresponding flow view are generated. And then, feature stitching is carried out on the extracted feature vectors of the flow view and the tree view, and the obtained fusion feature vectors are output to a trained CNN model, so that the detection of the loopholes released in the software code is realized.
According to the scheme, the tree view and the stream view in the CPG are extracted in a classified mode, embedded into the vectors and then input into the model, so that the semantic and syntactic characteristics of codes are classified, the model is enabled to extract relevant characteristics more easily, and the efficiency and the accuracy of vulnerability detection are improved.
Drawings
Fig. 1 is a flowchart of steps of a multi-view fusion source code logical vulnerability detection method provided in embodiment 1 of the present invention.
Fig. 2 is a typical diagram of an enhanced syntax tree generated from an abstract syntax tree according to embodiment 1 of the present invention.
Fig. 3 is a schematic diagram of a source code logical vulnerability detection method of multi-view fusion in embodiment 1 of the present invention.
Fig. 4 is a system architecture diagram of a multi-view fused source code logical vulnerability detection system provided in embodiment 2 of the present invention.
Fig. 5 is a graph showing the comparison of performance between the CPG feature extraction method of the invention and the conventional scheme in the ablation experiment.
Fig. 6 is a graph comparing the performance of the inventive protocol in an ablation experiment using CNN and MLP.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Example 1
The embodiment provides a multi-view fusion source code logic vulnerability detection method, which still belongs to a detection method based on graph analysis. However, unlike the conventional method, the present embodiment does not directly perform feature extraction and network model recognition when obtaining a code attribute map corresponding to a software code, but decomposes a tree view and a stream view containing different feature information from the code attribute map. And extracting the characteristics of the characteristic information contained in the two views, finally, after the characteristic representation of the tree view and the stream view is obtained, merging the characteristics of the two views by using a vector splicing method, and taking the obtained merged characteristic vector as the input of a deep learning network model to perform characteristic learning and vulnerability detection classification.
After the scheme provided by the embodiment is adopted, the network model can be focused on the core information related to the software logic function in the code attribute diagram, so that the interference of a large amount of invalid information contained in the core information is eliminated, and the problems of low accuracy of leak detection and excessive resource consumption in the traditional scheme are further solved.
Specifically, as shown in fig. 1, the method for detecting a source code logical vulnerability of multi-view fusion in this embodiment includes the following steps:
S1, encoding the software source code into a fusion feature vector containing feature information of the tree view and the stream view. The fusion feature vector is a new data type for training a network model and realizing software vulnerability detection in the scheme of the embodiment, and the construction process of the fusion feature vector comprises the following steps:
S11, generating a code attribute graph CPG according to the software source code.
The code attribute map is a graph-based code representation that provides a combined and compact representation of the code, consisting of elements from control and data streams and all feature information in abstract syntax. The code attribute map may provide additional context for grammar and semantic results related to the source code. Like all graph structure data, the format of the code attribute graph may be represented as g= (V, E), where V represents the set of nodes and E is the set of edges between all nodes. In CPG, node set V is made up of segments of types and codes such as CALLSTATEMENT, RETURNSTATEMENT and ArithmeticExpression.
In practical applications, joern tools (an advanced source code-based static analysis tool Joern) may be used to generate code attribute graphs (CPGs), although in other implementations, other tools similar in function to joern may be used to transcode software into corresponding code attribute graphs.
And S12, extracting corresponding edges and nodes from the CPG to respectively construct an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG.
As described above, the CPG is original graph structure data directly converted from the source code, and this type of data includes a large amount of information related to the inherent logic of the source code, including syntax, semantic information, data logic relationships, control procedures, and so on. Is a large and complex data set. In order to realize better logic vulnerability detection, three types of core information, namely an abstract syntax tree AST, a control flow graph CFG and a program dependency graph PDG, are extracted from the CPG in the embodiment. Wherein the separated abstract syntax tree enters branches of the tree view, and the control flow graph and the program dependency graph enter branches of the flow view.
And S13, adding additional edges representing return, circulation and jump operations among nodes in the abstract syntax tree AST according to preset rules, so as to obtain an enhanced syntax tree EAST.
The original abstract syntax tree only contains the relation among variables, and the embodiment carries out enhancement processing on the abstract syntax tree, adds some additional edges with different purposes and adds more information for the syntax labels corresponding to the variables. Specifically, as shown in fig. 2, the added additional edges in the enhancement syntax tree EAST include, with respect to the abstract syntax tree AST:
(1) The Return TO edge is used TO connect the "Return TO" function declaration, i.e., connect the Return node in the AST with the function name node, denoted as the Return edge.
(2) The NEXT TOKEN edge is used to connect all the variables of the same code statement, i.e. an edge which is sequentially connected in series is added between the nodes corresponding to all the variables of the same code statement, and the edge is marked as an interlayer cascade edge.
(3) For all types of edges, we connect their respective backward edges (transposed adjacency matrix) to double the number of edges and edge types, i.e., add one backward edge REVERSE EDGE between all original edges and the corresponding nodes of the inter-layer cascade edge. The backward side allows information to propagate faster in the model and allows better model performance.
S14, taking the enhancement grammar tree EAST as a tree view, and generating corresponding tree view features through a pre-trained code semantic coding model.
In the branches corresponding to the tree view, the present embodiment utilizes GraphCodeBERT to embed the resulting enhanced AST-based graph, graphCodeBERT is a pre-training model for programming languages that takes into account the inherent structure of the code and the semantics of the code. GraphCodeBERT first preprocesses our enhanced AST-based graph to obtain source code tokens. In a similar manner to the embedding of Bert, graphCodeBERT employs a combination of token embedding and position embedding for encoding words, which uses 12 transformer encoder layers to form a core network structure, employs 12 attention head multi-head attention mechanisms, including Feed Forward layers and Layer Normalization layers, and so on. The embedding of GraphCodeBERT pre-training models can generate corresponding feature vectors, wherein the feature vectors comprise the inherent structure and grammar information of the program of the source code.
And S15, taking the CFG and PDG as flow views, and generating corresponding flow view features through a pre-trained code semantic coding model.
In the branching of the flow view, feature extraction may be performed on the obtained control flow graph and the program dependency graph, respectively, and then the extracted feature vectors are fused, and because the program dependency graph and the control flow graph have the same vertex but different sides, the fused graph considers both their structure information and their vertex information. In practice CodeBERT may be used to extract features of the control flow graph and the data flow graph, and their vectors are then merged together using vector stitching.
Particularly, considering that the vertices contained in the program dependency graph and the control flow graph are the same and the edges are different, when the flow view feature vector is generated, the features of the CFG and the PDG can be extracted through CodeBERT respectively, then the features of the CFG and the PDG are spliced, and the feature vector of the flow view is formed.
And S16, carrying out average pooling on the feature vectors of the tree view and the stream view, and then splicing to obtain the required fusion feature vector.
The dimensions of the extracted feature vectors may differ from view to view, for example, the pre-trained GraphCodeBERT and CodeBERT models generate n x1 x 768 dimensions of feature vectors for n rows of codes for each function, so that n x1 x 768 dimensions of vectors can be ultimately obtained in multiple views. Considering that the CNN model needs vector representations with the same size as input, the embodiment adopts an average pooling method to uniformly pool the feature vectors from different views of the same software source code into vectors with 1 x 768 dimensions, and the feature vectors of the tree view and the stream view are spliced through a vector fusion stage to generate feature vectors with 3 x 768 dimensions for each function.
S2, acquiring a large number of fusion feature vectors converted by the software codes, manually adding tag information representing whether the loopholes exist or not, and then using the tag information as sample data to further form a required sample data set.
In the practical application process, the same method can be adopted to process the acquired test data of all source codes, including the construction of tree views and stream views, feature extraction and vector embedding, so as to obtain corresponding fusion feature vectors. Meanwhile, the tag information is added by technicians according to the real vulnerability inspection result of the source code.
S3, selecting cross entropy loss as a loss function, dividing a sample data set into a training set and a testing set, and training and testing a CNN model, and storing model parameters of a network model meeting performance index requirements.
In this embodiment, the CNN model used includes 10 convolution filter layers and 128 hidden layers, and ReLU is used as the activation function. During the training phase of CNN, the penalty function used by this embodiment to penalize within the correct classification is cross entropy penalty (Cross Entropy Loss, CEloss). CEloss aims to minimize the distance between the anchor sample and the positive sample (i.e., closer in the embedding space) and maximize the separation between the anchor sample and the negative sample (farther in the embedding space).
And S4, taking the stored network model as a vulnerability detection model, encoding the software source code to be identified into a corresponding fusion feature vector, and inputting the fusion feature vector into the vulnerability detection model to further realize vulnerability detection.
In the actual application process, the fusion feature vector generated by the source code is divided into three channels and is simultaneously input into a vulnerability detection model, and the vulnerability detection model is used for outputting a prediction result of whether the software source code corresponding to the fusion feature vector contains a vulnerability.
In summary, since the CPG contains a large amount of functional information, the existing solution needs to extract information related to the vulnerability characteristics from a large amount of irrelevant information during practical application, and the model faces a great challenge in distinguishing relevant information and affects the accuracy and hardware dependence of the solution. In order to alleviate the problems of low accuracy and high resource consumption of vulnerability detection, the embodiment uses a CPG-based multi-view fusion method, and combines an enhanced abstract syntax tree, a program dependency graph and a control flow graph to perform vulnerability detection. As shown in fig. 3, in practical application, the scheme first constructs the CPG of the code using the input source code. Next, an abstract syntax tree AST (tree view) is decomposed from the CPG on the one hand, and then the AST builds a graph based on the enhanced AST and extracts its features using GraphCodeBERT. On the other hand, the control flow graph and the program dependency graph are extracted from the CPG, and feature vectors of the corresponding flow view are generated. And then, performing feature stitching on all the extracted feature vectors to obtain a fused feature vector containing all the feature information. And finally, extracting all the graphic information contained in the fusion feature vector by using the trained CNN module, and realizing classification detection on the loopholes contained in the fusion feature vector.
Example 2
Based on the method of embodiment 1, this embodiment further provides a multi-view fused source code logical vulnerability detection system, which is designed by adopting the multi-view fused source code logical vulnerability detection method of embodiment 1. The multi-view fused source code logic vulnerability detection system provided by the embodiment belongs to a computer program, and the program can receive a software source code to be detected when running, then transcode the software source code into a fused feature diagram by adopting the method in the embodiment 1, and further realize vulnerability detection through a trained CNN model.
Specifically, as shown in fig. 4, the multi-view fusion source code logical vulnerability detection system provided by the embodiment comprises a code acquisition unit, a CPG generation unit, a tree view generation unit, a stream view generation unit, a fusion feature generation unit and a vulnerability detection model. The CPG generation unit is used for generating a corresponding code attribute graph according to the input source code by adopting a preset static analysis tool. The tree view generating unit is used for extracting corresponding nodes and edges from the code attribute graph to construct an abstract syntax tree AST, and then adding three additional edges into the AST to obtain an enhanced syntax tree EAST as a required tree view. The flow view generating unit is used for extracting corresponding nodes and edges from the code attribute graph, and the constructed control flow graph CFG and the program dependency graph PDG are used together as required flow views.
The fusion feature generation unit is used for generating feature vectors corresponding to the tree view and the stream view by adopting the pre-trained code semantic coding model, carrying out average pooling on the feature vectors and the feature vectors, and then carrying out feature splicing to obtain the required fusion feature vectors. The vulnerability detection model is trained by the CNN model through the multi-view fusion source code logic vulnerability detection method, and is used for generating a prediction result of whether the corresponding software code has the vulnerability or not according to the input fusion feature vector and outputting the prediction result.
Example 3
On the basis of the foregoing embodiments, the present embodiment further provides a multi-view fused source code logical vulnerability detection apparatus, which includes a memory, a processor, and a computer program stored in the memory and running in the processor. When the processor executes the computer program, the source code logic vulnerability detection system is created, so that whether the input software source code has the vulnerability or not is detected.
The source code logical vulnerability detection device of multi-view fusion provided in this embodiment is essentially a computer device for implementing the scheme in embodiment 1. The computer device may be an intelligent terminal, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster formed by a plurality of servers) capable of executing a program, or the like.
The computer devices indicated in this embodiment include at least, but are not limited to, a memory and a processor that are communicatively coupled to each other via a system bus. Among them, the memory (i.e., readable storage medium) includes flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory may be an internal storage unit of a computer device, such as a hard disk or memory of the computer device. In other embodiments, the memory may also be an external storage device of the computer device, such as a plug-in hard disk provided on the computer device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Of course, the memory may also include both internal storage units of the computer device and external storage devices. In this embodiment, the memory is typically used to store an operating system and various application software installed on the computer device. In addition, the memory can be used to temporarily store various types of data that have been output or are to be output.
The processor may be a central processing unit (Central Processing Unit, CPU), an image processor GPU (Graphics Processing Unit), a controller, a microcontroller, a microprocessor, or other data processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device. In this embodiment, the processor is configured to execute the program code stored in the memory or process the data.
Simulation test
In order to verify the performance and advantages of the multi-view fusion source code logical vulnerability detection method provided by the invention, technicians make experimental plans and simulate and test related schemes. The test experiments were as follows:
1. Experimental setup
This experiment uses Pytorch.0.1 to implement the method of the present invention. The dataset used in the experimental procedure was from REVEAL, which contained 20494 non-loopholes and 2240 loopholes. These programs come from two open source items, linux Debian Kernel and Chromium (the open source item for Chrome). Linux Debian kernel and Chromiums (open source project for Chrome) are two popular and well-maintained public projects that represent various security issues in two important programming domains (operating system browsers). Both of these projects have a large number of published vulnerability reports.
Furthermore, the ratio of non-vulnerable and vulnerable functions in the data set selected in this experiment was about 9:1, which is similar to the ratio of vulnerable and non-vulnerable programs in the real world. In addition, the present experiment also used QEMU dataset and SARD dataset (this is an item maintained by the national institute of technical standards and standards (NIST), including 12303 vulnerability functions and 21057 non-vulnerability functions).
In the training phase of the network model, the loss function adopts cross entropy loss, the training process uses an Adam optimizer to train the CNN, and the learning rate is set to be 0.001.
2. Performance contrast test
In order to evaluate the true performance of the present invention, the experiment included selecting several most advanced vulnerability detection methods as control groups, which were compared with the present invention, including Reveal, vulCNN, design, syseVR, vulDeepecker, russell et al. The experiment tests the performance of the present invention and the control group scheme on the REVEAL data set and Sard data set respectively, and adopts four widely used indexes to evaluate the performance of vulnerability detection of each scheme. These four evaluation indexes include Accuracy (Accuracy), precision (Precision), recall (Recall), and F1 score (F1).
Where accuracy refers to the proportion of all test cases that are properly classified. Precision refers to the ratio of correctly predicted vulnerability samples to samples predicted to be vulnerable. Recall refers to the ratio of exactly predicted vulnerability samples to all vulnerability samples. The F1 score evaluates the overall effect by taking into account accuracy and recall.
In performance control experiments, the performance of the present invention on the REVEAL dataset versus other control protocols is shown in table 1:
TABLE 1 Performance test results of different protocols on REVEAL data set
The analysis of the data in the table shows that the method of the present invention achieves an accuracy and precision of 90.48% and 77.22%, respectively. Compared with the real scheme, the method can improve the accuracy, precision and F1 score by 8.71%, 45.67% and 21.71% respectively. However, the method of the present invention is 7.47% lower than real in terms of recall, due to the unbalanced ratio of loopholes and non-loopholes functions in real dataset (loopholes functions only account for 9% of all functions), and uneven number of test categories and too much data noise, resulting in poor recall performance of the method. Compared with the design, the accuracy is improved by 2.99%, the accuracy is improved by 45.67%, the recall rate is improved by 17.02%, and the F1 score is improved by 29.42%. Our method also has significant advantages over the other two methods. In summary, our approach provides significant improvements over the other most advanced approaches over the four widely used indicators.
Further, the present experiment was performed to compare the present invention with the control scheme on Sard dataset, and the experimental data obtained are shown in table 2:
TABLE 2 Performance test results of different schemes on Sard datasets
Analysis of the data of table 2 shows that the scheme of the present invention also has significant advantages over another data set. In summary, the method of the present invention provides a significant improvement over other most advanced methods over four widely used criteria. The method of the present invention is almost superior to token-based methods (Russell et al) and slice+token-based methods in all four metrics. This is because graph-based models learn semantic dependencies between each node along the way through various graphs, while graphs can highly preserve the semantic and syntactic information of the code, which enables graph-based models to make accurate predictions. However, token and slice based methods do not fully preserve the semantic information of the program, which may result in the model failing to make the correct predictions. Thus, the graph-based approach is significantly better than the slice and token-based approach.
3. Ablation experiments
The technical scheme provided by the invention comprises two core elements, namely, feature extraction and feature fusion are carried out on CPG, and the extracted fusion feature vector is learned and trained by using CNN model. To investigate the contribution of both to the outstanding performance achieved by the final protocol, the skilled person has specified the following ablation experiments.
3.1 Different feature extraction modes
The experiment uses two methods to process CPG and compares them to explore the effect of CPG processing mode of tree view and +flow view on the performance of the scheme employed by the present invention.
First, in the experiments of the present invention, the source code was standardized according to the design method and converted into CPG graph using Joern tools. CPG is then split to obtain the required abstract syntax tree, control flow graph and code dependency graph. And the abstract syntax tree is enhanced into a graph based on the enhanced abstract syntax tree by using corresponding rules, and as a tree view of the invention, the invention uses the pre-trained GraphCodeBERT for feature extraction and vector embedding. Control flow graphs and program dependency graphs as part of the flow view, the present invention uses CodeBERT for embedding. Finally, the tree view and the stream view are vector-fused by using a vector fusion method, and the tree view and the stream view are sent to the CNN for training and prediction.
In the second experiment, a method was used that did not split CPG. After the code attribute map of the source code is obtained using Joern tools, codeBERT is used directly to extract features and embed vectors into CPG without extracting tree and stream views. The obtained source code feature vector is then sent to CNNs with the same parameter configuration for training and prediction.
In this experiment, the final performance test results of both schemes are shown in fig. 5. As can be seen from an analysis of the data in fig. 5, the method of the present invention is superior to the conventional CPG processing method in terms of efficiency and accuracy in the first 100 rounds of training and prediction of Sard dataset. This proves that the method for extracting the characteristic information in CPG is superior to the traditional method in efficiency and precision.
This advantage is achieved in accordance with the present invention because the CPG contains a large amount of code semantic and grammar information. If CPG is simply vector embedded and sent to CNN for feature learning and training, the model has difficulty extracting features related to vulnerability functions from a large amount of irrelevant and complex information, resulting in low model efficiency and accuracy. The scheme of the invention extracts the tree view and the stream view in the CPG in a classified manner, and inputs the tree view and the stream view into the model after embedding the tree view and the stream view into the vector, which is equivalent to simply classifying the semantic and syntactic characteristics of the code, and then inputs the semantic and syntactic characteristics into the model, so that the model can extract relevant characteristics more easily, and the efficiency and the accuracy of vulnerability detection are improved.
3.2 Different neural network models
The experiment researches different influences on the leak detection performance of a final scheme when different models are adopted for training under the same method. Specifically, this experiment investigated two widely used classifier models, convolutional Neural Network (CNN) and multilayer perceptron (MLP). For the CNN model, the experiment used a model with 10 convolutional filter layers and 128 hidden layers. For the MLP model, the experiment used a network model with 3 fully connected layers (i.e. linear layers) and 2 ReLU activation layers, and finally the results were output as binary classifications using the sigmoid activation layer.
The present experiment uses the same procedure to process source codes on Sard and Qemu datasets, and after feature vectors of the source codes are obtained using the same feature extraction method, these feature vectors are fed into the CNN classifier and the MLP classifier, respectively, for feature learning and training. The performance of the resulting solution is shown in figure 6.
As can be seen from an analysis of the data in fig. 6, the results of training and prediction using the CNN model, which appeared to be more balanced in the overall value of the evaluation index, were slightly better than the MLP model, which showed bipolar differentiation (i.e., higher accuracy, but lower recall) in the evaluation index of the SARD dataset.
The reason why the above phenomenon occurs by analysis is that the structure of the MLP model is relatively simple, so that the MLP model is easy to sink into a local minimum value, and a global optimal solution cannot be found. And the fully connected layer of the MLP ensures that each neuron is connected to all neurons of the previous layer, thereby failing to take advantage of the local features of the data. Whereas CNNs can extract local features through local connections and weight sharing, which is not possible with MLP. In addition, the difficulty of the MLP model to process high-dimensional data can also be one of the reasons. From the experimental results, it is known that CNN should be selected as a network model required for feature training and classification of the present invention in practical applications.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (10)

1.一种多视图融合的源代码逻辑漏洞检测方法,其特征在于,其包括如下步骤:1. A method for detecting source code logic vulnerabilities by fusion of multiple views, characterized in that it comprises the following steps: S1:将软件源代码编码为包含树视图和流视图的特征信息的融合特征向量,过程包括:S1: Encode the software source code into a fused feature vector containing feature information of the tree view and the stream view. The process includes: S11:根据软件源代码生成代码属性图CPG;S11: Generate a code property graph CPG based on the software source code; S12:从CPG中抽取出对应的边和节点分别构造成抽象语法树AST、控制流图CFG和程序依赖图PDG;S12: extract the corresponding edges and nodes from CPG and construct them into abstract syntax tree AST, control flow graph CFG and program dependency graph PDG respectively; S13:按照预设的规则,在所述抽象语法树AST中的节点间增加表征返回、循环和跳转操作的附加边,进而得到增强语法树EAST;S13: according to a preset rule, additional edges representing return, loop and jump operations are added between nodes in the abstract syntax tree AST, thereby obtaining an enhanced syntax tree EAST; S14:将增强语法树EAST作为树视图,并通过预训练的代码语义编码模型生成对应的树视图特征;S14: Take the enhanced syntax tree EAST as the tree view, and generate the corresponding tree view features through the pre-trained code semantic encoding model; S15:将控制流图CFG和程序依赖图PDG作为流视图,并通过预训练的代码语义编码模型生成对应的流视图特征;S15: The control flow graph CFG and the program dependency graph PDG are used as flow views, and the corresponding flow view features are generated through the pre-trained code semantic encoding model; S16:将所述树视图和所述流视图的特征向量进行平均池化后拼接,进而得到所需的融合特征向量;S16: average pooling the feature vectors of the tree view and the stream view and concatenating them to obtain a desired fused feature vector; S2:获取大量由软件代码转换出的融合特征向量,人工添加表征是否存在漏洞的标签信息后作为样本数据,进而构成所需的样本数据集;S2: Obtain a large number of fused feature vectors converted from software codes, manually add label information representing whether there is a vulnerability as sample data, and then form the required sample data set; S3:选择交叉熵损失作为损失函数,将样本数据集分为训练集和测试集,并用于对一个CNN模型进行训练和测试,保存满足性能指标要求的网络模型的模型参数;S3: Select cross entropy loss as the loss function, divide the sample data set into a training set and a test set, and use it to train and test a CNN model, and save the model parameters of the network model that meets the performance index requirements; S4:将保存的网络模型作为漏洞检测模型,待识别的软件源代码编码为对应的融合特征向量后输入到所述漏洞检测模型,进而实现漏洞检测。S4: The saved network model is used as a vulnerability detection model, and the software source code to be identified is encoded into a corresponding fusion feature vector and then input into the vulnerability detection model, thereby realizing vulnerability detection. 2.如权利要求1所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S11中,利用基于源代码的静态分析工具joern将软件源代码转换为对应的代码属性图CPG。2. The multi-view fusion source code logic vulnerability detection method as described in claim 1 is characterized in that: in step S11, the software source code is converted into a corresponding code property graph CPG using a source code-based static analysis tool joern. 3.如权利要求2所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S13中,相对抽象语法树AST而言,所述增强语法树EAST中增加的附加边包括:3. The method for detecting source code logic vulnerabilities by multi-view fusion according to claim 2, characterized in that: in step S13, relative to the abstract syntax tree AST, the additional edges added to the enhanced syntax tree EAST include: (1)连接AST中的return节点与函数名节点,记为返回边RETURN TO;(1) Connect the return node in the AST and the function name node, denoted as the return edge RETURN TO; (2)在同一行代码语句的所有变量对应的节点之间增加一条依次串联的边,记为层间级联边NEXT_TOKEN;(2) Add a serial edge between the nodes corresponding to all variables in the same line of code statement, which is recorded as the inter-layer cascade edge NEXT_TOKEN; (3)在所有原始边和层间级联边对应的节点之间增加一条后向边Reverse edge。(3) Add a reverse edge between the nodes corresponding to all original edges and inter-layer cascade edges. 4.如权利要求3所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S14中,通过预训练的GraphCodeBERT生成树视图的特征向量;步骤S15中通过预训练的CodeBERT生成流视图的特征向量。4. The multi-view fusion source code logic vulnerability detection method as described in claim 3 is characterized in that: in step S14, the feature vector of the tree view is generated by the pre-trained GraphCodeBERT; in step S15, the feature vector of the flow view is generated by the pre-trained CodeBERT. 5.如权利要求4所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S15中,先通过CodeBERT分别提取CFG和PDG的特征,然后将二者的特征进行特征拼接,进而构成流视图的特征向量。5. The multi-view fusion source code logic vulnerability detection method as described in claim 4 is characterized in that: in step S15, the features of CFG and PDG are first extracted respectively by CodeBERT, and then the features of the two are spliced to form a feature vector of the stream view. 6.如权利要求5所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S16中经过平均池化后的树视图的特征向量为1*768维;流视图的特征向量为2*768维;则所述融合特征向量为3*768维。6. The multi-view fusion source code logic vulnerability detection method as described in claim 5 is characterized in that: the feature vector of the tree view after average pooling in step S16 is 1*768 dimensions; the feature vector of the stream view is 2*768 dimensions; and the fused feature vector is 3*768 dimensions. 7.如权利要求6所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S3中,所述CNN模型中包括10个卷积滤波层和128个隐藏层,并采用ReLU作为激活函数。7. The multi-view fusion source code logic vulnerability detection method as described in claim 6 is characterized in that: in step S3, the CNN model includes 10 convolutional filter layers and 128 hidden layers, and ReLU is used as the activation function. 8.如权利要求7所述的多视图融合的源代码逻辑漏洞检测方法,其特征在于:步骤S4中,所述融合特征向量分为三个通道同时输入到所述漏洞检测模型,所述漏洞检测模型用于输出融合特征向量对应的软件源代码是否包含漏洞的预测结果。8. The multi-view fusion source code logic vulnerability detection method as described in claim 7 is characterized in that: in step S4, the fused feature vector is divided into three channels and input into the vulnerability detection model at the same time, and the vulnerability detection model is used to output a prediction result of whether the software source code corresponding to the fused feature vector contains a vulnerability. 9.一种多视图融合的源代码逻辑漏洞检测系统,其特征在于:其基于如权利要求1-8中任意一项所述的多视图融合的源代码逻辑漏洞检测方法设计得到;所述多视图融合的源代码逻辑漏洞检测系统包括:9. A multi-view fusion source code logic vulnerability detection system, characterized in that: it is designed based on the multi-view fusion source code logic vulnerability detection method according to any one of claims 1 to 8; the multi-view fusion source code logic vulnerability detection system comprises: 代码获取单元,其用于获取待检测的软件的源代码;A code acquisition unit, which is used to acquire the source code of the software to be tested; CPG生成单元,其采用预设的静态分析工具根据输入的源代码生成对应的代码属性图;A CPG generation unit, which uses a preset static analysis tool to generate a corresponding code property graph according to the input source code; 树视图生成单元,其用于从所述代码属性图中抽取对应的节点和边,构造成抽象语法树AST,然后在AST中添加三类附件边,得到的增强语法树EAST作为所需的树视图;A tree view generation unit, which is used to extract corresponding nodes and edges from the code attribute graph, construct an abstract syntax tree AST, and then add three types of attachment edges to the AST to obtain an enhanced syntax tree EAST as the required tree view; 流视图生成单元,其用于从所述代码属性图中抽取对应的节点和边,构造成的控制流图CFG和程序依赖图PDG共同作为所需的流视图;A flow view generation unit, which is used to extract corresponding nodes and edges from the code attribute graph, and construct a control flow graph CFG and a program dependency graph PDG as the required flow view; 融合特征生成单元,其用于采用预训练的代码语义编码模型生成所述树视图和流视图对应的特征向量,并对二者先进行平均池化再进行特征拼接,进而得到所需的融合特征向量;A fusion feature generation unit, which is used to generate feature vectors corresponding to the tree view and the stream view using a pre-trained code semantic encoding model, and first average pool the two and then perform feature concatenation to obtain the required fusion feature vector; 漏洞检测模型,其由CNN模型经如权利要求1-8中任意一项所述的多视图融合的源代码逻辑漏洞检测方法训练得到;所述漏洞检测模型用于根据输入的融合特征向量生成对应的软件代码是否具有漏洞的预测结果,并输出。A vulnerability detection model, which is obtained by training a CNN model through the multi-view fusion source code logic vulnerability detection method as described in any one of claims 1 to 8; the vulnerability detection model is used to generate a prediction result of whether the corresponding software code has a vulnerability based on the input fusion feature vector, and output it. 10.一种多视图融合的源代码逻辑漏洞检测装置,其包含存储器、处理器,以及存储在所述存储器内并在处理器中运行的计算机程序,其特征在于:所述处理器执行计算机程序时,创建出如权利要求9所述的源代码逻辑漏洞检测系统,进而实现对输入的软件源代码中是否存在漏洞进行检测。10. A multi-view fusion source code logic vulnerability detection device, which includes a memory, a processor, and a computer program stored in the memory and running in the processor, characterized in that: when the processor executes the computer program, it creates a source code logic vulnerability detection system as described in claim 9, thereby realizing the detection of whether there are vulnerabilities in the input software source code.
CN202411987591.6A 2024-12-31 2024-12-31 Multi-view fusion source code logic vulnerability detection method, system and device Active CN119670101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411987591.6A CN119670101B (en) 2024-12-31 2024-12-31 Multi-view fusion source code logic vulnerability detection method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411987591.6A CN119670101B (en) 2024-12-31 2024-12-31 Multi-view fusion source code logic vulnerability detection method, system and device

Publications (2)

Publication Number Publication Date
CN119670101A true CN119670101A (en) 2025-03-21
CN119670101B CN119670101B (en) 2025-09-05

Family

ID=94991718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411987591.6A Active CN119670101B (en) 2024-12-31 2024-12-31 Multi-view fusion source code logic vulnerability detection method, system and device

Country Status (1)

Country Link
CN (1) CN119670101B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9201591B1 (en) * 2015-05-19 2015-12-01 Synack, Inc. Automated coverage monitoring of mobile applications
US20210056211A1 (en) * 2019-08-23 2021-02-25 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN117763560A (en) * 2023-12-08 2024-03-26 扬州大学 Interpretable vulnerability detection method and system based on double-view causal reasoning
CN118761058A (en) * 2024-07-02 2024-10-11 四川大学 A source code vulnerability classification detection method based on multi-feature fusion and self-attention encoder neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9201591B1 (en) * 2015-05-19 2015-12-01 Synack, Inc. Automated coverage monitoring of mobile applications
US20210056211A1 (en) * 2019-08-23 2021-02-25 Praetorian System and method for automatically detecting a security vulnerability in a source code using a machine learning model
CN117763560A (en) * 2023-12-08 2024-03-26 扬州大学 Interpretable vulnerability detection method and system based on double-view causal reasoning
CN118761058A (en) * 2024-07-02 2024-10-11 四川大学 A source code vulnerability classification detection method based on multi-feature fusion and self-attention encoder neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YALONG YU ET AL.: "A Vulnerability Detection Framework Based on Graph Decomposition Fusion and Augmented Abstract Syntax Tree", BDICN \'25: PROCEEDINGS OF THE 2025 4TH INTERNATIONAL CONFERENCE ON BIG DATA, INFORMATION AND COMPUTER NETWORK, 27 May 2025 (2025-05-27), pages 734 *
陈肇炫;邹德清;李珍;金海;: "基于抽象语法树的智能化漏洞检测系统", 信息安全学报, no. 04, 15 July 2020 (2020-07-15) *

Also Published As

Publication number Publication date
CN119670101B (en) 2025-09-05

Similar Documents

Publication Publication Date Title
CN110532353B (en) Text entity matching method, system and device based on deep learning
Meng et al. [Retracted] A Deep Learning Approach for a Source Code Detection Model Using Self‐Attention
Ferreira et al. Software engineering meets deep learning: a mapping study
CN112487154B (en) Intelligent search method based on natural language
CN117688560A (en) Semantic analysis-oriented intelligent detection method for malicious software
CN115146279A (en) Program vulnerability detection method, terminal device and storage medium
CN118860480B (en) A code defect detection and repair method and device based on large model
CN115934883A (en) Entity relation joint extraction method based on semantic enhancement and multi-feature fusion
CN116663539A (en) Chinese entity and relationship joint extraction method and system based on RoBERTa and pointer network
CN119272275A (en) Source code vulnerability detection method and system based on adaptive graph neural network
CN118886019A (en) A blockchain smart contract vulnerability detection method based on multimodal deep learning
CN117312559A (en) Method and system for extracting aspect-level emotion four-tuple based on tree structure information perception
CN116305119A (en) APT malware classification method and device based on predictive guidance prototype
He et al. VulTR: Software vulnerability detection model based on multi-layer key feature enhancement
CN119783799A (en) A multimodal knowledge graph completion method based on dynamic prompt learning and multi-granularity aggregation
Lian et al. A universal and efficient multi-modal smart contract vulnerability detection framework for big data
Wang et al. Multi-type source code defect detection based on textcnn
CN119670101B (en) Multi-view fusion source code logic vulnerability detection method, system and device
CN116561256B (en) Aspect-level sentiment quadruple extraction method and system based on deep learning
Parisi et al. Making the most of scarce input data in deep learning-based source code classification for heterogeneous device mapping
CN119150302B (en) A method and system for intelligent mining of software security vulnerabilities based on large language model
CN120277682B (en) Intelligent contract vulnerability detection method based on multi-mode selection state space fusion
Ning et al. Collaborative analysis on code structure and semantics
US20230169304A1 (en) Method of extracting information set based on parallel decoding and computing system for performing the same
Wang et al. Python open-source code traceability model based on graph neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant