[go: up one dir, main page]

CN119576363A - Adaptive code annotation updating method, system, terminal and storage medium - Google Patents

Adaptive code annotation updating method, system, terminal and storage medium Download PDF

Info

Publication number
CN119576363A
CN119576363A CN202411503013.0A CN202411503013A CN119576363A CN 119576363 A CN119576363 A CN 119576363A CN 202411503013 A CN202411503013 A CN 202411503013A CN 119576363 A CN119576363 A CN 119576363A
Authority
CN
China
Prior art keywords
code
annotation
abstract syntax
constructing
adaptive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411503013.0A
Other languages
Chinese (zh)
Inventor
熊壮
王珂
于福勇
张再胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Intelligent Technology Co Ltd
Inspur Intelligent Technology Wuhan Co Ltd
Original Assignee
Inspur Intelligent Technology Co Ltd
Inspur Intelligent Technology Wuhan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Intelligent Technology Co Ltd, Inspur Intelligent Technology Wuhan Co Ltd filed Critical Inspur Intelligent Technology Co Ltd
Priority to CN202411503013.0A priority Critical patent/CN119576363A/en
Publication of CN119576363A publication Critical patent/CN119576363A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/73Program documentation

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to the technical field of software development, in particular to a self-adaptive code annotation updating method, a self-adaptive code annotation updating system, a self-adaptive code annotation updating terminal and a self-adaptive code annotation updating storage medium, wherein the self-adaptive code annotation updating method comprises the steps of confirming that a code exists to be updated, extracting an updated code set and constructing an abstract syntax tree for the combination of the codes; the method comprises the steps of carrying out semantic analysis on an abstract syntax tree by using a deep learning model to generate semantic vectors, constructing a program dependency graph based on the semantic vectors, analyzing the dependency relationship between code elements in the program dependency graph by using a graph neural network, and generating code annotation based on the semantic vectors and the dependency relationship between the code elements by using a natural language generation model. The invention ensures the real-time performance and accuracy of the code document, greatly improves the readability and maintainability of the code and reduces the maintenance cost.

Description

Adaptive code annotation updating method, system, terminal and storage medium
Technical Field
The invention belongs to the technical field of software development, and particularly relates to a self-adaptive code annotation updating method, a self-adaptive code annotation updating system, a self-adaptive code annotation updating terminal and a self-adaptive code annotation storage medium.
Background
In the complex and lengthy software development process, code annotation plays a significant role, and is not only a valuable key for a developer to understand code logic and grasp program architecture, but also a powerful assistant for subsequent maintainers to quickly and accurately locate problems. Clear and detailed code annotation can greatly promote communication and collaboration among teams, and ensure effective inheritance of project knowledge. However, with the continuous advancement of software items, frequent changes of requirements, and continuous expansion and reconstruction of code bases, existing code annotations tend to become quickly obsolete, and it is difficult to accurately reflect the actual functions and inherent implementation logic of the current code.
The occurrence of such a situation not only weakens the value of the annotation as a "code specification", but also may mislead maintenance personnel, increasing the difficulty of debugging and repairing errors, thereby affecting the stability and reliability of the whole software product. More tricky, manually updating these annotations is a time-consuming and laborious task that is very easily overlooked or undermined during strenuous development cycles, thereby inducing inconsistencies between the annotations and the actual state of the code, resulting in what is known as an "annotation liability". Such liabilities not only reduce development efficiency, but may introduce new bugs due to annotation errors, severely compromising code quality and user experience.
Therefore, how to ensure timeliness and accuracy of code annotation in a development environment of rapid iteration becomes an important problem to be solved urgently. Exploring automated annotation generation and update tools, or establishing more stringent annotation maintenance and auditing mechanisms, may be an effective way to alleviate this dilemma.
Disclosure of Invention
In order to solve the above-mentioned shortcomings of the prior art, the present invention provides a method, a system, a terminal and a storage medium for updating adaptive code annotation, so as to solve the above-mentioned technical problems.
In a first aspect, the present invention provides an adaptive code annotation update method, comprising:
confirming that the codes are updated, extracting an updated code set, and constructing an abstract syntax tree for the code combination;
carrying out semantic analysis on the abstract syntax tree by using a deep learning model, generating a semantic vector, and constructing a program dependency graph based on the semantic vector;
Analyzing the dependency relationship between the code elements in the program dependency graph by using a graph neural network;
and generating code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
In an alternative embodiment, validating code presence updates, extracting updated code sets, and building abstract syntax trees for the code combinations, includes:
Monitoring all submitting actions of a code warehouse by using a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the current round of monitoring, wherein the change records comprise newly added, modified or deleted file lists, and extracting code detailed information from corresponding files based on the file lists, wherein the code detailed information comprises changed code line numbers;
and constructing an abstract syntax tree based on the code file corresponding to the change record and the code detailed information by using a parser.
In an alternative embodiment, constructing, with a parser, an abstract syntax tree based on the code file corresponding to the change record and the code details, includes:
Generating a new version code based on a code file corresponding to the change record and the code detailed information, and preprocessing the new version code, wherein the preprocessing comprises removing dead codes, analyzing macros and an inline conditional expression;
converting the preprocessed new version codes into abstract syntax trees by utilizing a parser based on language rules of the codes, and converting code structures of function call, control flow statement, circulation and condition judgment into corresponding node types in the abstract syntax trees, wherein the connection relations among the nodes identify logic and sequence relations of the codes;
And removing the repeated nodes in the abstract syntax tree.
In an alternative embodiment, the deep learning model is a bimodal pre-training model having 12 layers, each layer having 12 self-attention heads, each self-attention head having a dimension of 64. The hidden dimension is 768 and the dimension of the ff layer is 3072.
In an alternative embodiment, the pre-training method of the deep learning model includes:
Setting the input as a concatenation of code sequences, natural language sequences, and special tokens, and the output as a context vector representation and a summary sequence representation for each token of code and natural language;
acquiring unimodal data and bimodal data from a database, and performing data filtering on the data set by using a preset rule to obtain training data;
Giving data points of NL-PL pair (x=w, c) as input, where NL is a natural language word, PL is a code, w is a sequence of NL, c is a sequence of markers of PL, selecting a random set of locations of NL and PL to mask, then replacing the markers of the selected locations with special markers, where 15% of the markers in X are replaced;
The method comprises the steps of firstly respectively training a data generator by using unimodal natural language and code data to generate a reasonable alternative scheme for the random mask position, and further learning a fusion representation between the natural language and the code by a discriminator to detect whether a word is an original word or not, wherein the discriminator is a binary classifier, and if the generator generates a correct Token, the label of the Token is true, otherwise, the label is false.
In an alternative embodiment, analyzing dependencies between code elements in the program dependency graph using a graph neural network includes:
Capturing the topological structure and the characteristics of nodes in a program dependency graph by using a graph neural network, wherein the nodes are basic elements of codes and comprise variables, function calls or control flow sentences;
a critical section is identified based on the topology and characteristics of the node, the critical section including a loop body, conditional branches, and function calls.
In a second aspect, the present invention provides an adaptive code annotation update system comprising:
The code analysis module is used for confirming that the codes are updated, extracting updated code sets and constructing an abstract syntax tree for the code combination;
the relation analysis module is used for carrying out semantic analysis on the abstract syntax tree by utilizing a deep learning model, generating a semantic vector and constructing a program dependency graph based on the semantic vector;
The relation learning module is used for analyzing the dependency relation among the code elements in the program dependency graph by utilizing the graph neural network;
And the annotation generation module is used for generating a code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
In an alternative embodiment, the code parsing module includes:
The updating monitoring unit is used for monitoring all submitting actions of the code warehouse by utilizing a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the monitoring period of the round, wherein the change records comprise newly added, modified or deleted file lists, code detailed information is extracted from corresponding files based on the file lists, and the code detailed information comprises changed code line numbers;
And the grammar description unit is used for constructing an abstract grammar tree based on the code file corresponding to the change record and the code detailed information by utilizing a parser.
In a third aspect, a terminal is provided, including:
a memory for storing an adaptive code annotation update program;
A processor for implementing the steps of the adaptive code annotation update method as provided in the first aspect when executing the adaptive code annotation update program.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon an adaptive code annotation update program which, when executed by a processor, implements the steps of the adaptive code annotation update method as provided in the first aspect.
The self-adaptive code annotation updating method, the system, the terminal and the storage medium have the advantages that the modified code part and the context thereof can be accurately identified by utilizing the deep learning model to carry out semantic analysis on the abstract syntax tree, a program dependency graph is constructed on the basis of semantic analysis, the program dependency graph is mined by utilizing the graph neural network, the dependency relationship among code elements can be further analyzed, the structure and the logic relationship of the code can be clearly shown, the relation between the whole architecture of the code and each part is focused, and therefore, when the annotation is generated later, the annotation is generated for the modified code line, and the annotation is updated for other codes influenced by the modified code, and the whole accuracy of the code annotation is improved. The automatic updating annotation ensures the real-time performance and accuracy of the code document, greatly improves the readability and maintainability of the code and reduces the maintenance cost.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.
FIG. 2 is a schematic block diagram of a system of one embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a terminal according to an embodiment of the present invention.
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The self-adaptive code annotation updating method provided by the embodiment of the invention is executed by the computer equipment, and correspondingly, the self-adaptive code annotation updating system is operated in the computer equipment.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention. Wherein the execution subject of fig. 1 may be an adaptive code annotation update system. The order of the steps in the flow chart may be changed and some may be omitted according to different needs.
As shown in fig. 1, the method includes:
step 110, confirming that the codes are updated, extracting an updated code set, and constructing an abstract syntax tree for the code combination;
step 120, performing semantic analysis on the abstract syntax tree by using a deep learning model, generating a semantic vector, and constructing a program dependency graph based on the semantic vector;
Step 130, analyzing the dependency relationship between the code elements in the program dependency graph by using a graph neural network;
And 140, generating a code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
In order to facilitate understanding of the present invention, the following describes the method for updating the adaptive code annotation provided by the present invention in conjunction with the process of updating the adaptive code annotation in the embodiment according to the principles of the method for updating the adaptive code annotation of the present invention.
Specifically, the adaptive code annotation updating method comprises the following steps:
s1, confirming that codes are updated, extracting an updated code set, and constructing an abstract syntax tree for the code combination.
S101, monitoring all submitting actions of a code warehouse by using a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the monitoring period of the round, wherein the change records comprise newly added, modified or deleted file lists, and extracting code detailed information from corresponding files based on the file lists, wherein the code detailed information comprises changed code line numbers.
Git hooks (e.g., post-commit or pre-push hooks) are utilized to snoop all commit actions in the code store. Each time a developer submits an update to the code base, the corresponding hook triggers an automation script. The script calls the Git API to obtain all change records since the last commit, including a list of files that were added, modified, or deleted. For each changed file, the script further executes the 'gitdiff' operation, and extracts specific code change details, such as changed line numbers and contents. This information is then packaged into event objects and passed to subsequent processing modules for further analysis and annotation updating. The process ensures that code changes can be captured quickly and accurately, and lays a foundation for realizing efficient annotation updating.
After the code change detection step, the system of the present invention performs detailed extraction operations for the detected changes. This step involves parsing the change event captured by the Git hook, extracting specific code change details therefrom. The system first determines the type of change, including the addition of a new file, the modification of an existing file, or the deletion. For each type of change, the system uses the Git command (e.g., 'gitdiff') to obtain the detailed change contents since the last commit. This includes the changed line number, the specific code line added or deleted. The extracted information is organized into a structured data format, such as JSON or XML, in which changes to each file are recorded in detail. These structured data include not only changes to the code itself, but also metadata of the file, such as file path and author information of the changes, providing complete context information for subsequent semantic analysis and graph structure analysis.
S102, constructing an abstract syntax tree by utilizing a parser based on the code file corresponding to the change record and the code detailed information.
Generating a new version code based on a code file corresponding to a change record and the code detailed information, preprocessing the new version code, wherein the preprocessing comprises removing dead codes, analyzing macros and inline conditional expressions, converting the preprocessed new version code into an abstract syntax tree by utilizing a parser based on language rules of the codes, converting code structures of function call, control flow statement, circulation and condition judgment into corresponding node types in the abstract syntax tree, identifying logic and sequential relations of codes by connection relations among the nodes, and removing repeated nodes in the abstract syntax tree.
For each programming language, the system uses a particular parser to process the code. For example, for Java code, the system may use Java Parser library, and for Python code, the ast module of Python may be employed. These parsers are able to understand the grammar rules of the corresponding language and convert source code into AST. In AST, the code structures of function calls, control flow statements, loops, condition judgment, etc. are all converted into specific node types, and the connection relations between nodes represent the logical and sequential relations of the codes.
In the process of constructing an AST, the system also needs to deal with various complex programming modes and language characteristics, such as template element programming, anonymous functions, decorators, and the like. To improve AST accuracy and readability, the system may perform additional processing steps including removing dead code, parsing macros, and inlining conditional expressions, etc. In addition, the system may optimize the AST to reduce its size and increase processing efficiency, such as by merging insignificant nodes or removing duplicate structures.
In the process of constructing an AST, the system also needs to record important elements in the code, such as variable definition, function signature, class, interface statement and the like, which are important for subsequent semantic analysis and annotation generation. After the AST construction is completed, the system will use the structured information to perform deep code understanding and annotation updating, ensuring that the annotations are consistent with the actual function and structure of the code.
S2, carrying out semantic analysis on the abstract syntax tree by using a deep learning model, generating a semantic vector, and constructing a program dependency graph based on the semantic vector.
The deep learning model is a bimodal pre-training model, such as CodeBERT model. The bimodal pre-training model has 12 layers, each layer having 12 self-attention heads, each having dimensions of 64. The hidden dimension is 768 and the dimension of the ff layer is 3072.
The pre-training method of the deep learning model comprises the following steps:
Setting the input as a concatenation of code sequences, natural language sequences, and special tokens, and the output as a context vector representation and a summary sequence representation for each token of code and natural language;
acquiring unimodal data and bimodal data from a database, and performing data filtering on the data set by using a preset rule to obtain training data;
Giving data points of NL-PL pair (x=w, c) as input, where NL is a natural language word, PL is a code, w is a sequence of NL, c is a sequence of markers of PL, selecting a random set of locations of NL and PL to mask, then replacing the markers of the selected locations with special markers, where 15% of the markers in X are replaced;
The method comprises the steps of firstly respectively training a data generator by using unimodal natural language and code data to generate a reasonable alternative scheme for the random mask position, and further learning a fusion representation between the natural language and the code by a discriminator to detect whether a word is an original word or not, wherein the discriminator is a binary classifier, and if the generator generates a correct Token, the label of the Token is true, otherwise, the label is false.
And carrying out semantic analysis on the abstract syntax tree by using the trained deep learning model to generate a semantic vector. By embedding the nodes of an AST, the model is able to generate a vector representation reflecting the code logic and functionality. This process involves converting a code fragment into a series of numerical vectors that characterize its semantics, which capture the structure, logic, and potentially business logic of the code. In addition, the system analyzes the control flow and data flow in the code to identify complex program behavior such as loops, conditional branches, and function calls. In this way, semantic analysis provides an accurate semantic basis for automatic updating of code annotations, ensuring that the annotations closely correspond to the actual functionality of the code.
In one specific example, the semantic analysis process is as follows:
Let us assume that we have a simple Python code as follows:
def add(a, b):
return a + b。
The AST of the piece of code is processed using the CodeBERT model and a corresponding semantic vector is generated.
This piece of code is parsed into an AST using a compiler front-end tool (e.g., tree-sitter, etc.). The analyzed AST may be as follows (in simplified form):
FunctionDefinition
├── Name: "add"
├── Parameters
│├── Name: "a"
│└── Name: "b"
└── Body
└── ReturnStatement
└── BinaryExpression
├── Operator: "+"
├── Left: Variable
│└── Name: "a"
└── Right: Variable
└── Name: "b"。
An AST is converted to a serialization format suitable for input into the CodeBERT model. In this example, AST nodes and edges are converted into a series of token or node pairs. For example, one possible serialization approach is to convert an AST into a list of strings containing node types and relationships between nodes, as follows:
["FunctionDefinition", "Name:add", "Parameter:a", "Parameter:b", "ReturnStatement", "BinaryExpression:+", "Left:Variable:a", "Right:Variable:b"].
AST is serialized into a format suitable for input into CodeBERT model. The model receives these inputs and processes them to generate semantic vectors.
The CodeBERT model internally processes the incoming AST, capturing the semantic information of the code using its pre-trained weights and architecture. After processing the input, the model outputs one or more semantic vectors. These vectors are typically represented in a high-dimensional space, capable of capturing semantic features of the code.
In this example, assume that CodeBERT model outputs a 768-dimensional semantic vector (because CodeBERT model typically has a 768-dimensional hidden layer size). This vector can be represented as an array of the following forms:
[0.123, -0.456, 0.789, -0.234] # this is an 768-dimensional array, showing only a few elements.
After the semantic vector is generated, the dependencies in the source code need to be extracted. These dependencies may include data dependencies, control dependencies, and the like. Data dependencies refer to assignment and use relationships between variables, while control dependencies refer to execution order relationships between statements. There are many methods for extracting the dependency relationship, such as static analysis, dynamic analysis, data flow analysis, etc. These methods can identify dependencies from the syntax structure and semantic information of the source code.
After the dependencies are extracted, the build of the program dependency graph may begin. A program dependency graph is a directed graph in which nodes represent code fragments (e.g., statements, expressions, etc.), and edges represent dependencies.
In building a program dependency graph, nodes and edges may be added to the graph according to the extracted dependencies. At the same time, semantic vectors can also be utilized to enhance the representation capabilities of the graph. For example, semantic vectors may be used as attributes of nodes, or similarity between nodes may be calculated using semantic vectors, thereby enriching further the structure and information of the graph.
It will be appreciated that a Program Dependency Graph (PDG) is made up of nodes representing basic elements of code such as variables, function calls or control flow statements and edges representing dependencies between these elements including data dependencies, control dependencies and call dependencies.
S3, analyzing the dependency relationship among the code elements in the program dependency graph by using the graph neural network.
PDG was analyzed using a Graph Neural Network (GNN). GNN is a deep learning model dedicated to processing graph structure data, capable of capturing node topology and node characteristics. In the present invention, GNNs are used to identify key parts in code, such as loop bodies, conditional branches, and function calls, and how they are connected to each other by dependencies. In addition, the GNN can also identify potential effects of altering code on existing code structures, including newly introduced loop dependencies, corrupted modular designs, or effects of alterations on code control flow.
To analyze the PDG, an appropriate GNN model is selected. In this example, a graph rolling network (Graph Convolutional Network, GCN) is selected as the GNN model. GCN is a commonly used graph neural network model that can effectively capture node characteristics and topology information in a graph structure.
Some preprocessing work is required on the PDG before it is input to the GNN model. This includes:
node feature extraction for each node in the PDG we need to extract its corresponding feature. These features may include the type of node (e.g., statement, expression, etc.), the semantic vector of the node (generated by CodeBERT model), and so on.
Edge weight setting for each edge in the PDG we can set its weight according to the strength or importance of the dependency. For example, the data dependency and the control dependency may be respectively given different weights.
Next, the preprocessed PDG is input into the GCN model for training. In the training process, tasks such as node classification, graph classification or link prediction are used for supervising the training of the model. These tasks may help the model learn node characteristics and topology information in the PDG.
After training is completed, the GNN model is used to identify key parts in the code. This may be achieved by classifying nodes in the PDG. For example, key parts such as a loop body, a conditional branch, a function call and the like are respectively marked as different categories, and nodes in the PDG are classified by using a GNN model.
In addition to identifying critical portions, the GNN model can also be used to analyze the potential impact of altering code on existing code structures. This may be achieved by performing operations such as sub-graph matching or graph embedding on the PDG. For example, we can use the GNN model to calculate similarities between the altered code and existing code, thereby identifying newly introduced loop dependencies, corrupted modular designs, or the impact of alterations on the code control flow.
S4, generating a code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
After understanding the code change deeply by semantic analysis and graph structure analysis, the system constructs annotation text by a Natural Language Generation (NLG) model using the code semantic vector and PDG analysis result obtained in the previous step.
The NLG model is responsible for converting semantic representations of code into human-readable text. The model learns a large number of code-annotation pairs using machine learning techniques to understand how to generate appropriate annotations based on the semantics of the code. In generating the annotation, the model takes into account information such as the description of the function, list of parameters, return values, and possible error handling. In addition, the model generates more descriptive annotations based on the context information of the code, such as variable names and method names.
The system also designs a set of rules engines to assist in annotation generation, the rules being based on programming experience and community best practices for guiding the style and content of annotations. For example, the rules engine may suggest that an explanation for complex algorithm logic be included in the annotation, or that the conditions of use of particular parameters be specified in the annotation.
The generated annotations are first checked by a quality assessment module that evaluates the clarity, accuracy and integrity of the annotations. If the annotations do not meet the predetermined quality criteria, they are sent back to the NLG model for iterative optimization or provided to the developer for manual editing. In this way, the system ensures that the updating of the annotations is both automated and meets high quality standards.
Eventually, the updated annotations will be integrated into the code base, along with the change code, as a new submission. This integration process can be automatically performed through a Continuous Integration (CI) tool, ensuring seamless interfacing of the updating of annotations with code changes, thereby improving the level of documentation and maintainability of the entire code library.
In one specific example, the process of generating annotations is as follows:
S401, integrating code semantic vectors and PDG analysis results
After the code semantic vector and the PDG analysis result are obtained, the system integrates the code semantic vector and the PDG analysis result. By combining semantic information and structural information of the code, the system is able to more fully understand the context and potential impact of code changes.
S402, constructing annotation text by Natural Language Generation (NLG) model
Next, the system builds annotation text using Natural Language Generation (NLG) models. The NLG model has been trained with a large number of code-annotation pairs, which can understand the semantics of the code and generate appropriate annotations. In generating annotations, the model may consider the following:
Functional description the function and purpose of the code are accurately described.
Parameter list, list parameters of function and its type and application.
Return value, the type and meaning of the function return value are described.
Error handling-error handling logic and exception conditions in the description code.
In addition, the model generates more descriptive annotations according to the context information (such as variable names and method names) of the codes, so that the readability and comprehensiveness of the annotations are improved.
S403, rule engine assisted annotation generation
To further improve the quality of the annotations, the system also designs a set of rules engines to assist in the generation of the annotations. These rules are based on programming experience and community best practices for guiding the style and content of the annotation. For example, the rules engine may suggest that an explanation for complex algorithm logic be included in the annotation, or that the conditions of use of particular parameters be specified in the annotation. By applying these rules, the system is able to generate more canonical, accurate, and useful annotations.
S404, annotation quality assessment and optimization
The generated annotations are first checked by the quality assessment module. The module evaluates the clarity, accuracy and integrity of the annotation, ensuring that the annotation meets predetermined quality criteria. If the annotations do not meet these criteria, they are sent back to the NLG model for iterative optimization or provided to the developer for manual editing. In this way, the system can ensure that the generated annotations are both automated and meet high quality standards.
S405 comment integration and persistence integration
Finally, the quality-evaluated and optimized annotations are integrated into the code base, and taken as new submissions with the altered code. This integration process may be automated through a Continuous Integration (CI) tool, ensuring that the updating of annotations and code changes are seamlessly interfaced. In this way, the system can improve the level of documentation and maintainability of the entire code library, providing a developer with clearer, accurate and useful code annotations.
The specific implementation method of step S402 is as follows:
Code-annotation pair data is collected-a large number of code files and corresponding annotations are collected from open source projects, code warehouses, or software libraries. The collected data is ensured to have diversity, and codes in different programming languages and different fields are covered.
And data preprocessing, namely analyzing codes, and extracting key elements such as functions, classes, variables and the like and attributes (such as types, action fields and the like) of the key elements. And cleaning the annotation, removing irrelevant information, and reserving key contents such as description of the code function, parameter description, return value interpretation and the like. The code and annotation are aligned to ensure that each code segment has a corresponding annotation.
And selecting an NLG model architecture, namely selecting a proper NLG model architecture, such as a sequence-to-sequence (Seq 2 Seq) model based on a transducer, a GPT series model and the like according to task requirements and data characteristics. Factors such as the generation capacity, the interpretability, the calculation efficiency and the like of the model are considered.
Design model inputs and outputs input vector representations of code fragments (obtainable by code embedding techniques) and possibly context information (e.g., variable names, method names, etc.). And outputting the generated annotation text.
A loss function and an optimization strategy are defined, namely cross entropy loss and BLEU score are selected as the loss function and used for evaluating the difference between the generated annotation and the real annotation.
And (3) designing an optimization strategy, such as using an Adam optimizer, learning rate attenuation and the like, so as to improve the training effect of the model.
The model is trained on the data using the preprocessed code-annotation. In the training process, indexes such as loss value, generation quality and the like of the model are monitored, and training parameters are timely adjusted.
And (3) model evaluation and tuning, namely evaluating the model by using a verification set, and calculating indexes such as the accuracy rate, recall rate, F1 score and the like of generated notes. And (3) optimizing the model according to the evaluation result, such as adjusting the model architecture, adding training data, improving data preprocessing and the like.
Inputting the code segment, namely inputting the code segment to be annotated into the trained NLG model.
And generating annotation, namely generating corresponding annotation text by the model according to the input code segment and the context information.
Post-processing, namely post-processing the generated annotation, such as removing redundant information, adjusting sentence structure and the like, so as to improve the readability and accuracy of the annotation.
In some embodiments, the adaptive code annotation update system may comprise a plurality of functional modules consisting of computer program segments. The computer program of the various program segments in the adaptive code annotation update system may be stored in a memory of a computer device and executed by at least one processor to perform (see fig. 1 for details) the functions of the adaptive code annotation update.
In this embodiment, the adaptive code annotation update system may be divided into a plurality of functional modules according to the functions performed by the system, as shown in fig. 2. Functional modules of system 200 may include a code parsing module 210, a relationship parsing module 220, a relationship learning module 230, and an annotation generation module 240. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The code analysis module is used for confirming that the codes are updated, extracting updated code sets and constructing an abstract syntax tree for the code combination;
the relation analysis module is used for carrying out semantic analysis on the abstract syntax tree by utilizing a deep learning model, generating a semantic vector and constructing a program dependency graph based on the semantic vector;
The relation learning module is used for analyzing the dependency relation among the code elements in the program dependency graph by utilizing the graph neural network;
And the annotation generation module is used for generating a code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
Optionally, as an embodiment of the present invention, the code parsing module includes:
The updating monitoring unit is used for monitoring all submitting actions of the code warehouse by utilizing a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the monitoring period of the round, wherein the change records comprise newly added, modified or deleted file lists, code detailed information is extracted from corresponding files based on the file lists, and the code detailed information comprises changed code line numbers;
And the grammar description unit is used for constructing an abstract grammar tree based on the code file corresponding to the change record and the code detailed information by utilizing a parser.
Optionally, as an embodiment of the present invention, constructing, with a parser, an abstract syntax tree based on the code file corresponding to the change record and the code detailed information includes:
Generating a new version code based on a code file corresponding to the change record and the code detailed information, and preprocessing the new version code, wherein the preprocessing comprises removing dead codes, analyzing macros and an inline conditional expression;
converting the preprocessed new version codes into abstract syntax trees by utilizing a parser based on language rules of the codes, and converting code structures of function call, control flow statement, circulation and condition judgment into corresponding node types in the abstract syntax trees, wherein the connection relations among the nodes identify logic and sequence relations of the codes;
And removing the repeated nodes in the abstract syntax tree.
Alternatively, as an embodiment of the present invention, the deep learning model is a bimodal pre-training model having 12 layers, each layer having 12 self-attention heads, each self-attention head having a dimension of 64. The hidden dimension is 768 and the dimension of the ff layer is 3072.
Optionally, as an embodiment of the present invention, the pre-training method of the deep learning model includes:
Setting the input as a concatenation of code sequences, natural language sequences, and special tokens, and the output as a context vector representation and a summary sequence representation for each token of code and natural language;
acquiring unimodal data and bimodal data from a database, and performing data filtering on the data set by using a preset rule to obtain training data;
Giving data points of NL-PL pair (x=w, c) as input, where NL is a natural language word, PL is a code, w is a sequence of NL, c is a sequence of markers of PL, selecting a random set of locations of NL and PL to mask, then replacing the markers of the selected locations with special markers, where 15% of the markers in X are replaced;
The method comprises the steps of firstly respectively training a data generator by using unimodal natural language and code data to generate a reasonable alternative scheme for the random mask position, and further learning a fusion representation between the natural language and the code by a discriminator to detect whether a word is an original word or not, wherein the discriminator is a binary classifier, and if the generator generates a correct Token, the label of the Token is true, otherwise, the label is false.
Optionally, as an embodiment of the present invention, analyzing the dependency relationship between the code elements in the program dependency graph using a graph neural network includes:
Capturing the topological structure and the characteristics of nodes in a program dependency graph by using a graph neural network, wherein the nodes are basic elements of codes and comprise variables, function calls or control flow sentences;
a critical section is identified based on the topology and characteristics of the node, the critical section including a loop body, conditional branches, and function calls.
Fig. 3 is a schematic structural diagram of a terminal 300 according to an embodiment of the present invention, where the terminal 300 may be used to execute the adaptive code annotation updating method according to the embodiment of the present invention.
The terminal 300 may include a processor 310, a memory 320, and a communication unit 330. The components may communicate via one or more buses, and it will be appreciated by those skilled in the art that the configuration of the server as shown in the drawings is not limiting of the invention, as it may be a bus-like structure, a star-like structure, or include more or fewer components than shown, or may be a combination of certain components or a different arrangement of components.
The memory 320 may be used to store instructions for execution by the processor 310, and the memory 320 may be implemented by any type of volatile or non-volatile memory terminal or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. The execution of the instructions in memory 320, when executed by processor 310, enables terminal 300 to perform some or all of the steps in the method embodiments described below.
The processor 310 is a control center of the storage terminal, connects various parts of the entire electronic terminal using various interfaces and lines, and performs various functions of the electronic terminal and/or processes data by running or executing software programs and/or modules stored in the memory 320, and invoking data stored in the memory. The processor may be comprised of an integrated circuit (INTEGRATED CIRCUIT, simply referred to as an IC), for example, a single packaged IC, or may be comprised of multiple packaged ICs connected to one another for the same function or for different functions. For example, the processor 310 may include only a central processing unit (Central Processing Unit, CPU for short). In the embodiment of the invention, the CPU can be a single operation core or can comprise multiple operation cores.
And a communication unit 330 for establishing a communication channel so that the storage terminal can communicate with other terminals. Receiving user data sent by other terminals or sending the user data to other terminals.
The present invention also provides a computer storage medium in which a program may be stored, which program may include some or all of the steps in the embodiments provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory RAM), or the like.
Therefore, the invention can accurately identify the changed code part and the context thereof by utilizing the deep learning model to carry out semantic analysis on the abstract syntax tree, construct a program dependency graph on the basis of semantic analysis, and mine the program dependency graph by utilizing the graph neural network, further analyze the dependency relationship among code elements, clearly show the structure and the logic relationship of the codes, pay attention to the relation between the whole architecture of the codes and each part, thereby generating the notes for the modified code line and updating the notes for other codes influenced by the modified codes when generating the notes subsequently, and improving the whole accuracy of the code notes. The technical effects achieved by this embodiment may be referred to above, and will not be described herein.
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium such as a U-disc, a mobile hard disc, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, etc. various media capable of storing program codes, including several instructions for causing a computer terminal (which may be a personal computer, a server, or a second terminal, a network terminal, etc.) to execute all or part of the steps of the method described in the embodiments of the present invention.
The same or similar parts between the various embodiments in this specification are referred to each other. In particular, for the terminal embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description in the method embodiment for relevant points.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with respect to each other may be through some interface, indirect coupling or communication connection of systems or modules, electrical, mechanical, or other form.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims.

Claims (10)

1. An adaptive code annotation update method, comprising:
confirming that the codes are updated, extracting an updated code set, and constructing an abstract syntax tree for the code combination;
carrying out semantic analysis on the abstract syntax tree by using a deep learning model, generating a semantic vector, and constructing a program dependency graph based on the semantic vector;
Analyzing the dependency relationship between the code elements in the program dependency graph by using a graph neural network;
and generating code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
2. The method of claim 1, wherein validating code presence updates, extracting updated code sets, and constructing an abstract syntax tree for the code combinations, comprises:
Monitoring all submitting actions of a code warehouse by using a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the current round of monitoring, wherein the change records comprise newly added, modified or deleted file lists, and extracting code detailed information from corresponding files based on the file lists, wherein the code detailed information comprises changed code line numbers;
and constructing an abstract syntax tree based on the code file corresponding to the change record and the code detailed information by using a parser.
3. The method of claim 2, wherein constructing, with a parser, an abstract syntax tree based on the code file and the code details corresponding to the change record, comprises:
Generating a new version code based on a code file corresponding to the change record and the code detailed information, and preprocessing the new version code, wherein the preprocessing comprises removing dead codes, analyzing macros and an inline conditional expression;
converting the preprocessed new version codes into abstract syntax trees by utilizing a parser based on language rules of the codes, and converting code structures of function call, control flow statement, circulation and condition judgment into corresponding node types in the abstract syntax trees, wherein the connection relations among the nodes identify logic and sequence relations of the codes;
And removing the repeated nodes in the abstract syntax tree.
4. The method of claim 1, wherein the deep learning model is a bimodal pre-training model having 12 layers, each layer having 12 self-attention heads, each self-attention head having a dimension of 64. The hidden dimension is 768 and the dimension of the ff layer is 3072.
5. The method of claim 4, wherein the pre-training method of the deep learning model comprises:
Setting the input as a concatenation of code sequences, natural language sequences, and special tokens, and the output as a context vector representation and a summary sequence representation for each token of code and natural language;
acquiring unimodal data and bimodal data from a database, and performing data filtering on the data set by using a preset rule to obtain training data;
Giving data points of NL-PL pair (x=w, c) as input, where NL is a natural language word, PL is a code, w is a sequence of NL, c is a sequence of markers of PL, selecting a random set of locations of NL and PL to mask, then replacing the markers of the selected locations with special markers, where 15% of the markers in X are replaced;
The method comprises the steps of firstly respectively training a data generator by using unimodal natural language and code data to generate a reasonable alternative scheme for the random mask position, and further learning a fusion representation between the natural language and the code by a discriminator to detect whether a word is an original word or not, wherein the discriminator is a binary classifier, and if the generator generates a correct Token, the label of the Token is true, otherwise, the label is false.
6. The method of claim 1, wherein analyzing dependencies between code elements in the program dependency graph using a graph neural network comprises:
Capturing the topological structure and the characteristics of nodes in a program dependency graph by using a graph neural network, wherein the nodes are basic elements of codes and comprise variables, function calls or control flow sentences;
a critical section is identified based on the topology and characteristics of the node, the critical section including a loop body, conditional branches, and function calls.
7. An adaptive code annotation update system, comprising:
The code analysis module is used for confirming that the codes are updated, extracting updated code sets and constructing an abstract syntax tree for the code combination;
the relation analysis module is used for carrying out semantic analysis on the abstract syntax tree by utilizing a deep learning model, generating a semantic vector and constructing a program dependency graph based on the semantic vector;
The relation learning module is used for analyzing the dependency relation among the code elements in the program dependency graph by utilizing the graph neural network;
And the annotation generation module is used for generating a code annotation based on the semantic vector and the dependency relationship between the code elements by using a natural language generation model.
8. The system of claim 7, wherein the code parsing module comprises:
The updating monitoring unit is used for monitoring all submitting actions of the code warehouse by utilizing a custom script, and after the custom script is triggered by the submitting actions, calling a script interface to acquire all change records in the monitoring period of the round, wherein the change records comprise newly added, modified or deleted file lists, code detailed information is extracted from corresponding files based on the file lists, and the code detailed information comprises changed code line numbers;
And the grammar description unit is used for constructing an abstract grammar tree based on the code file corresponding to the change record and the code detailed information by utilizing a parser.
9. A terminal, comprising:
a memory for storing an adaptive code annotation update program;
a processor for implementing the steps of the adaptive code annotation update method according to any of claims 1-6 when executing said adaptive code annotation update program.
10. A computer readable storage medium storing a computer program, characterized in that the readable storage medium has stored thereon an adaptive code annotation update program, which when executed by a processor, implements the steps of the adaptive code annotation update method according to any of claims 1-6.
CN202411503013.0A 2024-10-25 2024-10-25 Adaptive code annotation updating method, system, terminal and storage medium Pending CN119576363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411503013.0A CN119576363A (en) 2024-10-25 2024-10-25 Adaptive code annotation updating method, system, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411503013.0A CN119576363A (en) 2024-10-25 2024-10-25 Adaptive code annotation updating method, system, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN119576363A true CN119576363A (en) 2025-03-07

Family

ID=94795966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411503013.0A Pending CN119576363A (en) 2024-10-25 2024-10-25 Adaptive code annotation updating method, system, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN119576363A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120144167A (en) * 2025-05-13 2025-06-13 济南浪潮数据技术有限公司 Code comment generation method, device, electronic device, storage medium and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120144167A (en) * 2025-05-13 2025-06-13 济南浪潮数据技术有限公司 Code comment generation method, device, electronic device, storage medium and product

Similar Documents

Publication Publication Date Title
CN117806980B (en) Automatic test case generating device based on large language model
Joshi et al. Repair is nearly generation: Multilingual program repair with llms
CN119105965A (en) A unit test case generation system based on large language model
CN118760612B (en) Code processing methods and code repair testing methods
CN120276718B (en) A multi-modal code automatic generation and optimization system
CN119442240A (en) A method for source code detection
CN119576799B (en) Dynamic language test case automatic generation method based on large language model
CN119576363A (en) Adaptive code annotation updating method, system, terminal and storage medium
CN118094561A (en) Code vulnerability detection method based on code attribute graph learning
Rajbahadur et al. Pitfalls analyzer: quality control for model-driven data science pipelines
Paduraru et al. LLM-based methods for the creation of unit tests in game development
CN120196543A (en) An automated software testing method and system based on artificial intelligence
CN110928535A (en) Derivative variable deployment method, device, equipment and readable storage medium
CN119443083A (en) Method, device, electronic device and storage medium for generating unified bidding documents
CN119597657A (en) Source code change analysis method, system and equipment
Szalontai et al. Deep learning-based refactoring with formally verified training data
Jansen et al. Software Engineering Meets Large Language Models
CN118897668B (en) Public code library management method, system, equipment and medium
Pham et al. Defect Prediction with Content-based Features
CN114610320B (en) LLVM (LLVM) -based variable type information restoration and comparison method and system
CN119807019B (en) Abnormal data generation method and abnormal code generation method
CN119690512B (en) A code defect detection method and system based on large model
El Beggar et al. Getting objects methods and interactions by extracting business rules from legacy systems
Ifham et al. Machine Learning-Based Approach for Classifying the Source Code Using Programming Keywords
Pi et al. Construction of a GitHub Knowledge Graph based on DISTANER

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination