CN118796191A

CN118796191A - Code parsing method, device, system, electronic device and storage medium

Info

Publication number: CN118796191A
Application number: CN202311191275.3A
Authority: CN
Inventors: 周成
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-10-18

Abstract

The invention provides a code analysis method, a device, a system, electronic equipment and a storage medium, and relates to the technical field of code processing. Comprising the following steps: determining a code file to be operated in the project from the file according to the project, dividing codes in the code file into segments taking functions as minimum units, and obtaining a plurality of code segments; determining the code hierarchy relation corresponding to the plurality of code segments according to the calling relation of the functions in the code file; according to the sequence from low to high of the code levels, a plurality of code segments are sequentially input into a first natural language model according to a first preset instruction format to obtain code annotations corresponding to the code segments, the first preset instruction format is used for instructing the first natural language model to generate the code annotations of the code segments, and by inputting the code segments into the first natural language model according to the sequence from low to high of the code levels, the first natural language model can generate more accurate annotations by using calling relations among the code segments, so that quick code interpretation is realized.

Description

Code parsing method, device, system, electronic device and storage medium

技术领域Technical Field

本发明涉及代码处理技术领域，尤其涉及一种代码解析方法、装置、系统、电子设备和存储介质。The present invention relates to the field of code processing technology, and in particular to a code parsing method, device, system, electronic device and storage medium.

背景技术Background Art

目前人工智能平台上有很多开源的GitHub项目。很多GitHub项目的README文件中写的内容既不详细，也缺乏和代码之间的关联，项目注释也多以函数输入输出为主，针对函数内部很多核心语句都缺乏对应注释和解释。There are many open source GitHub projects on the AI platform. The content in the README files of many GitHub projects is neither detailed nor relevant to the code. The project comments are mainly about function input and output, and there is a lack of corresponding comments and explanations for many core statements in the functions.

目前，针对README文件中所写的项目中需要运行的代码，多依赖人工反复阅读代码去理解项目实现逻辑。此种方式处理效率低。Currently, for the code that needs to be run in the project written in the README file, people mostly rely on reading the code repeatedly to understand the project implementation logic. This method is inefficient.

发明内容Summary of the invention

本发明提供一种代码解析方法、装置、系统、电子设备和存储介质，用以解决现有技术中依赖人工反复阅读代码去理解项目实现逻辑，导致处理效率低的缺陷。The present invention provides a code parsing method, device, system, electronic device and storage medium, which are used to solve the defect of relying on manual repeated reading of codes to understand the project implementation logic in the prior art, resulting in low processing efficiency.

本发明提供一种代码解析方法，所述方法包括：The present invention provides a code parsing method, the method comprising:

根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；Determine the code files that need to be run in the project according to the project readme file, divide the codes in the code files into segments with functions as the smallest units, and obtain a plurality of code segments;

根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；Determining the code hierarchical relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file;

按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。In order from low to high code levels, several code segments are input into the first natural language model in turn according to a first preset instruction format to obtain code comments corresponding to each code segment. The first preset instruction format is used to instruct the first natural language model to generate code comments for the code segments. The first natural language model is obtained by training an initial natural language model based on the original code segments that identify the original code comments.

根据本发明提供的一种代码解析方法，所述方法还包括：According to a code parsing method provided by the present invention, the method further includes:

确定所述项目自述文件所引用的文献，并获取所述文献对应的预设格式的文献文件；Determine the literature cited in the project README file, and obtain the literature file in a preset format corresponding to the literature;

将所述文献文件中的内容按照预设分隔符划分成若干个句子；Dividing the content in the document file into a plurality of sentences according to preset delimiters;

依次将若干个所述句子按照第二预设指令格式输入至第二自然语言模型中，得到各所述句子对应的文献解答，所述第二预设指令格式用于指示所述第二自然语言模型生成所述句子对应的文献解答；sequentially inputting a plurality of the sentences into a second natural language model according to a second preset instruction format to obtain a literature answer corresponding to each of the sentences, wherein the second preset instruction format is used to instruct the second natural language model to generate a literature answer corresponding to the sentence;

基于所述代码段对应的代码注释与所述句子对应的文献解答之间的文本相似度，建立所述代码段与所述句子之间的关联，以辅助代码解读。Based on the text similarity between the code annotation corresponding to the code segment and the literature answer corresponding to the sentence, an association between the code segment and the sentence is established to assist in code interpretation.

对所述项目自述文件中的文本信息进行分块处理，得到若干个文本块；Processing the text information in the project README file in blocks to obtain a plurality of text blocks;

获取嵌入向量集，所述嵌入向量集中包括各所述代码段的嵌入向量、各所述句子的嵌入向量以及各所述文本块的嵌入向量；Obtain an embedding vector set, wherein the embedding vector set includes an embedding vector of each of the code segments, an embedding vector of each of the sentences, and an embedding vector of each of the text blocks;

在接收到用户的问询语句的情况下，获取所述问询语句的嵌入向量；When receiving a query statement from a user, obtaining an embedding vector of the query statement;

根据所述问询语句的嵌入向量与所述嵌入向量集中的各嵌入向量之间的第一相似度，从所述嵌入向量集中选取出第一相似度最高的目标嵌入向量；According to a first similarity between the embedding vector of the query sentence and each embedding vector in the embedding vector set, selecting a target embedding vector with the highest first similarity from the embedding vector set;

根据所述目标嵌入向量，获取所述问询语句对应的答案信息。According to the target embedding vector, answer information corresponding to the query statement is obtained.

根据本发明提供的一种代码解析方法，所述根据所述目标嵌入向量，获取所述问询语句对应的答案信息，包括：According to a code parsing method provided by the present invention, obtaining answer information corresponding to the query statement according to the target embedding vector includes:

获取所述目标嵌入向量所表示的目标文本；Obtaining a target text represented by the target embedding vector;

将所述目标文本作为上下文信息，并与所述问询语句输入至第三自然语言模型中，得到所述问询语句对应的答案信息。The target text is used as context information and is input into a third natural language model together with the query statement to obtain answer information corresponding to the query statement.

根据本发明提供的一种代码解析方法，所述确定所述项目自述文件所引用的文献，并获取所述文献对应的预设格式的文献文件，包括：According to a code parsing method provided by the present invention, determining the literature cited by the project README file and obtaining a literature file in a preset format corresponding to the literature includes:

基于预设正则表达式匹配出所述项目自述文件中引用文献的文本信息；Matching the text information of the references in the project README file based on a preset regular expression;

基于所述文本信息解析出所引用的文献的元信息；Parsing meta information of the cited document based on the text information;

基于所述元信息从开源文献库中获取预设格式的文本文件。A text file in a preset format is obtained from an open source document library based on the meta information.

根据本发明提供的一种代码解析方法，所述第一自然语言模型是通过以下方式训练得到的：According to a code parsing method provided by the present invention, the first natural language model is trained in the following manner:

爬取带有原始代码注释的原始代码段；Crawl the original code snippets with original code comments;

将所述原始代码段输入至初始自然语言模型中，得到所述初始自然语言模型输出的预测注释；Inputting the original code segment into an initial natural language model to obtain a predicted annotation output by the initial natural language model;

根据所述预测注释与所述原始代码注释之间的第二相似度，对所述初始自然语言模型的模型参数进行更新，直至所述第二相似度达到预设相似度要求，得到训练好的所述第一自然语言模型。According to the second similarity between the predicted annotation and the original code annotation, the model parameters of the initial natural language model are updated until the second similarity reaches a preset similarity requirement, thereby obtaining the trained first natural language model.

本发明还提供一种代码解析装置，所述装置包括：The present invention also provides a code parsing device, the device comprising:

第一代码解析模块，用于根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；The first code parsing module is used to determine the code files that need to be run in the project according to the project readme file, and divide the codes in the code files into segments with functions as the smallest unit to obtain a plurality of code segments;

第二代码解析模块，用于根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；A second code parsing module, used to determine the code hierarchical relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file;

第三代码解析模块，用于按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。The third code parsing module is used to input the plurality of code segments into the first natural language model in sequence according to a first preset instruction format in order from low to high code levels to obtain code comments corresponding to each code segment, wherein the first preset instruction format is used to instruct the first natural language model to generate code comments for the code segments, and the first natural language model is obtained by training an initial natural language model based on the original code segments that identify the original code comments.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述代码解析方法。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, any of the code parsing methods described above is implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述代码解析方法。The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the code parsing method described in any one of the above is implemented.

本发明还提供一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现如上述任一种所述代码解析方法。The present invention also provides a computer program product, comprising a computer program, which implements any of the above-mentioned code parsing methods when executed by a processor.

本发明提供的代码解析方法、装置、系统、电子设备和存储介质，先根据项目自述文件确定项目中需要运行的代码文件，将代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段，接着根据代码文件中的函数的调用关系确定若干个代码段对应的代码层级关系，以通过代码层级关系构建起自底向上的逻辑体系，接着按照代码层级从低到高的顺序，依次将若干个代码段按照第一预设指令格式输入至第一自然语言模型中，通过按照代码层级从低到高的顺序将代码段输入至第一自然语言模型中，可以让第一自然语言模型更好地理解代码的上下文和逻辑流程，第一自然语言模型从而能够利用代码段之间的代码层级关系来产生更准确的注释，实现快速地代码解读。The code parsing method, device, system, electronic device and storage medium provided by the present invention first determine the code file that needs to be run in the project according to the project self-description file, divide the code in the code file into fragments with functions as the smallest unit to obtain a plurality of code segments, and then determine the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file, so as to build a bottom-up logical system through the code hierarchy relationship, and then input the plurality of code segments into the first natural language model in sequence according to the first preset instruction format in the order from low to high code hierarchy. By inputting the code segments into the first natural language model in the order from low to high code hierarchy, the first natural language model can better understand the context and logical flow of the code, so that the first natural language model can use the code hierarchy relationship between the code segments to generate more accurate annotations, thereby realizing rapid code interpretation.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明提供的代码解析方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a code parsing method provided by the present invention;

图2为本发明提供的代码解析装置的结构示意图；FIG2 is a schematic diagram of the structure of a code parsing device provided by the present invention;

图3为本发明提供的电子设备的结构示意图。FIG. 3 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

目前人工智能平台上有很多开源的GitHub项目。GitHub项目是指存储在GitHub代码托管平台上的软件项目。GitHub是一个面向开发者的版本控制和协作平台，让开发者可以共享、托管和协作开发代码。在GitHub上，开发者可以创建自己的代码仓库(Repository)，并将代码托管到该仓库中。每个代码仓库都对应一个项目，开发者可以在仓库中进行代码管理、版本控制和问题跟踪等操作。其他开发者可以浏览和克隆这些仓库，也可以参与到项目中来，提交代码、报告问题或提供改进建议。GitHub项目通常由多个代码文件组成，包括源代码、配置文件、文档以及其他相关资源。There are currently many open source GitHub projects on the artificial intelligence platform. GitHub projects refer to software projects stored on the GitHub code hosting platform. GitHub is a version control and collaboration platform for developers that allows developers to share, host and collaborate on code development. On GitHub, developers can create their own code repository and host the code in the repository. Each code repository corresponds to a project, and developers can perform operations such as code management, version control, and issue tracking in the repository. Other developers can browse and clone these repositories, or participate in the project, submit code, report problems, or provide suggestions for improvements. GitHub projects usually consist of multiple code files, including source code, configuration files, documents, and other related resources.

目前很多GitHub项目的README文件中写的内容既不详细，也缺乏和代码之间的关联，项目注释也多以函数输入输出为主，针对函数内部很多核心语句都缺乏对应注释和解释。Currently, the content written in the README files of many GitHub projects is neither detailed nor relevant to the code. Project comments are mostly focused on function input and output, and there is a lack of corresponding comments and explanations for many core statements within the function.

针对README文件中所写的项目中需要运行的代码，多依赖人工反复阅读代码去理解项目实现逻辑。此种方式处理效率低，且出错率也较高。For the code that needs to be run in the project written in the README file, people often rely on reading the code repeatedly to understand the project implementation logic. This method is inefficient and has a high error rate.

针对上述问题，本发明实施例提出一种代码解析方法，通过按照代码层级从低到高的顺序，依次将项目中需要运行的若干个代码段按照第一预设指令格式输入至第一自然语言模型中，通过按照代码层级从低到高的顺序将代码段输入至第一自然语言模型中，第一自然语言模型从而能够利用代码段之间的代码层级关系来产生更准确的注释，实现快速地代码解读。In response to the above problems, an embodiment of the present invention proposes a code parsing method, by inputting several code segments that need to be run in the project into a first natural language model in sequence according to a first preset instruction format in an order from low to high code hierarchy. By inputting the code segments into the first natural language model in an order from low to high code hierarchy, the first natural language model can utilize the code hierarchy relationship between the code segments to generate more accurate annotations, thereby achieving rapid code interpretation.

另外，有些GitHub项目也会关联一些文献，比如论文，但论文中数学推论也没能和项目代码中的实现部分建立起关联。很多GitHub项目在实现的时候考虑到工程实现简便问题，它的README文件中写的内容可能并不会和论文中写的方案完全一致。In addition, some GitHub projects are also associated with some literature, such as papers, but the mathematical inferences in the papers are not associated with the implementation part in the project code. Many GitHub projects consider the simplicity of engineering implementation when implementing them, and the content written in its README file may not be completely consistent with the solution written in the paper.

目前已经有一些针对文献的解读工具。这些工具一般是先将文献文件中的文本内容解析出来，之后通过添加适当的提示和对应文献的文本内容来指引模型做出解读，这部分解读通常集中在翻译、摘要生成、文本生成，其解析结果与项目代码之间的关联性比较小，导致解读也比较困难。There are already some tools for document interpretation. These tools generally parse the text content in the document first, and then guide the model to interpret by adding appropriate prompts and the text content of the corresponding document. This part of the interpretation usually focuses on translation, summary generation, and text generation. The correlation between the parsing results and the project code is relatively small, which makes the interpretation more difficult.

因此，针对上述问题，本发明实施例进一步地提出了以第一自然语言模型为基础，对第一自然语言模型的模型参数进行微调，得到第二自然语言模型，接着将项目自述文件所引用的文献中的若干个句子按照第二预设指令格式输入至第二自然语言模型中，得到句子对应的文献解答，并使用代码段对应的代码注释与句子对应的文献解答之间的文本相似度，建立代码段与句子之间的关联，以辅助代码解读。Therefore, in response to the above problems, an embodiment of the present invention further proposes to fine-tune the model parameters of the first natural language model based on the first natural language model to obtain a second natural language model, and then input several sentences in the literature cited in the project self-described file into the second natural language model according to a second preset instruction format to obtain the literature answers corresponding to the sentences, and use the text similarity between the code annotation corresponding to the code segment and the literature answer corresponding to the sentence to establish an association between the code segment and the sentence to assist in code interpretation.

此外，本发明实施例还提出了一种针对用户提出的有关项目代码的问题解答方法，针对性的解答用户的疑问，实现定制化代码解析，提高了解答效率。In addition, the embodiment of the present invention also proposes a method for answering questions about project codes raised by users, which specifically answers users' questions, realizes customized code analysis, and improves answering efficiency.

图1为本发明提供的代码解析方法的流程示意图之一，如图1所示，该代码解析方法包括：FIG. 1 is a flow chart of a code parsing method provided by the present invention. As shown in FIG. 1 , the code parsing method includes:

步骤101，根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；Step 101, determining the code files that need to be run in the project according to the project readme file, dividing the codes in the code files into segments with functions as the smallest unit, and obtaining a plurality of code segments;

其中，项目自述文件即指代README文件，README文件中通常包括项目名称(通常以项目的名称作为标题或开头，用于清楚地标识项目)、项目描述(包括其目的、功能和特点等，这些描述可以帮助读者了解项目的基本背景和目标)、安装说明(包含有关如何安装和配置项目的详细指南，比如所需的软件依赖项、环境设置和步骤等)、使用说明(通常提供有关如何使用项目的说明，包括启动项目、运行示例代码或执行特定操作的步骤等)、文件结构(README文件可能会列出项目的文件结构，包括各个文件和文件夹的功能和作用。这有助于读者更好地理解项目的组织和代码布局)、示例和演示(包括示例代码、命令行示例或演示链接，以展示项目的功能和用法。这些示例可以帮助用户更好地理解和验证项目的功能)、许可证信息(包含项目的许可证信息，明确项目使用的许可协议)。Among them, the project readme file refers to the README file, which usually includes the project name (usually with the project name as the title or beginning to clearly identify the project), project description (including its purpose, functions and features, etc. These descriptions can help readers understand the basic background and goals of the project), installation instructions (including detailed guidelines on how to install and configure the project, such as required software dependencies, environment settings and steps, etc.), usage instructions (usually provide instructions on how to use the project, including steps to start the project, run sample code, or perform specific operations, etc.), file structure (README file may list the file structure of the project, including the functions and roles of each file and folder. This helps readers better understand the organization and code layout of the project), examples and demonstrations (including sample code, command line examples or demonstration links to show the functions and usage of the project. These examples can help users better understand and verify the functions of the project), license information (including the license information of the project, clarifying the license agreement used by the project).

本实施例中，对README文件进行解析，以提取出需要运行的代码文件，可以使用适当的正则表达式来匹配和提取代码文件信息。例如，可以根据文件名扩展名(如.py)或特殊标记(如：代码文件)或函数名、参数列表和函数体进行匹配。通过正则表达式匹配到的文件列表可以作为候选的代码文件，还可以利用自然语言处理技术，如文本分词、关键词提取等，识别和提取与代码文件相关的文本信息。例如，可以查找包含关键词(如“代码”、“运行”、“文件”等)的句子或段落，并从中提取出文件名或路径信息。In the present embodiment, the README file is parsed to extract the code file that needs to be run, and appropriate regular expressions can be used to match and extract the code file information. For example, it can be matched according to the file name extension (such as .py) or special tags (such as: code file) or function name, parameter list and function body. The file list matched by the regular expression can be used as a candidate code file, and natural language processing techniques such as text segmentation, keyword extraction, etc. can also be used to identify and extract text information related to the code file. For example, sentences or paragraphs containing keywords (such as "code", "run", "file", etc.) can be found, and file name or path information can be extracted therefrom.

在提取出候选的代码文件后，对候选代码文件进一步整理和筛选，以确保只包含实际需要运行的代码文件，比如根据项目的需求，去除不需要运行的文件，如文档文件、配置文件等，进一步地，还可以在候选代码文件中去除重复的文件，避免重复运行相同的代码，另外，还可以根据README文件中提供的描述或说明，进一步筛选出与该项目的核心功能相关的代码文件。After extracting the candidate code files, further organize and screen the candidate code files to ensure that only the code files that actually need to be run are included. For example, according to the requirements of the project, remove files that do not need to be run, such as document files, configuration files, etc. Furthermore, duplicate files can be removed from the candidate code files to avoid repeatedly running the same code. In addition, code files related to the core functions of the project can be further screened out based on the description or instructions provided in the README file.

需要说明的是，代码段是指程序中的一部分连续的代码。它可以是一个独立的函数、方法，也可以是一段特定的逻辑操作或算法实现，本实施例中，可以根据代码的逻辑结构和功能，将代码划分为以函数为最小单元的片段。It should be noted that a code segment refers to a continuous portion of code in a program. It can be an independent function, method, or a specific logical operation or algorithm implementation. In this embodiment, the code can be divided into segments with functions as the smallest unit according to the logical structure and function of the code.

例如，可以根据函数或方法的定义和调用关系，将代码按照函数或方法进行划分。一个代码段可以包含一个完整的函数或方法，它封装了特定的功能并可以被其他代码调用。另外还可以根据编程语言的语法规则将代码划分为多个代码段，比如基于语句、表达式、循环或条件语句等语法元素进行划分。还可以通过识别代码块的开始和结束，可以将代码分割为多个代码段，但需注意的是，划分后的代码段中均需确保存在至少一个函数片段。For example, the code can be divided into functions or methods based on their definitions and calling relationships. A code segment can contain a complete function or method, which encapsulates specific functionality and can be called by other code. In addition, the code can be divided into multiple code segments based on the grammatical rules of the programming language, such as based on grammatical elements such as statements, expressions, loops, or conditional statements. The code can also be divided into multiple code segments by identifying the start and end of a code block, but it should be noted that there must be at least one function segment in each of the divided code segments.

步骤102，根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；Step 102, determining the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file;

需要说明的是，代码文件中的函数的调用关系是指代码文件中的代码所涉及到的函数之间的相互调用关系，即一个函数在执行过程中调用了其他函数。在程序中，函数可以通过函数名并提供所需的参数来调用其他函数。It should be noted that the calling relationship of functions in a code file refers to the mutual calling relationship between functions involved in the code in the code file, that is, one function calls other functions during execution. In a program, a function can call other functions by using the function name and providing the required parameters.

函数的调用关系可以形成一个函数调用图，表示函数之间的依赖和调用流程。在函数调用图中，每个函数可以被视为一个节点，函数之间的调用关系可以看作是节点之间的边。通过分析函数的调用关系，可以理清程序的执行流程、依赖关系和逻辑结构。The calling relationship of functions can form a function call graph, which represents the dependency and calling process between functions. In the function call graph, each function can be regarded as a node, and the calling relationship between functions can be regarded as the edge between nodes. By analyzing the calling relationship of functions, the execution process, dependency and logical structure of the program can be clarified.

本实施例中，通过函数之间的相互调用关系确定各代码段对应的代码层级关系，比如以下三个代码段：In this embodiment, the code hierarchical relationship corresponding to each code segment is determined by the mutual calling relationship between functions, such as the following three code segments:

通过分析代码函数调用函数，可知print("Hello,World！")是最底层的代码段，由于它没有调用其他函数，因此处于最低的层级。function2()被function1()调用，所以function2()处于比function1()更高一层的层级。function1()被main()调用，所以function1()处于比main()更高一层的层级。main()是最顶层的代码段，因为它没有被其他函数调用。By analyzing the code function calling function, we can know that print("Hello, World!") is the lowest code segment. Since it does not call other functions, it is at the lowest level. function2() is called by function1(), so function2() is at a higher level than function1(). function1() is called by main(), so function1() is at a higher level than main(). main() is the top-level code segment because it is not called by other functions.

步骤103，按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。Step 103, in order from low to high code levels, input the plurality of code segments into the first natural language model in turn according to a first preset instruction format to obtain code comments corresponding to each of the code segments, wherein the first preset instruction format is used to instruct the first natural language model to generate code comments for the code segments, and the first natural language model is obtained by training an initial natural language model based on the original code segments that identify the original code comments.

比如对于代码段1：For example, for code snippet 1:

pythonCopy CodepythonCopy Code

print("Hello,World！")print("Hello, World!")

可以将代码段1按照以下第一预设指令格式输入至第一自然语言模型中：The code segment 1 may be input into the first natural language model according to the following first preset instruction format:

"instruct":"请为以下代码段生成代码注释：","instruct":"Please generate code comments for the following code snippet:",

"code":"print('Hello,World！')""code":"print('Hello, World!')"

通过将代码转换为instruct格式，我们可以在与第一自然语言模型进行交互时明确指定任务或问题，并期望模型生成与此相关的响应。By converting the code to the instruction format, we can explicitly specify the task or question when interacting with the first natural language model and expect the model to generate responses related to it.

本实施例中为了提高第一自然语言模型解析出的代码注释的准确性，按照代码层级从低到高的顺序，依次将代码段按照第一预设指令格式输入至第一自然语言模型。In this embodiment, in order to improve the accuracy of the code comments parsed by the first natural language model, the code segments are input into the first natural language model in sequence according to the first preset instruction format in order from low to high code levels.

另外，在将代码段按照第一预设指令格式输入至第一自然语言模型中时，还可以使用特定的标记或注释来表征代码段之间的代码层级关系，或使用特定的函数命名或前缀来暗示代码层级关系，或通过函数参数传递表示代码层级关系等，来帮助自然语言模型理解每个代码段的功能和调用关系，从而更好地解析出代码段对应的代码注释。In addition, when the code segments are input into the first natural language model according to the first preset instruction format, specific tags or comments can be used to represent the code hierarchy relationship between the code segments, or specific function names or prefixes can be used to imply the code hierarchy relationship, or the code hierarchy relationship can be expressed through function parameter passing, etc., to help the natural language model understand the function and calling relationship of each code segment, so as to better parse the code comments corresponding to the code segments.

本发明实施例提供的代码解析方法，先根据项目自述文件确定项目中需要运行的代码文件，将代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段，接着根据代码文件中的函数的调用关系确定若干个代码段对应的代码层级关系，以通过代码层级关系构建起自底向上的逻辑体系，接着按照代码层级从低到高的顺序，依次将若干个代码段按照用于表示出若干个代码段之间的调用关系的第一预设指令格式输入至第一自然语言模型中，通过按照代码层级从低到高的顺序将代码段输入至第一自然语言模型中，可以让第一自然语言模型更好地理解代码的上下文和逻辑流程，第一自然语言模型从而能够利用代码段之间的调用关系来产生更准确的注释，实现快速地代码解读。The code parsing method provided by the embodiment of the present invention first determines the code files that need to be run in the project according to the project self-description file, divides the codes in the code files into fragments with functions as the smallest units to obtain a number of code segments, and then determines the code hierarchy relationship corresponding to the number of code segments according to the calling relationship of the functions in the code file, so as to build a bottom-up logical system through the code hierarchy relationship, and then, in order from low to high code levels, the number of code segments are input into the first natural language model in sequence according to a first preset instruction format for representing the calling relationship between the number of code segments. By inputting the code segments into the first natural language model in order from low to high code levels, the first natural language model can better understand the context and logical flow of the code, so that the first natural language model can use the calling relationship between the code segments to generate more accurate annotations, thereby realizing rapid code interpretation.

基于上述任一实施例，所述第一自然语言模型是通过以下方式训练得到的：Based on any of the foregoing embodiments, the first natural language model is trained in the following manner:

本实施例中，比如可以从GitHub网站中爬取包含原始代码注释的原始代码段，并将其作为训练数据使用。然后使用初始的自然语言模型将这些原始代码段输入，并生成预测注释。根据预测注释与原始代码注释之间的相似度，使用反向传播算法更新模型参数，直到所需的注释相似度达到预设值。通过这个过程，我们可以得到一个训练好的第一自然语言模型，可以用于在给定代码段时生成合适的注释。In this embodiment, for example, the original code segments containing the original code comments can be crawled from the GitHub website and used as training data. These original code segments are then input using the initial natural language model and predicted comments are generated. Based on the similarity between the predicted comments and the original code comments, the model parameters are updated using the back propagation algorithm until the required comment similarity reaches a preset value. Through this process, we can obtain a trained first natural language model that can be used to generate appropriate comments when a code segment is given.

其中，本实施例中可以基于词向量或文本嵌入的相似度计算方法，计算出预测注释与原始代码注释之间的相似度，对此不作限制。In this embodiment, the similarity between the predicted comments and the original code comments can be calculated based on the similarity calculation method of word vector or text embedding, and there is no limitation on this.

基于上述任一实施例，所述方法还包括：Based on any of the above embodiments, the method further includes:

本实施例中，对项目自述文件中的内容进行解析，确定出该项目自述文件中的内容所引用的文献，在确定出所引用的文献后，从对应的文献库中下载预设格式的文献文件。In this embodiment, the content in the project narration file is parsed to determine the literature cited by the content in the project narration file. After the cited literature is determined, the literature file in a preset format is downloaded from the corresponding literature library.

通常情况下，项目自述文件会包含一个参考文献列表或引用文献的部分。则可基于此部分内容确定出所引用的文献。还可以在项目自述文件中搜索与文献相关的信息，比如关键词、作者、期刊或会议名称等信息，则可基于此类信息确定出所引用的文献。Usually, the project self-reading file will contain a reference list or a section for cited literature. The cited literature can be determined based on the content of this section. You can also search for literature-related information in the project self-reading file, such as keywords, authors, journals or conference names, etc., and the cited literature can be determined based on such information.

其中，预设分隔符指代预先设置的可用于表征一段完整段落或完整的一句话的符号，比如句号、问号、感叹号、分号、换行符等。The preset delimiter refers to a pre-set symbol that can be used to represent a complete paragraph or a complete sentence, such as a period, a question mark, an exclamation mark, a semicolon, a line break, etc.

本实施例中，使用所选择的预设分隔符将文献文件进行分割。根据预设分隔符，在相应的位置将字符拆分为多个部分，并形成一个句子列表。In this embodiment, the document file is segmented using the selected preset separator. According to the preset separator, the characters are split into multiple parts at corresponding positions to form a sentence list.

需要说明的是本实施例中所使用的第二自然语言模型与第一自然语言模型为模型网络结构相同，但模型参数不同的自然语言模型，其中，第二自然语言模型为以第一自然语言模型为初始模型，使用标识了文献解答的句子对第一自然语言模型进行训练后得到的自然语言模型，如此通过微调第一自然语言模型的参数得到第二自然语言模型，并使用第二自然语言模型生成项目自述文件所引用的文献中的句子的文献解答，可以保持文献解答与代码注释的知识领域一致性、提高可读性和可理解性。It should be noted that the second natural language model used in this embodiment has the same model network structure as the first natural language model, but different model parameters. The second natural language model is a natural language model obtained by training the first natural language model with the first natural language model as the initial model using sentences that identify literature answers. In this way, the second natural language model is obtained by fine-tuning the parameters of the first natural language model, and the second natural language model is used to generate literature answers for sentences in the literature cited in the project's self-described file. This can maintain the consistency of the knowledge domain of the literature answers and the code comments, and improve readability and comprehensibility.

本实施例中，将句子按照第二预设指令格式输入至第二自然语言模型中，以指示第二自然语言模型生成句子对应的文献解答。In this embodiment, the sentence is input into the second natural language model according to a second preset instruction format to instruct the second natural language model to generate a document answer corresponding to the sentence.

此外，为了建立代码段与句子之间的关联，本实施例中可以通过使用文本相似度算法(例如余弦相似度等)，对句子对应的文献解答与每一个代码段对应的代码注释之间的相似度进行计算，根据实际情况，设定一个合适的相似度阈值，判断代码段与句子是否相关。若相似度超过阈值，则认为它们相关；否则认为它们不相关，若相关，则将当前句子和对应的代码段关联起来，例如以列表的形式存储。In addition, in order to establish the association between the code segment and the sentence, in this embodiment, the text similarity algorithm (such as cosine similarity, etc.) can be used to calculate the similarity between the document answer corresponding to the sentence and the code comment corresponding to each code segment, and a suitable similarity threshold is set according to the actual situation to determine whether the code segment and the sentence are related. If the similarity exceeds the threshold, they are considered to be related; otherwise, they are considered to be unrelated. If they are related, the current sentence and the corresponding code segment are associated, for example, stored in the form of a list.

本发明实施例提供的代码解析方法，通过以第一自然语言模型为基础，对第一自然语言模型的模型参数进行微调，得到第二自然语言模型，接着将项目自述文件所引用的文献中的若干个句子按照第二预设指令格式输入至第二自然语言模型中，得到句子对应的文献解答，并使用代码段对应的代码注释与句子对应的文献解答之间的文本相似度，建立代码段与句子之间的关联，以辅助代码解读。The code parsing method provided by the embodiment of the present invention is based on the first natural language model, and the model parameters of the first natural language model are fine-tuned to obtain a second natural language model. Then, several sentences in the literature cited in the project self-described file are input into the second natural language model according to a second preset instruction format to obtain the literature answers corresponding to the sentences, and the text similarity between the code annotation corresponding to the code segment and the literature answer corresponding to the sentence is used to establish an association between the code segment and the sentence to assist in code interpretation.

其中，文本块通常指的是以一定格式或标记包裹的一段连续文本。这些文本块通常用于承载一个特定的内容或信息，例如项目的简介、安装指南、使用示例、贡献指南等。A text block usually refers to a continuous text wrapped in a certain format or markup. These text blocks are usually used to carry a specific content or information, such as a project introduction, installation guide, usage examples, contribution guide, etc.

本实施例中，在对项目自述文件中的文本信息进行分块处理时，由于在项目自述文件中，通常会有不同的标题或小节，因此可以通过识别这些标题来将文本信息分为不同的块。例如，可以将标题和子标题作为各个块的名称，将它们下面的所有文本内容作为该块的内容。另外，项目自述文件中通常会包含项目特性、系统要求、安装说明等信息，这些信息可以根据列表或段落进行分块。例如，可以通过判断每个段落或列表项的起始和结束位置来确定不同的块。此外如果项目自述文件采用了标签或其他特殊格式来标记不同的部分，也可以根据这些标签来进行分块，例如，使用Markdown格式的README文件中，使用空行或者特定符号(如#)来定义标题、代码块、列表等，从而将文本划分为不同的块。In the present embodiment, when the text information in the project README file is processed in blocks, since there are usually different titles or sections in the project README file, the text information can be divided into different blocks by identifying these titles. For example, the title and subtitle can be used as the name of each block, and all the text content below them can be used as the content of the block. In addition, the project README file usually contains information such as project characteristics, system requirements, installation instructions, etc., which can be divided into blocks according to lists or paragraphs. For example, different blocks can be determined by judging the starting and ending positions of each paragraph or list item. In addition, if the project README file uses tags or other special formats to mark different parts, it can also be divided into blocks according to these tags. For example, in the README file using Markdown format, blank lines or specific symbols (such as #) are used to define titles, code blocks, lists, etc., so that the text is divided into different blocks.

本实施例中，对于项目自述文件中的文本信息，将划分后的每个文本块作为一个单位，分别输入到模型中获取对应的嵌入向量(embedding向量)；对于代码文件中的代码，将划分后的每个代码段作为一个单位，分别输入到模型中获取对应的嵌入向量(embedding向量)；对于项目自述文件所引用的文献，根据文献中句子的结构，在预设分隔符(如句号、分号、感叹号等)处将文献分割为多个句子。然后将每个句子作为一个单位，分别输入到模型中获取对应的嵌入向量(embedding向量)。In this embodiment, for the text information in the project README file, each divided text block is taken as a unit and input into the model to obtain the corresponding embedding vector; for the code in the code file, each divided code segment is taken as a unit and input into the model to obtain the corresponding embedding vector; for the literature cited in the project README file, the literature is divided into multiple sentences at the preset separators (such as period, semicolon, exclamation mark, etc.) according to the structure of the sentences in the literature. Then each sentence is taken as a unit and input into the model to obtain the corresponding embedding vector.

在接收到用户的问询语句的情况下，使用相同的预训练模型和相应的文本处理工具，将用户的问询语句作为输入，并将其转换为问询语句的嵌入向量，之后通过计算问询语句的嵌入向量与嵌入向量集中各个代码段、句子和文本块的嵌入向量之间的第一相似度，找到第一相似度最高的目标嵌入向量，最后使用与所选目标嵌入向量相关的代码段、句子或文本块等信息来提取答案。When a user's query statement is received, the same pre-trained model and corresponding text processing tools are used to take the user's query statement as input and convert it into an embedding vector of the query statement. Then, the first similarity between the embedding vector of the query statement and the embedding vectors of each code segment, sentence, and text block in the embedding vector set is calculated to find the target embedding vector with the highest first similarity. Finally, the code segment, sentence, or text block information related to the selected target embedding vector is used to extract the answer.

本实施例中，先分别以每个文本块作为一个单位、每个代码段作为一个单位、每个句子作为一个单位，分别输入到模型中获取对应的嵌入向量，构建出嵌入向量集，由于嵌入向量捕捉了文本的语义信息，因此针对用户提出的问询语句，只需通过计算问询语句的嵌入向量与嵌入向量集中各个代码段、句子和文本块的嵌入向量之间的第一相似度，找到第一相似度最高的目标嵌入向量，即可快速获取最相关的信息，提高回答的效率。In this embodiment, each text block, each code segment, and each sentence are taken as a unit and are input into the model to obtain corresponding embedding vectors to construct an embedding vector set. Since the embedding vector captures the semantic information of the text, for the query statement raised by the user, it is only necessary to calculate the first similarity between the embedding vector of the query statement and the embedding vectors of each code segment, sentence, and text block in the embedding vector set, and find the target embedding vector with the highest first similarity, so as to quickly obtain the most relevant information and improve the efficiency of answering.

基于上述任一实施例，所述根据所述目标嵌入向量，获取所述问询语句对应的答案信息，包括：Based on any of the foregoing embodiments, obtaining answer information corresponding to the query statement according to the target embedding vector includes:

其中，目标嵌入向量所表示的目标文本为构建该目标嵌入向量时所选择的特定文本片段、可以为代码段、句子或文本块中的任意一种，也可以为代码段对应的代码注释、句子对应的文献解答或文本块中的任意一种。Among them, the target text represented by the target embedding vector is a specific text fragment selected when constructing the target embedding vector, which can be any one of a code segment, a sentence or a text block, or any one of a code comment corresponding to a code segment, a literature answer corresponding to a sentence or a text block.

本实施例中，将目标文本作为上下文信息提供给第三自然语言模型，将用户的问询语句输入至第三自然语言模型中，作为当前的查询问题，第三自然语言模型则会将根据上下文信息和查询问题生成相应的答案信息。In this embodiment, the target text is provided to the third natural language model as context information, and the user's query statement is input into the third natural language model as the current query question. The third natural language model will generate corresponding answer information based on the context information and the query question.

其中，第三自然语言模型可以为与第一自然语言模型的网络架构相同，但模型参数不同的模型，也可以为与第一自然语言模型的网络架构以及模型参数都不同的模型，对此不作限制。Among them, the third natural language model can be a model with the same network architecture as the first natural language model but different model parameters, or it can be a model with different network architecture and model parameters from the first natural language model, and there is no limitation on this.

本实施例中，将与问询语句的嵌入向量的第一相似度最高的目标嵌入向量所表示的目标文本作为上下文信息，与问询语句一起输入第三方自然语言模型中，可以提供更丰富的语境和背景信息，从而提高模型的准确性、连贯性和理解能力，进而得到更好的答案信息。In this embodiment, the target text represented by the target embedding vector with the highest first similarity to the embedding vector of the query statement is used as context information and input into a third-party natural language model together with the query statement. This can provide richer context and background information, thereby improving the accuracy, coherence and comprehension ability of the model, and obtaining better answer information.

基于上述任一实施例，所述确定所述项目自述文件所引用的文献，并获取所述文献对应的预设格式的文献文件，包括：Based on any of the above embodiments, determining the document cited by the project README file and obtaining a document file in a preset format corresponding to the document includes:

其中，文献的元信息指代描述和定义文献的信息，以便于存储、管理、检索和使用。元信息可能包括作者名、文献标题、出版日期、期刊名称、卷号、期号、页码、DOI、摘要、关键词等内容。Meta-information of a document refers to the information that describes and defines the document, so as to facilitate storage, management, retrieval and use. Meta-information may include author name, document title, publication date, journal name, volume number, issue number, page number, DOI, abstract, keywords, etc.

本实施例中，采用正则表达方式确定出项目自述文件所引用的文献，具体地，可以根据引用文献的文本信息的特征，定义适当的预设正则表达式来匹配该文本信息。例如，可以使用预设正则表达式匹配文献标题、作者、会议/期刊名称、年份等元信息。In this embodiment, a regular expression method is used to determine the documents cited by the project self-described file. Specifically, an appropriate preset regular expression can be defined to match the text information according to the characteristics of the text information of the cited document. For example, a preset regular expression can be used to match meta-information such as document title, author, conference/journal name, year, etc.

对于匹配出的引用文献的文本信息，进一步解析得到其中的元信息，例如文献标题、作者、会议/期刊名称和年份等。可以使用正则表达式或其他文本分析技术来提取这些信息。The text information of the matched references is further parsed to obtain meta-information, such as document title, author, conference/journal name, year, etc. Regular expressions or other text analysis techniques can be used to extract this information.

对于解析出的元信息，可以构建一个查询请求，以获取预设格式的文本文件。查询请求可以包括文献标题、作者、会议/期刊名称和年份等元信息。在构建好查询请求后，发送构建好的查询请求到开源文献库，并获取相应的响应结果。具体地，根据开源文献库的API文档或接口规范，进行相应的请求和解析处理，以获取所需的预设格式的文本文件。For the parsed meta information, a query request can be constructed to obtain a text file in a preset format. The query request can include meta information such as document title, author, conference/journal name, and year. After the query request is constructed, the constructed query request is sent to the open source document library and the corresponding response result is obtained. Specifically, according to the API document or interface specification of the open source document library, the corresponding request and parsing processing are performed to obtain the required text file in the preset format.

下面对本发明提供的代码解析装置进行描述，下文描述的代码解析装置与上文描述的代码解析方法可相互对应参照。The code parsing device provided by the present invention is described below. The code parsing device described below and the code parsing method described above can be referenced to each other.

图2为本发明提供的代码解析装置的结构示意图之一，如图2所示，包括：第一代码解析模块210，用于根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；第二代码解析模块220，用于根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；第三代码解析模块230，用于按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。Figure 2 is one of the structural schematic diagrams of the code parsing device provided by the present invention, as shown in Figure 2, including: a first code parsing module 210, used to determine the code files that need to be run in the project according to the project self-description file, divide the codes in the code files into fragments with functions as the smallest units, and obtain a plurality of code segments; a second code parsing module 220, used to determine the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code files; a third code parsing module 230, used to input the plurality of code segments into the first natural language model in sequence according to the first preset instruction format in the order of code hierarchy from low to high, and obtain the code comments corresponding to each of the code segments, the first preset instruction format is used to instruct the first natural language model to generate the code comments of the code segments, and the first natural language model is obtained by training the initial natural language model according to the original code segments that identify the original code comments.

本发明提供的代码解析装置，先根据项目自述文件确定项目中需要运行的代码文件，将代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段，接着根据代码文件中的函数的调用关系确定若干个代码段对应的代码层级关系，以通过代码层级关系构建起自底向上的逻辑体系，接着按照代码层级从低到高的顺序，依次将若干个代码段按照第一预设指令格式输入至第一自然语言模型中，通过按照代码层级从低到高的顺序将代码段输入至第一自然语言模型中，可以让第一自然语言模型更好地理解代码的上下文和逻辑流程，第一自然语言模型从而能够利用代码段之间的代码层级关系来产生更准确的注释，实现快速地代码解读。The code parsing device provided by the present invention first determines the code files that need to be run in the project according to the project self-description file, divides the codes in the code files into fragments with functions as the smallest units to obtain a plurality of code segments, and then determines the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file, so as to build a bottom-up logical system through the code hierarchy relationship, and then inputs the plurality of code segments into the first natural language model in sequence according to the first preset instruction format in the order from low to high of the code hierarchy. By inputting the code segments into the first natural language model in the order from low to high of the code hierarchy, the first natural language model can better understand the context and logical flow of the code, so that the first natural language model can use the code hierarchy relationship between the code segments to generate more accurate annotations, thereby realizing rapid code interpretation.

在一些实施例中，代码解析装置还包括：第四代码解析模块，用于确定所述项目自述文件所引用的文献，并获取所述文献对应的预设格式的文献文件；将所述文献文件中的内容按照预设分隔符划分成若干个句子；依次将若干个所述句子按照第二预设指令格式输入至第二自然语言模型中，得到各所述句子对应的文献解答，所述第二预设指令格式用于指示所述第二自然语言模型生成所述句子对应的文献解答；基于所述代码段对应的代码注释与所述句子对应的文献解答之间的文本相似度，建立所述代码段与所述句子之间的关联，以辅助代码解读。In some embodiments, the code parsing device also includes: a fourth code parsing module, used to determine the literature cited by the project self-description file, and obtain a literature file in a preset format corresponding to the literature; divide the content in the literature file into a plurality of sentences according to a preset delimiter; input a plurality of the sentences into a second natural language model in turn according to a second preset instruction format to obtain a literature answer corresponding to each of the sentences, and the second preset instruction format is used to instruct the second natural language model to generate a literature answer corresponding to the sentence; based on the text similarity between the code annotation corresponding to the code segment and the literature answer corresponding to the sentence, establish an association between the code segment and the sentence to assist in code interpretation.

在一些实施例中，代码解析装置还包括：第五代码解析模块，用于对所述项目自述文件中的文本信息进行分块处理，得到若干个文本块；获取嵌入向量集，所述嵌入向量集中包括各所述代码段的嵌入向量、各所述句子的嵌入向量以及各所述文本块的嵌入向量；在接收到用户的问询语句的情况下，获取所述问询语句的嵌入向量；根据所述问询语句的嵌入向量与所述嵌入向量集中的各嵌入向量之间的第一相似度，从所述嵌入向量集中选取出第一相似度最高的目标嵌入向量；根据所述目标嵌入向量，获取所述问询语句对应的答案信息。In some embodiments, the code parsing device also includes: a fifth code parsing module, which is used to perform block processing on the text information in the project self-description file to obtain a plurality of text blocks; obtaining an embedding vector set, wherein the embedding vector set includes an embedding vector of each of the code segments, an embedding vector of each of the sentences, and an embedding vector of each of the text blocks; upon receiving a user's query statement, obtaining the embedding vector of the query statement; based on a first similarity between the embedding vector of the query statement and each embedding vector in the embedding vector set, selecting a target embedding vector with the highest first similarity from the embedding vector set; and based on the target embedding vector, obtaining answer information corresponding to the query statement.

在一些实施例中，第五代码解析模块，还用于获取所述目标嵌入向量所表示的目标文本；将所述目标文本作为上下文信息，并与所述问询语句输入至第三自然语言模型中，得到所述问询语句对应的答案信息。In some embodiments, the fifth code parsing module is also used to obtain the target text represented by the target embedding vector; use the target text as context information and input it into the third natural language model together with the query statement to obtain answer information corresponding to the query statement.

在一些实施例中，第一代码解析模块，还用于基于预设正则表达式匹配出所述项目自述文件中引用文献的文本信息；基于所述文本信息解析出所引用的文献的元信息；基于所述元信息从开源文献库中获取预设格式的文本文件。In some embodiments, the first code parsing module is also used to match the text information of the referenced documents in the project README based on a preset regular expression; parse the meta-information of the referenced documents based on the text information; and obtain a text file in a preset format from an open source document library based on the meta-information.

在一些实施例中，第三代码解析模块，还用于爬取带有原始代码注释的原始代码段；将所述原始代码段输入至初始自然语言模型中，得到所述初始自然语言模型输出的预测注释；根据所述预测注释与所述原始代码注释之间的第二相似度，对所述初始自然语言模型的模型参数进行更新，直至所述第二相似度达到预设相似度要求，得到训练好的所述第一自然语言模型。In some embodiments, the third code parsing module is also used to crawl the original code segment with the original code annotation; input the original code segment into the initial natural language model to obtain the predicted annotation output by the initial natural language model; according to the second similarity between the predicted annotation and the original code annotation, the model parameters of the initial natural language model are updated until the second similarity reaches the preset similarity requirement, thereby obtaining the trained first natural language model.

图3示例了一种电子设备的实体结构示意图，如图3所示，该电子设备可以包括：处理器(processor)310、通信接口(Communications Interface)320、存储器(memory)330和通信总线340，其中，处理器310，通信接口320，存储器330通过通信总线340完成相互间的通信。处理器310可以调用存储器330中的逻辑指令，以执行应用于服务端设备的代码解析方法，该方法包括：根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。FIG3 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG3 , the electronic device may include: a processor 310, a communication interface 320, a memory 330 and a communication bus 340, wherein the processor 310, the communication interface 320 and the memory 330 communicate with each other through the communication bus 340. The processor 310 may call the logic instructions in the memory 330 to execute the code parsing method applied to the server device, the method comprising: determining the code file to be run in the project according to the project self-reading file, dividing the code in the code file into fragments with functions as the smallest unit, and obtaining a plurality of code segments; determining the code level relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file; and inputting the plurality of code segments into the first natural language model in sequence according to the first preset instruction format in the order from low to high code levels, and obtaining the code comments corresponding to each of the code segments, wherein the first preset instruction format is used to instruct the first natural language model to generate the code comments of the code segments, and the first natural language model is obtained by training the initial natural language model according to the original code segment with the original code comments identified.

此外，上述的存储器330中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 330 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on such an understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的应用于服务端设备的代码解析方法，该方法包括：根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。On the other hand, the present invention also provides a computer program product, which includes a computer program, which can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the code parsing method provided by the above methods and applied to a server-side device, the method including: determining the code file that needs to be run in the project according to the project self-reading file, dividing the code in the code file into fragments with functions as the smallest unit, and obtaining a plurality of code segments; determining the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file; and inputting the plurality of code segments into a first natural language model in sequence according to a first preset instruction format in order of code hierarchy from low to high, to obtain code comments corresponding to each of the code segments, wherein the first preset instruction format is used to instruct the first natural language model to generate code comments for the code segments, and the first natural language model is obtained by training an initial natural language model based on an original code segment that identifies the original code comments.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的应用于服务端设备的代码解析方法，该方法包括：根据项目自述文件确定项目中需要运行的代码文件，将所述代码文件中的代码划分成以函数为最小单元的片段，得到若干个代码段；根据所述代码文件中的函数的调用关系确定所述若干个代码段对应的代码层级关系；按照代码层级从低到高的顺序，依次将若干个所述代码段按照第一预设指令格式输入至第一自然语言模型中，得到各所述代码段对应的代码注释，所述第一预设指令格式用于指示所述第一自然语言模型生成所述代码段的代码注释，所述第一自然语言模型是根据标识了原始代码注释的原始代码段对初始自然语言模型训练得到的。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the code parsing method applied to a server-side device provided by the above-mentioned methods, the method comprising: determining the code file that needs to be run in the project according to the project README file, dividing the code in the code file into fragments with functions as the smallest unit, and obtaining a plurality of code segments; determining the code hierarchy relationship corresponding to the plurality of code segments according to the calling relationship of the functions in the code file; and inputting the plurality of code segments into a first natural language model in sequence according to a first preset instruction format in order of code hierarchy from low to high, and obtaining code comments corresponding to each of the code segments, wherein the first preset instruction format is used to instruct the first natural language model to generate code comments for the code segments, and the first natural language model is obtained by training an initial natural language model based on an original code segment that identifies the original code comments.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. Ordinary technicians in this field can understand and implement it without paying creative labor.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of code resolution, the method comprising:

Determining a code file to be operated in the project from the file according to the project, dividing codes in the code file into segments taking a function as a minimum unit, and obtaining a plurality of code segments;

determining the code hierarchy relation corresponding to the plurality of code segments according to the calling relation of the functions in the code file;

Sequentially inputting a plurality of code segments into a first natural language model according to a first preset instruction format according to the sequence of the code levels from low to high to obtain code annotations corresponding to the code segments, wherein the first preset instruction format is used for indicating the first natural language model to generate the code annotations of the code segments, and the first natural language model is obtained by training the initial natural language model according to the original code segments with the original code annotations.

2. The code parsing method of claim 1, wherein the method further comprises:

determining a document cited by the item from the document, and acquiring a document in a preset format corresponding to the document;

dividing the content in the document file into a plurality of sentences according to preset separators;

Sequentially inputting a plurality of sentences into a second natural language model according to a second preset instruction format to obtain literature solutions corresponding to the sentences, wherein the second preset instruction format is used for instructing the second natural language model to generate the literature solutions corresponding to the sentences;

based on the text similarity between the code annotation corresponding to the code segment and the document solution corresponding to the sentence, an association between the code segment and the sentence is established to assist in code interpretation.

3. The code parsing method of claim 2, wherein the method further comprises:

Carrying out blocking processing on text information in the project self-contained file to obtain a plurality of text blocks;

acquiring an embedded vector set, wherein the embedded vector set comprises embedded vectors of all the code segments, embedded vectors of all sentences and embedded vectors of all the text blocks;

Under the condition that an inquiry sentence of a user is received, acquiring an embedded vector of the inquiry sentence;

selecting a target embedded vector with highest first similarity from the embedded vector set according to the first similarity between the embedded vector of the query statement and each embedded vector in the embedded vector set;

and acquiring answer information corresponding to the query statement according to the target embedded vector.

4. The code parsing method according to claim 3, wherein the obtaining answer information corresponding to the query statement according to the target embedded vector includes:

Acquiring a target text represented by the target embedded vector;

And taking the target text as context information, and inputting the context information and the query sentence into a third natural language model to obtain answer information corresponding to the query sentence.

5. The code parsing method according to claim 2, wherein the determining the document cited by the item from the document and acquiring the document file in the preset format corresponding to the document includes:

matching text information of a reference document in the project self-describing file based on a preset regular expression;

Parsing meta information of the cited document based on the text information;

and acquiring a text file in a preset format from an open source document library based on the meta information.

6. The code parsing method of claim 1, wherein the first natural language model is trained by:

Crawling an original code segment with original code annotations;

Inputting the original code segment into an initial natural language model to obtain a prediction annotation output by the initial natural language model;

And updating the model parameters of the initial natural language model according to the second similarity between the prediction annotation and the original code annotation until the second similarity meets the preset similarity requirement, so as to obtain the trained first natural language model.

7. A code parsing apparatus, the apparatus comprising:

The first code analysis module is used for determining a code file to be operated in the project according to the project self-description file, dividing codes in the code file into segments taking functions as minimum units, and obtaining a plurality of code segments;

the second code analysis module is used for determining the code hierarchy relation corresponding to the plurality of code segments according to the calling relation of the functions in the code file;

The third code analysis module is used for sequentially inputting a plurality of code segments into a first natural language model according to a code level from low to high and a first preset instruction format to obtain code annotations corresponding to the code segments, the first preset instruction format is used for indicating the first natural language model to generate the code annotations of the code segments, and the first natural language model is obtained by training the initial natural language model according to the original code segments with the original code annotations.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the code resolution method of any one of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the code parsing method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the code parsing method according to any one of claims 1 to 6.