CN118733450A

CN118733450A - A model testing method, device, equipment and medium

Info

Publication number: CN118733450A
Application number: CN202410788714.7A
Authority: CN
Inventors: 吴沁芸; 彭超; 高鹏飞; 甘浩宇; 江波; 邓志文; 管占明; 汤锦禾
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2024-06-18
Filing date: 2024-06-18
Publication date: 2024-10-01

Abstract

The disclosed embodiments relate to a model testing method, apparatus, device and medium, wherein the method comprises: constructing a test question based on a standard answer in a basic data set; constructing a test data set based on the test question and a prompt word template; obtaining a target model, and inputting the prompt words in the test data set into the target model to obtain the corresponding answer to be tested; performing a unit test on the answer to be tested to determine the unit test result; and determining the model test result of the target model based on the unit test result. The disclosed embodiments determine the model test result of the target model based on the unit test result, and use the unit test result as the test indicator of the target model. Compared with the prior art, considering the operability characteristics of special scenarios such as code processing, it can more objectively and effectively reflect the real capabilities of the model and greatly improve the accuracy of the model test.

Description

A model testing method, device, equipment and medium

技术领域Technical Field

本公开涉及计算机技术领域，尤其涉及一种模型测试方法、装置、设备及介质。The present disclosure relates to the field of computer technology, and in particular to a model testing method, device, equipment and medium.

背景技术Background Art

随着技术发展，大语言模型在代码领域的应用越来越多。相关技术中，针对模型的测试主要通过非真实场景的数据或真实场景的数据实现，非真实场景的数据测试与真实场景差别较大，真实场景的数据测试的测试指标通常是根据模型结果与正确答案之间的相似度来判断模型结果是否正确，这种方式在特殊场景，例如代码处理相关场景，不能有效、客观的反应模型能力，因此相关技术中模型测试难以在保证真实性的同时兼顾有效性，需要改进。With the development of technology, large language models are increasingly used in the field of code. In related technologies, the testing of models is mainly achieved through data from non-real scenarios or real scenarios. The data testing of non-real scenarios is quite different from that of real scenarios. The test indicators of real scenario data testing are usually based on the similarity between the model results and the correct answers to determine whether the model results are correct. This method cannot effectively and objectively reflect the model capabilities in special scenarios, such as code processing-related scenarios. Therefore, it is difficult for model testing in related technologies to ensure authenticity while taking into account effectiveness, and it needs to be improved.

发明内容Summary of the invention

为了解决上述技术问题，本公开提供了一种模型测试方法、装置、设备及介质。In order to solve the above technical problems, the present disclosure provides a model testing method, device, equipment and medium.

本公开实施例提供了一种模型测试方法，所述方法包括：The present disclosure provides a model testing method, the method comprising:

基于基础数据集中的标准答案构建测试题目；Construct test questions based on standard answers in the basic data set;

基于所述测试题目以及提示词模板，构建测试数据集；Constructing a test data set based on the test questions and prompt word templates;

获取目标模型，并将所述测试数据集中的提示词输入所述目标模型得到对应的待测答案；Obtaining a target model, and inputting the prompt words in the test data set into the target model to obtain the corresponding answer to be tested;

对所述待测答案进行单元测试，确定单元测试结果；Performing unit testing on the answer to be tested and determining the unit testing result;

基于所述单元测试结果确定所述目标模型的模型测试结果。A model test result of the target model is determined based on the unit test result.

本公开实施例还提供了一种模型测试装置，所述装置包括：The present disclosure also provides a model testing device, the device comprising:

题目模块，用于基于基础数据集中的标准答案构建测试题目；The question module is used to construct test questions based on the standard answers in the basic data set;

数据集模块，用于基于所述测试题目以及提示词模板，构建测试数据集；A data set module, used to construct a test data set based on the test questions and prompt word templates;

输入模块，用于获取目标模型，并将所述测试数据集中的提示词输入所述目标模型得到对应的待测答案；An input module, used to obtain a target model, and input the prompt words in the test data set into the target model to obtain a corresponding answer to be tested;

单测模块，用于对所述待测答案进行单元测试，确定单元测试结果；A unit test module, used to perform unit testing on the answer to be tested and determine the unit test result;

结果模型，用于基于所述单元测试结果确定所述目标模型的模型测试结果。A result model is used to determine a model test result of the target model based on the unit test result.

本公开实施例还提供了一种电子设备，所述电子设备包括：处理器；用于存储所述处理器可执行指令的存储器；所述处理器，用于从所述存储器中读取所述可执行指令，并执行所述指令以实现如本公开实施例提供的模型测试方法。An embodiment of the present disclosure also provides an electronic device, which includes: a processor; a memory for storing executable instructions of the processor; the processor is used to read the executable instructions from the memory and execute the instructions to implement the model testing method provided by the embodiment of the present disclosure.

本公开实施例还提供了一种计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序用于执行如本公开实施例提供的模型测试方法。The embodiment of the present disclosure further provides a computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is used to execute the model testing method provided by the embodiment of the present disclosure.

本公开实施例提供的技术方案与现有技术相比具有如下优点：本公开实施例提供的模型测试方案，基于基础数据集中的标准答案构建测试题目；基于测试题目以及提示词模板，构建测试数据集；获取目标模型，并将测试数据集中的提示词输入目标模型得到对应的待测答案；对待测答案进行单元测试，确定单元测试结果；基于单元测试结果确定目标模型的模型测试结果。采用上述技术方案，通过基础数据集构建测试数据集，基于该测试数据集对目标模型输出得到的待测答案进行单元测试，之后根据单元测试结果确定目标模型的模型测试结果，将单元测试结果作为目标模型的测试指标，相较于现有技术，考虑到代码处理等特殊场景的可运行特殊性，可以更加客观有效地反应模型的真实能力，极大提升模型测试的准确性。The technical solution provided by the disclosed embodiment has the following advantages over the prior art: the model testing solution provided by the disclosed embodiment constructs test questions based on the standard answers in the basic data set; constructs a test data set based on the test questions and the prompt word template; obtains the target model, and inputs the prompt words in the test data set into the target model to obtain the corresponding answer to be tested; performs unit testing on the answer to be tested to determine the unit test results; and determines the model test results of the target model based on the unit test results. By adopting the above technical solution, a test data set is constructed through a basic data set, and a unit test is performed on the answer to be tested outputted by the target model based on the test data set, and then the model test results of the target model are determined according to the unit test results. The unit test results are used as the test indicators of the target model. Compared with the prior art, considering the operability characteristics of special scenarios such as code processing, it can more objectively and effectively reflect the real capabilities of the model and greatly improve the accuracy of model testing.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下具体实施方式，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中，相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的，原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the accompanying drawings, the same or similar reference numerals represent the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.

图1为本公开一些实施例提供的模型测试方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a model testing method provided by some embodiments of the present disclosure;

图2为本公开一些实施例提供的另一模型测试方法的流程示意图；FIG2 is a schematic flow chart of another model testing method provided by some embodiments of the present disclosure;

图3为本公开一些实施例提供的模型测试过程的示意图；FIG3 is a schematic diagram of a model testing process provided by some embodiments of the present disclosure;

图4为本公开一些实施例提供的另一模型测试过程的示意图；FIG4 is a schematic diagram of another model testing process provided by some embodiments of the present disclosure;

图5为本公开一些实施例提供的模型测试装置的结构示意图；FIG5 is a schematic diagram of the structure of a model testing device provided in some embodiments of the present disclosure;

图6为本公开一些实施例提供的电子设备的结构示意图。FIG6 is a schematic diagram of the structure of an electronic device provided in some embodiments of the present disclosure.

具体实施方式DETAILED DESCRIPTION

下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments described herein, which are instead provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes and are not intended to limit the scope of protection of the present disclosure.

应当理解，本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行，和/或并行执行。此外，方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. In addition, the method embodiments may include additional steps and/or omit the steps shown. The scope of the present disclosure is not limited in this respect.

本文使用的术语“包括”及其变形是开放性包括，即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”；术语“另一实施例”表示“至少一个另外的实施例”；术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。The term "including" and its variations used herein are open inclusions, i.e., "including but not limited to". The term "based on" means "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". The relevant definitions of other terms will be given in the following description.

需要注意，本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分，并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules or units.

需要注意，本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的，本领域技术人员应当理解，除非在上下文另有明确指出，否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, it should be understood as "one or more".

本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的，而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes and are not used to limit the scope of these messages or information.

相关技术中，针对代码相关模型的测试一种方式是基于非真实场景的数据集，例如可以利用包括基础算法题的数据集测试模型在给定函数签名以及文档字符串(docstring)的完整上下文前提下，最终生成代码的通过率，本质上可以认为是编程场景的一个缩小版本，与真实开发场景的差别较大；另一种方式是利用真实开发仓库数据，但是测试方法基于模型推理结果与正确答案之间的文本相似度得分来判断模型结果是否正确，该评测方法没有考虑到代码本身的特性，不能有效、客观的评价模型能力。因此相关技术中代码相关模型的测试难以在保证真实性的同时兼顾有效性，需要改进。In the related art, one way to test code-related models is to use a dataset based on a non-real scenario. For example, a dataset including basic algorithm questions can be used to test the pass rate of the model in the final generated code under the premise of a complete context of a given function signature and a document string (docstring). In essence, it can be considered a reduced version of the programming scenario, which is quite different from the real development scenario. Another way is to use real development warehouse data, but the test method is based on the text similarity score between the model reasoning result and the correct answer to determine whether the model result is correct. This evaluation method does not take into account the characteristics of the code itself and cannot effectively and objectively evaluate the model's capabilities. Therefore, the testing of code-related models in the related art is difficult to ensure authenticity while taking into account effectiveness, and needs to be improved.

为了解决上述问题，本公开实施例提供了一种模型测试方法，下面结合具体的实施例对该方法进行介绍。In order to solve the above problems, the embodiments of the present disclosure provide a model testing method, which is introduced below in conjunction with specific embodiments.

图1为本公开一些实施例提供的模型测试方法的流程示意图，该方法可以由模型测试装置执行，其中该装置可以采用软件和/或硬件实现，一般可集成在电子设备中。如图1所示，该方法包括：FIG1 is a flow chart of a model testing method provided by some embodiments of the present disclosure, which can be performed by a model testing device, wherein the device can be implemented by software and/or hardware, and can generally be integrated in an electronic device. As shown in FIG1 , the method includes:

步骤101、基于基础数据集中的标准答案构建测试题目。Step 101: Construct test questions based on standard answers in the basic data set.

本公开实施例的模型测试方法针对目标模型进行测试，以获得该目标模型的处理能力或处理效果，目标模型可以是任意需要进行能力测试并且需要考虑运行通过特殊性的模型，本公开实施例的目标模型可以代码处理模型为例，代码处理模型具体不限，例如代码处理模型可以包括代码补全模型、单测生成模型以及代码优化模型等。The model testing method of the embodiment of the present disclosure tests the target model to obtain the processing capability or processing effect of the target model. The target model can be any model that needs to be tested for capability and needs to consider the specificity of running. The target model of the embodiment of the present disclosure can be a code processing model as an example. The code processing model is not limited to a specific one. For example, the code processing model can include a code completion model, a unit test generation model, and a code optimization model.

其中，基础数据集可以是基于已有的模型训练数据仓库筛选得到的数据集，具体筛选时可以基于具体创建时间、收藏数量以及单测文件等进行筛选，根据创建时间筛选时筛选创建时间在预设时间之后的模型训练数据仓库，预设时间可以根据实际情况设置，例如可以设置为距离当前时间较接近的一个历史时间；为了保证测试数据的质量，可以提取收藏数量大于预设数量的模型训练数据仓库，预设数量可以根据实际需求设置，例如预设数量可以为100；根据单测文件筛选时可以筛选包括单测文件的模型训练数据仓库，确保筛选的数据有概率单测通过，单测文件可以包括工作流(Workflow)文件以及测试(test)文件等。Among them, the basic data set can be a data set obtained by filtering based on an existing model training data warehouse. When filtering, it can be based on a specific creation time, number of collections, and single test files. When filtering based on the creation time, the model training data warehouse with a creation time after a preset time is filtered. The preset time can be set according to actual conditions, for example, it can be set to a historical time that is closer to the current time; in order to ensure the quality of the test data, the model training data warehouse with a collection number greater than a preset number can be extracted, and the preset number can be set according to actual needs, for example, the preset number can be 100; when filtering based on single test files, the model training data warehouse including single test files can be filtered to ensure that the filtered data has a probability of passing the single test. The single test files can include workflow files and test files, etc.

基础数据集中可以包括多个标准答案，每个标准答案可以是基础数据集中的一个单独完整的数据，例如当目标模型为代码处理模型，每个标准答案可以为一个完整的代码。测试题目可以是针对目标模型的功能设置的能够对其进行测试的题目，例如当目标模型为代码处理模型，测试题目可以为代码上下文。The basic data set may include multiple standard answers, each of which may be a separate complete data in the basic data set. For example, when the target model is a code processing model, each standard answer may be a complete code. The test topic may be a topic that can test the function of the target model. For example, when the target model is a code processing model, the test topic may be a code context.

具体的，模型测试装置可以获取基础数据集，并基于基础数据集中的标准答案以及目标模型的功能，构建测试题目，测试题目的数量可以为多个。Specifically, the model testing device can obtain a basic data set, and construct test questions based on the standard answers in the basic data set and the functions of the target model. The number of test questions can be multiple.

示例性的，图2为本公开一些实施例提供的另一模型测试方法的流程示意图，如图2所示，在一种可行的实施方式中，基于基础数据集中的标准答案构建测试题目，可以包括：Exemplarily, FIG2 is a flow chart of another model testing method provided by some embodiments of the present disclosure. As shown in FIG2 , in a feasible implementation manner, constructing a test question based on the standard answer in the basic data set may include:

步骤201、基于基础数据集中的单测环境数据构建单测环境。Step 201: construct a single test environment based on the single test environment data in the basic data set.

其中，单测环境数据可以包括从单测执行的配置文件中提取的用于执行单测的环境配置数据和环境配置指令，单测执行的配置文件可以位于工作流文件中。单测环境可以是用于进行单元测试的环境，具体可以利用容器引擎基于组合映像命令文本中的指令构建独立镜像，该独立镜像提供单元测试的基础运行环境和运行代码。单测，即单元测试，是指对应用程序中最小可测试单元进行检查和验证的过程，这些单元可以是独立的操作或函数，可以单独执行而不依赖于其他部分代码。Among them, the unit test environment data may include environment configuration data and environment configuration instructions for executing the unit test extracted from the configuration file of the unit test execution, and the configuration file of the unit test execution may be located in the workflow file. The unit test environment may be an environment for unit testing. Specifically, an independent image may be built based on the instructions in the combined image command text using a container engine. The independent image provides the basic operating environment and running code for the unit test. Unit testing, or unit testing, refers to the process of checking and verifying the smallest testable units in an application. These units may be independent operations or functions that can be executed independently without relying on other parts of the code.

具体的，模型测试装置可以在基础数据集中查询工作流文件，提取其中的单测环境数据，之后基于单测环境数据构建单测环境，以备后续使用。Specifically, the model testing device can query the workflow file in the basic data set, extract the single test environment data therein, and then build a single test environment based on the single test environment data for subsequent use.

步骤202、基于单测环境对基础数据集中的标准答案进行单元测试，得到单元测试覆盖率。Step 202: Perform unit testing on the standard answers in the basic data set based on the unit test environment to obtain unit test coverage.

其中，单元测试覆盖率可以表示应用程序中代码被测试的比例或程度，例如单元测试覆盖率可以包括行覆盖率、函数覆盖率以及语句覆盖率等等。模型测试装置在构建单测环境之后，可以在单测环境中对基础数据集中的每个标准答案进行单元测试，之后得到单元测试覆盖率。The unit test coverage rate can indicate the proportion or degree of the code in the application being tested. For example, the unit test coverage rate can include line coverage rate, function coverage rate, statement coverage rate, etc. After constructing the unit test environment, the model test device can perform unit tests on each standard answer in the basic data set in the unit test environment, and then obtain the unit test coverage rate.

步骤203、基于单元测试覆盖率所确定的可处理答案和可处理答案中的可处理位置构建测试题目。Step 203: Construct test questions based on the processable answers and processable positions in the processable answers determined by the unit test coverage.

其中，可处理答案可以是从标准答案中筛选的单测覆盖到的且能够通过挖孔、注入等修改方式构建与目标模型的功能匹配的题目的标准答案，可处理位置可以包括可处理答案中单测覆盖到的代码位置，例如当目标模型为代码处理模型中的代码补全模型，则可处理答案为单测覆盖到的标准答案，可处理位置为任意单测覆盖到的代码行位置；当目标模型为代码处理模型中的单测生成模型或代码优化模型，则可处理答案为单测覆盖到的标准答案，可处理位置为函数所在位置。Among them, the processable answer may be a standard answer to a question covered by a single test selected from the standard answer and capable of being constructed through modifications such as digging holes and injections to match the functions of the target model. The processable position may include the code position covered by the single test in the processable answer. For example, when the target model is a code completion model in a code processing model, the processable answer is the standard answer covered by the single test, and the processable position is any code line position covered by the single test; when the target model is a single test generation model or a code optimization model in a code processing model, the processable answer is the standard answer covered by the single test, and the processable position is the location of the function.

模型测试装置基于单元测试覆盖率进行解析，从标准答案中提取单测覆盖到的一个或多个作为可处理答案，并在可处理答案中将单测覆盖到的一个或多个位置确定为可处理位置，之后根据可处理答案和可处理位置构建测试题目，测试题目的数量可以为多个，具体不限。示例性的，当目标模型为代码处理模型中的代码补全模型，可以将可处理答案中任意可处理位置挖孔，得到测试题目；当目标模型为单测生成模型，可以将可处理答案中的函数挖孔，得到测试题目；当目标模型为代码优化模型，可以在可处理答案的任意可处理位置注入漏洞(bug)，得到测试题目。The model testing device performs parsing based on the unit test coverage, extracts one or more of the single test coverage from the standard answer as processable answers, and determines one or more positions covered by the single test in the processable answer as processable positions, and then constructs test questions based on the processable answers and processable positions. The number of test questions can be multiple, and there is no specific limit. Exemplarily, when the target model is a code completion model in a code processing model, any processable position in the processable answer can be drilled to obtain a test question; when the target model is a single test generation model, the function in the processable answer can be drilled to obtain a test question; when the target model is a code optimization model, a vulnerability (bug) can be injected into any processable position in the processable answer to obtain a test question.

上述方案中，针对代码处理模型，可以通过构建单测环境，对基础数据集中的标准答案进行单测执行，根据单测执行结果构建测试题目。In the above solution, for the code processing model, you can build a unit test environment, perform unit tests on the standard answers in the basic data set, and build test questions based on the unit test execution results.

步骤102、基于测试题目以及提示词模板，构建测试数据集。Step 102: construct a test data set based on the test questions and prompt word templates.

其中，提示词模板可以是对提示词中包括的内容和位置进行配置的模板，用于生成目标模型的提示词(Prompt)，提示词可以是将用户需求精确地转化为模型能够有效理解和处理的形式而生成的上下文信息。测试数据集可以是用于对目标模型的能力进行测试的数据集，测试数据集中可以包括多个提示词，每个提示词基于一个测试题目和提示词模板生成。The prompt word template may be a template for configuring the content and position included in the prompt word, and is used to generate the prompt word (Prompt) of the target model. The prompt word may be context information generated by accurately converting user requirements into a form that the model can effectively understand and process. The test data set may be a data set for testing the capabilities of the target model. The test data set may include multiple prompt words, and each prompt word is generated based on a test question and a prompt word template.

在一些实施例中，基于测试题目以及提示词模板，构建测试数据集，可以包括：按照至少一个提示词模板基于测试题目生成对应的提示词；将至少一个提示词组合得到测试数据集。In some embodiments, constructing a test data set based on a test topic and a prompt word template may include: generating corresponding prompt words based on the test topic according to at least one prompt word template; and combining at least one prompt word to obtain the test data set.

提示词模板的数量可以为至少一个，不同提示词模板包括的内容可以不同，例如一个提示词模板可以仅包括测试题目，另一个提示词模板可以包括测试题目以及测试题目的关联信息。本公开实施例中，目标模型包括代码处理模型，测试题目包括代码上下文，测试题目的关联信息包括代码上下文的相似代码片段，代码上下文可以包括代码上文和/或代码下文，代码上下文的相似代码片段可以是从代码上下文所在模型训练数据仓库中筛选的与代码上下文的相似度最高的代码片段。The number of prompt word templates may be at least one, and different prompt word templates may include different contents. For example, one prompt word template may include only a test question, and another prompt word template may include a test question and associated information of the test question. In the disclosed embodiment, the target model includes a code processing model, the test question includes a code context, the associated information of the test question includes similar code snippets of the code context, the code context may include the code context and/or the code context, and the similar code snippet of the code context may be a code snippet with the highest similarity to the code context that is selected from a model training data warehouse where the code context is located.

模型测试装置构建测试题目之后，可以获取至少一个提示词模板，按照每个提示词模板的配置组合测试题目以及其他相关信息，生成对应的提示词，进而多个测试题目根据每个提示词模板生成对应的多个提示词，将至少一个提示词模板的多个提示词进行组合确定为测试数据集。本公开实施例的测试数据集相较于相关技术，不需要标准答案，只需要提示词即可。After the model testing device constructs the test questions, it can obtain at least one prompt word template, combine the test questions and other related information according to the configuration of each prompt word template, generate corresponding prompt words, and then generate corresponding prompt words according to each prompt word template for multiple test questions, and combine multiple prompt words of at least one prompt word template to determine the test data set. Compared with the related art, the test data set of the embodiment of the present disclosure does not require standard answers, but only prompt words.

步骤103、获取目标模型，并将测试数据集中的提示词输入目标模型得到对应的待测答案。Step 103: Obtain a target model, and input the prompt words in the test data set into the target model to obtain the corresponding answer to be tested.

其中，目标模型可以是任意需要进行能力测试并且需要考虑运行通过特殊性的模型，本公开实施例的目标模型可以代码处理模型为例，代码处理模型具体不限，例如代码处理模型可以包括代码补全模型、单测生成模型以及代码优化模型等。Among them, the target model can be any model that needs to be tested for capabilities and needs to consider the specificity of running through. The target model of the embodiment of the present disclosure can be a code processing model as an example. The code processing model is not limited to a specific one. For example, the code processing model can include a code completion model, a unit test generation model, and a code optimization model.

待测答案可以是针对测试数据集中提示词的测试题目利用目标模型的计算生成的答案，该待测答案的正确性需要进行验证，待测答案的数量与提示词的数量相同，可以为多个。The answer to be tested can be the answer generated by calculation of the target model for the test questions of the prompt words in the test data set. The correctness of the answer to be tested needs to be verified. The number of answers to be tested is the same as the number of prompt words, which can be multiple.

当测试人员需要对目标模型的能力进行测试时，可以将该目标模型发送至模型测试装置，模型测试装置获取目标模型之后，可以利用已经创建好的测试数据集，将该测试数据集中的多个提示词分别输入目标模型中可以得到对应的多个待测答案。可以理解的是，当同一测试题目按照两个不同的提示词模板生成两个提示词之后，获得的两个待测答案可以不同，例如当目标模型为代码补全模型，测试题目为代码上下文，一个提示词模板仅包括代码上下文，则待测答案也即代码上下文的补全结果可以为定义函数，而另一个提示词模板包括代码上下文以及代码上下文的相似代码片段，待测答案也即代码上下文的补全结果可以为跨文件定义函数。When the tester needs to test the capabilities of the target model, the target model can be sent to the model testing device. After the model testing device obtains the target model, it can use the test data set that has been created to input multiple prompt words in the test data set into the target model to obtain corresponding multiple answers to be tested. It can be understood that when the same test question generates two prompt words according to two different prompt word templates, the two answers to be tested can be different. For example, when the target model is a code completion model and the test question is a code context, one prompt word template only includes the code context, then the answer to be tested, that is, the completion result of the code context, can be a defined function, while the other prompt word template includes the code context and similar code fragments of the code context, and the answer to be tested, that is, the completion result of the code context, can be a cross-file defined function.

步骤104、对待测答案进行单元测试，确定单元测试结果。Step 104: Perform unit testing on the answers to be tested and determine the unit test results.

单元测试是指对应用程序中最小可测试单元进行检查和验证的过程，这些单元可以单独执行而不依赖于其他部分代码，单元测试的目的是确保每个单元按照预期工作，并在与其他单元交互时能够正确处理输入和输出。单元测试结果可以包括单测通过率、单测覆盖率以及编译与单测运行日志等等。Unit testing refers to the process of checking and verifying the smallest testable units in an application. These units can be executed independently without relying on other parts of the code. The purpose of unit testing is to ensure that each unit works as expected and can correctly handle input and output when interacting with other units. Unit test results can include unit test pass rate, unit test coverage, compilation and unit test run logs, etc.

具体的，模型测试装置针对上述待测答案，可以基于云计算服务对每个待测答案进行单元测试，确定单元测试结果。Specifically, the model testing device can perform unit testing on each answer to be tested based on the cloud computing service for the above-mentioned answer to be tested, and determine the unit testing result.

在一些实施例中，对待测答案进行单元测试，确定单元测试结果，包括：利用多个服务构造多个单测环境；基于多个单测环境对所述待测答案进行单元测试，确定单元测试结果。In some embodiments, unit testing is performed on the answer to be tested to determine the unit test result, including: constructing multiple unit test environments using multiple services; and unit testing is performed on the answer to be tested based on the multiple unit test environments to determine the unit test result.

其中，服务可以为云计算服务，可以无需管理服务器响应事件而执行，本公开实施例可以通过服务对待测答案进行单元测试并进行结果上报，服务的数量可以包括多个，每个服务针对一个待测答案进行单测执行，或者，每个服务针对一个模型训练数据仓库对应的待测答案进行单测执行。模型测试装置在对待测答案进行单元测试时，可以将待测答案以及任务分发给多个服务，每个服务利用容器引擎构建容器镜像作为单测环境，该单测环境具有独立的运行环境和较长的运行时长；利用多个单测环境异步对待测答案进行单元测试，获取完整的单元测试结果。通过服务异步对待测答案进行单元测试，有效提升了单元测试的效率。Among them, the service can be a cloud computing service, and can be executed without managing server response events. The embodiments of the present disclosure can perform unit testing on the answers to be tested through the service and report the results. The number of services can include multiple, each service performs a single test on an answer to be tested, or each service performs a single test on the answer to be tested corresponding to a model training data warehouse. When the model testing device performs unit testing on the answers to be tested, the answers to be tested and tasks can be distributed to multiple services. Each service uses a container engine to build a container image as a single test environment. The single test environment has an independent operating environment and a longer operating time; multiple single test environments are used to asynchronously perform unit testing on the answers to be tested to obtain complete unit test results. The unit testing of the answers to be tested is performed asynchronously by the service, which effectively improves the efficiency of unit testing.

步骤105、基于单元测试结果确定目标模型的模型测试结果。Step 105: Determine the model test result of the target model based on the unit test result.

其中，模型测试结果可以是反应目标模型的能力或准确性的测试结果，本公开实施例对目标模型测试的测试指标可以设置为单测通过率，模型测试结果即可通过单测通过率表示。Among them, the model test result can be a test result that reflects the ability or accuracy of the target model. The test indicator for the target model test in the embodiment of the present disclosure can be set as a single test pass rate, and the model test result can be represented by the single test pass rate.

本公开实施例中，模型测试装置在基于单元测试结果确定目标模型的模型测试结果时，可以提取单元测试结果中的单测通过率；按照至少一个测试维度对单元测试结果中的单测通过率进行划分，将各测试维度的单测通过率确定为目标模型的模型测试结果。In an embodiment of the present disclosure, when the model testing device determines the model testing results of the target model based on the unit testing results, it can extract the single test pass rate in the unit testing results; divide the single test pass rate in the unit testing results according to at least one test dimension, and determine the single test pass rate of each test dimension as the model testing result of the target model.

其中，测试维度可以是反应目标模型在某个方面的能力的具体分类，本公开实施例中测试维度可以包括如下至少一种：数据集、代码语言、提示词模板以及场景知识点等，数据集可以是将整个测试数据集作为一个测试维度，能够确定整体的单测通过率，场景知识点可以是根据目标模型的应用场景对应的分类确定的知识点。单测通过率可以是单元测试的过程中成功通过测试的单元占总的单元数量的比例，一个待测答案的单测通过率越高表示这个待测答案的准确性越高，针对测试数据集对应的全部的待测答案，如果整体的单测通过率越高表示更多的待测答案通过了验证，目标模型的能力越高。Among them, the test dimension can be a specific classification that reflects the ability of the target model in a certain aspect. In the embodiment of the present disclosure, the test dimension can include at least one of the following: data set, code language, prompt word template, and scenario knowledge point, etc. The data set can be the entire test data set as a test dimension, which can determine the overall single test pass rate. The scenario knowledge point can be a knowledge point determined according to the classification corresponding to the application scenario of the target model. The single test pass rate can be the proportion of units that successfully pass the test in the process of unit testing to the total number of units. The higher the single test pass rate of a test answer, the higher the accuracy of the test answer. For all the test answers corresponding to the test data set, if the overall single test pass rate is higher, it means that more test answers have passed the verification, and the ability of the target model is higher.

模型测试装置可以提取单元测试结果中的单测通过率，按照至少一个测试维度中的每个测试维度计算对应的单测通过率，具体的，当测试维度为整体时，可以将全部待测答案作为整体计算整体的单测通过率；测试维度为语言时，可以分别计算不同代码语言对应的待测通过率；测试维度为提示词模板时，可以分别计算不同提示词模板对应的单测通过率；测试维度为场景知识点时，可以分别计算不同场景知识点对应的单测通过率。之后模型测试装置可以将各测试维度的单测通过率确定为目标模型的模型测试结果。The model testing device can extract the single test pass rate in the unit test results, and calculate the corresponding single test pass rate according to each test dimension in at least one test dimension. Specifically, when the test dimension is the whole, all the answers to be tested can be taken as a whole to calculate the overall single test pass rate; when the test dimension is language, the test pass rates corresponding to different code languages can be calculated respectively; when the test dimension is a prompt word template, the single test pass rates corresponding to different prompt word templates can be calculated respectively; when the test dimension is a scene knowledge point, the single test pass rates corresponding to different scene knowledge points can be calculated respectively. Afterwards, the model testing device can determine the single test pass rate of each test dimension as the model test result of the target model.

上述方案中，对目标模型进行测试时，基于单元测试方式的单测通过率的评测指标，不仅能够更加准确有效地反映目标模型的真实能力，而且可以支持多维度的模型测试，提升模型测试的多样性和灵活性。In the above scheme, when testing the target model, the evaluation indicator of the single test pass rate based on the unit test method can not only more accurately and effectively reflect the true capabilities of the target model, but also support multi-dimensional model testing and improve the diversity and flexibility of model testing.

本公开实施例提供的模型测试方案，基于基础数据集中的标准答案构建测试题目；基于测试题目以及提示词模板，构建测试数据集；获取目标模型，并将测试数据集中的提示词输入目标模型得到对应的待测答案；对待测答案进行单元测试，确定单元测试结果；基于单元测试结果确定目标模型的模型测试结果。采用上述技术方案，通过基础数据集构建测试数据集，基于该测试数据集对目标模型输出得到的待测答案进行单元测试，之后根据单元测试结果确定目标模型的模型测试结果，将单元测试结果作为目标模型的测试指标，相较于现有技术，考虑到代码处理等特殊场景的可运行特殊性，可以更加客观有效地反应模型的真实能力，极大提升模型测试的准确性。The model testing scheme provided by the disclosed embodiment constructs test questions based on the standard answers in the basic data set; constructs a test data set based on the test questions and the prompt word template; obtains the target model, and inputs the prompt words in the test data set into the target model to obtain the corresponding answers to be tested; performs unit tests on the answers to be tested and determines the unit test results; and determines the model test results of the target model based on the unit test results. The above technical scheme is adopted to construct a test data set through the basic data set, and the answers to be tested obtained by the output of the target model are unit tested based on the test data set. Then, the model test results of the target model are determined according to the unit test results, and the unit test results are used as the test indicators of the target model. Compared with the existing technology, considering the operability characteristics of special scenarios such as code processing, it can more objectively and effectively reflect the real capabilities of the model and greatly improve the accuracy of model testing.

在一些实施例中，在基于基础数据集中的标准答案构建测试题目之后，本公开实施例的模型测试方法还可以包括：建立测试题目与场景知识点的对应关系。可选的，建立测试题目与场景知识点的对应关系，可以包括：将目标模型的背景知识分类确定为初始知识点；基于初始知识点、测试题目以及标注模型，确定测试题目对应的场景知识点，并建立测试题目与场景知识点的对应关系。In some embodiments, after constructing the test questions based on the standard answers in the basic data set, the model testing method of the disclosed embodiment may further include: establishing a correspondence between the test questions and the scene knowledge points. Optionally, establishing a correspondence between the test questions and the scene knowledge points may include: classifying the background knowledge of the target model as initial knowledge points; based on the initial knowledge points, the test questions and the annotation model, determining the scene knowledge points corresponding to the test questions, and establishing a correspondence between the test questions and the scene knowledge points.

其中，场景知识点可以是与目标模型的应用场景相关的知识点，具体可以根据目标模型的应用场景对应的分类确定，例如当目标模型为代码处理模型，场景知识点可以包括编程基础中的内存管理、缓冲处理、常用工具中的配置管理、序列化等，仅为示例。标注模型可以对测试题目属于哪个初始知识点进行标注的模型，具体可以通过大模型实现。The scenario knowledge points may be knowledge points related to the application scenario of the target model, which may be determined according to the classification corresponding to the application scenario of the target model. For example, when the target model is a code processing model, the scenario knowledge points may include memory management and buffer processing in programming basics, configuration management and serialization in common tools, etc. These are just examples. The annotation model may be a model that annotates which initial knowledge point the test question belongs to, which may be implemented through a large model.

本公开实施例中，模型测试装置在构建测试题目之后，可以建立该测试题目与场景知识点的对应关系，具体建立的方式可以先获取目标模型的背景知识分类，并将背景知识分类作为初始知识点，之后根据初始知识点、测试题目构建提示词，此处提示词用于引导标注模型给出测试题目的分类(也即属于哪个初始知识点)以及理由，例如提示词可以表示为“判断当前代码片段：\n{code_snippet}\n属于以下哪个知识点(如果不在列表中，则回答其他-类别名称，并给出理由)：\n{categories}\n以下列格式回答：\n分类：【】\n理由：【】”，仅为示例，并将该提示词输入标注模型中，输出测试题目所属的初始知识点为场景知识点，建立测试题目与场景知识点的对应关系。通过该场景知识点可以衡量上述测试数据集的全面性，测试题目覆盖的场景知识点越多，表征测试数据集越全面。后续在确定目标模型的模型测试结果时，可以获取不同场景知识点对应的单测通过率，实现从知识点维度的模型测试。In the disclosed embodiment, after constructing the test questions, the model testing device can establish a correspondence between the test questions and the scene knowledge points. The specific establishment method can be to first obtain the background knowledge classification of the target model, and use the background knowledge classification as the initial knowledge point, and then construct a prompt word based on the initial knowledge point and the test question. Here, the prompt word is used to guide the annotation model to give the classification of the test question (that is, which initial knowledge point it belongs to) and the reason. For example, the prompt word can be expressed as "Judge the current code snippet: \n{code_snippet}\nWhich of the following knowledge points does it belong to (if it is not in the list, answer other-category name and give reasons): \n{categories}\nAnswer in the following format: \nCategory: [ ] \nReason: [ ]", which is only an example, and the prompt word is input into the annotation model, and the initial knowledge point to which the test question belongs is output as the scene knowledge point, and the correspondence between the test question and the scene knowledge point is established. The comprehensiveness of the above-mentioned test data set can be measured by the scene knowledge point. The more scene knowledge points covered by the test questions, the more comprehensive the test data set is. Later, when determining the model test results of the target model, the single test pass rates corresponding to knowledge points in different scenarios can be obtained to implement model testing from the knowledge point dimension.

接下来通过一个具体的示例对本公开实施例的模型测试方案进行进一步说明。示例性的，图3为本公开一些实施例提供的模型测试过程的示意图，如图3所示，图中以目标模型为代码处理模型为例，展示了模型测试过程，可以包括图中的数据集构建、模型推理和模型测试三个阶段，数据集构建阶段可以包括仓库筛选、运行环境配置、覆盖率解析、构建测试题目，最终基于测试题目和至少一个提示词模板构建得到测试数据集；模型推理阶段可以通过模型推理服务实现，当用户发起对目标模型的测试之后，模型推理服务可以将测试数据集输入目标模型中进行处理，得到测试数据集中各提示词对应的待测答案，即为图中的模型推理结果；模型测试阶段可以包括模型推理结果获取、执行单测、解析单元测试结果和指标计算，最终基于单测通过率的指标获取至少一个测试维度的单测通过率作为模型测试结果，用户可以获取该模型测试结果。Next, the model testing scheme of the embodiment of the present disclosure is further explained through a specific example. Exemplarily, Figure 3 is a schematic diagram of the model testing process provided by some embodiments of the present disclosure. As shown in Figure 3, the target model is taken as a code processing model as an example in the figure, and the model testing process is shown, which can include the three stages of data set construction, model reasoning and model testing in the figure. The data set construction stage can include warehouse screening, operating environment configuration, coverage analysis, and construction of test questions. Finally, a test data set is obtained based on the test questions and at least one prompt word template; the model reasoning stage can be implemented by a model reasoning service. After the user initiates a test on the target model, the model reasoning service can input the test data set into the target model for processing, and obtain the answer to be tested corresponding to each prompt word in the test data set, which is the model reasoning result in the figure; the model testing stage can include model reasoning result acquisition, execution of unit test, analysis of unit test results and indicator calculation, and finally obtain the single test pass rate of at least one test dimension based on the single test pass rate indicator as the model test result, and the user can obtain the model test result.

示例性的，图4为本公开一些实施例提供的另一模型测试过程的示意图，如图4所示，图中以目标模型包括单测生成模型和单测优化模型两个为例，展示了图3中的模型测试阶段中执行单测的过程，具体可以通过图中的测试处理器执行，测试处理器可以将待测答案和测试任务分发给多个云计算服务，可以处理云计算服务的限流错误并重试，查询云计算服务的任务执行情况，获取云计算服务的运行结果；每个云计算服务利用容器引擎构建容器镜像作为单测环境；图中仓库表示模型训练数据仓库，测试数据集中包括的多个提示词对应的多个测试题目属于多个仓库，对多个目标模型进行测试时可以针对每个目标模型设置对应的仓库，一个仓库可以测试一个或多个目标模型，同一个仓库的不同目标模型的待测答案通过不同的云计算服务构建的单测环境进行单元测试，利用多个单测环境异步对待测答案进行单元测试，获取单元测试结果。Exemplarily, FIG4 is a schematic diagram of another model testing process provided by some embodiments of the present disclosure. As shown in FIG4, the figure takes the target model including a single test generation model and a single test optimization model as an example, and shows the process of executing the single test in the model testing phase in FIG3. Specifically, it can be executed by the test processor in the figure. The test processor can distribute the answers to be tested and the test tasks to multiple cloud computing services, and can handle the current limiting errors of the cloud computing services and retry, query the task execution status of the cloud computing services, and obtain the operation results of the cloud computing services; each cloud computing service uses a container engine to build a container image as a single test environment; the warehouse in the figure represents the model training data warehouse, and the multiple test questions corresponding to the multiple prompt words included in the test data set belong to multiple warehouses. When testing multiple target models, a corresponding warehouse can be set for each target model. One warehouse can test one or more target models. The answers to be tested of different target models in the same warehouse are unit tested through single test environments constructed by different cloud computing services, and the answers to be tested are asynchronously unit tested using multiple single test environments to obtain unit test results.

本方案提供了一种基于单测通过率作为评测指标的模型测试方案，评测指标能准确反映模型的真实能力，选用真实开发仓库和编码场景，提升评测集真实性，从语言、知识点、提示词模板等多维度支持模型测试，并且支持多应用场景的模型测试，提升模型测试的多样性。This solution provides a model testing solution based on the single test pass rate as the evaluation indicator. The evaluation indicator can accurately reflect the real capabilities of the model. It selects real development warehouses and coding scenarios to improve the authenticity of the evaluation set. It supports model testing from multiple dimensions such as language, knowledge points, and prompt word templates. It also supports model testing in multiple application scenarios to improve the diversity of model testing.

图5为本公开一些实施例提供的模型测试装置的结构示意图，该装置可由软件和/或硬件实现，一般可集成在电子设备中。如图5所示，该装置包括：FIG5 is a schematic diagram of the structure of a model testing device provided in some embodiments of the present disclosure. The device can be implemented by software and/or hardware and can generally be integrated in an electronic device. As shown in FIG5 , the device includes:

题目模块501，用于基于基础数据集中的标准答案构建测试题目；A question module 501 is used to construct a test question based on the standard answer in the basic data set;

数据集模块502，用于基于所述测试题目以及提示词模板，构建测试数据集；A data set module 502, used to construct a test data set based on the test questions and prompt word templates;

输入模块503，用于获取目标模型，并将所述测试数据集中的提示词输入所述目标模型得到对应的待测答案；An input module 503 is used to obtain a target model and input the prompt words in the test data set into the target model to obtain a corresponding answer to be tested;

单测模块504，用于对所述待测答案进行单元测试，确定单元测试结果；The unit test module 504 is used to perform unit test on the answer to be tested and determine the unit test result;

结果模块505，用于基于所述单元测试结果确定所述目标模型的模型测试结果。The result module 505 is used to determine the model test result of the target model based on the unit test result.

可选的，所述题目模块501用于：Optionally, the topic module 501 is used to:

基于所述基础数据集中的单测环境数据构建单测环境；Building a single test environment based on the single test environment data in the basic data set;

基于所述单测环境对所述基础数据集中的标准答案进行单元测试，得到单元测试覆盖率；Performing unit testing on the standard answers in the basic data set based on the unit test environment to obtain unit test coverage;

基于所述单元测试覆盖率所确定的可处理答案和所述可处理答案中的可处理位置构建所述测试题目。The test questions are constructed based on processable answers determined by the unit test coverage and processable positions in the processable answers.

可选的，所述数据集模块502用于：Optionally, the data set module 502 is used to:

按照至少一个所述提示词模板基于所述测试题目生成对应的提示词；Generate a corresponding prompt word based on the test question according to at least one prompt word template;

将至少一个提示词组合得到所述测试数据集。The test data set is obtained by combining at least one prompt word.

可选的，所述结果模块505用于：Optionally, the result module 505 is used to:

提取所述单元测试结果中的单测通过率；Extract the unit test pass rate from the unit test results;

按照至少一个测试维度对所述单元测试结果中的单测通过率进行划分，将各所述测试维度的单测通过率确定为所述目标模型的模型测试结果。The unit test pass rates in the unit test results are divided according to at least one test dimension, and the unit test pass rates of each test dimension are determined as the model test results of the target model.

可选的，所述装置还包括知识点模块，用于：在所述基于基础数据集中的标准答案构建测试题目之后，Optionally, the device further comprises a knowledge point module, which is used to: after constructing the test questions based on the standard answers in the basic data set,

建立所述测试题目与场景知识点的对应关系。Establish a correspondence between the test questions and scenario knowledge points.

可选的，所述知识点模块用于：Optionally, the knowledge point module is used to:

将所述目标模型的背景知识分类确定为初始知识点；Classifying the background knowledge of the target model into initial knowledge points;

基于所述初始知识点、所述测试题目以及标注模型，确定所述测试题目对应的场景知识点，并建立所述测试题目与所述场景知识点的对应关系。Based on the initial knowledge points, the test questions and the annotation model, the scenario knowledge points corresponding to the test questions are determined, and a corresponding relationship between the test questions and the scenario knowledge points is established.

可选的，所述测试维度包括如下至少一种：数据集、代码语言、提示词模板以及场景知识点。Optionally, the test dimension includes at least one of the following: a data set, a code language, a prompt word template, and a scenario knowledge point.

可选的，所述提示词模板包括所述测试题目以及所述测试题目的关联信息；Optionally, the prompt word template includes the test topic and associated information of the test topic;

所述目标模型包括代码处理模型，所述测试题目包括代码上下文，所述测试题目的关联信息包括所述代码上下文的相似代码片段。The target model includes a code processing model, the test topic includes a code context, and the associated information of the test topic includes similar code fragments of the code context.

本公开实施例所提供的模型测试装置可执行本公开任意实施例所提供的模型测试方法，具备执行方法相应的功能模块和有益效果。The model testing device provided in the embodiments of the present disclosure can execute the model testing method provided in any embodiment of the present disclosure, and has the corresponding functional modules and beneficial effects of the execution method.

本公开实施例还提供了一种计算机程序产品，包括计算机程序/指令，该计算机程序/指令被处理器执行时实现本公开任意实施例所提供的模型测试方法。The embodiments of the present disclosure also provide a computer program product, including a computer program/instruction, which implements the model testing method provided by any embodiment of the present disclosure when executed by a processor.

下面具体参考图6，其示出了适于用来实现本公开实施例中的电子设备600的结构示意图。本公开实施例中的电子设备600可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例，不应对本公开实施例的功能和使用范围带来任何限制。6, which shows a schematic diagram of the structure of an electronic device 600 suitable for implementing the embodiment of the present disclosure. The electronic device 600 in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc. The electronic device shown in FIG6 is only an example and should not bring any limitation to the functions and scope of use of the embodiment of the present disclosure.

如图6所示，电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601，其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中，还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG6 , the electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage device 608 into a random access memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic device 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604.

通常，以下装置可以连接至I/O接口605：包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606；包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607；包括例如磁带、硬盘等的存储装置608；以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600，但是应理解的是，并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 607 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 608 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 609. The communication device 609 may allow the electronic device 600 to communicate wirelessly or wired with other devices to exchange data. Although FIG. 6 shows an electronic device 600 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括承载在非暂态计算机可读介质上的计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中，该计算机程序可以通过通信装置609从网络上被下载和安装，或者从存储装置608被安装，或者从ROM 602被安装。在该计算机程序被处理装置601执行时，执行本公开实施例的模型测试方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer-readable medium, and the computer program contains program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication device 609, or installed from the storage device 608, or installed from the ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the model testing method of the embodiment of the present disclosure are executed.

需要说明的是，本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于：具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中，计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号，其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式，包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质，该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输，包括但不限于：电线、光缆、RF(射频)等等，或者上述的任意合适的组合。It should be noted that the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above. The computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. The program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

在一些实施方式中，客户端、服务器可以利用诸如HTTP(HyperText TransferProtocol，超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信，并且可以与任意形式或介质的数字数据通信(例如，通信网络)互连。通信网络的示例包括局域网(“LAN”)，广域网(“WAN”)，网际网(例如，互联网)以及端对端网络(例如，ad hoc端对端网络)，以及任何当前已知或未来研发的网络。In some embodiments, the client and the server may communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), an internet (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any currently known or future developed network.

上述计算机可读介质可以是上述电子设备中所包含的；也可以是单独存在，而未装配入该电子设备中。The computer-readable medium may be included in the electronic device, or may exist independently without being installed in the electronic device.

上述计算机可读介质承载有一个或者多个程序，当上述一个或者多个程序被该电子设备执行时，使得该电子设备：基于基础数据集中的标准答案构建测试题目；基于所述测试题目以及提示词模板，构建测试数据集；获取目标模型，并将所述测试数据集中的提示词输入所述目标模型得到对应的待测答案；对所述待测答案进行单元测试，确定单元测试结果；基于所述单元测试结果确定所述目标模型的模型测试结果。The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: constructs a test question based on the standard answer in the basic data set; constructs a test data set based on the test question and the prompt word template; obtains a target model, and inputs the prompt words in the test data set into the target model to obtain the corresponding answer to be tested; performs unit testing on the answer to be tested to determine the unit test result; and determines the model test result of the target model based on the unit test result.

可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码，上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++，还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., via the Internet using an Internet service provider).

附图中的流程图和框图，图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings illustrate the possible architecture, function and operation of the system, method and computer program product according to various embodiments of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a module, a program segment or a part of a code, and the module, the program segment or a part of the code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some implementations as replacements, the functions marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two square boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs the specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现，也可以通过硬件的方式来实现。其中，单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure may be implemented by software or hardware, wherein the name of a unit does not, in some cases, constitute a limitation on the unit itself.

本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如，非限制性地，可以使用的示范类型的硬件逻辑部件包括：现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chip (SOCs), complex programmable logic devices (CPLDs), and the like.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

可以理解的是，在使用本公开各实施例公开的技术方案之前，应当依据相关法律法规通过恰当的方式对本公开所涉及的信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It is understandable that before using the technical solutions disclosed in the embodiments of the present disclosure, the type, scope of use, usage scenarios, etc. of the information involved in the present disclosure should be informed to the user and the user's authorization should be obtained in an appropriate manner in accordance with relevant laws and regulations.

以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解，本公开中所涉及的公开范围，并不限于上述技术特征的特定组合而成的技术方案，同时也应涵盖在不脱离上述公开构思的情况下，由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an explanation of the technical principles used. Those skilled in the art should understand that the scope of disclosure involved in the present disclosure is not limited to the technical solutions formed by a specific combination of the above technical features, but should also cover other technical solutions formed by any combination of the above technical features or their equivalent features without departing from the above disclosed concept. For example, the above features are replaced with the technical features with similar functions disclosed in the present disclosure (but not limited to) by each other.

此外，虽然采用特定次序描绘了各操作，但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下，多任务和并行处理可能是有利的。同样地，虽然在上面论述中包含了若干具体实现细节，但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地，在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, although each operation is described in a specific order, this should not be understood as requiring these operations to be performed in the specific order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although some specific implementation details are included in the above discussion, these should not be interpreted as limiting the scope of the present disclosure. Some features described in the context of a separate embodiment can also be implemented in a single embodiment in combination. On the contrary, the various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable sub-combination mode.

尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题，但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反，上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or methodological logical actions, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are merely example forms of implementing the claims.

Claims

1. A method of model testing, comprising:

Constructing a test question based on standard answers in the basic data set;

constructing a test data set based on the test questions and the prompt word templates;

acquiring a target model, and inputting a prompt word in the test data set into the target model to obtain a corresponding answer to be tested;

performing unit test on the answer to be tested, and determining a unit test result;

and determining a model test result of the target model based on the unit test result.

2. The method of claim 1, wherein constructing a test question based on standard answers in a base dataset comprises:

constructing a single-test environment based on the single-test environment data in the basic data set;

Performing unit test on the standard answers in the basic data set based on the single-test environment to obtain unit test coverage rate;

And constructing the test questions based on the processable answers determined by the unit test coverage rate and the processable positions in the processable answers.

3. The method of claim 1, wherein constructing a test dataset based on the test topics and a hint word template comprises:

Generating corresponding prompting words based on the test questions according to at least one prompting word template;

and combining at least one prompt word to obtain the test data set.

4. The method of claim 1, wherein determining the model test result of the target model based on the unit test result comprises:

Extracting single test passing rate in the unit test result;

Dividing the single-test passing rate in the unit test results according to at least one test dimension, and determining the single-test passing rate of each test dimension as a model test result of the target model.

5. The method of claim 1, wherein after constructing a test question based on standard answers in the base dataset, the method further comprises:

and establishing the corresponding relation between the test questions and the scene knowledge points.

6. The method of claim 5, wherein establishing the correspondence of the test topic to a scene knowledge point comprises:

Determining background knowledge classification of the target model as an initial knowledge point;

and determining scene knowledge points corresponding to the test questions based on the initial knowledge points, the test questions and the labeling model, and establishing corresponding relations between the test questions and the scene knowledge points.

7. The method of claim 4 or 5, wherein the test dimension comprises at least one of: data sets, code languages, hint word templates, and scene knowledge points.

8. The method of claim 1, wherein the alert word template includes the test question and associated information of the test question;

The object model includes a code processing model, the test question includes a code context, and the associated information of the test question includes similar code segments of the code context.

9. A model test apparatus, comprising:

the question module is used for constructing a test question based on standard answers in the basic data set;

The data set module is used for constructing a test data set based on the test questions and the prompt word templates;

The input module is used for acquiring a target model, inputting the prompt words in the test data set into the target model and obtaining corresponding answers to be tested;

the single test module is used for carrying out unit test on the answer to be tested and determining a unit test result;

and the result model is used for determining the model test result of the target model based on the unit test result.

10. An electronic device, the electronic device comprising:

A processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the model test method according to any one of the preceding claims 1-8.

11. A computer readable storage medium, characterized in that the storage medium stores a computer program for executing the model test method according to any of the preceding claims 1-8.