CN117648140A

CN117648140A - Method, apparatus, device, medium and program product for evaluating model performance

Info

Publication number: CN117648140A
Application number: CN202311685060.7A
Authority: CN
Inventors: 吴程程; 张烨; 刘宝菊; 王泽宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-05

Abstract

The present disclosure provides a method, device, equipment, medium and program product for evaluating model performance, and relates to artificial intelligence technology fields such as AI development and cloud platforms. An implementation manner of the method includes: in response to receiving a test request for the target class model, obtaining a model performance evaluation plug-in and an image file corresponding to the target class model; wherein the image file includes a test sample set and plug-in running mode information; starting image file, and use the model performance evaluation plug-in to evaluate the test sample set based on plug-in operation mode information to determine the performance evaluation indicators corresponding to the target class model.

Description

Methods, devices, equipment, media and program products for evaluating model performance

技术领域Technical field

本公开实施例涉及计算机领域，具体涉及AI开发、云平台等人工智能技术领域，尤其涉及一种用于评估模型性能的方法、装置、设备、介质及程序产品。The embodiments of the present disclosure relate to the computer field, specifically to the field of artificial intelligence technology such as AI development and cloud platforms, and in particular to a method, device, equipment, medium and program product for evaluating model performance.

背景技术Background technique

在实际应用中，通常会根据模型的特定测试场景来选择相应的指标进行模型性能评估。In practical applications, corresponding indicators are usually selected for model performance evaluation based on the specific test scenarios of the model.

目前，相关技术中的性能评估方法无法完全覆盖所有行业场景；且，特定场景的模型性能评估还存在无法个性化的选择和切换合适的运行流程来完成模型性能评估。At present, the performance evaluation methods in related technologies cannot fully cover all industry scenarios; moreover, model performance evaluation in specific scenarios still cannot be personalized and switched to an appropriate operating process to complete the model performance evaluation.

发明内容Contents of the invention

本公开实施例提出了一种用于评估模型性能的方法、装置、设备、介质及程序产品。Embodiments of the present disclosure provide a method, device, equipment, media and program product for evaluating model performance.

第一方面，本公开实施例提出了一种用于评估模型性能的方法，包括：响应于接收到针对目标类模型的测试请求，获取与目标类模型对应的模型性能评估插件和镜像文件；其中，镜像文件包括测试样本集和插件运行方式信息；启动镜像文件，并通过模型性能评估插件根据插件运行方式信息评估测试样本集，确定与目标类模型对应的性能评估指标。In a first aspect, an embodiment of the present disclosure proposes a method for evaluating model performance, including: in response to receiving a test request for a target class model, obtaining a model performance evaluation plug-in and an image file corresponding to the target class model; wherein , the image file includes the test sample set and plug-in running mode information; start the image file, and use the model performance evaluation plug-in to evaluate the test sample set based on the plug-in running mode information to determine the performance evaluation indicators corresponding to the target class model.

在一些示例中，模型性能评估插件包括模型推理插件和指标评估插件；In some examples, the model performance evaluation plug-in includes a model inference plug-in and a metric evaluation plug-in;

通过模型性能评估插件根据插件运行方式信息评估测试样本集，确定与目标类模型对应的性能评估指标，包括：Use the model performance evaluation plug-in to evaluate the test sample set based on plug-in operation mode information to determine the performance evaluation indicators corresponding to the target class model, including:

通过模型推理插件根据其对应的插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果；通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标。The model inference plug-in processes the test samples in the test sample set according to the corresponding plug-in operation mode information to obtain the inference results corresponding to the test samples; the indicator evaluation plug-in processes the inference results corresponding to the test samples according to the corresponding plug-in operation mode information and The label corresponding to the test sample determines the performance evaluation index corresponding to the target class model.

在一些示例中，插件运行方式信息：插件运行环境信息、插件运行逻辑信息和插件运行参数信息。In some examples, plug-in running mode information: plug-in running environment information, plug-in running logic information and plug-in running parameter information.

在一些示例中，插件运行参数信息包括插件输入参数和插件输出格式；In some examples, the plug-in running parameter information includes plug-in input parameters and plug-in output format;

通过模型推理插件根据其对应的插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果，包括：The model inference plug-in processes the test samples in the test sample set according to the corresponding plug-in operation mode information to obtain the inference results corresponding to the test samples, including:

通过模型推理插件按照其对应的插件运行逻辑信息，对测试样本集中的测试样本执行如下运行流程：对测试样本集中的测试样本进行解析，得到解析后的测试样本；遍历解析后的测试样本，并将解析后的测试样本转换为与插件输入参数；处理插件输入参数，得到与插件输出格式匹配的推理结果。Through the model inference plug-in according to its corresponding plug-in operation logic information, the following operation process is executed for the test samples in the test sample set: parse the test samples in the test sample set to obtain the parsed test samples; traverse the parsed test samples, and Convert the parsed test samples into plug-in input parameters; process the plug-in input parameters to obtain inference results that match the plug-in output format.

在一些示例中，插件运行参数信息包括指标评估算法；In some examples, plug-in operating parameter information includes indicator evaluation algorithms;

通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标，包括：通过指标评估插件按照其对应的插件运行逻辑信息，对推理结果执行以下运行流程：分别解析该测试样本对应的推理结果和与该测试样本对应的标签，得到解析后的标签和解析后的推理结果；根据指标评估算法，处理解析后的标签和解析后的推理结果，确定与目标类模型对应的性能评估指标。The indicator evaluation plug-in processes the inference results corresponding to the test sample and the label corresponding to the test sample according to the corresponding plug-in running mode information to determine the performance evaluation indicators corresponding to the target class model, including: using the indicator evaluation plug-in according to its corresponding The plug-in runs the logic information and performs the following operation process on the inference results: respectively parses the inference results corresponding to the test sample and the label corresponding to the test sample to obtain the parsed label and parsed inference result; according to the indicator evaluation algorithm, process the parsing The resulting labels and parsed reasoning results are used to determine the performance evaluation indicators corresponding to the target class model.

在一些示例中，插件运行参数信息包括展示方式；该方法还包括：通过展示方式展示目标类模型的性能评估指标。In some examples, the plug-in running parameter information includes a display method; the method also includes: displaying the performance evaluation indicators of the target class model through the display method.

在一些示例中，若目标类模型包括多种类模型时，通过展示方式展示目标类模型的性能评估指标，包括：通过与展示方式对应的比对图展示多种类模型的性能评估指标。In some examples, if the target class model includes multiple types of models, display the performance evaluation indicators of the target class model through a display method, including: displaying the performance evaluation indicators of the multiple types of models through a comparison chart corresponding to the display method.

在一些示例中，该方法还包括：将目标类模型的模型服务代码源文件、测试样本集和插件运行方式信息封装为镜像文件。In some examples, the method also includes: encapsulating the model service code source file, test sample set and plug-in running mode information of the target class model into an image file.

第二方面，本公开实施例提出了一种用于评估模型性能的装置，包括：获取模块，用于响应于接收到针对目标类模型的测试请求，获取与目标类模型对应的模型性能评估插件和镜像文件；其中，镜像文件包括测试样本集和插件运行方式信息；性能评估模块，用于启动镜像文件，并通过模型性能评估插件根据插件运行方式信息评估测试样本集，确定与目标类模型对应的性能评估指标。In a second aspect, an embodiment of the present disclosure proposes a device for evaluating model performance, including: an acquisition module, configured to obtain a model performance evaluation plug-in corresponding to the target class model in response to receiving a test request for the target class model. and image file; among them, the image file includes test sample set and plug-in running mode information; the performance evaluation module is used to start the image file, and evaluate the test sample set according to the plug-in running mode information through the model performance evaluation plug-in to determine the corresponding target class model performance evaluation indicators.

性能评估模块，包括：模型推理单元，用于通过模型推理插件根据其对应的插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果；性能评估单元，用于通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标。The performance evaluation module includes: a model inference unit, used to process the test samples in the test sample set according to the corresponding plug-in operation mode information through the model inference plug-in, and obtain the inference results corresponding to the test samples; the performance evaluation unit, used to evaluate through indicators The plug-in processes the inference result corresponding to the test sample and the label corresponding to the test sample according to the corresponding plug-in running mode information, and determines the performance evaluation index corresponding to the target class model.

模型推理单元，具体用于：通过模型推理插件按照其对应的插件运行逻辑信息，对测试样本集中的测试样本执行如下运行流程：对测试样本集中的测试样本进行解析，得到解析后的测试样本；遍历解析后的测试样本，并将解析后的测试样本转换为与插件输入参数；处理插件输入参数，得到与插件输出格式匹配的推理结果。The model inference unit is specifically used to: use the model inference plug-in to run logical information according to its corresponding plug-in, and execute the following operation process on the test samples in the test sample set: parse the test samples in the test sample set to obtain the parsed test sample; Traverse the parsed test samples and convert the parsed test samples into plug-in input parameters; process the plug-in input parameters to obtain inference results that match the plug-in output format.

性能评估单元，具体用于：通过指标评估插件按照其对应的插件运行逻辑信息，对推理结果执行以下运行流程：分别解析该测试样本对应的推理结果和与该测试样本对应的标签，得到解析后的标签和解析后的推理结果；根据指标评估算法，处理解析后的标签和解析后的推理结果，确定与目标类模型对应的性能评估指标。The performance evaluation unit is specifically used to: use the indicator evaluation plug-in to run logical information according to its corresponding plug-in, and perform the following operation process on the inference results: respectively parse the inference results corresponding to the test sample and the label corresponding to the test sample, and obtain the parsed labels and parsed reasoning results; according to the indicator evaluation algorithm, the parsed labels and parsed reasoning results are processed to determine the performance evaluation indicators corresponding to the target class model.

在一些示例中，插件运行参数信息包括展示方式；该装置还包括：展示模块，用于通过展示方式展示目标类模型的性能评估指标。In some examples, the plug-in running parameter information includes a display method; the device also includes: a display module, used to display the performance evaluation indicators of the target class model through the display method.

在一些示例中，若目标类模型包括多种类模型时，展示模块，具体用于：通过与展示方式对应的比对图展示多种类模型的性能评估指标。In some examples, if the target class model includes multiple types of models, the display module is specifically used to: display the performance evaluation indicators of the multiple types of models through a comparison chart corresponding to the display method.

在一些示例中，该装置还包括：封装模块，用于将目标类模型的模型服务代码源文件、测试样本集和插件运行方式信息封装为镜像文件。In some examples, the device further includes: an encapsulation module, configured to encapsulate the model service code source file of the target class model, the test sample set, and the plug-in running mode information into an image file.

第三方面，本公开实施例提出了一种电子设备，包括：至少一个处理器；以及与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够执行如第一方面描述的方法。In a third aspect, embodiments of the present disclosure provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are At least one processor executes, so that at least one processor can execute the method as described in the first aspect.

第四方面，本公开实施例提出了一种存储有计算机指令的非瞬时计算机可读存储介质，计算机指令用于使计算机执行如第一方面描述的方法。In a fourth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, the computer instructions being used to cause a computer to execute the method described in the first aspect.

第五方面，本公开实施例提出了一种计算机程序产品，包括计算机程序，计算机程序在被处理器执行时实现如第一方面描述的方法。In a fifth aspect, an embodiment of the present disclosure proposes a computer program product, which includes a computer program. When executed by a processor, the computer program implements the method described in the first aspect.

本公开实施例提出了一种用于评估模型性能的方法、装置、设备、介质及程序产品，通过针对目标类模型的测试请求，可以实现不同类模型的性能评估；在接收到测试请求后，选取与目标类模型对应的模型性能评估插件和镜像文件；在启动镜像文件之后，通过模型性能评估插件按照镜像文件中的插件运行方式信息评估镜像文件中的测试样本集，以得到目标类模型的性能评估指标，能够通过个性化定制的插件运行方式信息指示模型性能评估插件评估测试样本集的运行流程，从而实现个性化的性能评估。The embodiment of the present disclosure proposes a method, device, equipment, medium and program product for evaluating model performance. Through test requests for target class models, performance evaluation of different types of models can be achieved; after receiving the test request, Select the model performance evaluation plug-in and image file corresponding to the target class model; after starting the image file, use the model performance evaluation plug-in to evaluate the test sample set in the image file according to the plug-in running mode information in the image file to obtain the target class model Performance evaluation indicators can indicate the running process of the model performance evaluation plug-in evaluation test sample set through personalized customized plug-in operation mode information, thereby achieving personalized performance evaluation.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

通过阅读参照以下附图所作的对非限制性实施例所作的详细描述，本公开的其它特征、目的和优点将会变得更明显。附图用于更好地理解本方案，不构成对本公开的限定。其中：Other features, objects and advantages of the present disclosure will become more apparent upon reading the detailed description of the non-limiting embodiments taken with reference to the following drawings. The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure. in:

图1是本公开可以应用于其中的示例性系统架构图；Figure 1 is an exemplary system architecture diagram in which the present disclosure may be applied;

图2是根据本公开的用于评估模型性能的方法的一个实施例的流程图；Figure 2 is a flowchart of one embodiment of a method for evaluating model performance according to the present disclosure;

图3是插件开发流程的示意图；Figure 3 is a schematic diagram of the plug-in development process;

图4是根据本公开的用于评估模型性能的方法的一个实施例的流程图；Figure 4 is a flowchart of one embodiment of a method for evaluating model performance according to the present disclosure;

图5～6是根据本公开的用于评估模型性能的方法的应用场景图；Figures 5-6 are application scenario diagrams of methods for evaluating model performance according to the present disclosure;

图7是根据本公开的用于评估模型性能的装置的一个实施例的示意图；Figure 7 is a schematic diagram of one embodiment of an apparatus for evaluating model performance according to the present disclosure;

图8是用来实现本公开实施例的电子设备的框图。Figure 8 is a block diagram of an electronic device used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

需要说明的是，在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本公开。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of the present disclosure can be combined with each other. The present disclosure will be described in detail below in conjunction with embodiments with reference to the accompanying drawings.

图1示出了可以应用本公开的用于评估模型性能的方法和装置的实施例的示例性系统架构100。FIG. 1 illustrates an exemplary system architecture 100 in which embodiments of methods and apparatus for evaluating model performance of the present disclosure may be applied.

如图1所示，系统架构100可以包括用户端101，网络102和测试平台103。网络102用以在用户端101和测试平台103之间提供通信链路的介质。网络102可以包括各种连接类型，例如有线、无线通信链路或者光纤电缆等等。As shown in Figure 1, the system architecture 100 may include a client 101, a network 102 and a test platform 103. The network 102 is a medium used to provide a communication link between the client 101 and the test platform 103 . Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户端101通过网络102与测试平台103交互，以接收或发送测试请求等。The client 101 interacts with the test platform 103 through the network 102 to receive or send test requests and so on.

测试平台103可以是提供各种模型性能评估的平台，例如对用户端101的测试请求进行匹配，以匹配到目标类模型对应的模型性能评估插件和镜像文件，并通过模型性能评估插件按照镜像文件得到目标类模型的性能评估指标。The test platform 103 can be a platform that provides various model performance evaluations. For example, it matches the test request of the client 101 to the model performance evaluation plug-in and image file corresponding to the target class model, and uses the model performance evaluation plug-in to follow the image file. Obtain the performance evaluation index of the target class model.

其中，测试平台103可以是硬件，也可以是软件。当测试平台103为硬件时，可以实现成多个服务器组成的分布式服务器集群，也可以实现成单个服务器。当测试平台103为软件时，可以实现成多个软件或软件模块(例如用来提供分布式服务)，也可以实现成单个软件或软件模块。在此不做具体限定。Among them, the test platform 103 can be hardware or software. When the test platform 103 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or it can be implemented as a single server. When the test platform 103 is software, it may be implemented as multiple software or software modules (for example, used to provide distributed services), or it may be implemented as a single software or software module. There are no specific limitations here.

需要说明的是，本公开实施例所提供的用于评估模型性能的方法一般由测试平台103执行，相应地，用于评估模型性能的装置一般设置于测试平台103中。It should be noted that the method for evaluating model performance provided by the embodiment of the present disclosure is generally executed by the test platform 103. Correspondingly, the device for evaluating model performance is generally provided in the test platform 103.

应该理解，图1中的用户端、网络和测试平台的数目仅仅是示意性的。根据实现需要，可以具有任意数目的用户端、网络和测试平台。It should be understood that the numbers of clients, networks and test platforms in Figure 1 are only illustrative. You can have any number of clients, networks, and test platforms depending on your implementation needs.

继续参考图2，其示出了根据本公开的用于评估模型性能的方法的一个实施例的流程200。该用于评估模型性能的方法可以包括以下步骤：Continuing with reference to FIG. 2 , a process 200 is shown for one embodiment of a method for evaluating model performance in accordance with the present disclosure. The method for evaluating model performance may include the following steps:

步骤201，响应于接收到针对目标类模型的测试请求，获取与目标类模型对应的模型性能评估插件和镜像文件，该镜像文件包括测试样本集和插件运行方式信息。Step 201: In response to receiving a test request for the target class model, obtain the model performance evaluation plug-in and image file corresponding to the target class model. The image file includes a test sample set and plug-in operating mode information.

在本实施例中，用于评估模型性能的方法的执行主体(例如图1所示的测试平台103)可以在接收到通过用户端(例如图1所示的用户端)发送的针对目标类模型的测试请求时，获取与该目标类模型对应的模型性能评估插件和镜像文件。目标类模型可以为模型类型为某一既定的类型，该类型可以通过其名称体现。In this embodiment, the execution subject of the method for evaluating model performance (for example, the test platform 103 shown in Figure 1) can receive the target class model sent by the user end (for example, the user end shown in Figure 1). When making a test request, obtain the model performance evaluation plug-in and image file corresponding to the target class model. The target class model can be a model type of a given type, which can be reflected by its name.

在这里，该镜像文件可以包括：用于测试的测试样本集，以及插件运行方式信息。该插件运行方式信息可以用于定义插件运行流程、插件运行所需的环境、插件运行所涉及的参数、算法等。Here, the image file can include: a test sample set for testing, and information about how the plug-in runs. The plug-in running mode information can be used to define the plug-in running process, the environment required for plug-in running, parameters and algorithms involved in plug-in running, etc.

在一个示例中，在接收到针对目标类模型的测试请求时，可以从相应存储路径中获取与目标类模型对应的模型性能评估插件和镜像文件。In one example, when receiving a test request for a target class model, the model performance evaluation plug-in and image file corresponding to the target class model can be obtained from the corresponding storage path.

需要说明的是，相应存储路径可以为插件开发者定制规范中的存储路径或由测试用户通过在镜像文件中个性化设置的存储路径。It should be noted that the corresponding storage path can be the storage path in the custom specification for the plug-in developer or the storage path personalized by the test user in the image file.

在本实施例中，通过测试请求来选择或切换不同的模型性能评估插件，并通过插件运行方式信息指引模型性能评估插件评估测试样本集，以完成不同类模型的性能评估。In this embodiment, different model performance evaluation plug-ins are selected or switched through test requests, and the plug-in running mode information is used to guide the model performance evaluation plug-in to evaluate the test sample set to complete the performance evaluation of different types of models.

在一个示例中，在图3中，测试用户可以在测试平台，构建模型性能评估插件并导入测试样本集。在测试平台上，测试用户可以发起测试请求，并根据测试请求获取对应的模型性能评估插件。随后，在测试完之后，测试用户可以查看性能评估指标，从而更深入地了解目标类模型的性能。In an example, in Figure 3, test users can build a model performance evaluation plug-in on the test platform and import the test sample set. On the test platform, test users can initiate a test request and obtain the corresponding model performance evaluation plug-in based on the test request. Later, after testing, test users can view performance evaluation metrics to gain a deeper understanding of the performance of the target class model.

在获取模型性能评估插件时，基于目标类模型的类型选取匹配的模型性能评估插件，从而可以适用于各种类型的模型性能评估场景，实现了个性化的模型性能评估。When obtaining the model performance evaluation plug-in, the matching model performance evaluation plug-in is selected based on the type of the target class model, so that it can be applied to various types of model performance evaluation scenarios and achieve personalized model performance evaluation.

插件开发方可以通过其开发端上进行模型性能评估插件的开发和调试，确保模型性能评估插件可以在离线环境下运行正常。当模型性能评估插件调试成功后，插件开发者可以将插件的代码包、依赖项、配置等信息上传至测试平台，实施纳管操作。随后，在测试平台进行插件验证通过后对插件进行发布上线操作，使其可供测试用户进行评估和使用。Plug-in developers can develop and debug the model performance evaluation plug-in through their development terminal to ensure that the model performance evaluation plug-in can run normally in an offline environment. After the model performance evaluation plug-in is successfully debugged, the plug-in developer can upload the plug-in's code package, dependencies, configuration and other information to the test platform and implement management operations. Subsequently, after the plug-in verification is passed on the test platform, the plug-in will be released and put online so that it can be evaluated and used by test users.

在插件版本需要更新时，也可以先完成插件更新版本的开发调试，并将更新的版本上传至测试平台，由测试平台进行插件验证后对该插件进行上线操作，以供测试用户对该更新版本的插件进行评估和使用。When the plug-in version needs to be updated, you can also complete the development and debugging of the updated version of the plug-in first, and upload the updated version to the test platform. The test platform will verify the plug-in and then put the plug-in online so that test users can use the updated version. Plugins are evaluated and used.

在这里，插件开发者可以用于为插件运行提供运行环境(例如，镜像)、为插件运行制定规范等；其中，制定规范可以为测试样本集的规范、推理结果的规范、输入插件的参数(例如，插件输入参数)和插件输出参数(例如，插件输出格式)规范、自定义性能评估指标的评估算法规范等。在这里，上述规范可以为存储路径、格式等的规范。Here, plug-in developers can provide a running environment (for example, a mirror) for plug-in operation, develop specifications for plug-in operation, etc.; among which, the specifications can be specifications for test sample sets, specifications for inference results, and parameters for input plug-ins ( For example, plug-in input parameters) and plug-in output parameters (for example, plug-in output format) specifications, evaluation algorithm specifications for custom performance evaluation indicators, etc. Here, the above specifications may be specifications of storage paths, formats, etc.

需要说明的是，测试用户可以根据其特定的测试需求，给插件开发人员提供测试需求，由插件开发人员制定开发新的插件，并共享至测试平台，以便评估不同类型的模型，且保证了测试平台的可扩展性。It should be noted that test users can provide test requirements to plug-in developers according to their specific testing needs, and plug-in developers develop new plug-ins and share them with the test platform to evaluate different types of models and ensure testing Platform scalability.

在本实施例中，插件的运行环境、插件的输入输出的参数格式的存储路径都可以通过插件开发方进行规范或由测试用户个性化设置在镜像文件中。In this embodiment, the plug-in's running environment and the storage path of the plug-in's input and output parameter formats can be standardized by the plug-in developer or personalized by the test user and set in the image file.

步骤202，启动镜像文件，并通过模型性能评估插件根据插件运行方式信息评估测试样本集，确定与目标类模型对应的性能评估指标。Step 202: Start the image file, and use the model performance evaluation plug-in to evaluate the test sample set according to the plug-in operating mode information to determine the performance evaluation indicators corresponding to the target class model.

在本实施例中，上述执行主体先启动镜像文件；然后，通过模型性能评估插件根据插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果；然后，根据该测试样本对应的推理结果和该测试样本对应的标签，确定与目标类模型对应的性能评估指标。In this embodiment, the above execution subject first starts the image file; then, uses the model performance evaluation plug-in to process the test samples in the test sample set according to the plug-in running mode information to obtain the inference results corresponding to the test samples; then, according to the corresponding test samples The inference results and the label corresponding to the test sample are used to determine the performance evaluation index corresponding to the target class model.

在这里，该目标类模型对应的性能评估指标可以为该类模型专有的性能评估指标。该性能评估指标可以表征目标类模型的性能。Here, the performance evaluation index corresponding to the target class model can be a performance evaluation index specific to this type of model. This performance evaluation index can characterize the performance of the target class model.

这里，启动镜像文件可以包括：将插件运行方式信息启动为容器；之后，将测试样本集附加到容器中，以实现测试样本集的共享或访问。Here, starting the image file may include: starting the plug-in running mode information as a container; and then attaching the test sample set to the container to achieve sharing or access of the test sample set.

在一个示例中，上述评估测试样本集的方式可以根据目标类模型对应的性能评估指标对应的运算方式进行确定，例如，均方根误差是基于均方误差的平方根得到。In one example, the above method of evaluating the test sample set can be determined based on the operation method corresponding to the performance evaluation index corresponding to the target class model. For example, the root mean square error is obtained based on the square root of the mean square error.

在一个示例中，目标类模型的性能评估指标可以包括：In one example, the performance evaluation indicators of the target class model may include:

例如，分类模型对应的性能评估指标可以包括：准确率(Accuracy)：正确分类的样本数占总样本数的比例；精确度(Precision)：真正例占真正例与假正例的比例；召回率(Recall)：真正例占真正例与假负例的比例；F1分数(F1 Score)：精确度和召回率的调和均值，用于平衡两者；ROC(Receiver Operating Characteristic)曲线和AUC(Area UnderCurve)：接收者操作特征曲线和曲线下面积，用于二分类问题。For example, the performance evaluation indicators corresponding to the classification model can include: Accuracy: the proportion of correctly classified samples to the total number of samples; Precision: the proportion of true cases to true cases and false positives; recall rate (Recall): the proportion of true examples to true examples and false negative examples; F1 Score: the harmonic mean of precision and recall, used to balance the two; ROC (Receiver Operating Characteristic) curve and AUC (Area UnderCurve) ): Receiver operating characteristic curve and area under the curve, used for binary classification problems.

回归模型对应的性能评估指标可以包括：均方误差(Mean Squared Error，MSE)：实际值和预测值之间的平方差的平均值；均方根误差(Root Mean Squared Error，RMSE)：MSE的平方根；平均绝对误差(Mean Absolute Error，MAE)：实际值和预测值之间的绝对差的平均值；R平方(R-squared，R²)：用于衡量模型对总方差的解释程度。Performance evaluation indicators corresponding to the regression model can include: Mean Squared Error (MSE): the average of the squared differences between the actual value and the predicted value; Root Mean Squared Error (RMSE): the MSE Square root; Mean Absolute Error (MAE): the average of the absolute difference between the actual value and the predicted value; R-squared (R-squared, R ² ): used to measure the degree to which the model explains the total variance.

目标检测模型对应的性能评估指标可以包括：平均精确度(Average Precision，AP)：目标检测任务中不同IoU(交并比)阈值下的精确度的平均值；mAP(mean AveragePrecision)：AP在多个类别上的平均值；IoU(交并比)：目标边界框与真实目标边界框的交集与并集的比率；FROC(Free-Response ROC)曲线：目标检测任务中假阳性率与召回率之间的曲线。The performance evaluation indicators corresponding to the target detection model can include: Average Precision (AP): the average accuracy under different IoU (intersection over union ratio) thresholds in the target detection task; mAP (mean Average Precision): AP in multiple The average value on each category; IoU (Intersection and Union Ratio): the ratio of the intersection and union of the target bounding box and the real target bounding box; FROC (Free-Response ROC) curve: the ratio between the false positive rate and the recall rate in the target detection task curve between.

图像生成模型对应的性能评估指标可以包括：SSIM(Structural SimilarityIndex，结构相似性指数)：用于比较生成图像与原始图像的结构相似性；PSNR(Peak Signalto Noise Ratio，峰值信噪比)：度量生成图像与原始图像之间的信噪比；LPIPS(LearnedPerceptual Image Patch Similarity，感知图像相似性)：度量图像的感知相似性，更符合人眼感知。Performance evaluation indicators corresponding to the image generation model can include: SSIM (Structural SimilarityIndex, structural similarity index): used to compare the structural similarity of the generated image and the original image; PSNR (Peak Signalto Noise Ratio, peak signal-to-noise ratio): metric generation The signal-to-noise ratio between the image and the original image; LPIPS (LearnedPerceptual Image Patch Similarity, perceived image similarity): measures the perceptual similarity of the image, which is more in line with human eye perception.

自然语言处理模型对应的性能评估指标可以包括：BLEU(Bilingual EvaluationUnderstudy)：用于机器翻译任务的自动评价指标；ROUGE(Recall-Oriented Understudyfor Gisting Evaluation)：用于文本摘要任务的评价指标；METEOR(Metric forEvaluation of Translation with Explicit ORdering)：用于机器翻译的自动评价指标；Perplexity(困惑度)：用于语言模型性能评估的指标。Performance evaluation indicators corresponding to natural language processing models can include: BLEU (Bilingual Evaluation Understudy): an automatic evaluation indicator for machine translation tasks; ROUGE (Recall-Oriented Understudy for Gisting Evaluation): an evaluation indicator for text summarization tasks; METEOR (Metric forEvaluation of Translation with Explicit ORdering): an automatic evaluation indicator for machine translation; Perplexity (perplexity): an indicator for language model performance evaluation.

在本实施例中，上述性能评估指标能够在不同维度对模型性能进行评估并给出评估数值(即，性能评估指标)，某个指标越大时说明模型的分类性能在某个方面表现越好。In this embodiment, the above performance evaluation index can evaluate the model performance in different dimensions and give an evaluation value (ie, performance evaluation index). The larger a certain index is, the better the classification performance of the model is in a certain aspect. .

在本实施例中，可以通过目标类模型对应的性能评估指标，完成对目标类模型的性能评估。In this embodiment, the performance evaluation of the target class model can be completed through the performance evaluation indicators corresponding to the target class model.

本公开实施例提供的用于评估模型性能的方法，通过针对目标类模型的测试请求，可以实现不同类模型的性能评估；在接收到测试请求后，选取与目标类模型对应的模型性能评估插件和镜像文件；在启动镜像文件之后，通过模型性能评估插件按照镜像文件中的插件运行方式信息评估镜像文件中的测试样本集，以得到目标类模型的性能评估指标，能够通过个性化定制的插件运行方式信息指示模型性能评估插件评估测试样本集的运行流程，从而实现个性化的性能评估。The method for evaluating model performance provided by the embodiments of the present disclosure can achieve performance evaluation of different types of models through test requests for target class models; after receiving the test request, select a model performance evaluation plug-in corresponding to the target class model. and image files; after starting the image file, use the model performance evaluation plug-in to evaluate the test sample set in the image file according to the plug-in running mode information in the image file to obtain the performance evaluation indicators of the target class model, which can be customized through personalized plug-ins. The running mode information instructs the model performance evaluation plug-in to evaluate the running process of the test sample set, thereby achieving personalized performance evaluation.

进一步参考图4，其示出了根据本公开的用于评估模型性能的方法的一个实施例的流程400。该用于评估模型性能的方法可以包括以下步骤：Referring further to Figure 4, a process 400 is shown for one embodiment of a method for evaluating model performance in accordance with the present disclosure. The method for evaluating model performance may include the following steps:

步骤401，响应于接收到针对目标类模型的测试请求，获取与目标类模型对应的模型性能评估插件和镜像文件，该镜像文件包括测试样本集和插件运行方式信息。Step 401: In response to receiving a test request for the target class model, obtain the model performance evaluation plug-in and image file corresponding to the target class model. The image file includes a test sample set and plug-in operating mode information.

在本实施例中，用于评估模型性能的方法的执行主体(例如图1所示的测试平台103)可以在接收到针对目标类模型的测试请求时，获取与该目标类模型对应的模型性能评估插件和镜像文件。In this embodiment, the execution body of the method for evaluating model performance (such as the test platform 103 shown in Figure 1) can obtain the model performance corresponding to the target class model when receiving a test request for the target class model. Evaluate plugins and image files.

在这里，该模型性能评估插件可以包括模型推理插件和指标评估插件，该模型推理插件可以用于处理测试样本集，得到推理结果。该指标评估插件可以用于处理推理结果和标签，得到与该目标类模型对应的性能评估指标。Here, the model performance evaluation plug-in can include a model inference plug-in and an indicator evaluation plug-in. The model inference plug-in can be used to process the test sample set to obtain inference results. This indicator evaluation plug-in can be used to process inference results and labels to obtain performance evaluation indicators corresponding to the target class model.

在本实施例中，在模型测试的过程中，可以根据测试用户的特定测试请求相应增加插件来完成测试用户的测试请求。In this embodiment, during the model testing process, plug-ins can be added accordingly according to the test user's specific test request to complete the test user's test request.

步骤402，通过模型推理插件根据其对应的插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果。Step 402: The model inference plug-in processes the test samples in the test sample set according to the corresponding plug-in operation mode information to obtain the inference results corresponding to the test samples.

在本实施例中，上述执行主体可以通过模型推理插件按照与模型推理插件对应的插件运行方式信息处理测试样本集，得到推理结果。该推理结果可以为预测值。In this embodiment, the above execution subject can use the model inference plug-in to process the test sample set according to the plug-in operating mode information corresponding to the model inference plug-in to obtain the inference result. The result of this inference can be a predicted value.

在这里，上述执行主体可以通过插件开发方和/或由测试用户设置模型推理插件的插件运行方式信息。该模型推理插件的插件运行方式信息可以用于定义从测试样本集得到推理结果的过程中所涉及的运行逻辑、运行参数、运行环境等。Here, the above-mentioned execution subject can set the plug-in running mode information of the model inference plug-in through the plug-in developer and/or the test user. The plug-in operation mode information of the model inference plug-in can be used to define the operation logic, operation parameters, operation environment, etc. involved in the process of obtaining inference results from the test sample set.

在本实施例，通过测试用户设置模型推理插件的插件运行方式信息，可以满足测试用户特定的测试请求。In this embodiment, by setting the plug-in running mode information of the model inference plug-in by the test user, the test user's specific test request can be satisfied.

步骤403，通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标。Step 403: The indicator evaluation plug-in processes the inference result corresponding to the test sample and the label corresponding to the test sample according to its corresponding plug-in operation mode information to determine the performance evaluation index corresponding to the target class model.

在本实施例中，上述执行主体通过指标评估插件按照与指标评估插件对应的插件运行方式信息处理推理结果和标签，确定与该目标类模型对应的性能评估指标。该指标评估插件的插件运行方式信息可以用于定义从推理结果得到性能评估指标的过程中所涉及的运行逻辑、运行参数、运行环境等。In this embodiment, the execution subject uses the indicator evaluation plug-in to process the inference results and labels according to the plug-in operating mode information corresponding to the indicator evaluation plug-in, and determines the performance evaluation index corresponding to the target class model. The plug-in operation mode information of the indicator evaluation plug-in can be used to define the operation logic, operation parameters, operation environment, etc. involved in the process of obtaining performance evaluation indicators from the inference results.

在这里，上述执行主体可以通过插件开发方和/或由测试用户设置指标评估插件的插件运行方式信息。Here, the above-mentioned execution subject can evaluate the plug-in running mode information of the plug-in through the plug-in developer and/or the indicators set by the test user.

在本实施例，通过测试用户设置指标评估插件的插件运行方式信息，可以满足测试用户特定的测试请求。In this embodiment, the test user's specific test request can be satisfied by setting indicators to evaluate the plug-in running mode information of the plug-in.

在本实施例中，步骤40的具体操作分别已在图2所示的实施例中步骤201进行了详细的介绍，在此不再赘述。In this embodiment, the specific operations of step 40 have been introduced in detail in step 201 in the embodiment shown in FIG. 2 and will not be described again here.

从图4中可以看出，与图2对应的实施例相比，本实施例中的用于评估模型性能的方法突出了通过模型推理插件处理测试样本集，得到推理结果；然后，通过指标评估插件评估推理结果和标签得到性能评估指标，从而可以可插拔的插件，完成测试用户的特定测试需求。As can be seen from Figure 4, compared with the embodiment corresponding to Figure 2, the method for evaluating model performance in this embodiment highlights the process of processing the test sample set through the model inference plug-in to obtain the inference results; then, through the indicator evaluation The plug-in evaluates the inference results and labels to obtain performance evaluation indicators, so that the plug-in can be plugged in to complete the specific testing needs of the test user.

在本实施例的一些可选的实现方式中，插件运行方式信息：插件运行环境信息、插件运行逻辑信息和插件运行参数信息。In some optional implementations of this embodiment, plug-in running mode information: plug-in running environment information, plug-in running logic information and plug-in running parameter information.

在本实现方式中，可以通过插件运行环境信息、插件运行逻辑信息和插件运行参数信息中的至少一项来配置模型性能评估插件的运行方式，该运行方式可以用于定义模型性能评估插件评估测试样本集的运行流程。In this implementation, the running mode of the model performance evaluation plug-in can be configured through at least one of the plug-in running environment information, plug-in running logic information, and plug-in running parameter information. The running mode can be used to define the model performance evaluation plug-in evaluation test. The running process of the sample set.

在这里，插件运行环境信息可以为模型性能评估插件在测试平台(例如图1所示的测试平台103)上运行的环境信息。该插件运行逻辑信息可以定义模型性能评估插件处理测试样本集的执行逻辑。该插件运行参数信息可以定义模型性能评估插件在运行过程中与参数相关的信息，例如，参数格式、参数类型、参数对应的算法、参数对应的展示方式。Here, the plug-in running environment information may be the environment information of the model performance evaluation plug-in running on a test platform (such as the test platform 103 shown in Figure 1). The plug-in running logic information can define the execution logic of the model performance evaluation plug-in to process the test sample set. The plug-in running parameter information can define parameter-related information during the operation of the model performance evaluation plug-in, such as parameter format, parameter type, algorithm corresponding to the parameter, and display method corresponding to the parameter.

在本实现方式中，通过插件运行环境信息、插件运行逻辑信息和插件运行参数信息来定义模型性能评估插件评估测试样本集的运行流程。In this implementation, the running process of the model performance evaluation plug-in evaluation test sample set is defined through plug-in running environment information, plug-in running logic information, and plug-in running parameter information.

在本实施例的一些可选的实现方式中，插件运行参数信息包括插件输入参数和插件输出格式；In some optional implementations of this embodiment, the plug-in running parameter information includes plug-in input parameters and plug-in output format;

通过模型推理插件按照其对应的插件运行逻辑信息，对测试样本集中的测试样本执行如下运行流程：Through the model inference plug-in, according to the corresponding plug-in operation logic information, the following operation process is executed for the test samples in the test sample set:

对测试样本集中的测试样本进行解析，得到解析后的测试样本；Analyze the test samples in the test sample set to obtain the parsed test samples;

遍历解析后的测试样本，并将解析后的测试样本转换为插件输入参数；Traverse the parsed test samples and convert the parsed test samples into plug-in input parameters;

处理插件输入参数，得到与插件输出格式匹配的推理结果。Process the plug-in input parameters and obtain inference results that match the plug-in output format.

在本实现方式中，上述执行主体通过模型推理插件按照其对应的插件运行逻辑信息，解析测试样本集中的测试样本，得到解析后的测试样本；遍历解析后的测试样本，并将解析后的测试样本转换为插件输入参数；处理插件输入参数，得到推理结果，并将该推理结果转换为与插件输出格式匹配的推理结果。In this implementation, the above-mentioned execution subject uses the model inference plug-in to run logical information according to its corresponding plug-in, parses the test samples in the test sample set, and obtains the parsed test samples; traverses the parsed test samples, and compiles the parsed test samples. The samples are converted into plug-in input parameters; the plug-in input parameters are processed to obtain inference results, and the inference results are converted into inference results that match the plug-in output format.

在一个示例中，模型推理插件的运行逻辑，可以包括：In an example, the running logic of the model inference plug-in can include:

1.从测试样本集的路径获取测试样本集，并解析测试样本集；2.遍历解析后的测试样本，执行：3.1将解析后的测试样本转换为与插件输入参数；3.2发起推理请求，得到推理结果；3.3将推理结果的格式转换为插件输出格式；3.4将插件输出格式的推理结果写入结果文件中；4.如果有其他信息需要输出，可以写入结果文件，可以供后续其他插件读取。1. Obtain the test sample set from the path of the test sample set and parse the test sample set; 2. Traverse the parsed test samples and execute: 3.1 Convert the parsed test samples into plug-in input parameters; 3.2 Initiate an inference request and get Inference results; 3.3 Convert the format of the inference results to the plug-in output format; 3.4 Write the inference results in the plug-in output format into the result file; 4. If there is other information that needs to be output, it can be written into the result file, which can be read by other plug-ins later. Pick.

其中，input_func为模型推理入参脚本，用于将测试样本转换为插件输入参数；output_func为模型输出转换脚本，用于将推理结果的格式转换为插件输出格式。Among them, input_func is a model inference input parameter script, used to convert test samples into plug-in input parameters; output_func is a model output conversion script, used to convert the format of inference results into plug-in output format.

需要说明的是，推理结果写入结果文件中；对于其他可用于后续评估指标插件使用的数据，如标签，也可以文件形式写入结果文件或其他文件中。It should be noted that the inference results are written into the result file; other data that can be used by subsequent evaluation indicator plug-ins, such as labels, can also be written into the result file or other files in file form.

在解析测试样本之前，还需要根据模型服务代码源文件的路径获取模型服务代码源文件，部署到模型服务地址，得到模型性能评估插件；之后，通过模型推理插件处理测试样本集，得到推理结果。Before parsing the test sample, you need to obtain the model service code source file according to the path of the model service code source file, deploy it to the model service address, and obtain the model performance evaluation plug-in; then, process the test sample set through the model inference plug-in to obtain the inference results.

其中，模型服务地址为待部署模型服务的设备地址(例如，用户端的网络地址)。The model service address is the address of the device where the model service is to be deployed (for example, the network address of the client).

在这里，上述路径和地址都可以设置在插件运行参数信息中。例如，用于存储测试样本集、以及插件输入输出结果的文件的路径可以为插件开发者定制的存储路径或者由测试用户在插件运行参数信息中设置存储路径。Here, the above paths and addresses can be set in the plug-in running parameter information. For example, the path of the file used to store the test sample set and the plug-in input and output results can be a storage path customized by the plug-in developer or a storage path set by the test user in the plug-in running parameter information.

对应地，在该示例中，如果在评估目标类模型的过程中，还需要同时收集“平均推理延时”指标，可以给模型做热身；以得到“平均推理延时”对应的指标评估结果。Correspondingly, in this example, if you need to collect the "average inference delay" indicator at the same time during the evaluation of the target class model, you can warm up the model to obtain the indicator evaluation results corresponding to the "average inference delay".

在本实现方式中，测试用户可以通过设置插件运行方式信息来满足其特定的测试请求。In this implementation, test users can meet their specific test requests by setting plug-in running mode information.

在本实施例的一些可选的实现方式中，插件运行参数信息包括指标评估算法；In some optional implementations of this embodiment, the plug-in operating parameter information includes an indicator evaluation algorithm;

通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标，包括：The indicator evaluation plug-in processes the inference results corresponding to the test sample and the label corresponding to the test sample according to the corresponding plug-in running mode information to determine the performance evaluation indicators corresponding to the target class model, including:

通过指标评估插件按照其对应的插件运行逻辑信息，对推理结果执行以下运行流程：Through the indicator evaluation plug-in, according to the corresponding plug-in operation logic information, the following operation process is performed on the inference results:

分别解析该测试样本对应的推理结果和与该测试样本对应的标签，得到解析后的标签和解析后的推理结果；Analyze the inference result corresponding to the test sample and the label corresponding to the test sample respectively, and obtain the parsed label and the parsed inference result;

根据指标评估算法处理解析后的标签和解析后的推理结果，确定与目标类模型对应的性能评估指标。According to the index evaluation algorithm, the parsed labels and the parsed inference results are processed to determine the performance evaluation index corresponding to the target class model.

在本实现方式中，上述执行主体可以通过指标评估插件按照其对应的插件运行逻辑信息，分别解析该测试样本对应的推理结果和与该测试样本对应的标签，得到解析后的标签和解析后的推理结果；根据指标评估算法处理解析后的标签和解析后的推理结果，确定与目标类模型对应的性能评估指标。In this implementation, the above-mentioned execution subject can use the indicator evaluation plug-in to run the logic information according to its corresponding plug-in, respectively parse the inference results corresponding to the test sample and the label corresponding to the test sample, and obtain the parsed label and parsed label. Inference results; process the parsed labels and parsed inference results according to the indicator evaluation algorithm to determine the performance evaluation indicators corresponding to the target class model.

在一个示例中，指标评估插件的运行逻辑可以包括以下运行流程：In an example, the running logic of the indicator evaluation plug-in can include the following running process:

(1.1).从结果文件中获取推理结果和标签；(1.2).分别解析推理结果和标签，该推理结果的文件路径为结果文件的路径；(1.3).通过指标评估插件确定与目标类模型对应的性能评估指标。(1.1). Obtain the inference results and labels from the result file; (1.2). Parse the inference results and labels respectively, and the file path of the inference result is the path of the result file; (1.3). Determine the model with the target class through the indicator evaluation plug-in Corresponding performance evaluation indicators.

需要的说明的是，在确定性能评估指标时，还可以通过插件运行参数信息中定义的指标评估算法来评估推理结果和标签来确定。It should be noted that when determining the performance evaluation index, it can also be determined by evaluating the inference results and labels through the index evaluation algorithm defined in the plug-in running parameter information.

在本实现方式中，通过插件运行参数信息中定义的指标评估算法，实现对推理结果和标签的个性化评估。In this implementation, personalized evaluation of inference results and labels is achieved through the indicator evaluation algorithm defined in the plug-in running parameter information.

在本实施例的一些可选的实现方式中，该方法还包括：In some optional implementations of this embodiment, the method further includes:

将目标类模型的模型服务代码源文件、测试样本集和插件运行方式信息封装为镜像文件。Encapsulate the model service code source file, test sample set and plug-in running mode information of the target class model into an image file.

本实现方式，在模型性能评估之前，该用于评估模型性能的方法还包括：将目标类模型的模型服务代码源文件、测试样本集和插件运行方式信息进行封装为镜像文件。In this implementation, before model performance evaluation, the method for evaluating model performance also includes: encapsulating the model service code source file, test sample set and plug-in running mode information of the target class model into an image file.

其中，目标类模型对应的模型性能评估插件，可以是文件类模型，也可以是镜像类模型，在指标评估过程中共享模型服务代码源文件的以模型服务的形态出现在模型性能评估过程中。该模型服务代码源文件可以用于定义模型服务代码源文件的路径、模型服务地址等。Among them, the model performance evaluation plug-in corresponding to the target class model can be a file class model or a mirror class model. During the indicator evaluation process, the shared model service code source file appears in the model performance evaluation process in the form of a model service. The model service code source file can be used to define the path of the model service code source file, the model service address, etc.

测试样本集：其包括标签，该测试样本集的格式可以由插件开发者或者由测试用户在镜像文件中设置。例如，由插件开发者提供示例数据包的存储路径。Test sample set: It includes tags. The format of the test sample set can be set in the image file by the plug-in developer or the test user. For example, the plug-in developer provides the storage path of the sample data package.

性能评估指标：基于标签和推理结果运算出的用于评估目标类模型的性能评估指标，如准确率、预测时延等。Performance evaluation indicators: Performance evaluation indicators calculated based on labels and inference results to evaluate the target class model, such as accuracy, prediction delay, etc.

插件运行方式信息：基于Python代码组成，定义了模型性能评估过程插件的执行流程。例如，插件运行方式信息包括插件运行环境信息，该插件运行环境信息：用于定义插件运行的运行环境(或资源等)，一般由镜像组成，例如python38+环境。Plug-in running mode information: Based on Python code composition, the execution process of the plug-in in the model performance evaluation process is defined. For example, plug-in running mode information includes plug-in running environment information. The plug-in running environment information is used to define the running environment (or resources, etc.) for plug-in running. It is generally composed of images, such as python38+ environment.

在本实施例的一些可选的实现方式中，若插件运行参数信息包括展示方式时，该方法还包括：通过展示方式展示目标类模型的性能评估指标。In some optional implementations of this embodiment, if the plug-in running parameter information includes a display method, the method further includes: displaying the performance evaluation index of the target class model through a display method.

在本实现方式中，可以用过展示方式可以包括：数值型、文字型、线条图。In this implementation, the display methods that can be used include: numerical type, text type, and line drawing.

在这里，上述展示方式可以指目标类模型的部分或全部性能评估指标的展示方式。Here, the above display method may refer to the display method of some or all performance evaluation indicators of the target class model.

在得到性能评估指标之后，可以根据插件运行参数信息，将性能评估指标写入对应的存储目标，例如指标数据库，列表。After obtaining the performance evaluation indicators, the performance evaluation indicators can be written into the corresponding storage target, such as an indicator database or list, based on the plug-in operating parameter information.

在一个示例中，该线条图可以为通过线条等元素来展示目标类模型的性能评估结果，例如，折线图、柱状图等。In one example, the line graph can display the performance evaluation results of the target class model through elements such as lines, for example, a line graph, a bar graph, etc.

在本实现方式中，通过插件运行参数信息包括的展示方式来展示性能评估指标。In this implementation, the performance evaluation indicators are displayed through the display method included in the plug-in running parameter information.

在本实施例的一些可选的实现方式中，若目标类模型包括多种类模型时，通过展示方式展示目标类模型的性能评估指标，包括：In some optional implementations of this embodiment, if the target class model includes multiple types of models, the performance evaluation indicators of the target class model are displayed through a display method, including:

通过与展示方式对应的比对图展示多种类模型的性能评估指标。The performance evaluation indicators of multiple types of models are displayed through comparison charts corresponding to the display methods.

在本实现方式中，通过与展示方式对应的对比图展示多种类模型的性能评估指标。该对比图可以通过展示方式的定义来将多种类模型的全部或部分性能评估指标展示在一起；或将全部或部分性能评估指标进行交叉分析，并展示；或通过混淆矩阵展示多种类模型的全部或部分性能评估指标，例如，样本比例等。In this implementation, the performance evaluation indicators of multiple types of models are displayed through comparison charts corresponding to the display method. This comparison chart can display all or part of the performance evaluation indicators of multiple types of models together through the definition of the display method; or cross-analyze and display all or part of the performance evaluation indicators; or display all of the multiple types of models through a confusion matrix. Or some performance evaluation indicators, such as sample proportion, etc.

在这里，对全部或部分性能评估指标进行交叉分析，可以通过交叉分析结果来展示模型的一些性能，例如，准确率和延时可以衡量模型时延和资源占用情况。Here, cross-analysis is performed on all or part of the performance evaluation indicators, and some performance of the model can be demonstrated through cross-analysis results. For example, accuracy and delay can measure model delay and resource usage.

在一个示例中，通过对比图展示多种类模型的全部性能指标，在该对比图中可以呈现模型名称、性能评估指标列表、每种模型的全部性能评估指标的折线图。In one example, all performance indicators of multiple types of models are displayed through a comparison chart. In the comparison chart, the model name, a list of performance evaluation indicators, and a line chart of all performance evaluation indicators of each model can be presented.

在一个示例中，通过对比图展示全部性能指标中的至少一部分性能指标，例如，折线图或柱状图或曲线图(例如，P-R曲率或ROC曲线)，例如，准确率。In one example, at least some of the performance indicators among all performance indicators are displayed through a comparison chart, for example, a line chart or a bar chart or a curve chart (for example, P-R curvature or ROC curve), for example, accuracy rate.

在本实现方式中，通过对比图展示多种类模型，且在展示的过程中，测试用户可以根据其需要选择其需展示的内容(例如，某一类模型、某些性能评估指标)等，从而可以满足该测试用户的特定测试需求。In this implementation, multiple types of models are displayed through comparison charts, and during the display process, test users can select the content they need to display (for example, a certain type of model, certain performance evaluation indicators), etc. according to their needs, so that Can meet the specific testing needs of this test user.

进一步参考图5～6，其示出了根据本公开的用于评估模型性能的方法的应用场景图。在该应用场景中，用于评估模型性能的方法可以包括以下步骤：Referring further to FIGS. 5-6 , application scenario diagrams of methods for evaluating model performance according to the present disclosure are shown. In this application scenario, the method used to evaluate model performance can include the following steps:

(1)评估前准备：模型服务代码源文件和测试样本集准备，并将模型性能评估插件启动发布成在线服务；(2)启动模型服务：通过测试请求将目标类模型和测试样本集请求至模型推理环节；(3)模型推理：由模型推理插件先解析测试样本集，得到解析后的测试样本，将解析后的测试样本转换为插件输入参数(即，input_func)，以及将推理结果的格式转换为插件输出格式(即，output_func)。(4)性能评估：由指标评估插件解析推理结果和标签，并通过指标评估计算(即，calculate_metrics_func)评估推理结果和标签，得到性能评估指标。(1) Preparation before evaluation: Prepare model service code source files and test sample sets, and start and publish the model performance evaluation plug-in as an online service; (2) Start model service: Request the target class model and test sample set through test requests to Model reasoning link; (3) Model reasoning: The model reasoning plug-in first parses the test sample set to obtain the parsed test samples, converts the parsed test samples into plug-in input parameters (i.e., input_func), and formats the inference results Convert to plugin output format (i.e., output_func). (4) Performance evaluation: The metric evaluation plug-in parses the inference results and labels, and evaluates the inference results and labels through metric evaluation calculations (i.e., calculate_metrics_func) to obtain performance evaluation indicators.

在得到性能评估指标之后，该方法还可以包括：通过展示方式展示性能评估指标；在展示性能评估指标之后，该方法还可以包括中间件清理和模型服务的销毁，以评估该目标类模型的性能占用的资源，即，CPU(Central Processing Unit/Processor)/内存/GPU(Graphics Processing Unit))。After obtaining the performance evaluation index, the method may also include: displaying the performance evaluation index through display; after displaying the performance evaluation index, the method may also include middleware cleaning and destruction of the model service to evaluate the performance of the target class model. The resources occupied, that is, CPU (Central Processing Unit/Processor)/Memory/GPU (Graphics Processing Unit)).

需要说明的是，在评估目标类模型的性能时，还可以结合该目标类模型占用资源的情况来辅助评估该目标类模型的性能。例如，资源占用较少的模型的性能会优于资源占用较多的模型的性能。It should be noted that when evaluating the performance of a target class model, the resource usage of the target class model can also be used to assist in evaluating the performance of the target class model. For example, a model that uses fewer resources will perform better than a model that uses more resources.

在本实施例中，通过可插拔的插件，完成目标类模型的性能评估。In this embodiment, the performance evaluation of the target class model is completed through pluggable plug-ins.

进一步参考图7，作为对上述各图所示方法的实现，本公开提供了一种用于评估模型性能的装置的一个实施例，该装置实施例与图2所示的方法实施例相对应，该装置具体可以应用于各种电子设备中。With further reference to Figure 7, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a device for evaluating model performance. The device embodiment corresponds to the method embodiment shown in Figure 2, The device can be applied in various electronic devices.

如图7所示，本实施例的用于评估模型性能的装置700可以包括：获取模块701和性能评估模块702。其中，获取模块701，用于响应于接收到针对目标类模型的测试请求，获取与目标类模型对应的模型性能评估插件和镜像文件；其中，镜像文件包括测试样本集和插件运行方式信息；性能评估模块702，用于启动镜像文件，并通过模型性能评估插件根据插件运行方式信息评估测试样本集，确定与目标类模型对应的性能评估指标。As shown in Figure 7, the device 700 for evaluating model performance in this embodiment may include: an acquisition module 701 and a performance evaluation module 702. Among them, the acquisition module 701 is used to obtain the model performance evaluation plug-in and image file corresponding to the target class model in response to receiving a test request for the target class model; wherein the image file includes a test sample set and plug-in operating mode information; performance The evaluation module 702 is used to start the image file, and evaluate the test sample set according to the plug-in operation mode information through the model performance evaluation plug-in to determine the performance evaluation indicators corresponding to the target class model.

在本实施例中，用于评估模型性能的装置700中：获取模块701和性能评估模块702的具体处理及其所带来的技术效果可分别参考图2对应实施例中的步骤201-202的相关说明，在此不再赘述。In this embodiment, in the device 700 for evaluating model performance: the specific processing of the acquisition module 701 and the performance evaluation module 702 and the technical effects they bring can be referred to steps 201-202 in the corresponding embodiment of Figure 2 respectively. Relevant instructions will not be repeated here.

性能评估模块702，包括：模型推理单元，用于通过模型推理插件根据其对应的插件运行方式信息处理测试样本集中的测试样本，得到该测试样本对应的推理结果；性能评估单元，用于通过指标评估插件根据其对应的插件运行方式信息处理该测试样本对应的推理结果和与该测试样本对应的标签，确定与目标类模型对应的性能评估指标。The performance evaluation module 702 includes: a model inference unit, used to process the test samples in the test sample set according to the corresponding plug-in operation mode information through the model inference plug-in, and obtain the inference results corresponding to the test samples; a performance evaluation unit, used to pass the indicator The evaluation plug-in processes the inference results corresponding to the test sample and the label corresponding to the test sample according to the corresponding plug-in running mode information, and determines the performance evaluation index corresponding to the target class model.

在一些示例中，若目标类模型包括多种类模型时，展示模块，具体用于：展示多种类模型的性能评估指标的比对图。In some examples, if the target class model includes multiple types of models, the display module is specifically used to: display a comparison chart of performance evaluation indicators of multiple types of models.

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图8示出了可以用来实施本公开的实施例的示例电子设备800的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。Figure 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图8所示，设备800包括计算单元801，其可以根据存储在只读存储器(ROM)802中的计算机程序或者从存储单元808加载到随机访问存储器(RAM)803中的计算机程序，来执行各种适当的动作和处理。在RAM 803中，还可存储设备800操作所需的各种程序和数据。计算单元801、ROM 802以及RAM 803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , the device 800 includes a computing unit 801 that can execute according to a computer program stored in a read-only memory (ROM) 802 or loaded from a storage unit 808 into a random access memory (RAM) 803 Various appropriate actions and treatments. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. Computing unit 801, ROM 802 and RAM 803 are connected to each other via bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

设备800中的多个部件连接至I/O接口805，包括：输入单元808，例如键盘、鼠标等；输出单元807，例如各种类型的显示器、扬声器等；存储单元808，例如磁盘、光盘等；以及通信单元809，例如网卡、调制解调器、无线通信收发机等。通信单元809允许设备800通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 800 are connected to the I/O interface 805, including: an input unit 808, such as a keyboard, a mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, optical disk, etc. ; and communication unit 809, such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

计算单元801可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元801的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元801执行上文所描述的各个方法和处理，例如用于评估模型性能的方法。例如，在一些实施例中，用于评估模型性能的方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元808。在一些实施例中，计算机程序的部分或者全部可以经由ROM 802和/或通信单元809而被载入和/或安装到设备800上。当计算机程序加载到RAM 803并由计算单元801执行时，可以执行上文描述的用于评估模型性能的方法的一个或多个步骤。备选地，在其他实施例中，计算单元801可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行用于评估模型性能的方法。Computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 801 performs various methods and processes described above, such as methods for evaluating model performance. For example, in some embodiments, a method for evaluating model performance may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809 . When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the method for evaluating model performance described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the method for evaluating model performance in any other suitable manner (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.

人工智能是研究计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科，既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术；人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语音处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。Artificial intelligence is a discipline that studies computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology. Artificial intelligence hardware technology generally includes sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing and other technologies; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural speech processing technology and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开提及的技术方案所期望的结果，本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solutions mentioned in the present disclosure can be achieved, there is no limitation here.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims

1. A method for evaluating model performance, including:

In response to receiving a test request for the target class model, obtain a model performance evaluation plug-in and an image file corresponding to the target class model; wherein the image file includes a test sample set and plug-in operating mode information;

Start the image file, and use the model performance evaluation plug-in to evaluate the test sample set according to the plug-in operating mode information to determine the performance evaluation index corresponding to the target class model.

2. The method according to claim 1, wherein the model performance evaluation plug-in includes a model inference plug-in and an indicator evaluation plug-in;

The method of using the model performance evaluation plug-in to evaluate the test sample set according to the plug-in operating mode information and determine the performance evaluation index corresponding to the target class model includes:

The test samples in the test sample set are processed by the model inference plug-in according to the corresponding plug-in operation mode information to obtain the inference results corresponding to the test samples;

The performance evaluation index corresponding to the target class model is determined by processing the inference result corresponding to the test sample and the label corresponding to the test sample according to the corresponding plug-in operation mode information by the index evaluation plug-in.

3. The method according to claim 1 or 2, wherein the plug-in operation mode information includes: plug-in operation environment information, plug-in operation logic information and plug-in operation parameter information.

4. The method according to claim 3, wherein the plug-in operation parameter information includes plug-in input parameters and plug-in output format;

Processing the test samples in the test sample set through the model inference plug-in according to its corresponding plug-in operation mode information to obtain the inference results corresponding to the test samples includes:

Through the model inference plug-in according to its corresponding plug-in operation logic information, the following operation process is executed for the test samples in the test sample set:

Analyze the test samples in the test sample set to obtain the parsed test samples;

Traverse the parsed test samples and convert the parsed test samples into the plug-in input parameters;

Process the plug-in input parameters to obtain inference results that match the plug-in output format.

5. The method according to claim 3, wherein the plug-in operating parameter information includes an indicator evaluation algorithm;

Processing the inference result corresponding to the test sample and the label corresponding to the test sample through the indicator evaluation plug-in according to its corresponding plug-in operation mode information to determine the performance evaluation index corresponding to the target class model, including:

Through the indicator evaluation plug-in according to its corresponding plug-in operation logic information, the following operation process is executed for the inference results corresponding to the test sample:

Analyze the inference result corresponding to the test sample and the label corresponding to the test sample respectively, and obtain the parsed label and the parsed inference result;

According to the index evaluation algorithm, the parsed labels and the parsed reasoning results are processed to determine the performance evaluation index corresponding to the target class model.

6. The method according to claim 3, wherein the plug-in operation parameter information includes a display mode; the method further includes:

The performance evaluation indicators of the target class model are displayed through the display method.

7. The method according to claim 6, wherein if the target class model includes multiple class models, the performance evaluation indicators of the target class model displayed through the display method include:

The performance evaluation indicators of the multiple types of models are displayed through comparison charts corresponding to the display method.

8. The method of claim 1, wherein the method further comprises:

Encapsulate the model service code source file, test sample set and plug-in running mode information of the target class model into an image file.

9. An apparatus for evaluating model performance, comprising:

An acquisition module, configured to obtain a model performance evaluation plug-in and an image file corresponding to the target class model in response to receiving a test request for the target class model; wherein the image file includes a test sample set and plug-in operation mode information;

A performance evaluation module is used to start the image file, evaluate the test sample set according to the plug-in operating mode information through the model performance evaluation plug-in, and determine the performance evaluation index corresponding to the target class model.

10. The device according to claim 9, wherein the model performance evaluation plug-in includes a model inference plug-in and an indicator evaluation plug-in;

The performance evaluation module includes:

A model inference unit, configured to process the test samples in the test sample set according to the corresponding plug-in operation mode information through the model inference plug-in to obtain the inference results corresponding to the test samples;

A performance evaluation unit configured to process the inference results corresponding to the test sample and the label corresponding to the test sample through the index evaluation plug-in according to its corresponding plug-in operation mode information, and determine the performance evaluation index corresponding to the target class model.

11. The device according to claim 9 or 10, wherein the plug-in operation mode information includes: plug-in operation environment information, plug-in operation logic information and plug-in operation parameter information.

12. The device according to claim 11, wherein the plug-in operating parameter information includes plug-in input parameters and plug-in output format;

The model inference unit is specifically used to perform the following operation process on the test samples in the test sample set through the model inference plug-in according to its corresponding plug-in operation logic information: parse the test samples in the test sample set, and obtain Parsed test samples; traverse the parsed test samples, and convert the parsed test samples into the plug-in input parameters; process the plug-in input parameters to obtain inference results that match the plug-in output format.

13. The device according to claim 11, wherein the plug-in operating parameter information includes an indicator evaluation algorithm;

The performance evaluation unit is specifically configured to perform the following operation process on the inference results corresponding to the test sample through the indicator evaluation plug-in according to its corresponding plug-in operation logic information: respectively analyze the inference results corresponding to the test sample and the inference results corresponding to the test sample. Test the label corresponding to the sample to obtain the parsed label and the parsed inference result; process the parsed label and the parsed inference result according to the index evaluation algorithm to determine the performance evaluation index corresponding to the target class model.

14. The device according to claim 11, wherein the plug-in operating parameter information includes a display mode; the device further includes: a display module configured to display performance evaluation indicators of the target class model through the display mode.

15. The device according to claim 14, wherein if the target class model includes multiple class models, the display module is specifically configured to: display the multiple classes through a comparison chart corresponding to the display method. Model performance evaluation metrics.

16. The device according to claim 9, wherein the device further comprises: an encapsulation module, configured to encapsulate the model service code source file, test sample set and plug-in running mode information of the target class model into an image file.

17. An electronic device, including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform any one of claims 1-8. Methods.

18. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.