CN114819159A

CN114819159A - Inference method, device, equipment and storage medium of deep learning model

Info

Publication number: CN114819159A
Application number: CN202210404848.5A
Authority: CN
Inventors: 闻磊
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-29
Anticipated expiration: 2042-04-18
Also published as: CN114819159B

Abstract

The application relates to an inference method, an inference device, inference equipment and a storage medium of a deep learning model, wherein the inference method comprises the following steps: acquiring a data set to be inferred; inputting the data set to be inferred into an inference convolution kernel to obtain a floating point type data inference result; acquiring an output scaling factor corresponding to the data set to be inferred; wherein the output scaling factor is determined from a maximum of an element in the data inference result; calculating the product of the data reasoning result and the output scaling factor to obtain a quantization result; and continuing the reasoning of the deep learning model according to the quantification result. The method and the device are used for solving the technical problem that in the model performance optimization process of the prior art, the integral error is large due to quantization.

Description

Inference method, device, device and storage medium for deep learning model

技术领域technical field

本申请涉及深度学习网络技术领域，尤其涉及一种深度学习模型的推理方法、装置、设备及存储介质。The present application relates to the technical field of deep learning networks, and in particular, to a reasoning method, apparatus, device and storage medium for a deep learning model.

背景技术Background technique

目前，深度学习已经广泛应用到各行各业，在传统算法难以解决的领域，取得了巨大的成果。但目前深度学习应用推广的一个问题在于，它的运行成本巨大，即使在如今GPU(图形处理器，graphics processing unit)的算力有较大提高的情况下，参数量逐年提升的深度学习模型也吃掉了GPU性能提升的红利。因此，对于模型本身的性能优化，是能否在大规模生产过程中应用深度学习的一个关键。At present, deep learning has been widely used in all walks of life, and has achieved great results in areas that are difficult to solve by traditional algorithms. However, a problem with the current promotion of deep learning applications is that its operating costs are huge. Even when the computing power of GPUs (graphics processing units, graphics processing units) has been greatly improved, the deep learning models whose parameters have been increased year by year are also Eat up the bonus of GPU performance improvement. Therefore, the performance optimization of the model itself is a key to whether deep learning can be applied in the large-scale production process.

在模型性能优化过程，量化是其中的一个方法。量化的一个关键在于将浮点float32的精度的输入转换为整型int8的形式。而这个转换过程会引入相应的误差，如今的解决方案，大多不能实时对误差进行调整，导致整体误差较大，从而影响了最终的产品指令。Quantization is one of the methods in the model performance optimization process. A key to quantization is to convert floating-point float32 precision input to integer int8 form. This conversion process will introduce corresponding errors. Most of today's solutions cannot adjust the errors in real time, resulting in a large overall error, which affects the final product order.

发明内容SUMMARY OF THE INVENTION

本申请提供了一种深度学习模型的推理方法、装置、设备及存储介质，用以解决现有技术的模型性能优化过程中，量化导致整体误差较大的技术问题。The present application provides an inference method, device, device and storage medium for a deep learning model, which are used to solve the technical problem that quantization leads to a large overall error in the process of model performance optimization in the prior art.

第一方面，本申请提供了一种深度学习模型的推理方法，包括：In a first aspect, the present application provides an inference method for a deep learning model, including:

获取待推理的数据集；Get the data set to be reasoned;

将所述待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果；Inputting the data set to be inferred into the inference convolution kernel to obtain a floating-point data inference result;

获取所述待推理的数据集对应的输出缩放因子；其中，所述输出缩放因子根据所述数据推理结果中的元素的最大值确定；Obtain the output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

计算所述数据推理结果与所述输出缩放因子的乘积，得到量化结果；Calculate the product of the data inference result and the output scaling factor to obtain a quantization result;

根据所述量化结果继续深度学习模型的推理。Continue the inference of the deep learning model according to the quantization result.

可选的，所述获取所述待推理的数据集对应的输出缩放因子，包括：Optionally, the obtaining the output scaling factor corresponding to the data set to be inferred includes:

从所述数据推理结果的各个元素中，确定最大元素；From each element of the data inference result, determine the largest element;

计算预设值与所述最大元素的商，得到所述输出缩放因子；其中，所述预设值为所述待推理的数据集的数据类型对应的取值范围的上限值。The quotient of the preset value and the largest element is calculated to obtain the output scaling factor; wherein, the preset value is the upper limit of the value range corresponding to the data type of the data set to be inferred.

可选的，所述将所述待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果，包括：Optionally, inputting the data set to be inferred into an inference convolution kernel to obtain a floating-point data inference result, including:

获取所述待推理的数据集对应的第一缩放因子；obtaining the first scaling factor corresponding to the data set to be inferred;

获取所述推理卷积内核对应的权重及所述权重对应的第二缩放因子；obtaining the weight corresponding to the inference convolution kernel and the second scaling factor corresponding to the weight;

获取所述推理卷积内核的偏置；obtain the bias of the inference convolution kernel;

根据所述待推理的数据集、所述第一缩放因子、所述权重、所述第二缩放因子和所述偏置，计算所述浮点型的数据推理结果。The floating-point data inference result is calculated according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the offset.

可选的，所述根据所述待推理的数据集、所述第一缩放因子、所述权重、所述第二缩放因子和所述偏置，计算所述浮点型的数据推理结果，包括：Optionally, calculating the floating-point data inference result according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the offset, including :

计算所述待推理的数据集与所述权重的乘积，得到第一中间结果；Calculate the product of the data set to be inferred and the weight to obtain a first intermediate result;

利用所述第一中间结果除以所述第一缩放因子和所述第二缩放因子，得到所述第二中间结果；Dividing the first intermediate result by the first scaling factor and the second scaling factor to obtain the second intermediate result;

计算所述第二中间结果和所述偏置的和，得到所述浮点型的数据推理结果。Calculate the sum of the second intermediate result and the offset to obtain the floating-point data inference result.

可选的，所述获取所述待推理的数据集对应的第一缩放因子，包括：Optionally, the obtaining the first scaling factor corresponding to the data set to be inferred includes:

获取所述待推理的数据集对应的浮点型的输入数据集；obtaining a floating-point input data set corresponding to the data set to be inferred;

利用所述待推理的数据集除以所述浮点型的输入数据集，得到所述第一缩放因子。The first scaling factor is obtained by dividing the data set to be inferred by the floating-point input data set.

可选的，所述获取待推理的数据集，包括:Optionally, the obtained data set to be reasoned includes:

从目标视频中，确定当前视频帧对应的待推理的数据集；From the target video, determine the data set to be inferred corresponding to the current video frame;

所述获取所述待推理的数据集对应的输出缩放因子，包括：The obtaining the output scaling factor corresponding to the data set to be inferred includes:

判断所述当前视频帧相对于上一视频帧是否发生场景切换，得到判断结果；Judging whether scene switching occurs in the current video frame relative to the previous video frame, and obtaining a judgment result;

根据所述判断结果，确定所述待推理的数据集对应的所述输出缩放因子。According to the judgment result, the output scaling factor corresponding to the data set to be inferred is determined.

可选的，所述根据所述判断结果，确定所述待推理的数据集对应的所述输出缩放因子，包括：Optionally, determining the output scaling factor corresponding to the data set to be inferred according to the judgment result includes:

如果所述判断结果指示发生场景切换，则从所述数据推理结果的各个元素中，确定最大元素；计算预设值与所述最大元素的商，得到所述输出缩放因子；其中，所述预设值为所述待推理的数据集的数据类型对应的取值范围的上限值；If the judgment result indicates that scene switching occurs, determine the maximum element from the elements of the data inference result; calculate the quotient of the preset value and the maximum element to obtain the output scaling factor; wherein the preset value Set the value to the upper limit of the value range corresponding to the data type of the data set to be inferred;

如果所述判断结果指示未发生场景切换，则将所述待推理的数据集的上一个输入数据集对应的输出缩放因子，作为所述待推理的数据集的输出缩放因子。If the judgment result indicates that no scene switching has occurred, the output scaling factor corresponding to the last input data set of the data set to be inferred is used as the output scaling factor of the data set to be inferred.

第二方面，本申请提供了一种深度学习模型的推理装置，包括：In a second aspect, the present application provides an inference device for a deep learning model, including:

第一获取模块，用于获取待推理的数据集；The first acquisition module is used to acquire the data set to be reasoned;

第一推理模块，用于将所述待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果；a first inference module, configured to input the data set to be inferred into the inference convolution kernel to obtain a floating-point data inference result;

第二获取模块，用于获取所述待推理的数据集对应的输出缩放因子；其中，所述输出缩放因子根据所述数据推理结果中的元素的最大值确定；a second obtaining module, configured to obtain an output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

量化模块，用于计算所述数据推理结果与所述输出缩放因子的乘积，得到量化结果；a quantization module for calculating the product of the data inference result and the output scaling factor to obtain a quantization result;

第二推理模块，用于根据所述量化结果继续深度学习模型的推理。The second reasoning module is configured to continue the reasoning of the deep learning model according to the quantization result.

第三方面，本申请提供了一种电子设备，包括：处理器、存储器和通信总线，其中，处理器和存储器通过通信总线完成相互间的通信；所述存储器，用于存储计算机程序；所述处理器，用于执行所述存储器中所存储的程序，实现第一方面所述的深度学习模型的推理方法。In a third aspect, the present application provides an electronic device, including: a processor, a memory, and a communication bus, wherein the processor and the memory communicate with each other through the communication bus; the memory is used to store a computer program; the The processor is configured to execute the program stored in the memory to implement the inference method of the deep learning model described in the first aspect.

第四方面，本申请提供了一种计算机可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时实现第一方面所述的深度学习模型的推理方法。In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the inference method of the deep learning model described in the first aspect is implemented.

本申请实施例提供的上述技术方案与现有技术相比具有如下优点：本申请实施例提供的该方法，对于不同的待推理的数据集，能够获取与待推理的数据集匹配的输出缩放因子，能够根据实际应用时输入的待推理的数据集动态生成输出缩放因子，再与待推理的数据集输入到推理卷积内核得到的浮点型的数据推理结果相乘，能够达到int8整型量化速度和质量的平衡，即本申请实施例提供的方法不仅能够提升深度学习网络模型的推理效率，同时能够保证推理质量，能够有效缓解整体的误差比较大的问题。Compared with the prior art, the above technical solutions provided by the embodiments of the present application have the following advantages: the method provided by the embodiments of the present application can obtain output scaling factors matching the data sets to be inferred for different data sets to be inferred , the output scaling factor can be dynamically generated according to the input data set to be inferred during the actual application, and then multiplied with the floating-point data inference result obtained by inputting the data set to be inferred to the inference convolution kernel, which can achieve int8 integer quantization The balance between speed and quality, that is, the method provided by the embodiments of the present application can not only improve the inference efficiency of the deep learning network model, but also ensure the inference quality, and can effectively alleviate the problem of relatively large overall errors.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1为本申请实施例提供的将浮点型数据映射为整型数据的示意图；1 is a schematic diagram of mapping floating-point data to integer data according to an embodiment of the present application;

图2为本申请实施例提供的一种深度学习模型的推理方法的流程示意图；2 is a schematic flowchart of an inference method for a deep learning model provided by an embodiment of the present application;

图3为本申请实施例提供的获取浮点型的数据推理结果的方法的流程示意图；3 is a schematic flowchart of a method for obtaining a floating-point data inference result provided by an embodiment of the present application;

图4为本申请实施例提供的int8整型卷积内核的结构示意图；4 is a schematic structural diagram of an int8 integer convolution kernel provided by an embodiment of the present application;

图5为本申请实施例提供的一种深度学习模型的推理装置的结构示意图；FIG. 5 is a schematic structural diagram of an inference apparatus for a deep learning model provided by an embodiment of the present application;

图6为本申请实施例提供的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述。In order to make the purposes, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings in the embodiments of the present application.

本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。The term "and/or" in this article is only an association relationship to describe the associated objects, indicating that there can be three kinds of relationships, for example, A and/or B, it can mean that A exists alone, A and B exist at the same time, and A and B exist independently B these three cases. In addition, the character "/" in this document generally indicates that the related objects are an "or" relationship.

在深度学习的训练过程中，为了整体的目标性能，目前主流的方法是使用浮点float32数据进行训练的，其中，float代表浮点型数据类型，float后面的数字32代表占用的比特位数。但是，具体到模型的部署，为了生产效率，则会应用诸如int8整型量化的方法进行相应的推理提速；其中，int代表整型数据类型，int后面的数字8代表占用的比特位数。而浮点型到整型的映射，主流的方法是使用一个缩放系数，如图1所示，将浮点型数据乘以这个缩放系数，从而扩增到[-128，127]这个整型的区间，随后使用取整(round)的方法将扩增后的浮点型数据进一步取整，从而得到最终的int8类型的整型数值，例如：浮点型数据-0.5映射后的int8类型的整型数值为-128；浮点型数据1.5映射后的int8类型的整型数值为128；浮点型数据0映射后的int8类型的整型数值为-Z；浮点型数据SZ映射后的int8类型的整型数值为0。In the training process of deep learning, for the overall target performance, the current mainstream method is to use floating-point float32 data for training, where float represents the floating-point data type, and the number 32 after float represents the number of bits occupied. However, specific to the deployment of the model, for production efficiency, methods such as int8 integer quantization will be applied to speed up inference; where int represents an integer data type, and the number 8 after int represents the number of bits occupied. As for the floating-point to integer mapping, the mainstream method is to use a scaling factor, as shown in Figure 1, multiply the floating-point data by this scaling factor, thereby expanding to [-128, 127] this integer interval, and then use the rounding method to further round the amplified floating-point data to obtain the final integer value of int8 type, for example: floating-point data-0.5 mapped integer of int8 type The value of the type is -128; the integer value of the int8 type after the floating point data 1.5 mapping is 128; the integer value of the int8 type after the floating point data 0 mapping is -Z; the int8 after the floating point data SZ mapping The integer value of the type is 0.

后续将这个int8类型的整型数值作为输入，输入到推理卷积内核(inferenceconv kernel)中进行推理，以提高模型的推理效率。Subsequently, the integer value of type int8 is used as input and input into the inference convolution kernel (inferenceconv kernel) for inference to improve the inference efficiency of the model.

但是，在实际应用中，为了保证整体的速度，虽然将输入的数据类型映射为int8类型，但是，在推理卷积内核内部，为了保证一定水平的精度，在推理卷积内核内部推理的过程中，例如：乘法计算后的相加运算以及偏置相加的运算中，保留的仍然是浮点型的数据结果，而为了保证输出结果为int8型，则需要在浮点型的数据推理结果的基础上，再经过relu函数和round操作，如果数值为0和1之间，则经过round操作，则结果为0。因此，现有技术提供的方法，为了提升速度，牺牲了精度，导致整体误差较大。However, in practical applications, in order to ensure the overall speed, although the input data type is mapped to the int8 type, in the inference convolution kernel, in order to ensure a certain level of accuracy, in the inference process of the inference convolution kernel. , for example, in the addition operation after multiplication and the offset addition operation, the floating-point data result is still reserved, and in order to ensure that the output result is int8 type, it is necessary to infer the floating-point data in the result. On the basis, after the relu function and the round operation, if the value is between 0 and 1, after the round operation, the result is 0. Therefore, in the method provided by the prior art, in order to improve the speed, the accuracy is sacrificed, resulting in a large overall error.

为了解决现有技术中模型性能优化过程中，量化导致的整体误差较大的问题，本申请实施例提供了一种深度学习模型的推理方法，该深度学习模型的推理方法可以应用于服务器或者云端；其中，服务器或云端存储有深度学习网络模型，该深度学习网络模型中包含推理卷积内核。In order to solve the problem of large overall error caused by quantization in the process of model performance optimization in the prior art, an embodiment of the present application provides an inference method for a deep learning model, and the inference method for a deep learning model can be applied to a server or a cloud ; wherein, the server or cloud stores a deep learning network model, and the deep learning network model includes an inference convolution kernel.

如图2所示，本申请实施例提供的一种深度学习模型的推理方法具体包括如下步骤：As shown in FIG. 2 , the inference method of a deep learning model provided by an embodiment of the present application specifically includes the following steps:

步骤201，获取待推理的数据集；Step 201, obtaining a data set to be reasoned;

其中，待推理的数据集可以是从浮点型数据映射到整型数据类型的矩阵。其中，从浮点型数据映射到整型数据类型的具体方法可以参照上述现有技术中的映射方法。在浮点型的输入矩阵的各个浮点型元素的基础上，乘以第一缩放因子，将各个浮点型元素映射为整型元素，得到整型类型的矩阵，作为待推理的数据集。其中，待推理的数据集为推理卷积内核的输入。The data set to be inferred may be a matrix mapping from floating-point data to integer data types. For the specific method of mapping from floating point data to integer data type, reference may be made to the mapping method in the above-mentioned prior art. On the basis of each floating-point element of the floating-point input matrix, multiply the first scaling factor to map each floating-point element into an integer element to obtain an integer-type matrix, which is used as the data set to be inferred. Among them, the data set to be inferred is the input of the inference convolution kernel.

例如，以英伟达(nvidia)GPU上运行的int8整型卷积内核为例，卷积内核的组成部分包括：输入(activation)、权值(weight)和偏置(bias)，其中的输入(activation)在实际应用中即为待推理的数据集。For example, taking the int8 integer convolution kernel running on the NVIDIA GPU as an example, the components of the convolution kernel include: input (activation), weight (weight) and bias (bias), where the input (activation) ) is the dataset to be reasoned in practical applications.

步骤202，将待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果；Step 202, input the data set to be inferred into the inference convolution kernel to obtain a floating-point data inference result;

具体的，如图3所示，将待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果的方法主要包括如下步骤：Specifically, as shown in Figure 3, the method for inputting the data set to be inferred into the inference convolution kernel to obtain the floating-point data inference result mainly includes the following steps:

步骤301，获取待推理的数据集对应的第一缩放因子；Step 301, obtaining the first scaling factor corresponding to the data set to be inferred;

其中，实际输入数据集为浮点型，为了提高推理速度，将浮点型输入数据集转换为整型类型的输入数据集，作为待推理的数据集。将浮点型输入数据集乘以第一缩放因子，得到待推理的数据集，反过来，在已知待推理的数据集对应的浮点型的输入数据集时，利用待推理的数据集除以浮点型的输入数据集，得到第一缩放因子。The actual input data set is of floating point type. In order to improve the inference speed, the floating point input data set is converted into an input data set of integer type as the data set to be inferred. Multiply the floating-point input data set by the first scaling factor to obtain the data set to be inferred. Conversely, when the floating-point input data set corresponding to the data set to be inferred is known, divide the data set to be inferred by the data set to be inferred. Taking an input dataset of floats, get the first scaling factor.

步骤302，获取推理卷积内核对应的权重及权重对应的第二缩放因子；Step 302, obtaining the weight corresponding to the inference convolution kernel and the second scaling factor corresponding to the weight;

每个推理卷积内核均对应一个权重，且该权重对于某个推理卷积内核来讲通常是不变的，该权重对应的第二缩放因子，也是固定不变的。Each inference convolution kernel corresponds to a weight, and the weight is usually constant for a certain inference convolution kernel, and the second scaling factor corresponding to the weight is also fixed.

步骤303，获取推理卷积内核的偏置；Step 303, obtaining the bias of the inference convolution kernel;

偏置也是推理卷积内核中一个重要参数，对于某个推理卷积内核来讲，偏置也是固定不变的。Bias is also an important parameter in the inference convolution kernel. For a certain inference convolution kernel, the bias is also fixed.

步骤304，根据待推理的数据集、第一缩放因子、权重、第二缩放因子和偏置，计算浮点型的数据推理结果。Step 304: Calculate a floating-point data inference result according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor, and the offset.

具体的，已知待推理的数据集、第一缩放因子、权重、第二缩放因子和偏置，根据推理卷积内核内部的运算，计算浮点型的数据推理结果。以英伟达(nvidia)GPU上运行的int8整型卷积内核为例，根据待推理的数据集、第一缩放因子、权重、第二缩放因子和偏置，计算浮点型的数据推理结果，包括：Specifically, the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the offset are known, and the floating-point data inference result is calculated according to the operation inside the inference convolution kernel. Taking the int8 integer convolution kernel running on the NVIDIA GPU as an example, according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the offset, calculate the floating-point data inference results, including :

计算待推理的数据集与权重的乘积，得到第一中间结果；利用第一中间结果除以第一缩放因子和第二缩放因子，得到第二中间结果；计算第二中间结果和偏置的和，得到浮点型的数据推理结果。Calculate the product of the data set to be inferred and the weight to obtain the first intermediate result; divide the first intermediate result by the first scaling factor and the second scaling factor to obtain the second intermediate result; calculate the sum of the second intermediate result and the offset , and get the floating-point data inference result.

假设输入为I_Q，权重为W_Q，输入对应的第一缩放因子为S_I，权重对应的第二缩放因子为S_W，偏置为B，则浮点型的数据推理结果为I_Q*W_Q/S_I/S_W+B，此时，输出推理结果为浮点型。Assuming that the input is _I _Q , the weight is _W _Q , the first scaling factor corresponding to the input is SI , the second scaling factor corresponding to the weight is SW , and the bias is B, then the floating-point data inference result is I _Q * W _Q /S _I /S _W +B, at this time, the output inference result is a floating-point type.

步骤203，获取待推理的数据集对应的输出缩放因子；其中，输出缩放因子根据数据推理结果中的元素的最大值确定；Step 203, obtaining the output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

具体的，从待推理的数据集的数据推理结果的各个元素中，确定最大元素；计算预设值与最大元素的商，得到输出缩放因子。Specifically, the largest element is determined from each element of the data inference result of the data set to be inferred; the quotient of the preset value and the largest element is calculated to obtain the output scaling factor.

在具体实现时，预设值可以设定为127。以输入数据集为输入矩阵为例，数据推理结果亦为矩阵，设数据推理结果中各个矩阵元素的最大值为MAX_output，则输出缩放因子为So＝127/MAX_output。In specific implementation, the preset value can be set to 127. Taking the input data set as the input matrix as an example, the data inference result is also a matrix. If the maximum value of each matrix element in the data inference result is MAX _output , the output scaling factor is So=127/MAX _output .

以矩阵元素的最大值计算输出缩放因子，能够保证其他矩阵元素也可以满足需要。Calculating the output scaling factor with the maximum value of the matrix elements ensures that other matrix elements can also meet the needs.

在本申请实施例中，输出缩放因子与待推理的数据集相匹配，而不是如现有技术中那样每次推理均复用同一个系数。在本申请实施例中，通过输出缩放因子将浮点型的数据推理结果重新映射到int8类型，不仅保证了深度学习模型的推理效率，也能够保证推理质量。In the embodiment of the present application, the output scaling factor matches the data set to be inferred, instead of multiplexing the same coefficient for each inference as in the prior art. In the embodiment of the present application, the floating-point data inference result is remapped to the int8 type through the output scaling factor, which not only ensures the inference efficiency of the deep learning model, but also ensures the inference quality.

步骤204，计算数据推理结果与输出缩放因子的乘积，得到量化结果；Step 204, calculating the product of the data inference result and the output scaling factor to obtain a quantization result;

在浮点型的数据推理结果的基础上，乘以输出缩放因子，将浮点型的输出结果映射到int8类型。int8类型的数据相对于float32类型推理速度更快。On the basis of the floating-point data inference result, multiply the output scaling factor to map the floating-point output result to the int8 type. Data of type int8 is faster to reason about than type float32.

步骤205，根据量化结果继续深度学习模型的推理。Step 205, continue the reasoning of the deep learning model according to the quantization result.

其中，不同的深度学习模型后续的推理过程可能不同。例如：后续的推理过程可以是经过relu函数和round操作等。Among them, the subsequent reasoning process of different deep learning models may be different. For example, the subsequent reasoning process can be through the relu function and the round operation.

为了更加突出本申请实施例提供的深度学习模型的推理方法与现有技术的区别，对现有技术提供的深度学习模型的推理方法做如下说明：In order to highlight the difference between the inference method of the deep learning model provided by the embodiment of the present application and the prior art, the inference method of the deep learning model provided by the prior art is described as follows:

取一个数据集，然后利用这个数据集对整个量化过程进行微调(finetune)，得到一个在该数据集上表现良好的系数。随后在模型部署的时候，每次推理均复用这个系数。但这样的做法的问题在于，实际模型推理过程，所遇到的数据可能与微调时所使用的数据集，在数据分布上存在很大的不同。如果出现这样的情况，则根据微调时所用的数据集得到的系数就不能再使用，如果强行使用的话，则会直接影响推理的质量。Take a dataset and use this dataset to finetune the entire quantization process to get a coefficient that performs well on that dataset. Then when the model is deployed, this coefficient is reused for each inference. However, the problem with this approach is that the data encountered during the actual model inference process may be very different in data distribution from the data set used for fine-tuning. If this happens, the coefficients obtained from the data set used for fine-tuning can no longer be used, and if they are used forcibly, it will directly affect the quality of inference.

而在本申请实施例中，对于不同的待推理的数据集，能够获取与待推理的数据集匹配的输出缩放因子，能够根据实际应用时输入的待推理的数据集动态生成输出缩放因子，再与待推理的数据集输入到推理卷积内核得到的浮点型的数据推理结果相乘，能够达到int8整型量化速度和质量的平衡，即本申请实施例提供的方法不仅能够提升深度学习网络模型的推理效率，同时能够保证推理质量，能够有效缓解整体的误差比较大的问题。However, in this embodiment of the present application, for different data sets to be inferred, an output scaling factor matching the data set to be inferred can be obtained, and an output scaling factor can be dynamically generated according to the input data set to be inferred during actual application, and then the output scaling factor can be dynamically generated. Multiplying the floating-point data inference result obtained by inputting the data set to be inferred into the inference convolution kernel can achieve a balance between int8 integer quantization speed and quality, that is, the method provided by the embodiment of the present application can not only improve the deep learning network The inference efficiency of the model can be guaranteed, and the inference quality can be ensured, which can effectively alleviate the problem of relatively large overall errors.

为了便于理解本申请实施例的发明构思，同样以英伟达(nvidia)GPU上运行的int8整型卷积内核为例。对于深度学习网络模型来讲，卷积内核的组成部分包括：输入(activation)、权值(weight)和偏置(bias)，其中的输入(activation)在实际应用中即为待推理的数据集，即目标输入矩阵。为了保证整体的速度的同时，保证相当水平的精度，nvidia的int8内核实现过程中，在最后和偏置相加的过程保留了浮点float形式。In order to facilitate understanding of the inventive concept of the embodiments of the present application, an int8 integer convolution kernel running on an NVIDIA GPU is also taken as an example. For the deep learning network model, the components of the convolution kernel include: input (activation), weight (weight) and bias (bias), where the input (activation) is the data set to be reasoned in practical applications , the target input matrix. In order to ensure the overall speed and a considerable level of precision, in the implementation of nvidia's int8 kernel, the floating-point float form is retained in the final and offset addition process.

由于输入和权值是int8类型的，它们各自都有一个缩放系数，在这里分别定义为S_I(第一缩放因子)和S_W(第二缩放因子)，而对于int8形式的输入和权值，分别定义为I_Q和W_Q，浮点形式的偏置定义为B，对于最终以int8形式的输出，它也有一个对应的缩放系数(输出缩放因子)，定义为S_O。经过如图4所述的推理过程，能够得到int8形式的输出。对于卷积内核的推理过程，可以表示为：(I_Q*W_Q/S_I/S_W+B)*S_O(1)；可以展开变为：I_Q*W_Q*S_O/S_I/S_W+B*S_O(2)。Since the _inputs and weights are of type int8, they each have a scaling factor, defined here as SI (first scaling factor) and SW (second scaling factor), respectively, while for inputs and weights in _int8 form , defined as I _Q and W _Q respectively, the offset in floating point form is defined as B, and for the final output in int8 form, it also has a corresponding scaling factor (output scaling factor), defined as S _O . After the reasoning process described in Figure 4, the output in the form of int8 can be obtained. For the inference process of the convolution kernel, it can be expressed as: (I _Q *W _Q /S _I /S _W +B)*S _O (1); it can be expanded into: I _Q *W _Q *S _O /S _I /S _W +B*S _O (2).

但是，最终的输出对应的缩放系数是不能被提前获知的。那么这里换种思路，假如输出不是int8的形式，而是浮点的形式，那边输出的缩放系数就不再需要，动态计算缩放系数就变为可能。However, the scaling factor corresponding to the final output cannot be known in advance. Then here is another way of thinking. If the output is not in the form of int8, but in the form of floating point, the scaling factor of the output there is no longer needed, and it becomes possible to dynamically calculate the scaling factor.

为了实现该目标，先将So设定为1，则公式(1)表示为：In order to achieve this goal, first set So as 1, then formula (1) is expressed as:

I_Q*W_Q/S_I/S_W+B(3)I _Q *W _Q /S _I /S _W +B(3)

将公式(3)得到的结果作为浮点型的数据推理结果。其中，权重对应的缩放系数S_W，对于int8整型卷积内核是不变的，同时，输入对应的缩放系数S_I可以计算得出。The result obtained by formula (3) is regarded as the floating-point data inference result. Among them, the scaling factor SW corresponding to the weight is unchanged for the _int8 integer convolution kernel, and at the same time, the input corresponding scaling factor _SI can be calculated.

得到浮点型的数据推理结果后，通过确定各个矩阵元素中的最大矩阵元素，计算预设值与最大矩阵元素的商，得到输出缩放因子。After the floating-point data inference result is obtained, the output scaling factor is obtained by determining the maximum matrix element in each matrix element, and calculating the quotient of the preset value and the maximum matrix element.

在具体实现时，预设值可以设定为127。设数据推理结果中各个矩阵元素的最大值为MAX_output，则输出缩放因子为So＝127/MAX_output。In specific implementation, the preset value can be set to 127. Assuming that the maximum value of each matrix element in the data inference result is MAX _output , the output scaling factor is So=127/MAX _output .

在实际应用过程中，对于同一段视频，前后输入的输入数据的特性变化不大(例如：色彩变化不大，风格类似)的情况下，输出缩放因子可以继续被复用，无需每次推理都要重新推算缩放因子，这样大大的缩短了推理所需的时间。In the actual application process, for the same video, when the characteristics of the input data input before and after do not change much (for example, the color changes little, the style is similar), the output scaling factor can continue to be reused, without the need to re-use each inference. To recalculate the scaling factor, this greatly reduces the time required for inference.

在实际应用过程中，为了进一步提高精度，可以根据该视频中前后两帧视频判断是否发生场景切换，来决定是否复用输出缩放因子。In the actual application process, in order to further improve the accuracy, it is possible to determine whether scene switching occurs according to the two frames of video before and after the video, so as to determine whether to multiplex the output scaling factor.

具体的，如果发生场景切换，则从数据推理结果的各个元素中，确定最大元素；计算预设值与最大元素的商，得到输出缩放因子；其中，预设值为待推理的数据集的数据类型对应的取值范围的上限值；例如，int数据类型的取值范围是-128～127，则预设值为127；如果未发生场景切换，则将待推理的数据集的上一个输入数据集对应的输出缩放因子，作为待推理的数据集的输出缩放因子。其中，上一个输入数据集为根据当前视频帧的前一帧确定的输入数据集。Specifically, if a scene switch occurs, the largest element is determined from each element of the data inference result; the quotient of the preset value and the largest element is calculated to obtain the output scaling factor; wherein, the preset value is the data of the data set to be inferred The upper limit of the value range corresponding to the type; for example, if the value range of the int data type is -128 to 127, the default value is 127; if no scene switching occurs, the last input of the data set to be inferred The output scaling factor corresponding to the dataset is used as the output scaling factor of the dataset to be inferred. The last input data set is an input data set determined according to the previous frame of the current video frame.

此外，还需要说明的是，在具体实现实现，每一帧视频对应一个输入矩阵，该输入矩阵即为输入数据集，该输入矩阵可以是表征该视频帧特征的矩阵，并对该输入矩阵的数据类型进行数据类型转换，转换后的结果作为推理卷积内核的输入。In addition, it should be noted that, in the specific implementation, each frame of video corresponds to an input matrix, the input matrix is the input data set, the input matrix can be a matrix characterizing the characteristics of the video frame, and the input matrix The data type is converted to the data type, and the converted result is used as the input of the inference convolution kernel.

本申请实施例中提及的根据相邻两帧视频帧判断是否发生场景切换的方法，可以采用任何一种现有技术中判断场景是否切换的方法，这里不做限定。The method for judging whether a scene switching occurs according to two adjacent video frames mentioned in the embodiments of the present application may adopt any method of judging whether a scene switching occurs in the prior art, which is not limited here.

在本申请实施例中，根据场景是否切换，来确定输出缩放因子，如果发生场景切换，则根据待推理的数据集重新确定缩放因子；如果未发生场景切换，则为了最大程度缩短推理所用时间，复用上一个输入数据集的缩放因子即可。In the embodiment of the present application, the output scaling factor is determined according to whether the scene is switched. If the scene is switched, the scaling factor is re-determined according to the data set to be inferred; if the scene is not switched, in order to minimize the time used for inference, Just reuse the scaling factor of the previous input dataset.

基于同一构思，本申请实施例中提供了一种深度学习模型的推理装置，该装置的具体施可参见方法实施例部分的描述，重复之处不再赘述。如图5所示，该装置主要包括：Based on the same concept, an inference device for a deep learning model is provided in an embodiment of the present application. For the specific implementation of the device, reference may be made to the description in the method embodiment section, and repeated descriptions will not be repeated. As shown in Figure 5, the device mainly includes:

第一获取模块501，用于获取待推理的数据集；The first acquisition module 501 is used to acquire the data set to be reasoned;

第一推理模块502，用于将所述待推理的数据集输入到推理卷积内核，得到浮点型的数据推理结果；The first inference module 502 is configured to input the data set to be inferred into the inference convolution kernel to obtain a floating-point data inference result;

第二获取模块503，用于获取所述待推理的数据集对应的输出缩放因子；其中，所述输出缩放因子根据所述数据推理结果中的元素的最大值确定；The second obtaining module 503 is configured to obtain the output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

量化模块504，用于计算所述数据推理结果与所述输出缩放因子的乘积，得到量化结果；a quantization module 504, configured to calculate the product of the data inference result and the output scaling factor to obtain a quantization result;

第二推理模块505，用于根据所述量化结果继续深度学习模型的推理。The second reasoning module 505 is configured to continue the reasoning of the deep learning model according to the quantization result.

在本申请实施例中，对于不同的待推理的数据集，能够获取与待推理的数据集匹配的输出缩放因子，能够根据实际应用时输入的待推理的数据集动态生成输出缩放因子，再与待推理的数据集输入到推理卷积内核得到的浮点型的数据推理结果相乘，能够达到int8整型量化速度和质量的平衡，即本申请实施例提供的方法不仅能够提升深度学习网络模型的推理效率，同时能够保证推理质量，能够有效缓解整体的误差比较大的问题。In this embodiment of the present application, for different data sets to be inferred, an output scaling factor matching the data set to be inferred can be obtained, and an output scaling factor can be dynamically generated according to the input data set to be inferred during actual application, and then combined with the data set to be inferred. The data set to be inferred is multiplied by the floating-point data inference results obtained by the inference convolution kernel, which can achieve a balance between int8 integer quantization speed and quality, that is, the method provided by the embodiment of the present application can not only improve the deep learning network model The inference efficiency can be guaranteed, and the inference quality can be guaranteed, which can effectively alleviate the problem of relatively large overall errors.

在一个具体实施例中，第二获取模块503，用于从所述待推理的数据集的所述数据推理结果的各个元素中，确定最大元素；计算预设值与所述最大元素的商，得到所述输出缩放因子。In a specific embodiment, the second obtaining module 503 is configured to determine the largest element from each element of the data inference result of the data set to be inferred; calculate the quotient of the preset value and the largest element, Get the output scaling factor.

在一个具体实施例中，第一推理模块502，用于获取所述待推理的数据集对应的第一缩放因子；获取所述推理卷积内核对应的权重及所述权重对应的第二缩放因子；获取所述推理卷积内核的偏置；根据所述待推理的数据集、所述第一缩放因子、所述权重、所述第二缩放因子和所述偏置，计算所述浮点型的数据推理结果。In a specific embodiment, the first inference module 502 is configured to acquire a first scaling factor corresponding to the data set to be inferred; acquire a weight corresponding to the inference convolution kernel and a second scaling factor corresponding to the weight ; Obtain the bias of the inference convolution kernel; calculate the floating point type according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the bias data inference results.

在一个具体实施例中，第一推理模块502，用于计算所述待推理的数据集与所述权重的乘积，得到第一中间结果；利用所述第一中间结果除以所述第一缩放因子和所述第二缩放因子，得到所述第二中间结果；计算所述第二中间结果和所述偏置的和，得到所述浮点型的数据推理结果。In a specific embodiment, the first inference module 502 is configured to calculate the product of the data set to be inferred and the weight to obtain a first intermediate result; divide the first intermediate result by the first scaling factor and the second scaling factor to obtain the second intermediate result; calculate the sum of the second intermediate result and the offset to obtain the floating-point data inference result.

在一个具体实施例中，第一推理模块502，用于获取所述待推理的数据集对应的浮点型的输入数据集；利用所述待推理的数据集除以所述浮点型的输入数据集，得到所述第一缩放因子。In a specific embodiment, the first inference module 502 is configured to obtain a floating-point input data set corresponding to the data set to be inferred; divide the data set to be inferred by the floating-point input data set to obtain the first scaling factor.

在一个具体实施例中，第一获取模块501，用于从目标视频中，确定当前视频帧对应的待推理的数据集；第二获取模块503，用于获取所述待推理的数据集对应的输出缩放因子；判断所述当前视频帧相对于上一视频帧是否发生场景切换，得到判断结果；根据所述判断结果，确定所述待推理的数据集对应的所述输出缩放因子。In a specific embodiment, the first acquisition module 501 is used to determine the data set to be inferred corresponding to the current video frame from the target video; the second acquisition module 503 is used to acquire the corresponding data set to be inferred. outputting a scaling factor; judging whether scene switching occurs in the current video frame relative to the previous video frame, and obtaining a judgment result; determining the output scaling factor corresponding to the data set to be inferred according to the judgment result.

在一个具体实施例中，第二获取模块503，用于如果所述判断结果指示发生场景切换，则从所述数据推理结果的各个元素中，确定最大元素；计算预设值与所述最大元素的商，得到所述输出缩放因子；其中，所述预设值为所述待推理的数据集的数据类型对应的取值范围的上限值；如果所述判断结果指示未发生场景切换，则将所述待推理的数据集的上一个输入数据集对应的输出缩放因子，作为所述待推理的数据集的输出缩放因子。In a specific embodiment, the second obtaining module 503 is configured to, if the judgment result indicates that a scene switch occurs, determine the maximum element from each element of the data inference result; calculate the preset value and the maximum element The quotient of , obtains the output scaling factor; wherein, the preset value is the upper limit value of the value range corresponding to the data type of the data set to be inferred; if the judgment result indicates that scene switching does not occur, then The output scaling factor corresponding to the previous input data set of the data set to be inferred is used as the output scaling factor of the data set to be inferred.

基于同一构思，本申请实施例中还提供了一种电子设备，如图6所示，该电子设备主要包括：处理器601、存储器602和通信总线603，其中，处理器601和存储器602通过通信总线603完成相互间的通信。其中，存储器602中存储有可被处理器601执行的程序，处理器601执行存储器602中存储的程序，实现如下步骤：Based on the same concept, the embodiment of the present application also provides an electronic device. As shown in FIG. 6 , the electronic device mainly includes: a processor 601, a memory 602, and a communication bus 603, wherein the processor 601 and the memory 602 communicate through The bus 603 performs communication with each other. The memory 602 stores a program that can be executed by the processor 601, and the processor 601 executes the program stored in the memory 602 to implement the following steps:

获取待推理的数据集；Get the data set to be reasoned;

计算浮点型的所述数据推理结果与所述输出缩放因子的乘积，得到量化结果；Calculate the product of the floating-point data inference result and the output scaling factor to obtain a quantization result;

上述电子设备中提到的通信总线603可以时外设部件互连标准(PeripheralComponent Interconnect，简称PCI)总线或扩展工业标准结构(Extended IndustryStandard Architecture，简称EISA)总线等。该通信总线603可以分为地址总线、矩阵总线、控制总线等。为便于表示，图6中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。The communication bus 603 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI for short) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA for short) bus or the like. The communication bus 603 can be divided into an address bus, a matrix bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 6, but it does not mean that there is only one bus or one type of bus.

存储器602可以包括随机存取存储器(Random Access Memory，简称RAM)，也可以包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。可选地，存储器还可以是至少一个位于远离前述处理器601的存储装置。The memory 602 may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk storage. Optionally, the memory may also be at least one storage device located away from the aforementioned processor 601 .

上述的处理器601可以是通用处理器，包括中央处理器(Central ProcessingUnit，简称CPU)、网络处理器(Network Processor，简称NP)等，还可以是数字信号处理器(Digital Signal Processing，简称DSP)、专用集成电路(Application SpecificIntegrated Circuit，简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array，简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The above-mentioned processor 601 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc., and may also be a digital signal processor (Digital Signal Processing, DSP for short) , Application Specific Integrated Circuit (ASIC for short), Field-Programmable Gate Array (FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, and discrete hardware components.

在本申请的又一实施例中，还提供了一种计算机可读存储介质，该计算机可读存储介质中存储有计算机程序，当该计算机程序在计算机上运行时，使得计算机执行上述实施例中所描述的一种深度学习模型的推理方法。In yet another embodiment of the present application, a computer-readable storage medium is also provided, where a computer program is stored in the computer-readable storage medium, and when the computer program runs on a computer, the computer is made to execute the above-mentioned embodiments. An inference method for deep learning models is described.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机指令时，全部或部分地产生按照本申请实施例所述的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，计算机指令从一个网站站点、计算机、服务器或者矩阵中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、微波等)方式向另外一个网站站点、计算机、服务器或矩阵中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、矩阵中心等矩阵存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带等)、光介质(例如DVD)或者半导体介质(例如固态硬盘)等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored on or transmitted from one computer-readable storage medium to another computer-readable storage medium, eg, from a website site, computer, server, or matrix center via wire (eg, Coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, microwave, etc.) way to transmit to another website site, computer, server or matrix center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a matrix storage device such as a server, a matrix center, or the like that contains one or more of the available media integrations. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes, etc.), optical media (eg, DVDs), or semiconductor media (eg, solid state drives), and the like.

需要说明的是，在本文中，诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as "first" and "second" etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these There is no such actual relationship or sequence between entities or operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上所述仅是本发明的具体实施方式，使本领域技术人员能够理解或实现本发明。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所申请的原理和新颖特点相一致的最宽的范围。The above descriptions are only specific embodiments of the present invention, so that those skilled in the art can understand or implement the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features claimed herein.

Claims

1. a kind of reasoning method of deep learning model, is characterized in that, comprises:

Get the data set to be reasoned;

Inputting the data set to be inferred into the inference convolution kernel to obtain a floating-point data inference result;

Obtain the output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

Calculate the product of the data inference result and the output scaling factor to obtain a quantization result;

Continue the inference of the deep learning model according to the quantization result.

2. The inference method of a deep learning model according to claim 1, wherein the obtaining the corresponding output scaling factor of the data set to be inferred comprises:

From each element of the data inference result, determine the largest element;

The quotient of the preset value and the maximum element is calculated to obtain the output scaling factor; wherein the preset value is the upper limit of the value range corresponding to the data type of the data set to be inferred.

3. The inference method of a deep learning model according to claim 1 or 2, wherein the described data set to be inferred is input into an inference convolution kernel to obtain a floating-point data inference result, comprising:

obtaining the first scaling factor corresponding to the data set to be inferred;

obtaining the weight corresponding to the inference convolution kernel and the second scaling factor corresponding to the weight;

obtain the bias of the inference convolution kernel;

The floating-point data inference result is calculated according to the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the offset.

4 . The inference method of a deep learning model according to claim 3 , wherein the data set to be inferred, the first scaling factor, the weight, the second scaling factor and the The bias is calculated, and the data inference result of the floating-point type is calculated, including:

Calculate the product of the data set to be inferred and the weight to obtain a first intermediate result;

Dividing the first intermediate result by the first scaling factor and the second scaling factor to obtain the second intermediate result;

The sum of the second intermediate result and the offset is calculated to obtain the floating-point data inference result.

5. The inference method of a deep learning model according to claim 3, wherein the obtaining the first scaling factor corresponding to the data set to be inferred comprises:

obtaining a floating-point input data set corresponding to the data set to be inferred;

The first scaling factor is obtained by dividing the data set to be inferred by the floating-point input data set.

6. the inference method of deep learning model according to claim 1, is characterized in that, described acquisition to be inferred data set, comprises:

From the target video, determine the data set to be inferred corresponding to the current video frame;

The obtaining the output scaling factor corresponding to the data set to be inferred includes:

Judging whether scene switching occurs in the current video frame relative to the previous video frame, and obtaining a judgment result;

According to the judgment result, the output scaling factor corresponding to the data set to be inferred is determined.

7. The inference method of a deep learning model according to claim 6, wherein the determining the output scaling factor corresponding to the data set to be inferred according to the judgment result comprises:

If the judgment result indicates that scene switching occurs, determine the maximum element from each element of the data inference result; calculate the quotient of the preset value and the maximum element to obtain the output scaling factor; wherein the preset value Set the value to the upper limit of the value range corresponding to the data type of the data set to be inferred;

If the judgment result indicates that scene switching does not occur, the output scaling factor corresponding to the previous input data set of the data set to be inferred is used as the output scaling factor of the data set to be inferred.

8. An inference device for a deep learning model, comprising:

The first acquisition module is used to acquire the data set to be reasoned;

a first inference module, configured to input the data set to be inferred into an inference convolution kernel to obtain a floating-point data inference result;

a second obtaining module, configured to obtain the output scaling factor corresponding to the data set to be inferred; wherein, the output scaling factor is determined according to the maximum value of the elements in the data inference result;

a quantization module for calculating the product of the data inference result and the output scaling factor to obtain a quantization result;

The second reasoning module is configured to continue the reasoning of the deep learning model according to the quantization result.

9. An electronic device, comprising: a processor, a memory, and a communication bus, wherein the processor and the memory communicate with each other through the communication bus; the memory is used to store a computer program; the processor , for executing the program stored in the memory to implement the inference method of the deep learning model according to any one of claims 1 to 7.

10 . A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the inference method of the deep learning model according to any one of claims 1 to 7 is implemented. 11 .