CN111814955B

CN111814955B - Quantification method and equipment for neural network model and computer storage medium

Info

Publication number: CN111814955B
Application number: CN202010568807.0A
Authority: CN
Inventors: 周旭亚
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2024-05-31
Anticipated expiration: 2040-06-19
Also published as: CN111814955A

Abstract

The application provides a quantization method, equipment and a computer storage medium of a neural network model. Comprising the following steps: inputting training pictures into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors for each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking an initial quantization factor corresponding to the second input data with the largest correlation as a final quantization factor of a calculation layer; the final quantization factor is input into the neural network model. The application calculates the quantization factor of each calculation layer through at least two algorithms, and determines the optimal quantization factor of each calculation layer after comparison, thereby improving the quantization precision of the whole neural network model.

Description

Quantification method and equipment for neural network model and computer storage medium

Technical Field

The present application relates to the field of neural network technologies, and in particular, to a method and apparatus for quantifying a neural network model, and a computer storage medium.

Background

At present, the common neural network quantization method adopts the same algorithm to quantize the input activation values of all convolution layers and all connection layers, but due to the flexible variability of the input activation values, the error of a certain layer in the neural network becomes larger by adopting one algorithm, and further, the error of the neural network becomes larger in the reasoning process due to the feedforward property and the complexity of the network structure, and finally, the quantization precision of the neural network model is poorer.

Disclosure of Invention

The application provides a quantization method, equipment and a computer storage medium of a neural network model, which mainly solve the technical problem of how to improve the quantization precision of the neural network model.

In order to solve the technical problems, the application provides a quantization method of a neural network model, which comprises the following steps:

Inputting training pictures into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model;

Obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms;

Obtaining at least two second input data quantized by each calculation layer based on the at least two initial quantization factors;

comparing the correlation of the first input data and each second input data in each calculation layer;

taking an initial quantization factor corresponding to the second input data with the largest correlation as a final quantization factor of the calculation layer;

the final quantization factor is input to the neural network model.

According to an embodiment of the present application, the method further includes:

And merging the data normalization layer after the calculation layer to the calculation layer for calculation.

According to an embodiment of the present application, the calculation layer includes a convolution layer and a full connection layer; the method further comprises the steps of:

and setting the output data type of the previous layer of the convolution layer and the full connection layer as a second data type.

setting an output data type of a layer preceding the non-calculation layer to a second data type.

According to an embodiment of the present application, the quantization factors include a weight quantization factor and an input quantization factor; the inputting the final quantization factor into the neural network model includes:

And transmitting the input quantization factor of the calculation layer to a previous layer, so that the output data of the previous layer is of a second data type.

According to an embodiment of the present application, the first data type is a floating point type, and the second data type is a fixed point type.

According to an embodiment of the present application, the inputting the final quantization factor into the neural network model further includes:

And calculating according to the weight quantization factor to obtain a quantization weight value, and inputting the quantization weight value into the neural network model.

According to an embodiment of the present application, the inputting the final quantization factor into the neural network model includes:

and converting the offset value of the calculation layer into the output data type of the calculation layer according to the quantization factor.

To solve the above technical problems, the present application provides a terminal device, which includes a memory and a processor coupled to the memory;

the memory is used for storing program data, and the processor is used for executing the program data to realize the quantification method of the neural network model.

In order to solve the above technical problem, the present application further provides a computer storage medium for storing program data, which when executed by a processor, is configured to implement the method for quantifying a neural network model as described above.

The beneficial effects of the application are as follows: inputting training pictures into the neural network model, and calculating a first data type to obtain first input data of each calculation layer in the neural network model; obtaining at least two initial quantization factors for each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking an initial quantization factor corresponding to the second input data with the largest correlation as a final quantization factor of a calculation layer; the final quantization factor is input into the neural network model. According to the method for quantizing the neural network model, at least two initial quantizing factors of each calculating layer are calculated through at least two algorithms, correlation comparison is carried out on the at least two initial quantizing factors of each calculating layer and first input data of each calculating layer, so that the initial quantizing factor corresponding to second input data with the largest correlation is used as a final quantizing factor, namely the quantizing factor with the optimal precision, the final quantizing factor is input into the neural network model, and the quantizing precision of the whole neural network model is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flowchart of an embodiment of a method for quantifying a neural network model according to the present application;

FIG. 2 is a flow diagram of a prior art convolutional layer and full-link layer quantization operation;

Fig. 3 is a schematic structural diagram of an embodiment of a terminal device provided in the present application;

fig. 4 is a schematic structural diagram of an embodiment of a computer storage medium provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the prior art, an algorithm is generally adopted to quantify the input activation values of all convolution layers and all connection layers in the neural network model, so that the accuracy of the neural network model is improved. However, when the same algorithm is adopted to quantize the input activation values of all the convolution layers and the full connection layers, the error of a certain layer in the neural network model becomes larger due to the flexible variability of the input activation values, and the error of the neural network model in the reasoning process becomes larger and larger due to the feedforward property and the complexity of the neural network model, so that the quantization precision of the neural network model is poor finally.

In order to solve the above-mentioned problems, the present application provides a method for quantifying a neural network model, and particularly referring to fig. 1, fig. 1 is a flow chart of an embodiment of a method for quantifying a neural network model according to the present application. The method for quantizing the neural network model in the embodiment can be applied to terminal equipment quantized by the neural network model, and can also be applied to a server with data processing capability. The method for quantifying the neural network model in this embodiment specifically includes the following steps:

S101: and inputting training pictures into the neural network model, and calculating the first data type to obtain first input data of each calculation layer in the neural network model.

In order to quickly train the model to obtain the neural network model with optimal precision, a plurality of images can be randomly selected from the training images to be input into the neural network model, for example, 100 images are calculated according to a first data type, so that first input data of each calculation layer in the neural network model is obtained. The first data type is a floating point type, the floating point type is a real number, and the floating point is used for approximately representing any real number in a computer. The computation layer may be a convolution layer or a fully connected layer. The first input data is the maximum value of the absolute value of the input data of each convolution layer or full connection layer and the maximum value of the absolute value of the weight data of each convolution layer or full connection layer.

S102: at least two initial quantization factors for each calculation layer are obtained according to at least two algorithms.

In this embodiment, at least two algorithms are adopted to obtain the quantization factor, and specifically, two algorithms or three algorithms may be adopted, and the number of specific algorithms is not limited. As for the kind of algorithm, those skilled in the art can set according to the actual situation. For example, two algorithms, namely global-sacle algorithm and kl-divengence algorithm, are used to obtain two initial quantization factors for each calculation layer.

In practical application, when a global-sacle algorithm and a kl-divengence algorithm are adopted to obtain two initial quantization factors of each calculation layer, the global-sacle algorithm is utilized to obtain an initial quantization factor M of each calculation layer; the initial quantization factor N of each calculation layer is obtained by using a kl-divergence algorithm.

Wherein the quantization factors include a weight quantization factor and an input quantization factor. Specifically, a global-sacle algorithm can be adopted to obtain a weight quantization factor M1 and an input quantization factor M2 of each calculation layer; and obtaining a weight quantization factor N1 and an input quantization factor N2 of each calculation layer by using a kl-divergence algorithm.

S103: at least two second input data quantized for each calculation layer are obtained based on at least two initial quantization factors.

Based on the at least two initial quantization factors of each calculation layer acquired in S102, inputting the initial quantization factors into each calculation layer of the convolutional neural network to obtain at least two second input data quantized by each calculation layer. Specifically, the input quantization factors and the weight quantization factors under each algorithm are input into each calculation layer of the convolutional neural network, and second input data quantized by each calculation layer in each algorithm is obtained. The second input data is input data of each layer calculated by inputting the weight quantization factor and the input quantization factor in each algorithm into the convolutional neural network. For example, in practical application, the weight quantization factor M1 and the input quantization factor M2 acquired by using the global-sacle algorithm are input into a convolutional neural network, and the second input data under the global-sacle algorithm is acquired; and inputting the weight quantization factor N1 and the input quantization factor N2 which are obtained by adopting the kl-divergence algorithm into a convolutional neural network, and obtaining second input data under the kl-divergence algorithm.

S104: the correlation of the first input data and each second input data in each calculation layer is compared.

Based on the first input data in each calculation layer acquired in S101 and the second input data in each algorithm in S103, the correlation of the first input data and each second input data is compared. The correlation represents the correlation degree of the first input data and the second input data, and the quantized neural network model of the second input data with the higher correlation degree with the first input data is more accurate.

For example, in practical application, performing correlation calculation on second input data and first input data acquired based on global-sacle algorithm to acquire a correlation as C; and carrying out correlation calculation on the second input data and the first input data, which are obtained based on the kl-divergence algorithm, so as to obtain a correlation C ', and comparing the correlation C with the correlation C'.

S105: and taking the initial quantization factor corresponding to the second input data with the largest correlation as the final quantization factor of the calculation layer.

And based on the correlation magnitude compared in S104, taking the initial quantization factor corresponding to the second input data with the largest correlation as the final quantization factor of the calculation layer. For example, when the correlation C is greater than the correlation C', the initial quantization factor M corresponding to the second input data of the correlation C is used as the final quantization factor of the calculation layer, i.e., the weight quantization factor M1 and the input quantization factor M2 corresponding to the second input data of the correlation C are used as the final quantization factors.

S106: the final quantization factor is input into the neural network model.

Based on the final quantization factor acquired in S105, the final quantization factor is input into the neural network model to acquire a neural network model in which the inference process is accelerated. For example, the weight quantization factor M1 and the input quantization factor M2 corresponding to the second input data of the correlation C are input into the neural network model.

In the embodiment, training pictures are input to a neural network model, calculation of a first data type is performed, and first input data of each calculation layer in the neural network model are obtained; obtaining at least two initial quantization factors for each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking an initial quantization factor corresponding to the second input data with the largest correlation as a final quantization factor of a calculation layer; the final quantization factor is input into the neural network model. According to the method, the at least two initial quantization factors of each calculation layer are calculated through at least two algorithms, and the correlation between the two initial quantization factors of each calculation layer obtained through the at least two algorithms and the first input data of each calculation layer is compared, so that the quantization factor with the optimal precision can be obtained, and the quantization precision of the whole neural network model is improved.

Further, in order to effectively solve the problems of neural network gradient disappearance, gradient explosion and the like when training a neural network model, the application performs calculation by merging data normalization layers after a calculation layer into the calculation layer. The data normalization layer may be BN (Batch Normalization) layers or scale layers, and BN (Batch Normalization) is generally placed after the convolution layer, which can accelerate network convergence and control over fitting, but when the neural network is inferred, the operation of BN (Batch Normalization) layers or scale layers will affect the performance of the neural network model, and occupy too much memory or display space.

In a specific embodiment, in order to solve the above problem caused by the BN (Batch Normalization) layers or scale layers in calculation, when the neural network is actually trained, if network segments such as conv+bn+scale or conv+bn exist, the weight values of BN (Batch Normalization) layers and scale layers can be combined to the convolution layer by adopting the same weight, so as to reduce the calculation of data normalization layers in the process of training the neural network, and simultaneously reduce the occupation of excessive memory or display space by the data normalization layers. The manner in which the data are merged together in the layer is not particularly limited in this embodiment.

In order to make the output data types of the convolution layer, the full-connection layer or the non-calculation layer of the convolution neural network model be the second data type, thereby reducing the conversion of one data type, the output data types of the previous layer of the convolution layer, the full-connection layer and the like are set as the second data type, and the output data types of the previous layer of the non-calculation layer and the like are set as the second data type. The non-calculation layer is a layer in the neural network model, which does not involve calculation, for example, permute layers, concat layers and the like. The second data type is a fixed point type.

Specifically, in order to make the output data of the previous layer of the convolutional layer or the fully connected layer of the neural network model be the second data type, the embodiment transmits the input quantization factor of the convolutional layer or the fully connected layer in the neural network model to the previous layer of the convolutional layer or the fully connected layer, so that the output data of the previous layer of the convolutional layer or the fully connected layer is the second data type, namely the fixed point type.

For the mode of inputting the weight quantization factor into the neural network model, in this embodiment, the quantization weight value is obtained by calculating according to the weight quantization factor, and finally, the quantization weight value is input into the neural network model, so as to improve the quantization accuracy of the neural network model.

Referring to fig. 2, fig. 2 is a schematic flow chart of a quantization operation of a convolution layer and a full connection layer in the prior art. In the prior art, the inverse quantization processing is performed on the intermediate data, so that the inverse quantization data type is the same as the offset data type, but the calculation amount is increased in the mode, and excessive memory is occupied. In order to facilitate uniform precision in operation and facilitate data conversion, the embodiment converts the offset value of the calculation layer into the output data type of the calculation layer according to the quantization factor when the offset value exists in the convolution layer or the full connection layer. Specifically, by converting the offset value float32 into the offset value int32 of the output data type, the intermediate data int32 can be directly summed with the offset value of the output data type int32, so that the inverse quantization process of the intermediate data int32 is avoided, the data operation is reduced, and the conversion cost is reduced.

In the embodiment, training pictures are input to a neural network model, calculation of a first data type is performed, and first input data of each calculation layer in the neural network model are obtained; obtaining at least two initial quantization factors for each calculation layer according to at least two algorithms; obtaining at least two second input data quantized by each calculation layer based on at least two initial quantization factors; comparing the correlation of the first input data and each second input data in each calculation layer; taking an initial quantization factor corresponding to the second input data with the largest correlation as a final quantization factor of a calculation layer; the final quantization factor is input into the neural network model. According to the method, the quantization factors are calculated by adopting at least two algorithms, and the correlation between the two initial quantization factors of each calculation layer acquired by the at least two algorithms and the first input data of each calculation layer is compared, so that the quantization factor with optimal precision can be acquired according to the correlation, and the quantization precision of the whole neural network model is improved; further, the data normalization layer is combined to the calculation layer for calculation, so that the calculation of the data normalization layer is reduced, and meanwhile, the occupation of excessive memory or display space by the data normalization layer is reduced; setting the output data types of the previous layers of the convolution layer, the full connection layer and the non-calculation layer as second data types, setting the output data types of each layer of network according to the characteristics of the output layer, and reducing the data moving overhead of each layer; the bias value of the calculation layer is converted into the output data type of the calculation layer, so that the inverse quantization process of intermediate data is avoided, data operation is reduced, conversion cost is reduced, and the reasoning speed of the neural network model is improved.

In order to implement the method for quantifying the neural network model according to the above embodiment, another terminal device is provided in the present application, and referring specifically to fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.

The device 300 comprises a memory 31 and a processor 32, wherein the memory 31 and the processor 32 are coupled.

The memory 31 is used for storing program data, and the processor 32 is used for executing the program data to implement the quantization method of the neural network model of the above embodiment.

In this embodiment, the processor 32 may also be referred to as a CPU (Central Processing Unit ). The processor 32 may be an integrated circuit chip having signal processing capabilities. Processor 32 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The general purpose processor may be a microprocessor or the processor 32 may be any conventional processor or the like.

The present application also provides a computer storage medium 400, as shown in fig. 4, where the computer storage medium 400 is configured to store program data 41, and the program data 41, when executed by a processor, is configured to implement a method for quantifying a neural network model according to an embodiment of the method of the present application.

The method referred to in the embodiment of the method for quantifying the neural network model according to the present application may be stored in a device, such as a computer readable storage medium, when implemented in the form of a software functional unit and sold or used as a stand alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or directly or indirectly applied to other related technical fields are included in the scope of the present application.

Claims

1. A method for quantifying a neural network model, the method comprising:

obtaining at least two initial quantization factors of each calculation layer according to at least two algorithms, wherein the at least two algorithms comprise a global maximum algorithm and a KL divergence algorithm;

the final quantization factor is input to the neural network model.

2. The quantization method according to claim 1, further comprising:

3. The quantization method according to claim 1, wherein the computation layer comprises a convolution layer and a full connection layer; the method further comprises the steps of:

4. A quantization method according to claim 3, characterized in that the method further comprises:

the output data type of the previous layer of the non-calculation layer is set to the second data type.

5. A quantization method according to claim 3, characterized in that the quantization factors comprise a weight quantization factor and an input quantization factor; the inputting the final quantization factor into the neural network model includes:

6. The quantization method according to claim 5, wherein the first data type is a floating point type and the second data type is a fixed point type.

7. The quantization method according to claim 5, wherein said inputting said final quantization factor into said neural network model, further comprises:

8. The quantization method according to claim 1, wherein said inputting the final quantization factor into the neural network model comprises:

9. A terminal device, the device comprising a memory and a processor coupled to the memory;

Wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the method of quantifying a neural network model according to any of claims 1 to 8.

10. A computer storage medium for storing program data which, when executed by a processor, is adapted to carry out a method of quantifying a neural network model according to any of claims 1 to 8.