CN110097186B

CN110097186B - A Neural Network Heterogeneous Quantization Training Method

Info

Publication number: CN110097186B
Application number: CN201910354693.7A
Authority: CN
Inventors: 王子彤; 姜凯; 秦刚
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Yuanqixin Shandong Semiconductor Technology Co ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2023-04-18
Anticipated expiration: 2039-04-29
Also published as: CN110097186A

Abstract

本发明提供一种神经网络异构量化训练方法，属于人工神经网络技术领域，本发明在传统基于CPU或GPU或二者结合的训练架构基础上，添加高速接口逻辑，通过高速接口逻辑连接硬件计算加速模块，训练过程中间特定的某一步或几步计算过程下放至所述硬件计算加速模块，计算完成后结果经所述高速接口逻辑返回至源训练主控，完成具有特定定制功能的训练过程。将前沿新结构或新算法快速实现并部署到训练中，提高系统灵活性，降低存储与带宽需求，减少正向预测过程中资源需求，降低训练复杂度，提高训练效率，保证当前训练装置能较好适应最新的神经网络结构。The invention provides a neural network heterogeneous quantitative training method, which belongs to the technical field of artificial neural network. On the basis of the traditional training framework based on CPU or GPU or a combination of the two, the invention adds high-speed interface logic, and connects hardware computing through the high-speed interface logic. In the acceleration module, a specific step or steps in the middle of the training process is transferred to the hardware computing acceleration module. After the calculation is completed, the result is returned to the source training master through the high-speed interface logic, and the training process with specific customization functions is completed. Quickly implement and deploy cutting-edge new structures or algorithms into training, improve system flexibility, reduce storage and bandwidth requirements, reduce resource requirements in the forward prediction process, reduce training complexity, improve training efficiency, and ensure that current training devices can be more efficient. Well adapted to the latest neural network architectures.

Description

A Neural Network Heterogeneous Quantization Training Method

技术领域technical field

本发明涉及人工神经网络技术领域，尤其涉及一种神经网络异构量化训练方法。The invention relates to the technical field of artificial neural networks, in particular to a neural network heterogeneous quantitative training method.

背景技术Background technique

神经网络训练将一组训练集送入网络，根据网络的实际输出与期望输出间的差别来调整权值。训练过程包括：定义神经网络的结构和前向传播的输出结果，求出结果与期望值的误差，再将误差一层一层的返回，然后进行权值更新。通过训练样本和期望值来调整网络权值。Neural network training sends a set of training sets into the network, and adjusts the weights according to the difference between the actual output of the network and the expected output. The training process includes: defining the structure of the neural network and the output of forward propagation, finding the error between the result and the expected value, and then returning the error layer by layer, and then updating the weights. Adjust the network weights by training samples and expected values.

CPU擅长逻辑控制、串行运算与通用类型数据运算，GPU侧重处理大规模并行计算多重任务。CPU与GPU在各自领域都可以高效地完成任务，亦可作为当前神经网络训练的主流方式。CPU is good at logic control, serial operation and general type data operation, while GPU focuses on processing multiple tasks of large-scale parallel computing. Both CPU and GPU can efficiently complete tasks in their respective fields, and can also be used as the mainstream method of current neural network training.

随着研究深入，越来越多的新结构，新算法不断被提出，给通用CPU、GPU训练方式带来了更高要求与挑战，特定的细节结构难以快速实现，训练时间可能变得更加冗长。With the deepening of research, more and more new structures and new algorithms are constantly being proposed, which brings higher requirements and challenges to general-purpose CPU and GPU training methods. It is difficult to quickly realize specific detailed structures, and the training time may become more lengthy. .

发明内容Contents of the invention

为了解决以上技术问题，本发明提出了一种神经网络异构量化训练方法。采用异构方式对原有训练过程进行加速，能将前沿新结构，如特种卷积类型等，或新算法，如模型参数量化等，快速实现并部署到训练中，提高系统灵活性，降低存储与带宽需求，减少正向预测过程中资源需求，降低训练复杂度，提高训练效率，保证当前训练装置能较好适应最新的神经网络结构。In order to solve the above technical problems, the present invention proposes a neural network heterogeneous quantization training method. Heterogeneous methods are used to accelerate the original training process, and cutting-edge new structures, such as special convolution types, or new algorithms, such as model parameter quantization, can be quickly implemented and deployed in training, improving system flexibility and reducing storage And bandwidth requirements, reduce resource requirements in the forward prediction process, reduce training complexity, improve training efficiency, and ensure that the current training device can better adapt to the latest neural network structure.

本发明的技术方案是：Technical scheme of the present invention is:

一种神经网络异构量化训练方法，在传统基于CPU或GPU或二者结合的训练架构基础上，添加高速接口逻辑，通过高速接口逻辑连接硬件量化加速模块，在训练过程中添加量化步骤，将模型参数与特征图结果的量化计算过程下放至所述硬件量化加速模块，将量化计算完成后的结果经所述高速接口逻辑返回至源训练主控，更新量化后的模型参数，迭代完成具有模型参数与特征图结果量化功能的训练过程。A neural network heterogeneous quantization training method, on the basis of the traditional training architecture based on CPU or GPU or a combination of the two, add high-speed interface logic, connect the hardware quantization acceleration module through the high-speed interface logic, add quantization steps in the training process, and The quantitative calculation process of model parameters and feature map results is delegated to the hardware quantization acceleration module, and the results after the quantitative calculation are completed are returned to the source training main control through the high-speed interface logic, the quantized model parameters are updated, and the model with the model is iteratively completed. The training process of the parameter and feature map result quantization function.

进一步的，所述硬件量化加速模块负责完成神经网络模型参数与神经网络特征图结果的低比特位量化，由专用电路实现，与传统的训练主体CPU或GPU构成异构结构。Further, the hardware quantization acceleration module is responsible for completing the low-bit quantization of neural network model parameters and neural network feature map results, which is implemented by a dedicated circuit and forms a heterogeneous structure with the traditional training subject CPU or GPU.

进一步的，所述数据量化操作包括：数据暂存，数据统计排序，数据压缩与解压缩，数据哈希与查表，浮点数转特定位数定点数，浮点数移位缩放与截取，数据逆量化等。Further, the data quantization operation includes: data temporary storage, data statistical sorting, data compression and decompression, data hashing and table lookup, floating-point number to fixed-point number with specific digits, floating-point number shift scaling and interception, data inversion quantification etc.

所述特定定制功能包括但不限于：模型参数量化，浮点数转定点数，特殊的卷积操作，如膨胀卷积、deep-wise卷积，1x1乘法器阵列，全连接乘加器阵列等；所述特定定制功能由所述硬件计算加速模块实现，训练过程中可能多次或仅一次使能该模块，完成特定功能。The specific customization functions include but are not limited to: model parameter quantization, conversion of floating-point numbers to fixed-point numbers, special convolution operations, such as dilated convolution, deep-wise convolution, 1x1 multiplier array, fully connected multiplier-adder array, etc.; The specific custom function is realized by the hardware computing acceleration module, and the module may be enabled multiple times or only once during the training process to complete the specific function.

具体包括以下步骤：Specifically include the following steps:

1)在传统基于CPU或GPU训练框架下，设置神经网络模型参数与超参数初始值，同时初始化硬件量化加速模块，开始训练；1) Under the traditional CPU or GPU-based training framework, set the initial values of neural network model parameters and hyperparameters, initialize the hardware quantization acceleration module at the same time, and start training;

2)首轮反向传播更新完神经网络最后一层参数后，将更新后的权重参数传入所述硬件量化加速模块，经过如GZIP或熵编码等通用数据压缩方法对权重参数进行初次压缩并存储，然后对数据进行统计排序，根据期望的定点位数对数据进一步移位与截取，并限定数据最大值与最小值，得到量化后的权重参数，传回传统框架中继续进行反向传播的前一层参数更新，直到完成首轮反向传播，得到全部权重参数；2) After the first round of backpropagation has updated the last layer parameters of the neural network, the updated weight parameters are passed into the hardware quantization acceleration module, and the weight parameters are initially compressed and compressed by general data compression methods such as GZIP or entropy coding. Store, then sort the data statistically, further shift and intercept the data according to the expected fixed-point digits, and limit the maximum and minimum values of the data, obtain the quantized weight parameters, and send them back to the traditional framework to continue backpropagation The parameters of the previous layer are updated until the first round of backpropagation is completed, and all weight parameters are obtained;

3)重复步骤2对权重进行更新，完成多轮反向传播，直到达到模型Loss要求，完成训练；3) Repeat step 2 to update the weights, complete multiple rounds of backpropagation, until the model Loss requirements are met, and the training is completed;

4)除权重参数外，神经网络各层特征图结果也可进行量化操作，以对整个模型推理进行进一步量化；4) In addition to the weight parameters, the results of the feature maps of each layer of the neural network can also be quantified to further quantify the reasoning of the entire model;

5)根据需要，可对权重数据进行哈希运算得到索引值，对数据进行进一步压缩，或在量化完成后立即进行数据逆量化，降低量化损失。5) According to the needs, the weight data can be hashed to obtain the index value, and the data can be further compressed, or the data can be dequantized immediately after the quantization is completed, so as to reduce the quantization loss.

所述硬件计算加速模块采用FPGA或ACAP通过逻辑配置实现，外接非易失存储器件，不同定制功能可同时存储，按训练需求对FPGA或ACAP进行实时配置，完成同一训练过程中的不同功能。The hardware computing acceleration module adopts FPGA or ACAP to realize through logical configuration, external non-volatile memory device, different customized functions can be stored at the same time, and FPGA or ACAP is configured in real time according to training requirements to complete different functions in the same training process.

所述高速接口逻辑包括但不限于PCIE接口，USB3.0接口，万兆以太网接口等方式实现，与原有训练主控进行通信交互。The high-speed interface logic includes but is not limited to PCIE interface, USB3.0 interface, 10 Gigabit Ethernet interface, etc., and communicates and interacts with the original training master control.

本发明的有益效果是The beneficial effects of the present invention are

采用异构方式对原有训练过程进行加速，能将前沿新结构，如特种卷积类型等，或新算法，如模型参数量化等，快速实现并部署到训练中，提高系统灵活性，降低存储与带宽需求，减少正向预测过程中资源需求，降低训练复杂度，提高训练效率，保证当前训练装置能较好适应最新的神经网络结构。Heterogeneous methods are used to accelerate the original training process, and cutting-edge new structures, such as special convolution types, or new algorithms, such as model parameter quantization, can be quickly implemented and deployed in training, improving system flexibility and reducing storage And bandwidth requirements, reduce resource requirements in the forward prediction process, reduce training complexity, improve training efficiency, and ensure that the current training device can better adapt to the latest neural network structure.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例，基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below. Obviously, the described embodiments are part of the embodiments of the present invention, not all Embodiments, based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

本发明的一种神经网络异构量化训练方法，在传统基于CPU或GPU或二者结合的训练架构基础上，添加高速接口逻辑，通过高速接口逻辑连接硬件量化加速模块，在训练过程中添加量化步骤，将模型参数与特征图结果的量化计算过程下放至所述硬件量化加速模块，将量化计算完成后的结果经所述高速接口逻辑返回至源训练主控，更新量化后的模型参数，迭代完成具有模型参数与特征图结果量化功能的训练过程。A neural network heterogeneous quantization training method of the present invention, on the basis of the traditional training framework based on CPU or GPU or a combination of the two, adds high-speed interface logic, connects the hardware quantization acceleration module through the high-speed interface logic, and adds quantization during the training process step, delegating the quantization calculation process of model parameters and feature map results to the hardware quantization acceleration module, returning the results of the quantization calculation to the source training master through the high-speed interface logic, updating the quantized model parameters, and iterating Complete the training process with the quantization function of model parameters and feature map results.

硬件量化加速模块负责完成神经网络模型参数与神经网络特征图结果的低比特位量化，由专用电路实现，与传统的训练主体CPU或GPU构成异构结构。数据量化操作包括：数据暂存，数据统计排序，数据压缩与解压缩，数据哈希与查表，浮点数转特定位数定点数，浮点数移位缩放与截取，数据逆量化等。The hardware quantization acceleration module is responsible for completing the low-bit quantization of neural network model parameters and neural network feature map results. It is realized by a dedicated circuit and forms a heterogeneous structure with the traditional training main body CPU or GPU. Data quantization operations include: data temporary storage, data statistical sorting, data compression and decompression, data hashing and table lookup, floating-point numbers to fixed-point numbers with specific digits, floating-point number shift scaling and interception, data inverse quantization, etc.

包括以下步骤：Include the following steps:

1)在传统基于CPU或GPU训练框架下，设置神经网络模型参数与超参数初始值，同时初始化硬件量化加速模块，开始训练；1) Under the traditional CPU or GPU-based training framework, set the initial values of the neural network model parameters and hyperparameters, initialize the hardware quantization acceleration module at the same time, and start training;

5)根据需要，可对权重数据进行哈希运算得到索引值，对数据进行进一步压缩，或在量化完成后立即进行数据逆量化，降低量化损失；5) According to the needs, the weight data can be hashed to obtain the index value, and the data can be further compressed, or the data can be inversely quantized immediately after the quantization is completed to reduce the quantization loss;

所述硬件计算加速模块采用FPGA或ACAP通过逻辑配置实现，外接非易失存储器件，不同定制功能可同时存储，按训练需求对FPGA或ACAP进行实时配置，完成同一训练过程中的不同功能；所述高速接口逻辑包括但不限于PCIE接口，USB3.0接口，万兆以太网接口等方式实现，与原有训练主控进行通信交互。The hardware calculation acceleration module adopts FPGA or ACAP to realize through logical configuration, external non-volatile storage device, different customized functions can be stored at the same time, FPGA or ACAP is configured in real time according to training requirements, and different functions in the same training process are completed; The above-mentioned high-speed interface logic includes but is not limited to PCIE interface, USB3.0 interface, 10 Gigabit Ethernet interface, etc., and communicates with the original training master control.

以上所述仅为本发明的较佳实施例，仅用于说明本发明的技术方案，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内所做的任何修改、等同替换、改进等，均包含在本发明的保护范围内。The above descriptions are only preferred embodiments of the present invention, and are only used to illustrate the technical solution of the present invention, and are not used to limit the protection scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present invention are included in the protection scope of the present invention.

Claims

1. A neural network heterogeneous quantization training method, characterized in that,

On the basis of the traditional training architecture based on CPU or GPU or a combination of the two, add high-speed interface logic, connect the hardware quantization acceleration module through the high-speed interface logic, and delegate the quantization calculation process of model parameters and feature map results to hardware quantization during the training process The acceleration module returns the result after the quantization calculation is completed to the source training main control via the high-speed interface logic, updates the quantized model parameters, and iteratively completes the training process with the quantization function of model parameters and feature map results;

Specific steps are as follows:

1) Under the traditional CPU or GPU-based training framework, set the initial values of the neural network model parameters and hyperparameters, initialize the hardware quantization acceleration module at the same time, and start training;

2) After the first round of backpropagation updates the last layer parameters of the neural network, the updated weight parameters are passed to the hardware quantization acceleration module, and the weight parameters are compressed and stored for the first time through the data compression method, and then the data are counted Sorting, further shifting and intercepting the data according to the expected number of fixed-point digits, and limiting the maximum and minimum values of the data to obtain quantized weight parameters, which are sent back to the traditional framework to continue backpropagating the previous layer of parameter updates until Complete the first round of backpropagation and get all weight parameters;

3) Repeat step 2) to update the weights and complete multiple rounds of backpropagation until the model Loss requirements are met and the training is completed;

In addition to the weight parameters, the results of the feature maps of each layer of the neural network can also be quantified to further quantify the entire model reasoning;

According to needs, the weight data can be hashed to obtain the index value, and the data can be further compressed, or the data can be dequantized immediately after the quantization is completed to reduce the quantization loss.

2. The method of claim 1, wherein,

The hardware quantization acceleration module is responsible for completing low-bit quantization of neural network model parameters and neural network feature map results.

3. The method of claim 2, wherein,

The hardware quantization acceleration module is implemented by a circuit, and forms a heterogeneous structure with the traditional training subject CPU or GPU.

4. The method of claim 1, wherein,

The data quantization operation includes: data temporary storage, data statistical sorting, data compression and decompression, data hashing and table lookup, floating-point number conversion to fixed-point number with specific digits, floating-point number shift scaling and interception, and data inverse quantization.

5. The method of claim 1, wherein,

The hardware quantization acceleration module adopts FPGA or ACAP to realize through logical configuration, external non-volatile memory device, different customized functions can be stored at the same time, and FPGA or ACAP is configured in real time according to training requirements to complete different functions in the same training process.

6. The method of claim 1, wherein,

The high-speed interface logic can be realized by using PCIE interface, USB3.0 interface or 10 Gigabit Ethernet interface, and communicates and interacts with the original training master control.