CN111814448A

CN111814448A - Pre-trained language model quantization method and device

Info

Publication number: CN111814448A
Application number: CN202010636126.3A
Authority: CN
Inventors: 俞凯; 赵梓涵; 刘韫聪; 陈露; 刘奇; 马娆
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2020-10-23
Anticipated expiration: 2040-07-03
Also published as: CN111814448B

Abstract

The invention discloses a pre-training language model quantization method and device, wherein a pre-training language model quantization method includes: first fine-tuning the pre-training language model on downstream tasks; The data in the weight matrices of all embedding layers and all linear layers except the classification layer are clustered, and the number of categories is set to 2 ⁿ , where n is the number of bits occupied by each data of the target model after compression; The quantized model is fine-tuned a second time on the downstream task under the condition of maintaining quantization, and finally a quantized network is obtained. The solutions provided by the embodiments of the present application show that the impact of the improvement of the underlying quantization solution on the quantization effect is greatly underestimated and ignored; it also shows that a simple k-means quantization without any skills can achieve a very good compression effect , indicating that the k-means compression method has a very large development space and application prospects.

Description

Pre-trained language model quantization method and device

技术领域technical field

本发明属于语言模型量化领域，尤其涉及预训练语言模型量化方法和装置。The invention belongs to the field of language model quantization, and in particular relates to a pre-training language model quantization method and device.

背景技术Background technique

现有技术中，目前已经出现了一些有关预训练语言模型的量化方法，包括8比特定精度量化，基于海森矩阵的混精度量化。In the prior art, there have been some quantization methods for pre-trained language models, including 8-bit specific precision quantization and Hessian matrix-based mixed-precision quantization.

8比特定精度量化：将模型所有需要量化的层都量化到8比特，之后再微调。8-bit specific precision quantization: All layers of the model that need to be quantized are quantized to 8 bits, and then fine-tuned.

基于海森矩阵的混精度量化：利用每层参数的海森矩阵的信息决定该层的量化精度。海森矩阵越大特征值越大的层量化精度越高，反之越低。量化之后再进行微调。Mixed-precision quantization based on Hessian matrix: Use the information of the Hessian matrix of the parameters of each layer to determine the quantization accuracy of the layer. The larger the Hessian matrix, the higher the quantization accuracy of the layer with the larger the eigenvalue, and the lower the vice versa. Fine-tune after quantization.

上面两个方法中最底层的量化方案都是线性量化。也就是说，每一个要单独量化的张量都采用了线性量化：首先找到该张量里参数的最大值和最小值，然后将这个范围平均划分成若干份，如果量化成n比特，则要分成2ⁿ份，也就是2ⁿ类。将属于每类的所有参数的平均值作为该类的中心值，每个参数就被它所属类的中心值代替。这样这个张量就被一个储存每类中心值的张量和一个储存每个参数所属类的张量代替。The bottom-most quantization scheme in the above two methods is linear quantization. That is to say, each tensor to be quantized separately adopts linear quantization: first find the maximum and minimum values of the parameters in the tensor, and then divide the range into several parts. Divided into 2 ⁿ parts, that is, 2 ⁿ categories. The average value of all parameters belonging to each class is taken as the central value of the class, and each parameter is replaced by the central value of the class to which it belongs. This tensor is then replaced by a tensor that stores the center value of each class and a tensor that stores the class to which each parameter belongs.

发明人在实现本申请的过程中发现，现有方案至少存在以下缺陷：During the process of realizing the present application, the inventor found that the existing solution has at least the following defects:

线性量化的压缩效果不是很好：量化模型在低精度时的表现下降较多，这使得模型不能压缩到很低精度。The compression effect of linear quantization is not very good: the performance of the quantized model drops more at low precision, which makes the model cannot be compressed to very low precision.

线性量化并不是一种很好的聚类方法。被量化的向量并不能很好的表示原来向量的参数分布。Linear quantization is not a good clustering method. The quantized vector is not a good representation of the parametric distribution of the original vector.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种预训练语言模型量化方法及装置，用于至少解决上述技术问题之一。Embodiments of the present invention provide a pre-trained language model quantization method and apparatus, which are used to solve at least one of the above technical problems.

第一方面，本发明实施例提供一种预训练语言模型量化方法，包括：将预训练语言模型在下游任务上进行第一次微调；使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类，将类别数设置为2ⁿ，其中，n为压缩后目标模型每个数据所占的比特数；以及将量化后的模型在所述下游任务上在维持量化的条件下进行第二次微调，最终得到量化后的网络。In a first aspect, an embodiment of the present invention provides a method for quantifying a pre-trained language model, including: performing a first fine-tuning of the pre-training language model on a downstream task; The data in the weight matrices of all other embedding layers and all linear layers are clustered, and the number of categories is set to 2 ⁿ , where n is the number of bits occupied by each data of the compressed target model; and the quantized The model is fine-tuned a second time on the downstream task while maintaining quantization, resulting in a quantized network.

第二方面，本发明实施例提供一种预训练语言模型量化装置，包括：第一次微调模块，配置为将预训练语言模型在下游任务上进行第一次微调；聚类压缩模块，配置为使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类，将类别数设置为2ⁿ，其中，n为压缩后目标模型每个数据所占的比特数；以及第二次微调模块，配置为将量化后的模型在所述下游任务上在维持量化的条件下进行第二次微调，最终得到量化后的网络。In a second aspect, an embodiment of the present invention provides a pre-trained language model quantization device, including: a first fine-tuning module configured to perform the first fine-tuning of the pre-trained language model on downstream tasks; a clustering compression module, configured as Using k-means clustering, cluster the data in the weight matrices of all embedding layers and all linear layers of the fine-tuned model except the classification layer, and set the number of categories to 2 ⁿ , where n is the compressed target The number of bits occupied by each data of the model; and a second fine-tuning module, configured to perform a second fine-tuning of the quantized model on the downstream task under the condition of maintaining quantization, and finally obtain a quantized network.

第三方面，提供一种电子设备，其包括：至少一个处理器，以及与所述至少一个处理器通信连接的存储器，其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行本发明任一实施例的预训练语言模型量化方法的步骤。In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform the steps of the pretrained language model quantization method of any embodiment of the present invention.

第四方面，本发明实施例还提供一种计算机程序产品，所述计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，使所述计算机执行本发明任一实施例的预训练语言模型量化方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program is When the instructions are executed by a computer, the computer is made to execute the steps of the pre-trained language model quantization method according to any embodiment of the present invention.

本申请的方法和装置提供的方案表明了底层的量化方案的提升对量化效果的影响被大大的低估和忽视了；同时也表明了简单的不使用任何技巧的k均值量化就可以达到非常好的压缩效果，说明k均值压缩方法有非常大的发展空间和应用前景。The solution provided by the method and device of the present application shows that the impact of the improvement of the underlying quantization solution on the quantization effect is greatly underestimated and ignored; it also shows that simple k-means quantization without any skills can achieve very good results The compression effect shows that the k-means compression method has a very large development space and application prospect.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明一实施例提供的一种预训练语言模型量化方法的流程图；1 is a flowchart of a method for quantizing a pre-trained language model according to an embodiment of the present invention;

图2为本发明一实施例提供的一种预训练语言模型量化方法的k均值量化的算法图；2 is an algorithm diagram of k-means quantization of a pre-trained language model quantization method provided by an embodiment of the present invention;

图3为本发明一实施例提供的一种预训练语言模型量化方法的一具体实施例的在BERT模型上进行线性和k均值量化的8个GLUE任务的平均得分的比较图；3 is a comparison diagram of the average scores of 8 GLUE tasks that perform linear and k-means quantization on a BERT model according to a specific embodiment of a pre-trained language model quantization method provided by an embodiment of the present invention;

图4是本发明实施例的预训练语言模型量化方案一具体实施例的比较ALBERT模型上线性和k均值量化的8个GLUE任务的平均得分的示意图；4 is a schematic diagram of comparing the average scores of 8 GLUE tasks linearly and k-means quantized on the ALBERT model according to a specific embodiment of the pre-trained language model quantization scheme according to the embodiment of the present invention;

图5是本发明实施例的预训练语言模型量化方案一具体实施例的具有k均值量化的BERT和ALBERT模型的8个GLUE任务的平均得分的比较图；5 is a comparison diagram of the average scores of 8 GLUE tasks with k-means quantized BERT and ALBERT models according to a specific embodiment of a pre-trained language model quantization scheme according to an embodiment of the present invention;

图6为本发明实施例的预训练语言模型量化方案一具体实施例的带有k均值量化的BERT和ALBERT模型的性能比较，每个值是指量化模型的平均分数与全精度模型的分数相比的百分比示意图；FIG. 6 is a performance comparison of the BERT and ALBERT models with k-means quantization according to a specific embodiment of the pre-trained language model quantization scheme according to the embodiment of the present invention, each value refers to the average score of the quantization model and the score of the full-precision model. The percentage diagram of the ratio;

图7为本发明一实施例提供的一种预训练语言模型量化装置的框图；7 is a block diagram of a pre-trained language model quantization apparatus according to an embodiment of the present invention;

图8是本发明一实施例提供的电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

请参考图1，其示出了本申请的预训练语言模型量化方法的一实施例的流程图，本实施例的预训练语言模型量化方法可以适用于对用户的请求进行分配处理，本申请在此没有限制。Please refer to FIG. 1 , which shows a flowchart of an embodiment of the pre-trained language model quantization method of the present application. The pre-trained language model quantization method of the present embodiment can be applied to the allocation processing of user requests. There is no limit to this.

如图1所示，在步骤101中，将预训练语言模型在下游任务上进行第一次微调；As shown in Figure 1, in step 101, the pre-trained language model is fine-tuned for the first time on the downstream task;

在步骤102中，使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类，将类别数设置为2ⁿ，其中，n为压缩后目标模型每个数据所占的比特数；In step 102, use k-means clustering to cluster the data in the weight matrices of all embedding layers and all linear layers of the fine-tuned model except the classification layer, and set the number of categories to 2 ⁿ , where, n is the number of bits occupied by each data of the target model after compression;

在步骤103中，将量化后的模型在所述下游任务上在维持量化的条件下进行第二次微调，最终得到量化后的网络。In step 103, the quantized model is fine-tuned for the second time on the downstream task under the condition of maintaining quantization, and finally a quantized network is obtained.

在本实施例中，通过对于每个选定的任务，将依次进行以下实验：对下游任务的预训练模型(例如BERT和ALBERT)进行微调；量化特定任务模型；微调量化模型。然后，在每个选定任务的验证集上测试所得模型的性能。In this embodiment, for each selected task, the following experiments will be performed in sequence: fine-tuning the pre-trained models for downstream tasks (eg, BERT and ALBERT); quantizing task-specific models; fine-tuning quantized models. Then, the performance of the resulting model is tested on the validation set for each selected task.

为了避免其他技巧的影响，我们仅遵循固定精度量化策略应用两种量化方案(线性和k均值)，而没有使用任何技巧。我们量化嵌入层和完全连接的层(分类层除外)的所有权重。对于每个权重向量，在量化之后，将由对应的簇索引向量和均值向量表示，并且权重向量的每个参数都将被其所属的簇的均值代替。To avoid the influence of other tricks, we apply only two quantization schemes (linear and k-means) following a fixed-precision quantization strategy without using any tricks. We quantify all weights of embedding layers and fully connected layers (except classification layers). For each weight vector, after quantization, will be represented by the corresponding cluster index vector and mean vector, and each parameter of the weight vector will be replaced by the mean of the cluster to which it belongs.

模型量化后，我们在保持量化的同时在相应的下游任务上对其进行了微调。对于前向遍历，我们通过其聚簇索引向量和均值向量重构每个量化层。对于后向传递，在正常更新其余参数的同时，我们通过训练均值向量来更新量化参数。更具体地，均值向量中的每个参数的梯度被计算为属于对应簇的参数的梯度的平均值。然后，通过相同的反向传播方法更新均值向量。After the model is quantized, we fine-tune it on the corresponding downstream tasks while maintaining the quantization. For forward traversal, we reconstruct each quantization layer by its cluster index vector and mean vector. For the backward pass, we update the quantization parameters by training the mean vector while updating the rest of the parameters normally. More specifically, the gradient of each parameter in the mean vector is calculated as the mean of the gradients of the parameters belonging to the corresponding cluster. Then, the mean vector is updated by the same backpropagation method.

在一些可选的实施例中，所述使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类包括：利用k-means++初始化将所述数据划分为2^k个簇(此处的k与前文中n意义相同)，并为所述2^k个簇初始化2^k个均值；根据每个数据与各均值的关系，将所述每个数据分类到最近的簇；在每个数据都分类完成后，将对应的均值更新为所在的簇的所有数据的平均值；以及重复对每个数据进行重新分类并更新均值，直至满足收敛或达到预设最大迭代轮数。In some optional embodiments, using k-means clustering to cluster the data in the weight matrices of all other embedding layers and all linear layers of the fine-tuned model except the classification layer includes: using k- The means++ initialization divides the data into 2 ^k clusters (k here has the same meaning as n in the previous text), and initializes 2 ^k means for the 2 ^k clusters; according to the relationship between each data and each mean, the Each data is classified into the nearest cluster; after each data is classified, the corresponding mean value is updated to the mean value of all the data in the cluster; and each data is repeatedly reclassified and the mean value is updated, until Convergence is satisfied or the preset maximum number of iteration rounds is reached.

在一些可选的实施例中，所述利用k-means++初始化将所述数据划分为2^k个簇，并为所述2^k个簇初始化2^k个均值包括：从所述数据种选择一个随机数据作为第一个均值；根据与现有均值的最小距离分配剩余数据作为下一个均值的可能性，并按照所述下一个均值的可能性选择下一个均值；以及重复可能性计算和均值选择，直至生成所有2^k均值。In some optional embodiments, using k-means++ initialization to divide the data into ^2k clusters, and initializing ^2k means for the ^2k clusters includes: selecting a random data as the first mean; assign the remaining data as the likelihood of the next mean based on the smallest distance from the existing mean, and select the next mean according to the likelihood of said next mean; and repeat the likelihood calculation and mean selection, until all 2 ^k -means are generated.

进一步可选的，上述预设最大迭代轮数设置为3。Further optionally, the preset maximum number of iteration rounds is set to 3.

在另一些可选的实施例中，所述量化后的网络在进行前向计算时，通过存储的每个数据的类别以及各个类别的均值对原权重矩阵进行还原，即每个数据用相应类别的均值代替；以及所述量化后的网络在后向计算时，使用梯度下降法更新网络参数，特别的经过量化的权重矩阵，我们对同一个类别的元素的梯度进行平均，作为该类别的均值的梯度对各个均值进行更新。In some other optional embodiments, when the quantized network performs forward calculation, the original weight matrix is restored by the stored category of each data and the mean value of each category, that is, each data uses the corresponding category The mean value of the quantized network is replaced; and the quantized network uses the gradient descent method to update the network parameters in the backward calculation, especially the quantized weight matrix, we average the gradients of the elements of the same category as the mean of the category The gradient of , updates each mean.

进一步可选的，所述预训练语言模型为BERT(Bidirectional EncoderRepresentation from Transformers)或ALBERT。Further optionally, the pre-trained language model is BERT (Bidirectional Encoder Representation from Transformers) or ALBERT.

下面对通过描述发明人在实现本发明的过程中遇到的一些问题和对最终确定的方案的一个具体实施例进行说明，以使本领域技术人员更好地理解本申请的方案。The following describes some problems encountered by the inventor in the process of implementing the present invention and a specific embodiment of the finalized solution, so that those skilled in the art can better understand the solution of the present application.

为了提高预训练语言模型的压缩率或者提高压缩后模型的性能，现有大多数工作主要是通过引入其他技巧实现，比如变精度压缩、分组压缩等，这种方法或提高效果有限，或同时导致运算耗时提高数十倍。而底层的量化方案的提升所能够带来的提高被大大地低估了，所以人们很少在这个方面进行尝试。In order to improve the compression rate of the pre-trained language model or improve the performance of the compressed model, most of the existing work is mainly achieved by introducing other techniques, such as variable precision compression, grouping compression, etc. This method has limited improvement effect, or at the same time leads to The operation time is increased by dozens of times. The improvement brought by the improvement of the underlying quantization scheme has been greatly underestimated, so people rarely try in this regard.

我们通过将底层的量化方案从线性聚类改变为k均值聚类，从而极大的改进了分组的合理性，进而使预训练语言模型能够被压缩到原大小的15％以下仍然能维持原有模型性能的90％以上。By changing the underlying quantization scheme from linear clustering to k-means clustering, we greatly improve the rationality of grouping, thereby enabling the pre-trained language model to be compressed to less than 15% of its original size and still maintain the original Over 90% of model performance.

具体步骤如下：Specific steps are as follows:

1)将预训练语言模型在具体的下游任务上微调；1) Fine-tune the pre-trained language model on specific downstream tasks;

2)用k均值聚类，对模型除了分类层之外的其他所有嵌入层和线性层的权重矩阵中的数据进行聚类，类别数为2ⁿ(其中n为压缩后目标模型每个数据所占的比特数)，使用k-means++初始化方法进行初始化，k均值方法最大迭代轮数设置为3；2) Use k-means clustering to cluster the data in the weight matrix of all embedding layers and linear layers of the model except the classification layer, and the number of categories is 2 ⁿ (where n is the value of each data of the compressed target model. The number of bits occupied), use the k-means++ initialization method to initialize, and the maximum number of iterations of the k-means method is set to 3;

3)将量化后的模型在相应下游任务上在维持量化的条件下再次微调，最终得到量化网络。3) The quantized model is fine-tuned again on the corresponding downstream tasks under the condition of maintaining quantization, and finally a quantized network is obtained.

另外，量化后的网络在进行前向计算时，通过存储的每个数据的类别以及各个类别的均值对原权重矩阵进行还原，即每个数据用相应类别的均值代替；而在后向计算时，使用梯度下降法更新网络参数，特别的经过量化的权重矩阵，我们对同一个类别的元素的梯度进行平均，作为该类别的均值的梯度对各个均值进行更新。In addition, when the quantized network performs forward calculation, the original weight matrix is restored by the stored category of each data and the mean of each category, that is, each data is replaced by the mean of the corresponding category; , using the gradient descent method to update the network parameters, especially the quantized weight matrix, we average the gradients of the elements of the same category, and update each mean as the gradient of the mean of the category.

本方案表明了底层的量化方案的提升对量化效果的影响被大大的低估和忽视了；同时也表明了简单的不使用任何技巧的k均值量化就可以达到非常好的压缩效果，说明k均值压缩方法有非常大的发展空间和应用前景。This scheme shows that the impact of the improvement of the underlying quantization scheme on the quantization effect has been greatly underestimated and ignored; at the same time, it also shows that a simple k-means quantization without any techniques can achieve a very good compression effect, indicating that k-means compression The method has a very large development space and application prospect.

以下介绍发明人的实现本申请实施例的过程，以及在该过程中的一些实验过程及相应的实验数据，以使本领域技术人员更好地理解本申请的技术方案。The inventor's process of implementing the embodiments of the present application, as well as some experimental procedures and corresponding experimental data in the process are described below, so that those skilled in the art can better understand the technical solutions of the present application.

最近，像BERT这样的预训练语言模型在多种自然语言处理任务上显示出极佳的性能。然而，由于它们所需的空间巨大，这些模型的应用受到了限制。一种被广泛研究且有效的减小网络大小的方法是量化。但是，大多数专注于BERT量化的工作都使用较为初级的线性聚类方法作为量化方案，而很少有工作尝试对量化方案进行改进。这极大地限制了量化的性能。在本文中，我们实现了k均值量化，并在BERT的固定精度量化上将其性能与线性量化进行了比较。通过比较，我们验证了底层的量化方案改进的对性能的提升效果被大大的低估了，并且k均值量化具有巨大的发展潜力。此外，我们还比较了ALBERT模型上两种量化方案的性能，以探索不同的预训练模型之间对于量化的鲁棒性差异。Recently, pre-trained language models like BERT have shown excellent performance on a variety of natural language processing tasks. However, the application of these models is limited due to the huge space they require. A widely studied and effective way to reduce network size is quantization. However, most of the works focusing on BERT quantization use relatively rudimentary linear clustering methods as the quantization scheme, and few works attempt to improve the quantization scheme. This greatly limits the performance of quantization. In this paper, we implement k-means quantization and compare its performance with linear quantization on BERT's fixed-precision quantization. By comparison, we verify that the performance improvement effect of the improvement of the underlying quantization scheme is greatly underestimated, and k-means quantization has great development potential. Furthermore, we also compare the performance of the two quantization schemes on the ALBERT model to explore the difference in robustness to quantization between different pretrained models.

关键字：K-means量化，线性量化，预训练语言模型，GLUE。Keywords: K-means quantization, linear quantization, pretrained language models, GLUE.

1引言1 Introduction

基于预训练的基于自注意力机制的模型(Transformer)最近在各种自然语言处理(NLP)任务(例如序列标签和句子分类)上都达到了最优的性能。其中，基于Transformer架构的BERT模型因其出色的性能和通用性而引起了更多关注。但是，这些模型的内存和计算消耗量令人望而却步。即使是相对较小版本的BERT模型(例如，BERT-base模型)也包含超过1亿个参数。过度参数化的特征使在智能手机和机器人等资源受限的设备上部署BERT模型具有挑战性。因此，压缩这些模型是业界的重要需求。A pre-trained self-attention based model (Transformer) has recently achieved state-of-the-art performance on various natural language processing (NLP) tasks such as sequence labeling and sentence classification. Among them, the BERT model based on the Transformer architecture has attracted more attention due to its excellent performance and generality. However, the memory and computational consumption of these models is prohibitive. Even relatively small versions of BERT models (e.g., the BERT-base model) contain over 100 million parameters. Over-parameterized features make it challenging to deploy BERT models on resource-constrained devices such as smartphones and robots. Therefore, compressing these models is an important requirement of the industry.

用于模型压缩的一种流行且有效的方法是量化。为了减小模型的大小，量化用更少的位而不是原始的32位来表示模型的参数。使用适当的硬件，量化可以大大减少内存占用，同时加快计算速度。在计算机视觉领域中有许多工作专注于量化模型，而在NLP上所做的工作却少得多。Transformer量化的试验工作成功地将Transformer模型量化为8或4位，同时保持了相当的性能。但据我们所知，只有两篇关于BERT量化的已发表著作。其中一篇文献将8位固定精度线性量化应用于BERT模型，并实现了4x的压缩率，而精度下降很小。另一篇文献通过基于参数张量的Hessian矩阵的逐组混合精度线性量化提高了量化性能。A popular and effective method for model compression is quantization. To reduce the size of the model, quantization uses fewer bits than the original 32 bits to represent the parameters of the model. With proper hardware, quantization can greatly reduce memory footprint while speeding up computation. There is a lot of work in computer vision focused on quantized models, and much less work on NLP. Experimental work on Transformer quantization successfully quantizes Transformer models to 8 or 4 bits while maintaining comparable performance. But to our knowledge, there are only two published works on BERT quantization. One of the papers applies 8-bit fixed-precision linear quantization to a BERT model and achieves a 4x compression ratio with little accuracy drop. Another paper improves quantization performance by group-wise mixed-precision linear quantization of Hessian matrices based on parameter tensors.

但是，对于底层的量化方案，上述大多数Transformer量化工作，尤其是BERT量化工作都利用线性聚类，线性聚类是一种主要的聚类方法。尽管它可以快速，轻松地处理，但量化结果不能很好地表示原始数据分布。尽管有的BERT量化工作在没有改进量化方案的情况下实现了更高的压缩率，但他们开发的分组量化的方法相当耗时，并且显著增加了延迟。尽管人们认为用更好的聚类方法代替线性聚类可以提高量化模型的性能，但量化方案改进的效果被低估了。因此，在本文中，我们探讨了简单地将量化方案从线性聚类改进为k-均值聚类的效果，并比较了两种方案的性能。此外，为了查看对其他预训练语言模型的影响，我们还比较了ALBERT模型这种BERT的改进模型的上述两种量化方案。However, for the underlying quantization scheme, most of the above Transformer quantization work, especially the BERT quantization work, utilizes linear clustering, which is a major clustering method. Although it can be processed quickly and easily, the quantized results are not a good representation of the original data distribution. Although some BERT quantization works achieve higher compression ratios without improving the quantization scheme, the method they developed for packet quantization is rather time-consuming and significantly increases latency. Although it is believed that replacing linear clustering with better clustering methods can improve the performance of quantization models, the effect of quantization scheme improvements has been underestimated. Therefore, in this paper, we explore the effect of simply improving the quantization scheme from linear clustering to k-means clustering, and compare the performance of the two schemes. Furthermore, to see the impact on other pre-trained language models, we also compared the above two quantization schemes for the ALBERT model, an improved model of BERT.

总的来说，我们在BERT和ALBERT上应用了k均值和线性量化，并在GLUE任务集上测试了它们的性能。通过这种方式，我们验证了量化方案的简单改进可以导致性能的极大提高，并且简单的k均值聚类作为BERT量化方案具有巨大的潜力。此外，我们还表明，k均值迭代轮数在k均值量化中起着重要作用。通过进一步的比较，我们发现ALBERT在量化方面不如BERT鲁棒，因为参数共享减少了参数的冗余性。Overall, we apply k-means and linear quantization on BERT and ALBERT and test their performance on the GLUE task set. In this way, we verify that a simple improvement of the quantization scheme can lead to a great improvement in performance, and that simple k-means clustering has great potential as a BERT quantization scheme. Furthermore, we show that the number of k-means iteration rounds plays an important role in k-means quantization. Through further comparison, we find that ALBERT is not as robust as BERT in terms of quantization, because parameter sharing reduces parameter redundancy.

2背景：BERT和ALBERT2 Background: BERT and ALBERT

在本节中，我们简要介绍BERT和ALBERT模型的体系结构，并指出我们在实验中使用的模型的版本。In this section, we briefly describe the architecture of the BERT and ALBERT models and indicate the version of the model we used in our experiments.

2.1 BERT2.1 BERT

BERT模型是一种特殊的基于Transformer的预训练网络。它们主要由嵌入层，编码器块和输出层组成。BERT模型中没有解码器块。每个编码器块包含一个自注意力层(包括与查询，键和值对应的三个并行线性层)和3个前馈层(每个包含一个线性层)。The BERT model is a special Transformer-based pretrained network. They mainly consist of an embedding layer, an encoder block and an output layer. There is no decoder block in the BERT model. Each encoder block contains one self-attention layer (including three parallel linear layers corresponding to query, key, and value) and 3 feedforward layers (each containing one linear layer).

对于每个自注意力层，BERT利用多头技术进一步提高其性能。对于每个自注意力头，存在3个权重矩阵W_q，W_k和W_v，其中W_q，W_k，W_v

(h是每个自我注意层中的头数)。令

表示相应的自注意力层的输入。因此，自注意力头的输出计算如下：For each self-attention layer, BERT utilizes a multi-head technique to further improve its performance. For each self-attention head, there are 3 weight matrices W _q , W _k and W _v , where W _q , W _k , W _v

(h is the number of heads in each self-attention layer). make

represents the input of the corresponding self-attention layer. Therefore, the output of the self-attention head is calculated as follows:

然后，对于每个自注意力层，将其所有自注意力头的输出顺序连接起来，以生成相应层的输出。Then, for each self-attention layer, the outputs of all its self-attention heads are sequentially concatenated to generate the output of the corresponding layer.

具体来说，在我们的工作中，我们使用BERT模型的bert-base-uncased版本来进行以下实验，该模型具有12个编码器块，每个自注意力层有12个自注意力头。Specifically, in our work, we perform the following experiments using a bert-base-uncased version of the BERT model with 12 encoder blocks and 12 self-attention heads per self-attention layer.

2.2 ALBERT2.2 ALBERT

与BERT相比，ALBERT做出了三项主要改进。首先，ALBERT模型将嵌入层的参数分解为两个较小矩阵的乘积。其次，他们采用跨层参数共享来提高参数效率。这两项改进可以显著减少参数总数，并使模型更有效。此外，参数共享还可以稳定网络参数。第三，他们在预训练时用句子顺序预测(SOP，sentence-order prediction)损失代替了下一句预测(NSP，next-sentence prediction)损失。这使得模型专注于建模句子间的连贯性，而不是主题预测，并提高了多句子编码任务的性能。ALBERT makes three major improvements compared to BERT. First, the ALBERT model decomposes the parameters of the embedding layer into the product of two smaller matrices. Second, they employ cross-layer parameter sharing to improve parameter efficiency. These two improvements can significantly reduce the total number of parameters and make the model more efficient. In addition, parameter sharing can also stabilize network parameters. Third, they replaced the next-sentence prediction (NSP, next-sentence prediction) loss with sentence-order prediction (SOP, sentence-order prediction) loss during pre-training. This allows the model to focus on modeling inter-sentence coherence rather than topic prediction, and improves performance on multi-sentence encoding tasks.

具体来说，在本文中，我们使用ALBERT模型的albert-base-v2版本，该版本还具有12个编码器块(所有参数在层之间共享)，每个自注意力层有12个自注意力头。Specifically, in this paper, we use the albert-base-v2 version of the ALBERT model, which also has 12 encoder blocks (all parameters are shared between layers) and 12 self-attentions per self-attention layer Power head.

3方法论3 Methodology

在本节中，我们首先介绍实验中的量化过程，然后解释我们详细使用的两种量化方案。In this section, we first introduce the quantification process in our experiments and then explain the two quantification schemes we use in detail.

3.1 概述3.1 Overview

为了在基于Transformer的预训练模型上比较线性和k均值量化方案，我们测试了量化模型在不同下游任务上的性能。具体来说，对于每个选定的任务，将依次进行以下实验：在下游任务上对预训练模型(BERT和ALBERT)进行微调；量化特定任务模型；微调量化模型。然后，在每个选定任务的验证集上测试所得模型的性能。To compare linear and k-means quantization schemes on Transformer-based pretrained models, we test the performance of the quantized models on different downstream tasks. Specifically, for each selected task, the following experiments will be performed in sequence: fine-tuning the pre-trained models (BERT and ALBERT) on downstream tasks; quantizing task-specific models; fine-tuning quantized models. Then, the performance of the resulting model is tested on the validation set for each selected task.

3.2 线性量化3.2 Linear Quantization

假设我们需要将向量v量化为k位(k位量化)。我们首先搜索其最小值v_min和最大值v_max。然后将范围[v_min，v_max]分为2^k个簇：Suppose we need to quantize the vector v into k bits (k-bit quantization). We first search for its minimum value v _min and maximum value v _max . The range [v _min , v _max ] is then divided into ^2k clusters:

将函数Q^定义为Define the function Q^ as

其值在0到2k-1之间。这样每个参数v_i都属于第Q^(v_i)个簇。v_i将被第Q^(v_i)个簇的均值代替，即属于它的所有参数的平均值。因此，量化函数为：Its value is between 0 and 2k-1. Thus each parameter v _i belongs to the Q^(vi ) _th cluster. v _i will be replaced by the mean of the Q _^ (vi )th cluster, i.e. the mean of all parameters belonging to it. Therefore, the quantization function is:

其中，当大括号中语句为真时1{statement}等于1，否则为0。where 1{statement} is equal to 1 when the statement in the braces is true, and 0 otherwise.

3.3 K均值量化3.3 K-Means Quantization

假设我们需要将向量v量化为k位(k位量化)。对于k均值量化，我们利用k均值聚类和k-means++初始化将向量v划分为2^k簇。Suppose we need to quantize the vector v into k bits (k-bit quantization). For k-means quantization, we utilize k-means clustering and k-means++ initialization to divide the vector v into ^2k clusters.

我们首先使用k-means++初始化方法为每个簇(c_1，c_2，...，c_2^k)初始化均值(μ_1，μ_2，...，μ_2^k)。然后，将每个参数v_i分类到其最近的簇。对v中的所有参数进行分类后，均值将分别更新为属于它们的所有参数的平均值。然后，重复重新分类参数并更新均值，直到满足收敛条件或达到最大迭代轮数。此外，k-means++初始化方法的过程如下：首先，从向量v中选择一个随机参数作为第一个均值；然后根据与所有现有均值的最小距离分配其他参数作为下一个均值的可能性，并根据这些可能性选择下一个均值；最后，重复可能性计算和均值选择，直到生成所有2^k质心。具体算法请参考图2。We first initialize the mean (μ_1, μ_2,..., ^μ_2k ) for each cluster (c_1, c_2, ..., c_2k) using the ^k -means++ initialization method. Then, classify each parameter _vi to its nearest cluster. After classifying all parameters in v, the mean is updated to the mean of all parameters belonging to them respectively. Then, iteratively reclassifies the parameters and updates the mean until the convergence conditions are met or the maximum number of iteration rounds is reached. Furthermore, the process of the k-means++ initialization method is as follows: first, a random parameter from the vector v is chosen as the first mean; then other parameters are assigned as the likelihood of the next mean according to the minimum distance from all existing means, and according to These possibilities select the next mean; finally, the likelihood calculation and mean selection are repeated until all ^2k centroids are generated. Please refer to Figure 2 for the specific algorithm.

为了减少由于量化方案改进而导致的效率下降，我们将k均值聚类的最大迭代轮数设置为3。在k均值聚类完成之后，我们将所得的簇编号向量用作聚类索引向量，各个簇的均值作为对应的均值向量。每个参数v_i将被其所属簇的均值替换。To reduce the efficiency drop due to the improved quantization scheme, we set the maximum number of iteration rounds for k-means clustering to 3. After k-means clustering is completed, we use the resulting cluster number vector as the cluster index vector, and the mean of each cluster as the corresponding mean vector. Each parameter _vi will be replaced by the mean of the cluster to which it belongs.

4实验4 experiments

在本节中，我们首先介绍我们在实验中使用的数据集，然后解释我们在BERT和ALBERT上进行的实验的细节，最后展示我们的实验结果和相应的讨论。In this section, we first introduce the dataset we used in our experiments, then explain the details of our experiments on BERT and ALBERT, and finally present our experimental results and the corresponding discussion.

4.1 数据集4.1 Dataset

我们在通用语言理解评估(GLUE)任务集上测试了量化模型的性能。其中包含问题回答，情感分析和文本蕴涵等NLU任务。具体来说，我们利用8个任务(QNLI，CoLA，RTE，SST-2，MRPC，STS-B，MNLI和QQP)来测试不同量化方案的性能。每个任务的评估指标如下：CoLA为Matthews相关系数(mcc)；QNLI，RTE，SST2和MNLI为正确率(acc)；MRPC和QQP为正确率(acc)和F₁得分；STS-B为Pearson和Spearman相关系数(corr)。我们遵循数据集的默认划分。数据集可在此处下载：https://gluebenchmark.com/tasks。We test the performance of our quantized models on the General Language Understanding Evaluation (GLUE) task set. These include NLU tasks such as question answering, sentiment analysis, and text entailment. Specifically, we utilize 8 tasks (QNLI, CoLA, RTE, SST-2, MRPC, STS-B, MNLI and QQP) to test the performance of different quantization schemes. The evaluation metrics for each task are as follows: CoLA is the Matthews correlation coefficient (mcc); QNLI, RTE, SST2 and MNLI are the correct rate (acc); MRPC and QQP are the correct rate (acc) and F1 score; STS _- B is the Pearson and Spearman correlation coefficient (corr). We follow the default partition of the dataset. The dataset can be downloaded here: https://gluebenchmark.com/tasks.

4.2 实验细节4.2 Experimental Details

在量化之前，我们使用Adam优化器(初始学习率为5e-5，线性更新)在8个任务上对BERT模型的bert-base-uncased版本进行了微调。对于ALBERT模型，我们首先在QNLI，CoLA，SST-2，MNLI和QQP上微调albert-base-v2模型。然后基于MNLI的微调结果在RTE，MRPC和STSB上进行微调。我们使用线性更新的Adam优化器来微调ALBERT，并在{1e-5、2e-5、3e-5、4e-5、5e-5}中搜索每个任务的初始学习率。Before quantization, we fine-tune the bert-base-uncased version of the BERT model on 8 tasks using the Adam optimizer (initial learning rate 5e-5, linear update). For the ALBERT model, we first fine-tune the albert-base-v2 model on QNLI, CoLA, SST-2, MNLI and QQP. Then fine-tune on RTE, MRPC and STSB based on the fine-tuning results of MNLI. We fine-tune ALBERT using the linearly updated Adam optimizer and search for the initial learning rate for each task in {1e-5, 2e-5, 3e-5, 4e-5, 5e-5}.

表1.GLUE任务集上的BERT的固定精度线性量化结果。Table 1. Fixed-precision linear quantization results of BERT on the GLUE task set.

表2.GLUE任务集上的BERT的固定精度k均值量化结果。Table 2. Fixed-precision k-means quantization results for BERT on the GLUE task set.

表3.GLUE任务集上的ALBERT的固定精度线性量化结果。Table 3. Fixed-precision linear quantization results of ALBERT on the GLUE task set.

表4.GLUE任务集上的ALBERT的固定精度k均值量化结果。Table 4. Fixed-precision k-means quantization results for ALBERT on the GLUE task set.

图3示出了在BERT模型上进行线性和k均值量化的8个GLUE任务的平均得分的比较。Figure 3 shows a comparison of the mean scores on 8 GLUE tasks with linear and k-means quantization on the BERT model.

图4示出了比较ALBERT模型上线性和k均值量化的8个GLUE任务的平均得分。Figure 4 shows the average scores for 8 GLUE tasks comparing linear and k-means quantization on the ALBERT model.

量化后，我们会进一步对相应任务的量化模型进行微调。特别地，被量化的层的学习率被乘以10倍(例如，对于所有量化的BERT模型为5e-4)，而其他层的学习率保持不变。After quantization, we further fine-tune the quantized model for the corresponding task. In particular, the learning rates of the quantized layers are multiplied by a factor of 10 (e.g., 5e-4 for all quantized BERT models), while the learning rates of other layers remain the same.

4.3 实验结果与讨论4.3 Experimental results and discussion

我们主要关注1-5位固定精度量化。表1和表2分别显示了BERT的线性和k均值量化的结果，图3显示了两组实验的平均得分之间的进一步比较。类似地，ALBERT的结果和比较分别在表3，表4和图4中显示。We mainly focus on 1-5 bit fixed precision quantization. Tables 1 and 2 show the results of linear and k-means quantification of BERT, respectively, and Figure 3 shows a further comparison between the mean scores of the two sets of experiments. Similarly, the results and comparisons of ALBERT are shown in Table 3, Table 4 and Figure 4, respectively.

4.3.1 BERT4.3.1 BERT

量化方案改进带来的提升。如表1，表2和图3所示，尽管无论采用哪种量化方案，模型在位数较低时性能都较差，但在使用相同位数时，在所有8个任务及其平均值上，用k均值量化的模型的性能要明显好于使用线性量化的模型。在8个任务的平均性能上看，仅通过将量化方案从线性均值改进为k均值，我们就能实现将1-5位量化分别与全精度相比的性能下降从(38.8％，34.7％，27.6％，17.1％，4.8％)下降到(28.6％，3.94％，0.9％，0.3％，-0.2％)。结果表明，仅通过改进量化方案就可以实现很大的性能改进，这表明量化方案的改进空间被大大低估了。为了进一步说明这一点，我们使用分组线性量化方案重复了几次实验，该方案是基于线性量化的改进，并且比简单线性量化具有更高的性能。结果显示在表5中。与分组线性量化的性能相比，简单的k均值量化可实现更高的性能或相当的性能，同时节省大量时间。The improvement brought by the improvement of the quantification scheme. As shown in Table 1, Table 2, and Figure 3, although the model performs worse at lower number of bits regardless of the quantization scheme, when using the same number of bits, it performs better on all 8 tasks and its average , the model quantized with k-means performs significantly better than the model using linear quantization. Looking at the average performance of the 8 tasks, just by improving the quantization scheme from linear mean to k-means, we can achieve a performance drop of 1-5 bit quantization compared to full precision respectively from (38.8%, 34.7%, 27.6%, 17.1%, 4.8%) decreased to (28.6%, 3.94%, 0.9%, 0.3%, -0.2%). The results show that large performance improvements can be achieved by improving the quantization scheme alone, suggesting that the room for improvement of the quantization scheme is greatly underestimated. To further illustrate this, we repeated several experiments using a grouped linear quantization scheme, which is an improvement on linear quantization and has higher performance than simple linear quantization. The results are shown in Table 5. Compared to the performance of grouped linear quantization, simple k-means quantization can achieve higher or comparable performance while saving a lot of time.

k均值量化的潜力。如表2所示，可以简单地使用具有固定精度策略的k均值量化来很好地压缩模型，并且即使在某些特别低的量化位数下，量化的模型仍然可以很好地进行压缩。例如，在任务RTE上，使用k均值量化量化为3位的模型只会导致2.16％的性能下降。对于大多数任务，包括QNLI，SST-2，MRPC，STS-B，MNLI和QQP，量化模型的性能仅在压缩为1位时有显著下降。值得注意的是，这些结果是通过简单的k均值量化实现的，且最大迭代轮数仅为3，没有使用任何其他技巧，这表明k均值量化具有巨大的发展潜力。Potential for k-means quantization. As shown in Table 2, the model can be compressed well simply using k-means quantization with a fixed-precision strategy, and the quantized model still compresses well even at some particularly low quantization bits. For example, on task RTE, a model quantized to 3 bits using k-means quantization results in only a 2.16% performance drop. For most tasks, including QNLI, SST-2, MRPC, STS-B, MNLI, and QQP, the performance of the quantized model only drops significantly when compressed to 1 bit. It is worth noting that these results are achieved with simple k-means quantization with a maximum number of iterations of only 3, without using any other tricks, suggesting that k-means quantization has great potential for development.

表5.BERT上的k均值量化和分组线性量化之间的比较。最右列是与RTE和MRPC上的分组线性量化相比的k均值量化的平均耗时。(其中，在分组量化中，将每个矩阵被划分为不同的组，并分别对每个组进行量化。对于前向遍历，模型需要分别为每个层重构每个量化组，而不是直接重构每个量化层的整个权重矩阵。这就解释了为什么分组量化非常耗时。具体而言，在我们的分组量化实验中，我们将每个矩阵划分为128个组。)Table 5. Comparison between k-means quantization and grouped linear quantization on BERT. The rightmost column is the average elapsed time for k-means quantization compared to grouped linear quantization on RTE and MRPC. (wherein, in group quantization, each matrix is divided into different groups, and each group is quantized separately. For forward traversal, the model needs to reconstruct each quantized group for each layer separately, rather than directly Reconstruct the entire weight matrix for each quantization layer. This explains why group quantization is time-consuming. Specifically, in our group quantization experiments, we divide each matrix into 128 groups.)

4.3.2 ALBERT4.3.2 ALBERT

一般来说，从BERT实验得出的两个主要结论仍然成立。如表3，表4和图4所示，我们还可以看到量化方案改进带来的巨大改进以及k均值量化的巨大潜力。但是，有些异常结果值得讨论。In general, two main conclusions from BERT experiments still hold. As shown in Table 3, Table 4 and Figure 4, we can also see the huge improvement brought by the improvement of the quantization scheme and the huge potential of k-means quantization. However, some unusual results are worth discussing.

k均值迭代轮数的影响。第一组异常结果来自QNLI，MRPC和STS-B的1位量化。尽管k均值量化的结果通常优于线性量化，但这三组结果并不符合这一规律。我们认为这是因为参数的分布过于复杂使k均值仅通过3轮迭代无法得到很好的聚类结果。为了验证该理论并进一步探讨迭代轮数的影响，我们对这些异常结果进行了重复实验，同时将最大迭代轮数扩大到5、10和20。表6中显示了相应的结果。迭代轮数越多，k均值量化的效果越好，并且最终超过了线性量化的结果。但是，过拟合的问题仍然存在，当最大迭代轮数从10增加到20时，QNLI和STS-B的量化性能均出现明显下降，。因此，在k均值量化中，k均值最大迭代轮数也是很重要的需要仔细搜索的超参数。The effect of the number of k-means iteration rounds. The first set of abnormal results comes from 1-bit quantification of QNLI, MRPC and STS-B. Although the results of k-means quantization are generally better than linear quantization, these three sets of results do not conform to this law. We believe that this is because the distribution of parameters is too complex for k-means to get good clustering results with only 3 iterations. To test the theory and further explore the effect of the number of iterations, we repeated the experiments on these anomalous results while expanding the maximum number of iterations to 5, 10, and 20. The corresponding results are shown in Table 6. The higher the number of iterations, the better the k-means quantization works, and eventually surpasses the results of linear quantization. However, the problem of overfitting still exists, and the quantization performance of both QNLI and STS-B decreases significantly when the maximum number of iterations is increased from 10 to 20. Therefore, in k-means quantization, the k-means maximum number of iterations is also an important hyperparameter that needs to be carefully searched.

表6.ALBERT上k均值最大迭代轮数不同时的1位量化性能。Table 6. 1-bit quantization performance with different maximum number of iteration rounds for k-means on ALBERT.

图5示出了k均值量化的BERT和ALBERT模型的8个GLUE任务的平均得分的比较。Figure 5 shows a comparison of the mean scores for 8 GLUE tasks for the k-means quantized BERT and ALBERT models.

图6示出了k均值量化的BERT和ALBERT模型的性能比较。每个值是指量化模型的平均分数与全精度模型的分数相比的百分比。Figure 6 shows the performance comparison of the k-means quantized BERT and ALBERT models. Each value refers to the percentage of the quantized model's average score compared to the full-precision model's score.

CoLA的0和MRPC的68.4。另一组异常结果是来自CoLA和MRPC的线性量化，它们是二分类任务。我们发现，经过微调后，量化模型始终会输出“1”。0和68.4仅由验证集上的数据分布决定。换句话说，在通过线性量化将模型量化为1-5位之后，该模型几乎失效，很难在这两个任务上进行训练。此外，我们进一步在两个任务上将模型量化到较高的位数进行了实验，发现量化模型的表现从量化到6位开始便不再是0和68.4。0 for CoLA and 68.4 for MRPC. Another set of anomalous results are linear quantifications from CoLA and MRPC, which are binary classification tasks. We found that after fine-tuning, the quantized model always outputs "1". 0 and 68.4 are only determined by the data distribution on the validation set. In other words, after quantizing the model to 1-5 bits by linear quantization, the model almost fails and it is difficult to train on both tasks. In addition, we further experimented with quantizing the model to higher bits on two tasks, and found that the performance of the quantized model is no longer 0 and 68.4 from quantization to 6 bits.

BERT和ALBERT之间的比较。此外，我们比较了BERT和ALBERT的k均值量化的性能，结果如图5和图6所示。与BERTBERT在k均值2位量化后仍保持其原始性能的96.1％相比，经过k均值4位和3位量化后，ALBERT的性能就已经分别下降至93.4％和72.5％。因此就量化而言，ALBERT的鲁棒性较差(在我们的工作中，对量化的鲁棒性意味着在保持高性能的同时量化至较低位数的能力)。考虑到BERT的ALBERT的主要改进是参数共享，同时量化也可以视为层内参数共享，我们推测参数共享和量化具有相似的效果，这意味着通过参数共享和量化去除的冗余信息有部分地重叠。考虑到，在参数共享之后，与BERT相比，ALBERT去除了大量冗余信息(参数总数从108M下降到12M)，因此，在ALBERT上进一步应用量化将很容易损坏有用信息，于是导致了ALBERT对量化的鲁棒性较差。但是，从另一个角度来看，参数共享已经大大减少了参数数量，因此也可以被视为模型压缩方法。此外，考虑到全精度ALBERT的性能优于在GPU中占用相似内存的4位和3位BERT模型的性能，因此参数共享，与不使用任何技巧的量化相比，甚至可以获得更好的压缩性能。但是，作为一种压缩方法，参数共享具有不可忽略的缺点：它只能减少内存消耗，而大多数其他压缩方法可以同时减少内存消耗和计算消耗(即时间开销)。Comparison between BERT and ALBERT. Furthermore, we compare the performance of k-means quantization for BERT and ALBERT, and the results are shown in Figures 5 and 6. Compared to BERTBERT which still maintains 96.1% of its original performance after k-means 2-bit quantization, ALBERT’s performance has dropped to 93.4% and 72.5% after k-means 4-bit and 3-bit quantization, respectively. ALBERT is therefore less robust in terms of quantization (in our work, robustness to quantization means the ability to quantize to lower number of bits while maintaining high performance). Considering that the main improvement of ALBERT of BERT is parameter sharing, and quantization can also be regarded as intra-layer parameter sharing, we speculate that parameter sharing and quantization have similar effects, which means that the redundant information removed by parameter sharing and quantization is partially overlapping. Considering that, after parameter sharing, compared with BERT, ALBERT removes a lot of redundant information (the total number of parameters drops from 108M to 12M), therefore, further application of quantization on ALBERT will easily damage useful information, resulting in ALBERT pair Quantization is less robust. However, from another point of view, parameter sharing has greatly reduced the number of parameters, so it can also be regarded as a model compression method. Furthermore, given that full-precision ALBERT outperforms 4-bit and 3-bit BERT models that occupy similar memory in the GPU, parameter sharing, even better compression performance can be achieved compared to quantization without any tricks . However, as a compression method, parameter sharing has non-negligible disadvantages: it can only reduce memory consumption, while most other compression methods can reduce both memory consumption and computational consumption (i.e. time overhead).

5结论5 Conclusion

在本文中，我们在BERT和ALBERT模型上比较了k均值量化和线性量化，并得到了三个主要结论。首先，我们发现用k均值量化的模型的性能明显优于使用线性量化的模型。只需改进底层的量化方案，即可实现巨大的性能提升。其次，即使在使用简单的固定精度压缩策略且没有任何其他的技巧的情况下，使用k-means量化也可以将模型压缩到相对较低的位数且保持较高的性能。这表明k均值量化拥有巨大发展潜力。第三，k均值的迭代轮数在量化模型的性能中起着重要作用，应谨慎确定。此外，通过比较BERT和ALBERT的k均值量化结果，我们发现ALBERT对于量化的鲁棒性不如BERT。这表明参数共享和量化具有一些相似的效果。因此，在应用了广泛的参数共享的模型上进一步应用量化将较容易损坏有用信息，从而导致性能显著下降。In this paper, we compare k-means quantization and linear quantization on BERT and ALBERT models and draw three main conclusions. First, we found that models quantized with k-means performed significantly better than models using linear quantization. Huge performance gains can be achieved simply by improving the underlying quantization scheme. Second, even when using a simple fixed-precision compression strategy and without any other tricks, using k-means quantization can compress the model to a relatively low number of bits while maintaining high performance. This shows that k-means quantization has great potential for development. Third, the number of iterative rounds of k-means plays an important role in quantifying the performance of the model and should be determined carefully. Furthermore, by comparing the k-means quantization results of BERT and ALBERT, we find that ALBERT is less robust to quantization than BERT. This suggests that parameter sharing and quantization have some similar effects. Therefore, further application of quantization on models that apply extensive parameter sharing will more easily corrupt useful information, resulting in significant performance degradation.

请参考图7，其示出了本发明一实施例提供的一种预训练语言模型量化装置的框图。Please refer to FIG. 7 , which shows a block diagram of a pre-trained language model quantization apparatus according to an embodiment of the present invention.

如图7所示，预训练语言模型量化装置700，包括第一次微调模块710、聚类压缩模块720和第二次微调模块730。As shown in FIG. 7 , the pre-trained language model quantization apparatus 700 includes a first fine-tuning module 710 , a clustering compression module 720 and a second fine-tuning module 730 .

其中，第一次微调模块710，配置为将预训练语言模型在下游任务上进行第一次微调；聚类压缩模块720，配置为使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类，将类别数设置为2ⁿ，其中，n为压缩后目标模型每个数据所占的比特数；以及第二次微调模块730，配置为将量化后的模型在所述下游任务上在维持量化的条件下进行第二次微调，最终得到量化后的网络。The first fine-tuning module 710 is configured to fine-tune the pre-trained language model on the downstream task for the first time; the clustering and compression module 720 is configured to use k-means clustering to perform fine-tuning on the fine-tuned model except for the classification layer. The data in the weight matrices of all other embedding layers and all linear layers are clustered, and the number of categories is set to 2 ⁿ , where n is the number of bits occupied by each data of the compressed target model; and the second fine-tuning module 730, is configured to perform a second fine-tuning of the quantized model on the downstream task under the condition of maintaining quantization, and finally obtain a quantized network.

在一些可选的实施例中，上述聚类压缩模块720进一步配置为：利用k-means++初始化将所述数据划分为2^k个簇，并为所述2^k个簇初始化2^k个均值；根据每个数据与各均值的关系，将所述每个数据分类到最近的簇；在每个数据都分类完成后，将对应的均值更新为所在的簇的所有数据的平均值；以及重复对每个数据进行重新分类并更新质心，直至满足收敛或达到预设迭代数。In some optional embodiments, the above clustering compression module 720 is further configured to: use k-means++ initialization to divide the data into ^2k clusters, and initialize ^2k means for the ^2k clusters; For the relationship between each data and each mean, classify each data into the nearest cluster; after each data is classified, update the corresponding mean to the average of all the data in the cluster; and repeat the process for each data The data are reclassified and the centroids are updated until convergence is met or a preset number of iterations is reached.

应当理解，图7中记载的诸模块与参考图1中描述的方法中的各个步骤相对应。由此，上文针对方法描述的操作和特征以及相应的技术效果同样适用于图7中的诸模块，在此不再赘述。It should be understood that the modules recited in FIG. 7 correspond to various steps in the method described with reference to FIG. 1 . Therefore, the operations and features described above with respect to the method and the corresponding technical effects are also applicable to the modules in FIG. 7 , and details are not repeated here.

值得注意的是，本申请的实施例中的模块并不用于限制本申请的方案，例如接收模块可以描述为接收语音识别请求的模块。另外，还可以通过硬件处理器来实现相关功能模块，例如接收模块也可以用处理器实现，在此不再赘述。It should be noted that the modules in the embodiments of the present application are not used to limit the solution of the present application, for example, the receiving module may be described as a module that receives a voice recognition request. In addition, the relevant functional modules may also be implemented by a hardware processor, for example, the receiving module may also be implemented by a processor, which will not be repeated here.

在另一些实施例中，本发明实施例还提供了一种非易失性计算机存储介质，计算机存储介质存储有计算机可执行指令，该计算机可执行指令可执行上述任意方法实施例中的预训练语言模型量化方法；In other embodiments, embodiments of the present invention further provide a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions can execute the pre-training in any of the foregoing method embodiments Language model quantification methods;

作为一种实施方式，本发明的非易失性计算机存储介质存储有计算机可执行指令，计算机可执行指令设置为：As an embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions, and the computer-executable instructions are set to:

将预训练语言模型在下游任务上进行第一次微调；Perform the first fine-tuning of the pretrained language model on the downstream task;

使用k均值聚类，对微调后的模型除了分类层之外的其他所有嵌入层和所有线性层的权重矩阵中的数据进行聚类，将类别数设置为2ⁿ，其中，n为压缩后目标模型每个数据所占的比特数；Using k-means clustering, cluster the data in the weight matrices of all embedding layers and all linear layers of the fine-tuned model except the classification layer, and set the number of categories to 2 ⁿ , where n is the compressed target The number of bits occupied by each data of the model;

将量化后的模型在所述下游任务上在维持量化的条件下进行第二次微调，最终得到量化后的网络。The quantized model is fine-tuned a second time on the downstream task under the condition of maintaining quantization, and finally a quantized network is obtained.

非易失性计算机可读存储介质可以包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需要的应用程序；存储数据区可存储根据预训练语言模型量化装置的使用所创建的数据等。此外，非易失性计算机可读存储介质可以包括高速随机存取存储器，还可以包括非易失性存储器，例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中，非易失性计算机可读存储介质可选包括相对于处理器远程设置的存储器，这些远程存储器可以通过网络连接至预训练语言模型量化装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile computer-readable storage medium can include a stored program area and a stored data area, wherein the stored program area can store an operating system and an application program required by at least one function; the stored data area can store a quantization device according to a pre-trained language model using the created data, etc. In addition, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the non-transitory computer-readable storage medium may optionally include memory located remotely from the processor, the remote memory being connectable to the pretrained language model quantization apparatus through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

本发明实施例还提供一种计算机程序产品，计算机程序产品包括存储在非易失性计算机可读存储介质上的计算机程序，计算机程序包括程序指令，当程序指令被计算机执行时，使计算机执行上述任一项预训练语言模型量化方法。An embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer is made to execute the above Any pretrained language model quantization method.

图8是本发明实施例提供的电子设备的结构示意图，如图8所示，该设备包括：一个或多个处理器810以及存储器820，图8中以一个处理器810为例。预训练语言模型量化方法的设备还可以包括：输入装置830和输出装置840。处理器810、存储器820、输入装置830和输出装置840可以通过总线或者其他方式连接，图8中以通过总线连接为例。存储器820为上述的非易失性计算机可读存储介质。处理器810通过运行存储在存储器820中的非易失性软件程序、指令以及模块，从而执行服务器的各种功能应用以及数据处理，即实现上述方法实施例预训练语言模型量化方法。输入装置830可接收输入的数字或字符信息，以及产生与预训练语言模型量化装置的用户设置以及功能控制有关的键信号输入。输出装置840可包括显示屏等显示设备。FIG. 8 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. As shown in FIG. 8 , the device includes: one or more processors 810 and a memory 820 . In FIG. 8 , one processor 810 is used as an example. The apparatus for pre-training the language model quantization method may further include: an input device 830 and an output device 840 . The processor 810, the memory 820, the input device 830, and the output device 840 may be connected through a bus or in other ways, and the connection through a bus is taken as an example in FIG. 8 . The memory 820 is the aforementioned non-volatile computer-readable storage medium. The processor 810 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions and modules stored in the memory 820, that is, to implement the pre-trained language model quantization method in the above method embodiment. The input device 830 may receive input numerical or character information, and generate key signal input related to user settings and function control of the pre-trained language model quantization device. The output device 840 may include a display device such as a display screen.

上述产品可执行本发明实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本发明实施例所提供的方法。The above product can execute the method provided by the embodiment of the present invention, and has corresponding functional modules and beneficial effects for executing the method. For technical details not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

作为一种实施方式，上述电子设备应用于预训练语言模型量化装置中，包括：As an embodiment, the above-mentioned electronic equipment is applied to a pre-trained language model quantization device, including:

至少一个处理器；以及，与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够：at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to:

本申请实施例的电子设备以多种形式存在，包括但不限于：The electronic devices in the embodiments of the present application exist in various forms, including but not limited to:

(1)移动通信设备：这类设备的特点是具备移动通信功能，并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机，以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (eg iPhone), multimedia phones, feature phones, and low-end phones.

(2)超移动个人计算机设备：这类设备属于个人计算机的范畴，有计算和处理功能，一般也具备移动上网特性。这类终端包括：PDA、MID和UMPC设备等，例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as iPads.

(3)便携式娱乐设备：这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod)，掌上游戏机，电子书，以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (eg iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4)服务器:提供计算服务的设备，服务器的构成包括处理器、硬盘、内存、系统总线等，服务器和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of the server includes a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general computer architecture, but due to the need to provide highly reliable services, the processing power, stability , reliability, security, scalability, manageability and other aspects of high requirements.

(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.

以上所描述的装置实施例仅仅是示意性的，其中作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place , or distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of quantizing a pre-trained language model, comprising:

carrying out first fine adjustment on a pre-training language model on a downstream task;

clustering the data in the weight matrixes of all other embedded layers and all linear layers of the trimmed model except the classification layer by using k-means clustering, and setting the class number to be 2ⁿWherein n is the bit number occupied by each data of the compressed target model;

and carrying out secondary fine adjustment on the quantized model on the downstream task under the condition of maintaining quantization to finally obtain a quantized network.

2. The method of claim 1, wherein the clustering the data in the weight matrices of all embedded layers and all linear layers of the trimmed model except the classification layer using k-means clustering comprises:

partitioning the data into 2 with k-means + + initialization^kAnd is said 2^kIndividual cluster initialization 2^kAn average value;

classifying each data into a nearest cluster according to the relation between each data and each mean value;

after each data is classified, updating the corresponding average value to the average value of all data of the cluster where the corresponding average value is located;

and repeatedly reclassifying each data and updating the mean value until convergence is met or the preset maximum iteration round number is reached.

3. The method of claim 2, wherein said partitioning said data into 2 using k-means + + initialization^kAnd is said 2^kIndividual cluster initialization 2^kThe mean values include:

selecting a random data from said data as a first mean;

distributing the residual data as the possibility of the next mean value according to the minimum distance from the existing mean value, and selecting the next mean value according to the possibility of the next mean value;

repeat likelihood calculation and mean selection until all 2 s are generated^kAnd (4) average value.

4. The method of claim 2, wherein the preset maximum number of iterations is 3.

5. The method according to claim 1, wherein the quantized network restores the original weight matrix by the stored category of each data and the mean value of each category when performing forward calculation, that is, each data is replaced by the mean value of the corresponding category;

when the quantized network is calculated backwards, a gradient descent method is used for updating network parameters, particularly quantized weight matrixes are used, the gradients of elements in the same category are averaged, and the average is used as the gradient of the average of the category to update each average.

6. The method of any of claims 1-5, wherein the pre-trained language model is BERT or ALBERT.

7. A pre-trained language model quantification apparatus comprising:

the first fine tuning module is configured to perform first fine tuning on the pre-training language model on a downstream task;

a clustering compression module configured to cluster data in the weight matrix of all embedded layers and all linear layers of the trimmed model except the classification layer by using k-means clustering, and set the category number to be 2ⁿWherein n is the bit number occupied by each data of the compressed target model;

and the second fine tuning module is configured to perform second fine tuning on the quantized model on the downstream task under the condition of maintaining quantization, so as to finally obtain a quantized network.

8. The apparatus of claim 1, wherein the cluster compression module is further configured to:

9. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the pre-trained language model quantification method of any one of claims 1-6.

10. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.