CN118349814A

CN118349814A - Optimization method, computing system and storage medium for large model auxiliary verification

Info

Publication number: CN118349814A
Application number: CN202410527518.4A
Authority: CN
Inventors: 戴国浩; 许珈铭
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2024-04-29
Filing date: 2024-04-29
Publication date: 2024-07-16

Abstract

The invention discloses an optimization method, a computing system and a storage medium for large model auxiliary verification, wherein the method comprises the following steps: the first large model M ₁ generates an output one by one based on the input, the output comprising n+1 semantic units; the second large model M ₂ processes the input and output of the first large model M ₁ to obtain the output of the second large model M ₂ so as to verify the correctness of the n+1st semantic unit output by the first large model M ₁; wherein the scale of the first large model M ₁ is smaller than the scale of the second large model M ₂. The method adopts a smaller model to generate semantic units one by one, and uses a large model to verify the generated semantic unit sequence; the method can give consideration to efficiency and accuracy, and can effectively improve the parallelism and the calculation efficiency of the large language model.

Description

A large model-assisted verification optimization method, computing system and storage medium

技术领域Technical Field

本发明涉及人工智能领域，尤其涉及一种大模型辅助验证的优化方法、计算系统及存储介质。The present invention relates to the field of artificial intelligence, and in particular to an optimization method, a computing system and a storage medium for large model-assisted verification.

背景技术Background technique

生成式大语言模型是一种自回归的模型，在机器翻译、智能客服等多个领域有较好的效果。该模型分为两个阶段，分别是预填充(Prefill)和解码(decode)阶段。The generative large language model is an autoregressive model that has good results in many fields such as machine translation and intelligent customer service. The model is divided into two stages: prefill and decode.

预填充阶段的输入是由输入的文字经过嵌入(embedding)之后得到的矩阵，经过模型的处理后，可以得到一个向量，对应于第一个语义单元(token)的嵌入(embedding)，这个向量会作为解码(decode)阶段的输入，经过模型处理得到一个向量，对应于第2个语义单元(token)的嵌入(embedding)，所以显然大语言模型采取的是语义单元逐个生成的方式，由此可以得出生成k个语义单元需要执行模型k次。所以影响大语言模型的性能不仅仅是在模型的准确率上，更在生成相应语义单元的速度上。数据表明，假设输入语义单元长度为2k，对于Llama2-7B的模型在A100 GPU上执行decode阶段，每次生成一个语义单元约15ms。而对于Llama2-70B的模型，在A100 GPU上执行decode阶段，每次生成一个语义单元数量约为35ms，大概是7B模型的两倍多延迟。The input of the pre-filling stage is a matrix obtained by embedding the input text. After being processed by the model, a vector corresponding to the embedding of the first semantic unit (token) can be obtained. This vector will be used as the input of the decoding stage. After being processed by the model, a vector corresponding to the embedding of the second semantic unit (token) is obtained. Therefore, it is obvious that the large language model adopts the method of generating semantic units one by one. It can be concluded that the model needs to be executed k times to generate k semantic units. Therefore, the performance of the large language model is not only affected by the accuracy of the model, but also by the speed of generating the corresponding semantic units. The data shows that, assuming that the length of the input semantic unit is 2k, for the Llama2-7B model, the decode stage is executed on the A100 GPU, and each semantic unit is generated for about 15ms. For the Llama2-70B model, the decode stage is executed on the A100 GPU, and the number of semantic units generated each time is about 35ms, which is about twice the delay of the 7B model.

目前对于加速大语言模型推理的方法主要有：At present, the main methods for accelerating the reasoning of large language models are:

1.模型压缩。对模型进行剪枝，即稀疏化，对权重和activation(激活)进行量化，还有对模型进行蒸馏。这种方式会带来较好的效果，但是本质上改变了模型，所以输出结果会发生一些改变。1. Model compression. Prune the model, that is, make it sparse, quantize the weights and activations, and distill the model. This method will bring better results, but it essentially changes the model, so the output results will change.

2.算子加速。对模型的计算流进行优化，这种方式虽然没有改变结果，但是提升的幅度有限。2. Operator acceleration: Optimizing the model’s computational flow. Although this method does not change the results, the improvement is limited.

3.模型辅助验证。利用大参数模型对小参数模型的输出结果进行优化，这种方式在理想情况下能够实现利用小模型的推理时间得到大模型的推理结果。3. Model-assisted verification: Use a large-parameter model to optimize the output of a small-parameter model. Ideally, this approach can achieve the inference results of a large model using the inference time of a small model.

但是，针对大模型辅助验证，在系统上没有进行很好的优化。However, the system is not well optimized for large model-assisted verification.

发明内容Summary of the invention

有鉴于现有技术的上述缺陷，本发明提供一种大模型辅助验证的优化方法、计算系统及存储介质，该大模型辅助验证的优化方法通过采用小规模的模型逐个生成语义单元，并采用大规模的模型对生成的语义单元进行验证，以提升大模型推理的计算并行度。为实现以上技术目的，本发明提供了：In view of the above-mentioned defects of the prior art, the present invention provides an optimization method, a computing system and a storage medium for large-model assisted verification. The optimization method for large-model assisted verification generates semantic units one by one by using a small-scale model, and verifies the generated semantic units by using a large-scale model, so as to improve the computational parallelism of large-model reasoning. To achieve the above technical objectives, the present invention provides:

一种大模型辅助验证的优化方法，其包括：A large model-assisted verification optimization method, comprising:

第一大模型M₁基于输入逐个生成输出，所述输出包括n+1个语义单元；The first large model _M1 generates outputs one by one based on the inputs, wherein the outputs include n+1 semantic units;

第二大模型M₂对第一大模型M₁的输入以及输出进行处理，得到第二大模型M₂的输出，以验证第一大模型M₁输出的第n+1个语义单元的正确性；其中，第一大模型M₁的规模小于第二大模型M₂的规模。The second largest model M ₂ processes the input and output of the first largest model M ₁ to obtain the output of the second largest model M ₂ to verify the correctness of the n+1th semantic unit output by the first largest model M ₁ ; wherein the scale of the first largest model M ₁ is smaller than that of the second largest model M ₂ .

本发明的进一步改进在于：A further improvement of the present invention is:

第一大模型M₁在生成n+1个语义单元的预填充阶段的输入包括L个语义单元组组合形成的矩阵X；第一大模型M₁输出的前n个语义单元对应的嵌入向量a₁,a₂,a₃……a_n组合形成的矩阵为Y；The input of the first large model M ₁ in the pre-filling stage of generating n+1 semantic units includes a matrix X formed by the combination of L semantic unit groups; the matrix formed by the combination of the embedding vectors a ₁ , a ₂ , a ₃ …… a _n corresponding to the first n semantic units output by the first large model M ₁ is Y;

第二大模型M₂对第一大模型M₁的输入以及输出的处理包括预填充阶段和解码阶段；在预填充阶段，第二大模型M₂的输入为矩阵X和矩阵Y在行方向组合形成的矩阵；在解码阶段第二大模型M₂的输入为矩阵Y，在此阶段，第二大模型M₂在计算矩阵P＝QK^T时采用下三角掩码，其中Q为查询矩阵，K为键矩阵。The processing of the input and output of the first largest model M ₁ by the second largest model M ₂ includes a pre-filling stage and a decoding stage; in the pre-filling stage, the input of the second largest model M ₂ is a matrix formed by combining the matrix X and the matrix Y in the row direction; in the decoding stage, the input of the second largest model M ₂ is the matrix Y. In this stage, the second largest model M ₂ uses a lower triangular mask when calculating the matrix P = QK ^T , where Q is the query matrix and K is the key matrix.

本发明的进一步改进在于：第一大模型M₁以及第二大模型M₂均包括若干变换器模块,变换器模块包括注意力模块和前馈网络。A further improvement of the present invention is that the first large model _M1 and the second large model _M2 both include a plurality of converter modules, and the converter modules include an attention module and a feedforward network.

本发明的进一步改进在于：第一大模型M₁的种类包括Llama2-7B；第二大模型M₂的种类包括Llama2-70B。A further improvement of the present invention is that the types of the first large model _M1 include Llama2-7B; and the types of the second large model _M2 include Llama2-70B.

本发明还提供一种计算系统，所述计算系统包括：The present invention also provides a computing system, the computing system comprising:

存储器，所述存储器用于存储计算机可执行指令；和a memory for storing computer-executable instructions; and

处理器，所述处理器用于执行所述计算机可执行指令，以实现上述的大模型辅助验证的优化方法。A processor is used to execute the computer executable instructions to implement the above-mentioned optimization method for large model assisted verification.

本发明的进一步改进在于：所述处理器包括图形处理单元GPU、神经网络处理单元NPU以及现场可编程门阵列FPGA芯片。A further improvement of the present invention is that the processor includes a graphics processing unit GPU, a neural network processing unit NPU and a field programmable gate array FPGA chip.

本发明还提供一种计算机可读存储介质，所述计算机可读存储介质存储有可执行指令，计算机执行所述可执行指令时能够实现根据权利要求1-4中任一项所述的大模型辅助验证的优化方法。The present invention also provides a computer-readable storage medium, which stores executable instructions. When a computer executes the executable instructions, it can implement the optimization method for large model-assisted verification according to any one of claims 1-4.

本发明提供的技术方案具有以下技术效果：用较小的模型逐个生成语义单元，并利用大的模型对生成的语义单元序列进行验证；这种方式可以兼顾效率以及准确性，可有效提高大语言模型的并行度以及计算效率。The technical solution provided by the present invention has the following technical effects: using a smaller model to generate semantic units one by one, and using a large model to verify the generated semantic unit sequence; this method can take into account both efficiency and accuracy, and can effectively improve the parallelism and computational efficiency of the large language model.

以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明，以充分地了解本发明的目的、特征和效果。The concept, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings to fully understand the purpose, characteristics and effects of the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1所示为本发明中大模型辅助验证的优化方法的示意图；FIG1 is a schematic diagram of an optimization method for large model-assisted verification in the present invention;

图2为本发明中采用的下三角掩码(mask)的示意图。FIG. 2 is a schematic diagram of a lower triangle mask used in the present invention.

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The following describes the embodiments of the present invention by specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and the details in this specification can also be modified or changed in various ways based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments can be combined with each other without conflict.

如图1所示，本发明的实施例提供了一种大模型辅助验证的优化方法，包括：As shown in FIG1 , an embodiment of the present invention provides an optimization method for large model-assisted verification, including:

第一大模型M₁逐个生成n+1个语义单元；前n个语义单元对应的嵌入向量分别为a₁,a₂,a₃……a_n；第一大模型M₁在生成n+1个语义单元的预填充阶段的输入包括L个语义单元组组合形成的矩阵X；第一大模型M₁输出的前n个语义单元对应的嵌入向量a₁,a₂,a₃……a_n组合形成的矩阵为Y；The first large model M ₁ generates n+1 semantic units one by one; the embedding vectors corresponding to the first n semantic units are a ₁ , a ₂ , a ₃ …… a _n respectively; the input of the first large model M ₁ in the pre-filling stage of generating n+1 semantic units includes a matrix X formed by the combination of L semantic unit groups; the embedding vectors a ₁ , a ₂ , a ₃ …… a _n corresponding to the first n semantic units output by the first large model M ₁ are formed by the combination of Y;

第二大模型M₂对第一大模型M₁的输入以及输出进行处理，得到输出，以验证第一大模型M₁输出的第n+1个语义单元的正确性。第二大模型M₂对第一大模型M₁的输入以及输出的处理包括预填充阶段和解码阶段；在预填充阶段，第二大模型M₂的输入为矩阵X和矩阵Y在行方向组合形成的矩阵；在解码阶段第二大模型M₂的输入为矩阵Y，在此阶段，第二大模型M₂在计算矩阵P＝QK^T时采用下三角掩码。The second largest model M ₂ processes the input and output of the first largest model M ₁ to obtain an output to verify the correctness of the n+1th semantic unit output by the first largest model M _1. The processing of the input and output of the first largest model M ₁ by the second largest model M ₂ includes a pre-filling stage and a decoding stage; in the pre-filling stage, the input of the second largest model M ₂ is a matrix formed by combining the matrix X and the matrix Y in the row direction; in the decoding stage, the input of the second largest model M ₂ is the matrix Y, and in this stage, the second largest model M ₂ uses a lower triangular mask when calculating the matrix P = QK ^T.

本实施例中，第一大模型M₁的种类为Llama2-7B；第二大模型M₂的种类为Llama2-70B。第一大模型M₁的规模以及运行耗时均远小于第二大模型M₂。In this embodiment, the type of the first large model _M1 is Llama2-7B, and the type of the second large model _M2 is Llama2-70B. The scale and running time of the first large model _M1 are much smaller than those of the second large model _M2 .

本实施例中，第一大模型M₁每生成3～4个语义单元采用第二大模型M₂验证一次；如果第二大模型M₂生成的语义单元与第一大模型M₁一致，则表示验证通过。In this embodiment, the first large model _M1 generates 3 to 4 semantic units and verifies them once using the second large model _M2 ; if the semantic units generated by the second large model _M2 are consistent with those of the first large model _M1 , it means that the verification is passed.

下面结合附图介绍该验证方法的正确性：The following is an introduction to the correctness of the verification method with the attached drawings:

如图1所示，第一大模型M₁在生成n+1个语义单元的预填充阶段的输入包括L个语义单元组组合形成的矩阵X；其依次生成n+1个语义单元；前n个语义单元对应的嵌入向量a₁,a₂,a₃……a_n组合形成的矩阵为Y；矩阵L的维度是L×dim；其中，dim是嵌入向量的维度。As shown in Figure 1, the input of the first large model _M1 in the pre-filling stage of generating n+1 semantic units includes a matrix X formed by the combination of L semantic unit groups; it generates n+1 semantic units in turn; the embedding vectors _a1 , _a2 , _a3 ... _an corresponding to the first n semantic units are combined to form a matrix Y; the dimension of the matrix L is L×dim; where dim is the dimension of the embedding vector.

本实施例中，第一大模型M₁以及第二大模型M₂均包括若干变换器模块(transfromblock),变换器模块包括注意力模块(attention)和前馈网络(feed-forward network)。大模型计算过程中涉及两个计算操作，分别是注意力机制(Attention)计算和输入与权重做矩阵乘法。In this embodiment, the first large model _M1 and the second large model _M2 both include several transformer modules, and the transformer modules include attention modules and feed-forward networks. The large model calculation process involves two calculation operations, namely, attention mechanism calculation and matrix multiplication of input and weight.

一、输入与权重做矩阵乘法的正确性1. Correctness of matrix multiplication of input and weight

首先先验证输入与权重做矩阵乘法的正确性。假设输入是n个嵌入向量a₁,a₂,a₃……a_n组成的矩阵A₀，其与权重矩阵的乘法的表达式为：First, verify the correctness of the matrix multiplication of the input and the weight. Assume that the input is a matrix A ₀ composed of n embedding vectors a ₁ , a ₂ , a ₃ … a _n , and the expression of its multiplication with the weight matrix is:

y₁＝a₁W^T y ₁ = a ₁ W ^T

y₂＝a₂W^T y ₂ = a ₂ W ^T

………

y_n＝a_nW^T y _n =a _n W ^T

Y＝A₀W^T Y＝A ₀ W ^T

Y在本质上是由y₁,......,y_n拼接而成，即：Y is essentially composed of y ₁ ,......,y _n , that is:

所以对于向量依次送入矩阵乘法计算和将所有向量组成一个矩阵送入矩阵乘法计算，这两者的结果是一致的。所以输入与权重做矩阵乘法的正确性得以验证。Therefore, the results of the two methods are the same: the vectors are fed into the matrix multiplication calculation one by one and the matrix multiplication calculation is composed of all vectors. Therefore, the correctness of the matrix multiplication of input and weight is verified.

二、注意力机制计算的正确性2. Correctness of Attention Mechanism Calculation

以下将分别从预填充(prefill)阶段和解码(decode)阶段验证注意力机制(Attention)计算的正确性。The following will verify the correctness of the attention mechanism calculation from the prefill stage and the decoding stage respectively.

Prefill阶段，M₂输入为X与a₁,a₂,a₃……a_n所组成的矩阵Y,在X的行方向上进行扩展，大小是(L+n)×dim。根据上文解释的与权重做矩阵乘法的正确性，可以得到对应的Q,K,V是正确的，大小是(L+n)×dim。接下来将Q与K相乘。In the prefill phase, M ₂ takes as input the matrix Y consisting of X and a ₁ , a ₂ , a ₃ …… a _n , and expands in the row direction of X to a size of (L+n)×dim. Based on the correctness of the matrix multiplication with the weights explained above, the corresponding Q, K, and V are correct and have a size of (L+n)×dim. Next, Q is multiplied by K.

P＝QK^T P＝QK ^T

结果P大小是(L+n)×(L+n)，但是由于有下三角mask的存在，所以P也是由正确结果拼接而成，即：The size of the result P is (L+n)×(L+n), but due to the existence of the lower triangle mask, P is also composed of the correct results, that is:

其中P_x是输入为X的计算结果，P₁,……,P_n，是输入为a₁,a₂,a₃……a_n的计算结果。对应的O＝PV本质上就是矩阵乘法，所以正确性不变。Where P _x is the result of the calculation with input X, and P ₁ , ..., P _n , are the results of the calculation with input a ₁ , a ₂ , a ₃ ... _an . The corresponding O = PV is essentially matrix multiplication, so the correctness remains unchanged.

解码(Decode)阶段，M₂输入为a₁,a₂,a₃……a_n所组成的矩阵Y，值得注意的是解码(Decode)阶段没有掩码(mask)，所以本实施例在计算过程中，需要人为的加上掩码(mask)。Y对应的Q,K,V,大小为n×dim，与KV_cache(键值缓存)拼接之后，K,V的大小均为(L+n)×dim，其中Q为查询矩阵，K为键矩阵，V为值矩阵。所以计算P的过程中，需要人为的加上掩码(mask)，确保P的正确性。In the decoding stage, M ₂ inputs a matrix Y composed of a ₁ , a ₂ , a ₃ …… a _n . It is worth noting that there is no mask in the decoding stage, so in the calculation process of this embodiment, a mask needs to be added manually. The size of Q, K, V corresponding to Y is n×dim. After splicing with KV_cache (key-value cache), the size of K and V is (L+n)×dim, where Q is the query matrix, K is the key matrix, and V is the value matrix. Therefore, in the process of calculating P, a mask needs to be added manually to ensure the correctness of P.

P＝QK^T P＝QK ^T

P的大小是n×(L+n)，对于P[0:n-1,L:L+n-1]这样的一个n×n的方阵需要用下三角的mask来对结果进行纠正。如图2所示，其中斜的虚线下方的下三角矩阵表示掩码，本申请对掩码的具体取值不做限定。The size of P is n×(L+n). For an n×n square matrix such as P[0:n-1,L:L+n-1], a lower triangular mask is needed to correct the result. As shown in FIG2 , the lower triangular matrix below the oblique dotted line represents the mask, and this application does not limit the specific value of the mask.

本发明的实施例还提供一种计算系统，所述计算系统包括：An embodiment of the present invention further provides a computing system, the computing system comprising:

处理器，所述处理器用于执行所述计算机可执行指令，以实现上述的大模型辅助验证的系统优化方法。处理器包括图形处理单元GPU、神经网络处理单元NPU以及现场可编程门阵列FPGA芯片。A processor is used to execute the computer executable instructions to implement the above-mentioned large model-assisted verification system optimization method. The processor includes a graphics processing unit GPU, a neural network processing unit NPU and a field programmable gate array FPGA chip.

本发明的实施例还提供一种计算机可读存储介质，所述计算机可读存储介质存储有可执行指令，计算机执行所述可执行指令时能够实现上述的大模型辅助验证的系统优化方法。An embodiment of the present invention further provides a computer-readable storage medium, wherein the computer-readable storage medium stores executable instructions, and when a computer executes the executable instructions, the above-mentioned system optimization method for large model-assisted verification can be implemented.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above embodiments are merely illustrative of the principles and effects of the present invention, and are not intended to limit the present invention. Anyone familiar with the art may modify or alter the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or alterations made by a person of ordinary skill in the art without departing from the spirit and technical ideas disclosed by the present invention shall still be covered by the claims of the present invention.

Claims

1. An optimization method for large model aided verification, comprising the steps of:

The first large model M ₁ generates an output one by one based on the input, the output comprising n+1 semantic units;

The second large model M ₂ processes the input and output of the first large model M ₁ to obtain the output of the second large model M ₂ so as to verify the correctness of the n+1st semantic unit output by the first large model M ₁; wherein the scale of the first large model M ₁ is smaller than the scale of the second large model M ₂.

2. The optimization method for large model aided verification of claim 1, wherein:

The input of the first large model M ₁ in the pre-filling stage for generating n+1 semantic units comprises a matrix X formed by combining L semantic unit groups; the matrix formed by combining the embedded vectors a ₁,a₂,a₃......a_n corresponding to the first n semantic units output by the first large model M ₁ is Y;

The processing of the input and output of the second large model M ₂ to the first large model M ₁ includes a pre-fill stage and a decode stage; in the pre-filling stage, the input of the second large model M ₂ is a matrix formed by combining a matrix X and a matrix Y in the row direction; the input of the second large model M ₂ is matrix Y during the decoding phase, at which stage the second large model M ₂ uses a lower triangular mask in computing the matrix p=qk ^T, where Q is the query matrix and K is the key matrix.

3. The optimization method for large model aided verification of claim 1, wherein: the first large model M ₁ and the second large model M ₂ each include a number of transducer modules including an attention module and a feed forward network.

4. The optimization method for large model aided verification of claim 1, wherein: the categories of the first large model M ₁ include L1ama2-7B; the second large model M ₂ includes Llama2-70B.

5. A computing system, the computing system comprising:

a memory for storing computer-executable instructions; and

A processor for executing the computer-executable instructions to implement the optimization method of large model-assisted verification according to any one of claims 1-4.

6. The computing system of claim 5, wherein the processor comprises a graphics processing unit GPU, a neural network processing unit NPU, and a field programmable gate array FPGA chip.

7. A computer readable storage medium storing executable instructions that when executed by a computer enable the optimization method of large model-assisted verification according to any one of claims 1-4.