CN111916058A

CN111916058A - Voice recognition method and system based on incremental word graph re-scoring

Info

Publication number: CN111916058A
Application number: CN202010588022.XA
Authority: CN
Inventors: 范建存; 马一航
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-11-10
Anticipated expiration: 2040-06-24
Also published as: CN111916058B

Abstract

本发明公开了一种基于增量词图重打分的语音识别方法及系统，获取待识别的语音信号并提取声学特征；由训练好的声学模型计算声学特征对应的似然概率；解码器构建对应的解码网络，从解码网络中获取状态级别的词图并通过更新词图确定化得到词级别的词图；对剩余解码网络的状态级别词图进行确定化，并和已得到的词级别词图合并生成一遍解码词图；将一遍解码词图和小语料训练得到的重打分语言模型通过有限状态转录机合并算法得到目标词图；获取目标词图的最优代价路径词图，继而得到对应的词序列，将之作为最终的识别结果。本发明降低了普通解码器解码结束后确定化的计算量，加快解码速度；降低特定场景下语音识别的词错误率提高准确度。The invention discloses a speech recognition method and system based on incremental word graph re-score, which acquires speech signals to be recognized and extracts acoustic features; calculates the likelihood probability corresponding to the acoustic features by a trained acoustic model; the decoding network, obtain the state-level word map from the decoding network and obtain the word-level word map by updating the word map deterministically; determine the state-level word map of the remaining decoding network, and compare the obtained word-level word map with the obtained word-level word map Merge to generate a decoded word graph once; use the re-scoring language model obtained by training the decoded word graph and the small corpus to obtain the target word graph through the finite state transcription machine merging algorithm; obtain the optimal cost path word graph of the target word graph, and then obtain the corresponding word sequence as the final recognition result. The invention reduces the calculation amount of the determination after the decoding of the common decoder is completed, and speeds up the decoding speed; the word error rate of speech recognition in a specific scene is reduced, and the accuracy is improved.

Description

A speech recognition method and system based on incremental word graph re-score

技术领域technical field

本发明属于语音识别技术领域，具体涉及一种基于增量词图重打分的语音识别方法及系统。The invention belongs to the technical field of speech recognition, and in particular relates to a speech recognition method and system based on incremental word graph re-score.

背景技术Background technique

近年来，随着人工智能行业的迅速发展，语音识别技术得到了越来越多学术界和工业界的关注。作为语音交互领域的前端技术，语音识别发挥着至关重要的作用。它被广泛地应用于诸多人机交互系统中，例如智能客服系统，聊天机器人，个人智能助理以及智能家居等。In recent years, with the rapid development of the artificial intelligence industry, speech recognition technology has received more and more attention from academia and industry. As a front-end technology in the field of voice interaction, voice recognition plays a vital role. It is widely used in many human-computer interaction systems, such as intelligent customer service systems, chat robots, personal intelligent assistants, and smart homes.

目前传统的语音识别技术主要是基于HMM-DNN框架搭建起来的，这样建模的优势是可以通过相对较少的数据训练得到一个准确率还不错的语音识别系统。解码器是语音识别系统中极其重要的组件，其作用是串联声学模型、发音词典以及语言模型对输入的语音特征进行处理构建解码网络从而获取一系列状态序列及其对应的词图，然后从中挑选最佳状态序列对应的词序列输出最终识别结果。At present, the traditional speech recognition technology is mainly based on the HMM-DNN framework. The advantage of this modeling is that a speech recognition system with good accuracy can be obtained through relatively less data training. The decoder is an extremely important component in the speech recognition system. Its function is to process the input speech features in series with the acoustic model, pronunciation dictionary and language model to construct a decoding network to obtain a series of state sequences and their corresponding word graphs, and then select from them. The word sequence corresponding to the best state sequence outputs the final recognition result.

现有方法中，为了保证一遍解码的识别准确率，需要使用一个较大的beam搜索宽度，会使最终词图偏大导致识别速度仍然不够快。有代替较大beam搜索宽度一步解码的方法虽然解码速度在较低WERs下大约有2-3倍的提升，但是可能会因为beam值过小导致两步解码之间存在较大的差异影响其最终使用。利用GPU并行计算的方法成本较高，在工业化场景下这种解码器的大范围使用仍然有待商榷。In the existing method, in order to ensure the recognition accuracy of one-pass decoding, it is necessary to use a larger beam search width, which will make the final word map too large and the recognition speed is still not fast enough. There is a method to replace one-step decoding with a larger beam search width. Although the decoding speed is about 2-3 times faster at lower WERs, there may be a large difference between the two-step decoding because the beam value is too small. use. The cost of using GPU parallel computing is relatively high, and the wide-scale use of this decoder in industrialized scenarios is still open to question.

发明内容SUMMARY OF THE INVENTION

本发明所要解决的技术问题在于针对上述现有技术中的不足，提供一种基于增量词图重打分的语音识别方法及系统，解码过程中，先为部分音频进行确定化生成词级别的词图，然后处理剩余音频的时候，在前面词图的基础上生成新的词级别词图，在不影响一遍解码语音识别准确率的前提下，大大提高词图生成的速度，减少实时流解码场景中长话语结束后的延迟，为中间结果的输出提供便利，之后通过小样本语料训练的高级语言模型对一遍解码结果进行重打分，实现特定场景下的自学习，从而降低词错误率提高语音识别的准确度。The technical problem to be solved by the present invention is to provide a speech recognition method and system based on incremental word graph re-score for the deficiencies in the above-mentioned prior art. When processing the remaining audio, a new word-level word map is generated on the basis of the previous word map, which greatly improves the speed of word map generation and reduces the real-time stream decoding scene without affecting the accuracy of one-pass decoding speech recognition. The delay after the end of medium and long utterances facilitates the output of intermediate results, and then re-scores the decoding results through an advanced language model trained with small sample corpus to achieve self-learning in specific scenarios, thereby reducing word error rates and improving speech recognition. accuracy.

本发明采用以下技术方案：The present invention adopts the following technical solutions:

一种基于增量词图重打分的语音识别方法，包括以下步骤：A speech recognition method based on incremental word graph re-score, comprising the following steps:

S1、获取待识别的语音信号并通过预处理进行声学特征提取；S1. Acquire the speech signal to be recognized and perform acoustic feature extraction through preprocessing;

S2、由训练好的声学模型计算声学特征对应的似然概率；S2. Calculate the likelihood probability corresponding to the acoustic feature by the trained acoustic model;

S3、解码器通过训练好的解码图和步骤S2计算得到的声学信息构建对应的解码网络，从解码网络中获取状态级别的词图并通过更新词图确定化得到词级别的词图；S3, the decoder constructs a corresponding decoding network through the trained decoding map and the acoustic information calculated in step S2, obtains the word map of the state level from the decoding network, and obtains the word map of the word level by updating the word map deterministically;

S4、解码结束后，对剩余解码网络的状态级别词图进行确定化，并和已得到的词级别词图合并生成一遍解码词图；S4. After decoding, the state-level word graph of the remaining decoding network is determined, and combined with the obtained word-level word graph to generate a decoded word graph;

S5、将一遍解码词图和小语料训练得到的重打分语言模型通过有限状态转录机合并算法得到目标词图；S5. The re-scoring language model obtained by decoding the word graph and the small corpus training is obtained through the finite state transcription machine merging algorithm to obtain the target word graph;

S6、获取目标词图的最优代价路径词图，继而得到该词图最优状态序列对应的词序列，将之作为最终的识别结果。S6. Obtain the optimal cost path word graph of the target word graph, and then obtain the word sequence corresponding to the optimal state sequence of the word graph, and use it as the final recognition result.

具体的，步骤S1中，对语音信号进行添加高斯白噪声，预加重，加窗等预处理工作；对预处理后的语音信号做快速傅里叶变换将时域信号转化为频域信号并获得功率谱；和一组三角滤波器点乘求梅尔能量得到对应维度的声学特征。Specifically, in step S1, preprocessing such as adding Gaussian white noise, pre-emphasis, and windowing is performed on the speech signal; fast Fourier transform is performed on the preprocessed speech signal to convert the time-domain signal into a frequency-domain signal and obtain Power spectrum; point product with a set of triangular filters to obtain the mel energy to obtain the acoustic features of the corresponding dimension.

具体的，步骤S2中，使用步骤S1计算得到的声学特征作为声学模型的输入，将中心帧的前后多帧语音特征一起输入声学模型，经过神经网络计算后得到中心语音帧对应的每一个发音单元的声学后验概率。Specifically, in step S2, the acoustic features calculated in step S1 are used as the input of the acoustic model, the voice features of multiple frames before and after the central frame are input into the acoustic model, and each pronunciation unit corresponding to the central voice frame is obtained after the neural network calculation The acoustic posterior probability of .

具体的，步骤S3中，解码器通过维特比动态规划算法搜索解码图，结合步骤S2计算得到的声学代价以及解码图中的图代价构建解码网络，通过设置阈值剪枝路径约束网格大小；然后从解码网络中获取状态级别的词图并通过更新词图确定化得到新的词级别词图。Specifically, in step S3, the decoder searches the decoding graph through the Viterbi dynamic programming algorithm, constructs a decoding network by combining the acoustic cost calculated in step S2 and the graph cost in the decoding graph, and constrains the grid size by setting a threshold pruning path; then The state-level word graph is obtained from the decoding network and a new word-level word graph is obtained by updating the word graph deterministically.

进一步的，具体步骤如下：Further, the specific steps are as follows:

S301、从解码网络的状态序列中获取F的状态级别词图，包括状态编号以及转移边；S301. Obtain the state-level word graph of F from the state sequence of the decoding network, including the state number and transition edge;

S302、对F的第一部分进行确定化操作，第一部分的最后一帧是重复确定化状态，为重复确定化状态的转移边添加终止状态即构成了有限状态接收机A；对A进行确定化得到a，从原图中把输入标签相同的跳转合并，逐步加入初始为空的新图中；S302. Perform a deterministic operation on the first part of F. The last frame of the first part is a repeated deterministic state, and adding a termination state to the transition edge of the repeated deterministic state constitutes a finite state receiver A; a, merge jumps with the same input label from the original image, and gradually add them to the new image that is initially empty;

S303、处理第二部分，第二部分的第一帧为第一部分的最后一帧，即重复确定化状态；取重复确定化状态的最后一个状态做为初始状态构建有限状态接收机B，B复用有限状态接收机A对重复确定化状态的处理结果；通过状态和弧边标签的映射表找到重复确定化状态的弧边标签；由弧边标签映射到第一部分确定化后重复确定化状态的状态编号；将新的状态编号和重复确定化状态一一对应，依次添加后面帧的状态编号和转移边，得到第二部分有限状态接受机B，对B进行确定化得到b；S303, processing the second part, the first frame of the second part is the last frame of the first part, that is, the repeated deterministic state; the last state of the repeated deterministic state is taken as the initial state to construct a finite state receiver B, B complex The processing result of the repeated determinized state by the finite state receiver A; the arc edge label of the repeated determinized state is found through the mapping table of the state and the arc edge label; the arc edge label is mapped to the first part of the repeated determinized state after the determinization. State number; the new state number and the repeated determination state are in one-to-one correspondence, and the state number and transition edge of the following frame are added in turn to obtain the second part of the finite state receiver B, and B is determined to obtain b;

S304、将a和b合并在一起构成有限状态接收机C，C中的状态正常情况下由以下两部分组成：a中所有转移边不是弧边标签的状态；b中除了第一个外的所有状态；接收机C中的弧边包含b中除初始状态出弧的所有弧边，以及a中所有以非重复确定化状态起始和结束的弧边；如果a的初始状态不是重复确定化状态，设为有限状态接收机C的初始状态，否则使用b的初始状态作为所述接收机C的初始状态；最后通过移除C中的空标签得到最终结果G，即实现了增量词图生成。S304. Combine a and b together to form a finite state receiver C. The state in C normally consists of the following two parts: all transition edges in a are not states of arc edge labels; all transition edges in b except the first one state; the arc edges in receiver C include all arc edges in b except the initial state arcs, and all arc edges in a that start and end in non-repeated deterministic states; if the initial state of a is not a repeated deterministic state , set as the initial state of the finite state receiver C, otherwise the initial state of b is used as the initial state of the receiver C; finally, the final result G is obtained by removing the empty label in C, that is, the incremental word graph generation is realized .

更进一步的，本发明的特点还在于，步骤S302具体为：Further, the present invention is also characterized in that step S302 is specifically:

S3021、建立一个新的空图，把原图的初始状态和相应的初始权重加入新图，并新建一个队列，把状态放入队列中；S3021. Create a new empty graph, add the initial state of the original graph and the corresponding initial weight to the new graph, create a new queue, and put the state into the queue;

S3022、从队列头部取出一个状态p，遍历状态p引出的所有跳转的输入标签，对每种输入标签x，在新图中加入新状态及对应的跳转，新跳转的输入标签为x，权重是原图中x对应的所有跳转的⊕运算，将原图中的若干跳转合并为一个跳转；S3022. Take a state p from the head of the queue, traverse all the input labels of the jumps derived from the state p, and for each input label x, add a new state and a corresponding jump to the new graph, and the input label of the new jump is x, the weight is the ⊕ operation of all jumps corresponding to x in the original image, combining several jumps in the original image into one jump;

S3023、把步骤S3024的新状态加入队列；S3023, adding the new state of step S3024 to the queue;

S3024、回到步骤S3023继续处理队列，直到队列为空，将确定化后的结果称之为a。S3024, go back to step S3023 to continue processing the queue until the queue is empty, and the determined result is called a.

具体的，步骤S4中，解码结束后得到增量生成的词级别词图；然后对解码网络对应的最后一部分状态级别词图进行确定化；最后将两部分词级别词图进行合并生成目标词图，完成最后一部分的增量词图生成。Specifically, in step S4, the incrementally generated word-level word map is obtained after decoding; then the last part of the state-level word map corresponding to the decoding network is determined; finally, the two parts of the word-level word map are merged to generate the target word map , complete the last part of incremental word graph generation.

具体的，步骤S5中，将一遍解码词图和小语料训练得到的重打分语言模型通过有限状态转录机合并算法得到目标词图；通过基于蒙特卡洛法的重要性采样训练长短时记忆神经网络语言模型；记一遍解码词图为T₁，LSTM-RNN语言模型转化成的G.fst为T₂，T₁和T₂基于广度优先搜索合并算法生成目标词图T。Specifically, in step S5, the re-scoring language model obtained by decoding the word graph and the small corpus training is obtained through the finite state transcription machine merging algorithm to obtain the target word graph; the long-term memory neural network is trained by the importance sampling based on the Monte Carlo method. Language model; remember that the decoded word graph is T ₁ , the G.fst converted by the LSTM-RNN language model is T ₂ , and T ₁ and T ₂ generate the target word graph T based on the breadth-first search merging algorithm.

具体的，步骤S6中，获取目标词图的最优代价路径词图；将词级别词图转化为状态级别词图得到最优状态序列；最后通过回溯寻找最优前驱节点的方式获得对应的最优词序列作为最终的识别结果。Specifically, in step S6, the optimal cost path word graph of the target word graph is obtained; the word-level word graph is converted into a state-level word graph to obtain the optimal state sequence; finally, the corresponding optimal state sequence is obtained by backtracking to find the optimal precursor node The excellent word sequence is used as the final recognition result.

本发明的另一技术方案是，一种基于增量词图重打分的语音识别系统，根据所述的方法，包括：Another technical solution of the present invention is a speech recognition system based on incremental word graph re-score, according to the method, comprising:

信号获取及检测模块，用于得到待识别的语音信号并进行检测，保留有效的语音信号；The signal acquisition and detection module is used to obtain and detect the voice signal to be recognized, and retain the effective voice signal;

预处理模块，用于对有效的语音信号进行预处理；The preprocessing module is used to preprocess the valid speech signal;

特征提取模块，对预处理后的语音信号进行特征提取得到声学特征序列；A feature extraction module, which performs feature extraction on the preprocessed speech signal to obtain an acoustic feature sequence;

增量解码模块，通过解码器结合解码图和声学模型对声学特征序列进行解码构建解码网络，并增量生成词级别词图；The incremental decoding module uses the decoder to combine the decoding map and the acoustic model to decode the acoustic feature sequence to construct a decoding network, and incrementally generate a word-level word map;

词图生成模块，用于生成解码网络最后一部分状态序列对应的词级别词图，并和增量解码模块的词级别词图合并得到一遍解码词图；The word map generation module is used to generate the word-level word map corresponding to the last part of the state sequence of the decoding network, and combine it with the word-level word map of the incremental decoding module to obtain a decoded word map;

重打分模块，通过特定场景语料训练的语言模型对一遍解码词图进行重打分进而生成目标词图；The re-scoring module re-scores the decoded word map through the language model trained on the specific scene corpus to generate the target word map;

识别模块，从目标词图中获取最终的识别结果。The recognition module obtains the final recognition result from the target word graph.

与现有技术相比，本发明至少具有以下有益效果：Compared with the prior art, the present invention at least has the following beneficial effects:

本发明一种基于增量词图重打分的语音识别方法及系统，语音识别过程中需要对从解码网络中获取的状态级别词图做确定化工作，以确保每一个状态都不存在输入标签相同的两条转移弧边从而可以精简词图的尺寸，并且因为输入标签的唯一性可以加快生成最优代价路径词图的速度。然而，普通解码器只能从解码网络中第一帧语音信号对应的状态序列开始确定化，这样当长话语结束后确定化操作会带来可感知的延迟，通过将确定化操作分步在整个解码过程中完成，大大降低解码结束后确定化部分的计算量，可以在解码中得到词级别的词图，在此基础上动态生成中间识别结果，并通过小语料新训练高计算量语言模型，对一遍解码词图通过合并算法进行加权求和计算增大其权重，提高了特定场景下的识别准确率，实现在不同领域语音识别的自适应。The present invention is a speech recognition method and system based on incremental word graph re-score. During the speech recognition process, the state-level word graph obtained from the decoding network needs to be determined to ensure that each state does not have the same input label. Therefore, the size of the word graph can be reduced, and because of the uniqueness of the input label, the speed of generating the optimal cost path word graph can be accelerated. However, ordinary decoders can only start determinization from the state sequence corresponding to the first frame of speech signal in the decoding network, so the determinization operation will bring a perceptible delay when the long utterance ends. It is completed during the decoding process, which greatly reduces the calculation amount of the deterministic part after decoding. The word-level word map can be obtained in the decoding process. On this basis, the intermediate recognition results are dynamically generated, and the high-computational language model is newly trained through a small corpus. The weighted sum calculation of the one-pass decoded word graph increases its weight through the merging algorithm, which improves the recognition accuracy in specific scenarios and realizes the self-adaptation of speech recognition in different fields.

进一步的，对于步骤S1中的语音信号，首先通过语音活动检测(Voice ActivityDetection，VAD)过滤一部分长静音帧，保留有效的语音信号实现降噪的效果。接着对有效的语音块进行预处理、特征提取后得到对应维度的声学特征，可以将8k采样率一帧25ms的200个样本点提取为经过Mel滤波器的40维Fbank特征，大大降低语音信号的维度。Further, for the voice signal in step S1, a part of long silent frames is first filtered through voice activity detection (Voice Activity Detection, VAD), and the effective voice signal is retained to achieve the effect of noise reduction. Then, the effective speech blocks are preprocessed and features are extracted to obtain the acoustic features of the corresponding dimensions. The 200 sample points of a frame of 25ms with a sampling rate of 8k can be extracted as 40-dimensional Fbank features after Mel filter, which greatly reduces the noise of the speech signal. dimension.

进一步的，使用步骤S1计算得到的语音特征作为声学模型的输入，一般一次计算多帧的声学概率矩阵，将多帧语音信号数据块拼接上它的上下文信息一起输入声学模型，降低声学后验概率矩阵计算的次数，提高解码速度。Further, using the speech features calculated in step S1 as the input of the acoustic model, generally calculating the acoustic probability matrix of multiple frames at a time, splicing the multi-frame speech signal data blocks with its context information and inputting the acoustic model together, reducing the acoustic posterior probability. The number of matrix calculations to improve the decoding speed.

进一步的，使用基于beam词图剪枝的维特比动态规划算法构建解码网络，所有超过beam阈值的路径都会被剪枝掉，在获取状态级别词图前会再次进行状态剪枝精简解码网络从而加快增量词图生成的速度。Further, the decoding network is constructed using the Viterbi dynamic programming algorithm based on beam word graph pruning. All paths that exceed the beam threshold will be pruned. Before obtaining the state-level word graph, state pruning will be performed again to simplify the decoding network to speed up the process. The speed of incremental word graph generation.

进一步的，对状态级别词图进行确定化，使每个状态节点上的任意输入序列都只对应唯一的跳转，可以大大减少在图中匹配序列的计算量，确定化后词图的冗余度要比确定化前低得多。Further, the state-level word graph is determined, so that any input sequence on each state node only corresponds to a unique jump, which can greatly reduce the amount of calculation of matching sequences in the graph, and determine the redundancy of the word graph after the determination. The degree is much lower than before determinization.

进一步的，解码结束后对词图进行确定化是传统解码器产生高延迟的原因，这里增量词图生成解码器只需对最后一小块未确定化的状态级别词图进行确定化，快速生成识别结果。Further, determining the word graph after decoding is the reason for the high delay of the traditional decoder. Here, the incremental word graph generation decoder only needs to determine the last small piece of undetermined state-level word graph, which is fast. Generate recognition results.

进一步的，对步骤S4生成的一遍解码词图进行重打分，通过特定场景小样本语料训练语言模型，可以是传统的N-gram语言模型，也可以是基于长短时记忆的循环神经网络语言模型，提高训练语料场景下语音识别的准确度，降低词错误率。Further, re-score the one-pass decoded word graph generated in step S4, and train the language model through a small sample corpus of a specific scene, which can be a traditional N-gram language model, or a recurrent neural network language model based on long-term and short-term memory, Improve the accuracy of speech recognition in the training corpus scene and reduce the word error rate.

进一步的，获取最佳词序列作为最终的识别结果后，通过加标点服务得到最终的输出。Further, after obtaining the best word sequence as the final recognition result, the final output is obtained through the punctuation service.

一种基于增量词图重打分的语音识别系统，在对解码网络确定化前又进行了一遍剪枝，解码网络生成中比非增量就小，因此加快网络生成速度，提高了实时率。经过增量词图生成后的一遍解码词图是最优词图，和重打分语言模型进行合并的时候会消耗较少的时间，降低语音结束后延迟。通过对一遍解码词图进行后处理，可以实现对任意场景下语音识别结果的微调提高其识别准确度，实现在较低成本下的精度提升及定制化广泛应。A speech recognition system based on incremental word graph re-score, pruning is performed again before the decoding network is determined, and the decoding network generation is smaller than the non-incremental, so the network generation speed is accelerated and the real-time rate is improved. After the incremental word graph is generated, the decoded word graph is the optimal word graph. When it is combined with the re-scoring language model, it will consume less time and reduce the delay after the end of the speech. By post-processing the decoded word graph in one pass, it is possible to fine-tune the speech recognition results in any scenario to improve its recognition accuracy, and to achieve higher accuracy and wider customization at lower costs.

综上所述，本发明在解码过程中增量生成词级别词图为输出中间结果提供便利；降低普通解码器解码结束后确定化的计算量，加快解码速度；降低特定场景下语音识别的词错误率提高准确度。To sum up, the present invention incrementally generates word-level word graphs in the decoding process to provide convenience for outputting intermediate results; reduces the amount of deterministic computation after the decoding of the ordinary decoder is completed, and speeds up the decoding speed; reduces the number of words recognized by speech in specific scenarios. Error rate improves accuracy.

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.

附图说明Description of drawings

图1为发明提供的一种语音识别方法的流程图；Fig. 1 is the flow chart of a kind of speech recognition method provided by the invention;

图2为本发明提供的增量词图生成方法示意图；Fig. 2 is the schematic diagram of the incremental word graph generation method provided by the present invention;

图3为本发明提供的增量词图确定化数据块大小流程图；Fig. 3 is the incremental word graph provided by the present invention to determine the flow chart of data block size;

图4为本发明提供的增量词图重复确定化状态处理流程图；Fig. 4 is the incremental word graph repetition determination state processing flow chart provided by the present invention;

图5为本发明系统的具体模块流程示意图；5 is a schematic flow chart of a specific module of the system of the present invention;

图6为本发明方法与普通解码器在实时率方面的性能对比图；6 is a performance comparison diagram of the method of the present invention and a common decoder in terms of real-time rate;

图7为本发明方法与普通解码器在延时方面的性能对比图；7 is a performance comparison diagram of the method of the present invention and a common decoder in terms of delay;

图8为本发明方法与普通解码器在特定场景下词错误率方面的性能对比图。FIG. 8 is a performance comparison diagram of the method of the present invention and an ordinary decoder in terms of word error rate in a specific scenario.

具体实施方式Detailed ways

首先对本发明涉及的方法和术语进行说明。First, the methods and terminology involved in the present invention will be described.

1)有限状态接收机(FSA)：加权有限状态转录机(WFST)由一组状态和状态间的有向跳转构成，其中每个跳转上保存了三种信息，即输入标签、输出标签和权重,以“input_label:output_label/weight”格式记录，本发明中提及的解码网络就是一个WFST。FSA可以看做FST的简化，它的每个跳转只有输入标签。1) Finite State Receiver (FSA): Weighted Finite State Transcriber (WFST) consists of a set of states and directed jumps between states, where each jump stores three kinds of information, namely input labels, output labels and weight, recorded in the format of "input_label:output_label/weight", the decoding network mentioned in the present invention is a WFST. FSA can be seen as a simplification of FST, with only input labels for each jump.

2)状态级别词图：一种有向无环图，它的转移边上有输入标签，输出标签和权重值。其中输入标签为对齐信息，输出标签为词结果。2) State-level word graph: a directed acyclic graph with input labels, output labels and weight values on its transition edges. The input label is the alignment information, and the output label is the word result.

3)词级别词图：也叫压缩词图，由状态级别词图确定化后得到，和状态级别词图不同的是它的对齐信息由输入标签存储改为权重存储。3) Word-level word map: also called compressed word map, which is obtained by determining the state-level word map. Unlike the state-level word map, its alignment information is changed from input label storage to weight storage.

4)确定化：有限状态转录机中的经典算法，其作用是确保转录机中任一状态出发的所有转移边上不存在相同的输入标签，保证输入标签序列的唯一性。4) Deterministic: The classical algorithm in the finite state transcription machine, whose role is to ensure that the same input label does not exist on all transition edges starting from any state in the transcription machine, and to ensure the uniqueness of the input label sequence.

5)声学特征：由语音信号经过预处理和傅里叶变换后得到的频域信息再经过加工处理得到，本发明实验中使用的声学特征为滤波器组(Filter Bank，FBank)特征。5) Acoustic feature: the frequency domain information obtained by the preprocessing and Fourier transform of the speech signal is obtained by processing, and the acoustic feature used in the experiment of the present invention is the filter bank (Filter Bank, FBank) feature.

6)声学模型：对发音相关的信息进行建模并基于声学特征进行迭代训练得到，声学模型的主要作用是获取输入声学特征序列和发音单元序列之间的匹配度，通常以概率表示。本发明实验中使用的是基于隐马尔可夫-深度神经网络(HMM-DNN)的声学模型。6) Acoustic model: The information related to pronunciation is modeled and obtained by iterative training based on acoustic features. The main function of the acoustic model is to obtain the matching degree between the input acoustic feature sequence and the pronunciation unit sequence, which is usually expressed by probability. The acoustic model based on Hidden Markov-Deep Neural Network (HMM-DNN) is used in the experiments of the present invention.

7)语言模型：用于建模待识别语言中词与词之间的关联性，本发明实验中一遍解码中使用的是基于概率统计的3元文法(3-gram)语言模型。重打分语言模型使用对连续空间有更好建模效果的长短时记忆(Long Short-Term Memory，LSTM)循环神经网络语言模型，通过语言模型可以计算得到每一个词序列W＝{w1，w2，···，wn}出现的概率。7) Language model: used to model the correlation between words in the language to be recognized. In the experiment of the present invention, a 3-gram language model based on probability statistics is used in one-pass decoding. The re-scoring language model uses the Long Short-Term Memory (LSTM) recurrent neural network language model that has a better modeling effect on continuous spaces. Through the language model, each word sequence W={w1, w2, The probability of occurrence of ..., wn}.

8)维特比算法：是一种基于动态规划思想求解解码网络中最优路径的算法。在语音识别的解码过程中，通常将维特比算法结合一定的阈值约束来构建相应的解码网络。8) Viterbi algorithm: It is an algorithm based on the idea of dynamic programming to solve the optimal path in the decoding network. In the decoding process of speech recognition, the Viterbi algorithm is usually combined with certain threshold constraints to construct a corresponding decoding network.

请参阅图1，本发明一种基于增量词图重打分的语音识别方法，包括以下步骤：Please refer to Fig. 1, a kind of speech recognition method based on incremental word graph re-score of the present invention, comprises the following steps:

S101、对语音信号做预处理工作，首先对每一帧样本点都加一个随机高斯值，然后减去该帧样本点的平均值移除直流分量，接着将每一帧样本点的值减去上一帧值乘以0.97进行预加重，最后和一个大小为帧长的窗函数进行点乘得到更平稳的语音信号。S101, preprocessing the speech signal, first adding a random Gaussian value to each frame sample point, then subtracting the average value of the frame sample point to remove the DC component, and then subtracting the value of each frame sample point The value of the previous frame is multiplied by 0.97 for pre-emphasis, and finally it is dot-multiplied with a window function whose size is the frame length to obtain a more stable speech signal.

S102、进行声学特征提取，首先得到一个扩充为2的n次幂大小长度的滤波器组对象，然后对预处理后的语音信号进行快速傅里叶变换将时域信号转换为频域信号得到频谱样本点，接着将频谱样本点的实部乘以实部加上虚部乘以虚部得到其功率谱，最后取功率谱一半加一的样本点和依照Mel刻度分布的三角滤波器组进行卷积得到FBank声学特征。S102, perform acoustic feature extraction, first obtain a filter bank object whose length is expanded to the nth power of 2, and then perform fast Fourier transform on the preprocessed speech signal to convert the time-domain signal into a frequency-domain signal to obtain a frequency spectrum Sample point, then multiply the real part of the spectrum sample point by the real part plus the imaginary part by the imaginary part to get its power spectrum, and finally take the sample point of half of the power spectrum plus one and roll it with the triangular filter bank distributed according to the Mel scale. Product to get FBank acoustic features.

使用步骤S1计算得到的声学特征作为声学模型的输入，为了考虑每一帧特征的声学上下文信息，将中心帧的前后多帧语音特征一起输入声学模型，假设前面13帧后面9帧一共23帧声学特征一起输入时延神经网络(Time-Delay Neural Network，TDNN)，经过神经网络第一层映射为7帧的声学特征，后面三层隐藏层通过子采样将7帧声学特征映射为一帧，最后通过softmax归一化操作得到每一帧语音信号经过三音素聚类后5696个发音结果的声学概率矩阵。The acoustic features calculated in step S1 are used as the input of the acoustic model. In order to consider the acoustic context information of each frame feature, the multiple frames of speech features before and after the central frame are input into the acoustic model together. The features are input into the Time-Delay Neural Network (TDNN) together, and the first layer of the neural network is mapped to the acoustic features of 7 frames. The latter three hidden layers map the 7 frames of acoustic features into one frame through subsampling. Finally, Through the softmax normalization operation, the acoustic probability matrix of 5696 pronunciation results of each frame of speech signal after triphone clustering is obtained.

S3、解码器通过训练好的解码图和步骤S2计算得到的声学信息构建对应的解码网络，从解码网络中获取状态级别的词图并通过更新词图确定化，增量生成词级别的词图；S3. The decoder constructs a corresponding decoding network through the trained decoding map and the acoustic information calculated in step S2, obtains the state-level word map from the decoding network, and determines it by updating the word map, and incrementally generates the word-level word map ;

解码是逐帧进行的，t时刻可以到达的状态由t-1时刻出发，每一帧生成的状态由转移边相链接，通过基于动态规划的Viterbi算法并设置容差估计值约束网格大小构成了一个包含所有识别结果的有向无环图，即为解码网络。The decoding is carried out frame by frame. The state that can be reached at time t starts from time t-1. The state generated in each frame is linked by transition edges. It is formed by the Viterbi algorithm based on dynamic programming and setting the tolerance estimation value to constrain the grid size. A directed acyclic graph containing all the recognition results is created, which is the decoding network.

本发明提出将解码网络对应的状态级别词图分成多个连续的块进行确定化，这些连续块对应词图中的顺序帧状态序列范围。把这些顺序帧的状态序列分别确定化后以某种方式将它们连接在一起，就可以实现增量词图生成。方法是在连续块的分解点引入特殊符号，即弧边标签，拥有弧边标签的状态序列被称为重复确定化状态。增量词图生成的具体步骤为：The present invention proposes to divide the state-level word graph corresponding to the decoding network into a plurality of consecutive blocks for determination, and these consecutive blocks correspond to the sequential frame state sequence range in the word graph. Incremental word graph generation can be achieved by determining the state sequences of these sequential frames and connecting them together in a certain way. The method is to introduce special symbols, namely arc edge labels, at the decomposition points of consecutive blocks. The state sequence with arc edge labels is called repeated deterministic state. The specific steps of incremental word graph generation are:

请参阅图2，把解码网络的确定化分成很多块来进行，每次处理相邻的两部分(F)，两部分重合的那一帧状态序列即为重复确定化状态，由转移边上的弧边标签唯一标识。Referring to Figure 2, the determinization of the decoding network is divided into many blocks, and two adjacent parts (F) are processed each time, and the frame state sequence where the two parts overlap is the repeated determinization state. Arc edge label unique identifier.

S301、解码得到第一部分状态，获取每个状态的状态编号以及转移边构成F的第一部分状态级别词图，即第一个数据块；S301, decode to obtain the first part of the state, obtain the state number of each state and the transition edge to form the first part of the state-level word graph of F, that is, the first data block;

S302、对F的第一部分状态级别词图进行确定化操作，它的最后一帧是重复确定化状态，为重复确定化状态的转移边添加终止状态即构成了有限状态接收机A。对A进行确定化的主要思路是不断从原图中把输入标签相同的跳转合并，逐步加入初始为空的新图中，具体为：S302. Perform a deterministic operation on the first part of the state-level word graph of F, the last frame of which is a repeated determinized state, and a finite state receiver A is formed by adding a termination state to the transition edge of the repeated deterministic state. The main idea of determining A is to continuously merge jumps with the same input label from the original image, and gradually add them to the new image that is initially empty, specifically:

S3021、建立一个新的空图，把原图的初始状态和相应的初始权重加入新图，并新建一个队列，把这些状态放入队列中。S3021. Create a new empty graph, add the initial state of the original graph and the corresponding initial weight to the new graph, create a new queue, and put these states into the queue.

S3022、从队列头部取出一个状态p，遍历状态p引出的所有跳转的输入标签。对每种输入标签x，在新图中加入新状态及对应的跳转，新跳转的输入标签为x，权重是原图中x对应的所有跳转的⊕运算。此步骤将原图中的若干跳转合并为一个跳转。S3022 , take out a state p from the head of the queue, and traverse all the jumped input labels drawn by the state p. For each input label x, a new state and corresponding jump are added to the new graph. The input label of the new jump is x, and the weight is the ⊕ operation of all jumps corresponding to x in the original graph. This step combines several jumps in the original image into one jump.

S3023、把步骤S3024的新状态加入队列。S3023. Add the new state of step S3024 to the queue.

S3024、回到步骤S3023继续处理队列，直到队列为空。S3024, go back to step S3023 to continue processing the queue until the queue is empty.

将确定化后的结果称之为a，由弧边标签唯一标识的状态序列的状态编号可能已经改变。Call the finalized result a, the state number of the state sequence uniquely identified by the arc edge label may have changed.

S303、需要考虑如何获得F的第二部分新的数据块。请参阅图3，实验中以确定化最小块大小阈值和确定化最大延迟阈值两个参数进行限制，前者定义了每个块中帧数的最小值；后者决定了需要解码多少新的帧数才能继续对新的块进行确定化。当新解码的帧数超过了最大延迟阈值时，挑选最小块帧阈值到最大延迟阈值间拥有最少状态数的那一帧，即为新块的最后一帧。从最后一帧前溯到已经确定化词图中的最后一帧即为新的数据块，它的第一帧即为第一部分的最后一帧，即重复确定化状态。取重复确定化状态的最后一个状态做为初始状态构建有限状态接收机B，B需要复用有限状态接收机A对重复确定化状态的处理结果；S303, it is necessary to consider how to obtain the new data block of the second part of F. Please refer to Figure 3. In the experiment, the minimum block size threshold and the maximum delay threshold are limited by two parameters. The former defines the minimum number of frames in each block; the latter determines how many new frames need to be decoded. The new block can continue to be determined. When the number of newly decoded frames exceeds the maximum delay threshold, the frame with the least number of states between the minimum block frame threshold and the maximum delay threshold is selected, which is the last frame of the new block. From the last frame to the last frame in the determined word graph is a new data block, and its first frame is the last frame of the first part, that is, the repeated determination state. Taking the last state of the repeated deterministic state as the initial state to construct a finite state receiver B, B needs to reuse the processing result of the repeated deterministic state by the finite state receiver A;

请参阅图4，增量词图生成的核心是对重复确定化状态进行处理从而让新块可以复用前面的确定化结果。首先通过状态和弧边标签的映射表找到重复确定化状态的弧边标签；由弧边标签映射到第一部分确定化后重复确定化状态的状态编号；更新状态节点和状态编号映射表将新的状态编号和重复确定化状态一一对应，依次添加后面帧的状态编号和转移边，即可得到第二部分有限状态接受机B，对B执行步骤S3021、S3022、S3023、S3024进行确定化得到b；Referring to Figure 4, the core of incremental word graph generation is to process the repeated determinization state so that new blocks can reuse the previous determinization results. First, find the arc edge label of the repeated determinized state through the mapping table between the state and the arc edge label; map the arc edge label to the state number of the repeated determinized state after the first part of the determinization; update the state node and state number mapping table to convert the new The state number corresponds to the repeated determination state one by one, and the state number and transition edge of the following frame are added in turn, and the second part of the finite state receiver B can be obtained. Perform steps S3021, S3022, S3023, and S3024 to determine B to obtain b ;

S304、将a和b合并在一起构成有限状态接收机C，C中的状态正常情况下由以下两部分组成：a中所有转移边不是弧边标签的状态；b中除了第一个外的所有状态。S304. Combine a and b together to form a finite state receiver C. The state in C normally consists of the following two parts: all transition edges in a are not states of arc edge labels; all transition edges in b except the first one state.

接收机C中的弧边包含b中除了初始状态出弧的所有弧边，以及a中所有以非重复确定化状态起始和结束的弧边；如果a的初始状态不是重复确定化状态，那么就把它设为有限状态接收机C的初始状态，否则我们使用b的初始状态作为所述接收机C的初始状态；The arc edges in receiver C include all arc edges in b except for the initial state arcs, and all arc edges in a that start and end with non-repeated deterministic states; if the initial state of a is not a repeated deterministic state, then Let it be the initial state of the finite state receiver C, otherwise we use the initial state of b as the initial state of the receiver C;

最后通过移除C中的空标签得到最终结果G，即实现了增量词图生成。Finally, the final result G is obtained by removing the empty labels in C, that is, the incremental word graph generation is realized.

S305、将步骤S304生成的G作为第一部分重复步骤S303、S304，直到生成最后一帧声学特征的状态并对最后一帧进行状态剪枝。S305: Repeat steps S303 and S304 with the G generated in step S304 as the first part until the state of the acoustic feature of the last frame is generated and state pruning is performed on the last frame.

S4、步骤S3解码结束后，对最后剩余解码网络的状态级别词图进行确定化，并和已得到的词级别词图合并生成一遍解码词图；S4, after the decoding in step S3 is completed, the state-level word graph of the last remaining decoding network is determined, and combined with the obtained word-level word graph to generate a decoded word graph;

S401、解码结束后得到增量生成的词级别词图；S401, after the decoding is completed, an incrementally generated word-level word map is obtained;

S402、对解码网络对应的最后一部分状态级别词图进行确定化；S402, determining the last part of the state-level word graph corresponding to the decoding network;

S403、将两部分词级别词图进行合并生成目标词图，完成了最后一部分的增量词图生成。S403: Combine the two parts of word-level word graphs to generate a target word graph, and complete the generation of the last part of the incremental word graph.

S501、通过基于蒙特卡洛法的重要性采样训练长短时记忆神经网络语言模型(LSTM-RNNLM)；S501. Train a long-short-term memory neural network language model (LSTM-RNNLM) through importance sampling based on Monte Carlo method;

S502、记一遍解码词图为T₁，LSTM-RNN语言模型转化成的G.fst为T₂，T₁和T₂基于广度优先搜索合并算法生成目标词图T的步骤如下：S502, record the decoded word graph as T ₁ , the G.fst converted by the LSTM-RNN language model is T ₂ , and the steps of generating the target word graph T based on the breadth-first search merging algorithm of T ₁ and T ₂ are as follows:

S5021、记T₁和T₂跳转的集合分别为E₁和E₂。S5021, denote the sets _of T1 and T2 jumps as _E1 and _E2 _respectively .

S5022、遍历T₁和T₂所有的跳转e₁属于E₁和e₂属于E₂，在遍历过程中，如果某e₁的输出标签o[e₁]和e₂的输入标签i[e₂]相同，那么就把e₁、e₂的来源状态对(p[e₁],p[e₂])和目标状态对(n[e₁],n[e₂])分别作为T的两个状态，并在T中加入一条从状态(p[e₁],p[e₂])指向(n[e₁],n[e₂])的跳转，其输入标签为i[e₁]、输出标签为o[e₂]、权重为e₁和e₂权重的加权求和运算。S5022, traverse all the jumps e ₁ of T ₁ and T ₂ belong to E ₁ and e ₂ belong to E ₂ , in the traversal process, if the output label o[e ₁ ] of a certain e ₁ and the input label i[e of e ₂ ₂ ] is the same, then take the source state pair (p[e ₁ ], p[e ₂ ]) and target state pair (n[e ₁ ], n[e ₂ ]) of e ₁ and e ₂ as T respectively. Two states, and add a jump from state (p[e ₁ ], p[e ₂ ]) to (n[e ₁ ], n[e ₂ ]) in T, and its input label is i[e ₁ ], the output label is o[e ₂ ], the weight is the weighted sum operation of the weight of e ₁ and e ₂ .

S5023、重复S5022步骤，直到把所有满足o[e₁]＝i[e₂]的e₁、e₂处理完毕，设定起始状态和结束状态及其权重后，就得到了T₁和T₂的复合结果T。S5023. Repeat step S5022 until all e ₁ and e ₂ satisfying o[e ₁ ]=i[e ₂ ] are processed, and after setting the starting state and ending state and their weights, T ₁ and T are obtained The composite result of ₂ T.

S6、获取上述目标词图的最优代价路径词图，继而得到该词图最优状态序列对应的词序列，将之作为最终的识别结果。S6. Obtain the optimal cost path word graph of the above target word graph, and then obtain the word sequence corresponding to the optimal state sequence of the word graph, and use it as the final recognition result.

S601、通过动态规划的思想确定每一个状态节点的最优前驱节点获取上述目标词图的最优代价路径词图；S601. Determine the optimal precursor node of each state node through the idea of dynamic programming to obtain the optimal cost path word graph of the above target word graph;

S602、将词级别词图中权重信息里的转移ID取出替换输入标签转化为状态级别词图得到最优状态序列；S602, taking out the transition ID in the weight information in the word-level word map and converting the replacement input label into a state-level word map to obtain an optimal state sequence;

S603、通过最优状态序列得到弧边上输出标签不为0的词标签，依据单词符号表找到这些词标签对应的汉字，依次输出即可实现语音由音频到汉字的识别过程。S603: Obtain word labels whose output labels are not 0 on the arc edge through the optimal state sequence, find the Chinese characters corresponding to these word labels according to the word symbol table, and output them in sequence to realize the process of recognizing speech from audio to Chinese characters.

请参阅图3，本发明提供了一种基于增量词图重打分的语音识别系统，包括：Referring to Fig. 3, the present invention provides a speech recognition system based on incremental word graph re-score, including:

信号获取及检测模块，用于得到待识别的语音信号并对其进行检测，保留有效的语言信号；The signal acquisition and detection module is used to obtain and detect the speech signal to be recognized, and retain the valid speech signal;

预处理模块包括：第一处理模块，对目标词图的每一帧状态进行拓扑排序，保证其拓扑有序性。第二处理模块，通过动态规划的思想，遍历整个词图确定每一个状态节点的最优前驱节点，并得到最佳终止状态的状态ID；第三处理模块，通过该状态ID依次回溯得到最佳路径上的所有状态ID，由此得到连接这些状态ID的弧边，共同构成包含最优状态序列的压缩词图。The preprocessing module includes: a first processing module, which performs topological sorting on the state of each frame of the target word graph to ensure its topological ordering. The second processing module, through the idea of dynamic programming, traverses the entire word graph to determine the optimal precursor node of each state node, and obtains the state ID of the optimal termination state; the third processing module, through the state ID, traces back to obtain the optimal precursor node. All the state IDs on the path, and the arc edges connecting these state IDs are obtained, which together constitute a compressed word graph containing the optimal state sequence.

增量解码模块，通过解码器结合解码图和声学模型对声学特征进行解码构建解码网络，并增量生成词级别词图；The incremental decoding module uses the decoder to combine the decoding map and the acoustic model to decode the acoustic features to construct a decoding network, and incrementally generate the word-level word map;

增量解码模块包括：第一确定模块，由基于时延神经网络的chain声学模型得到每帧特征对应所有发音单元的声学观测概率；第二确定模块，由解码图中的3-gram语言模型根据声学特征序列确定可能出现的目标词序列的概率，即图代价；第三确定模块，解码器结合声学概率和解码图中的图代价，根据维特比动态规划算法构建解码网络；第四确定模块，根据解码网络获得状态级别词图，对其进行确定化增量地生成词级别词图。The incremental decoding module includes: the first determination module, which obtains the acoustic observation probability of each frame feature corresponding to all pronunciation units from the chain acoustic model based on the time-delay neural network; the second determination module is based on the 3-gram language model in the decoding diagram. The acoustic feature sequence determines the probability of the possible target word sequence, that is, the graph cost; the third determining module, the decoder combines the acoustic probability and the graph cost in the decoding graph to construct a decoding network according to the Viterbi dynamic programming algorithm; the fourth determining module, The state-level word graph is obtained according to the decoding network, and the word-level word graph is incrementally generated by deterministically.

词图生成模块，用于生成解码网络最后一部分状态序列对应的词级别词图，并和上一个模块的词级别词图合并得到第一遍解码词图；The word map generation module is used to generate the word-level word map corresponding to the last part of the state sequence of the decoding network, and combine it with the word-level word map of the previous module to obtain the first-pass decoded word map;

重打分模块，通过特定场景语料训练的较小语言模型对一遍解码词图进行重打分进而生成目标词图；The re-scoring module re-scores the decoded word graph through a smaller language model trained on a specific scene corpus to generate the target word graph;

重打分模块包括：合并模块，将一遍解码词图中的图代价加上重打分语言模型分数乘以0.2根据合并算法生成新的词图；剪枝模块，将新词图中权重过大的转移边指向一个特定的状态ID，接着将指向特定状态ID的转移边删除并且将入弧为空的状态ID也删除。The re-scoring module includes: a merging module, which multiplies the graph cost of the decoded word graph plus the re-scoring language model score by 0.2 to generate a new word graph according to the merging algorithm; a pruning module, which transfers the weights that are too large in the new word graph. The edge points to a specific state ID, then the transition edge that points to the specific state ID is deleted and the state ID whose in-arc is empty is also deleted.

识别模块包括：处理模块，对目标词图通过动态规划算法得到一条只包含最优代价状态序列的压缩词图；转化模块，将该压缩词图转化为输入标签为转移ID，输出标签为词ID，权重为图代价的状态级别词图；生成模块，依次获取该状态级别词图弧边上词ID映射的对应词序列即为最终的识别结果。The identification module includes: a processing module, which obtains a compressed word graph containing only the optimal cost state sequence through a dynamic programming algorithm for the target word graph; a conversion module, which converts the compressed word graph into an input label as a transition ID, and an output label as a word ID , a state-level word graph whose weight is the graph cost; the generation module sequentially obtains the corresponding word sequence of the word ID mapping on the arc edge of the state-level word graph, which is the final recognition result.

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中的描述和所示的本发明实施例的组件可以通过各种不同的配置来布置和设计。因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本实施例在多个真实场景测试集下将增量词图重打分语音识别方法和主流解码方法进行了对比工作，采用的数据集是电话场景下采集的真实对话数据，其内容涵盖多个行业，平均每个数据集的时长在2h左右。实验过程中增量词图重打分方法的参数配置要比主流解码器多三个参数，即确定化最大延迟阈值参数和确定化最小块大小参数，以及传递进来重打分语言模型转换成的有限状态转录机。This embodiment compares the incremental word graph re-score speech recognition method and the mainstream decoding method under multiple real scene test sets. The data set used is real dialogue data collected in a telephone scene, and its content covers multiple industries. , the average duration of each dataset is about 2h. During the experiment, the parameter configuration of the incremental word graph re-score method is three more parameters than that of the mainstream decoder, namely the deterministic maximum delay threshold parameter and the deterministic minimum block size parameter, and the finite state converted by the incoming re-score language model. Transcription machine.

本发明参考的指标主要是语音识别的实时率(Real-Time factor,RTF)，词错误率(Word Error Rate,WER)以及解码结束后最后一帧状态剪枝和词图确定化带来的延迟。实时率的计算公式为Real_time_factor＝total_time_taken_/total_audio_，即音频总的解码时长除以该音频的总时长。三个指标的目的主要是为了验证本发明中增量词图重打分方法所带来的解码速度提升，识别准确率增大以及延迟的减少。如表1和表2所示，对比了传统解码器的确定化方法(DCG)和本发明增量词图重打分方法(ADCG)的解码实时率和延迟情况。The indicators referenced by the present invention are mainly the real-time rate (Real-Time factor, RTF) of speech recognition, the word error rate (Word Error Rate, WER), and the delay caused by the last frame state pruning and word map determination after decoding. . The calculation formula of the real-time rate is Real_time_factor=total_time_taken_/total_audio_, that is, the total decoding duration of the audio is divided by the total duration of the audio. The purpose of the three indicators is mainly to verify the improvement of decoding speed, the increase of recognition accuracy and the reduction of delay brought by the incremental word graph re-score method in the present invention. As shown in Table 1 and Table 2, the decoding real-time rate and delay of the deterministic method (DCG) of the traditional decoder and the incremental word graph re-score method (ADCG) of the present invention are compared.

表1Table 1

结合表1以及图4可以看到，在解码实时率(RTF)方面增量词图重打分方法的性能远优于主流的基于确定化的方法，实时率有近25％的下降，大大提高了解码速度。Combining Table 1 and Figure 4, it can be seen that the performance of the incremental word graph re-score method in terms of decoding real-time rate (RTF) is far superior to the mainstream deterministic-based method, and the real-time rate has dropped by nearly 25%, which greatly improves the decoding speed.

表2Table 2

延迟参数在实时率的基础上更加准确的反应了增量词图生成的性能，因为它主要考虑解码结束后重打分的延迟还包含重打分所消耗的时间变化。结合表2以及图5，增量确定化解码器7，该方法比传统解码器延迟平均减少为25％，充分展示了本发明方法的优越性。The delay parameter more accurately reflects the performance of incremental word graph generation based on the real-time rate, because it mainly considers the delay of re-score after decoding and also includes the time change consumed by re-score. Combining with Table 2 and Fig. 5, the incremental deterministic decoder 7 reduces the delay of the method by an average of 25% compared with the traditional decoder, which fully demonstrates the superiority of the method of the present invention.

最后结合表3，图6、图7和图8，针对特定场景下测试集的识别结果计算词错误率，基于LSTM结构循环神经网络语言模型的增量词图重打分方法，比传统语音识别方法的识别准确率提高了近3.14％，在这些场景下得到比行业内大厂语音识别略优的准确度。Finally, combined with Table 3, Figure 6, Figure 7 and Figure 8, the word error rate is calculated according to the recognition results of the test set in a specific scenario. In these scenarios, the recognition accuracy is improved by nearly 3.14%, and the accuracy is slightly better than the speech recognition of big manufacturers in the industry.

表3table 3

综上所述，本发明一种基于增量词图重打分的语音识别方法及系统，通过大量实验，展示了优于传统语音识别方法的性能，加快了解码速度，降低了延迟并提高了特定场景下的识别准确率，从实验证明了增量词图重打分的优越性。To sum up, the present invention is a speech recognition method and system based on incremental word graph re-score. Through a large number of experiments, it has demonstrated better performance than traditional speech recognition methods, accelerated decoding speed, reduced delay and improved specificity. The recognition accuracy in the scene proves the superiority of incremental word graph re-score from experiments.

以上内容仅为说明本发明的技术思想，不能以此限定本发明的保护范围，凡是按照本发明提出的技术思想，在技术方案基础上所做的任何改动，均落入本发明权利要求书的保护范围之内。The above content is only to illustrate the technical idea of the present invention, and cannot limit the protection scope of the present invention. Any changes made on the basis of the technical solution according to the technical idea proposed by the present invention all fall within the scope of the claims of the present invention. within the scope of protection.

Claims

1. a speech recognition method based on incremental word graph re-scoring, is characterized in that, comprises the following steps:

S1. Acquire the speech signal to be recognized and perform acoustic feature extraction through preprocessing;

S2. Calculate the likelihood probability corresponding to the acoustic feature by the trained acoustic model;

S3, the decoder constructs a corresponding decoding network through the trained decoding map and the acoustic information calculated in step S2, obtains the word map of the state level from the decoding network, and obtains the word map of the word level by updating the word map deterministically;

S4. After decoding, the state-level word graph of the remaining decoding network is determined, and combined with the obtained word-level word graph to generate a decoded word graph;

S5. The re-scoring language model obtained by decoding the word graph and the small corpus training is obtained through the finite state transcription machine merging algorithm to obtain the target word graph;

S6. Obtain the optimal cost path word graph of the target word graph, and then obtain the word sequence corresponding to the optimal state sequence of the word graph, and use it as the final recognition result.

2. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S1, adds Gaussian white noise, pre-emphasis, the preprocessing work such as windowing is carried out to speech signal; The processed speech signal is subjected to fast Fourier transform to convert the time-domain signal into a frequency-domain signal and obtains the power spectrum; point product with a set of triangular filters to obtain the Mel energy to obtain the acoustic features of the corresponding dimension.

3. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S2, use the acoustic feature that step S1 calculates to obtain as the input of acoustic model, the multiple frames before and after the center frame The speech features are input into the acoustic model together, and the acoustic posterior probability of each pronunciation unit corresponding to the central speech frame is obtained after the neural network calculation.

4. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S3, decoder searches decoding graph by Viterbi dynamic programming algorithm, in conjunction with the acoustic cost that step S2 calculates and The decoding network is constructed from the graph cost in the decoding graph, and the grid size is constrained by setting a threshold pruning path; then the state-level word graph is obtained from the decoding network and a new word-level word graph is obtained by updating the word graph deterministically.

5. the speech recognition method based on incremental word graph re-scoring according to claim 4, is characterized in that, concrete steps are as follows:

S301. Obtain the state-level word graph of F from the state sequence of the decoding network, including the state number and transition edge;

S302. Perform a deterministic operation on the first part of F. The last frame of the first part is a repeated deterministic state, and adding a termination state to the transition edge of the repeated deterministic state constitutes a finite state receiver A; a, merge jumps with the same input label from the original image, and gradually add them to the new image that is initially empty;

S303, processing the second part, the first frame of the second part is the last frame of the first part, that is, the repeated deterministic state; the last state of the repeated deterministic state is taken as the initial state to construct a finite state receiver B, B complex The processing result of the repeated determinized state by the finite state receiver A; the arc edge label of the repeated determinized state is found through the mapping table of the state and the arc edge label; the arc edge label is mapped to the first part of the repeated determinized state after the determinization. State number; the new state number and the repeated determination state are in one-to-one correspondence, and the state number and transition edge of the following frame are added in turn to obtain the second part of the finite state receiver B, and B is determined to obtain b;

S304. Combine a and b together to form a finite state receiver C. The state in C normally consists of the following two parts: all transition edges in a are not states of arc edge labels; all transition edges in b except the first one state; the arc edges in receiver C include all arc edges in b except the initial state arcs, and all arc edges in a that start and end in non-repeated deterministic states; if the initial state of a is not a repeated deterministic state , set as the initial state of the finite state receiver C, otherwise the initial state of b is used as the initial state of the receiver C; finally, the final result G is obtained by removing the empty label in C, that is, the incremental word graph generation is realized .

6. the speech recognition method based on incremental word graph re-scoring according to claim 5, is characterized in that, step S302 is specifically:

S3021. Create a new empty graph, add the initial state of the original graph and the corresponding initial weight to the new graph, create a new queue, and put the state into the queue;

S3022. Take a state p from the head of the queue, traverse all the input labels of the jumps derived from the state p, and for each input label x, add a new state and a corresponding jump to the new graph, and the input label of the new jump is x, the weight is the ⊕ operation of all jumps corresponding to x in the original image, combining several jumps in the original image into one jump;

S3023, adding the new state of step S3024 to the queue;

S3024, go back to step S3023 to continue processing the queue until the queue is empty, and the determined result is called a.

7. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S4, after decoding finishes, obtain the word-level word graph of incremental generation; Then to the last part corresponding to decoding network The state-level word map is determined; finally, the two parts of the word-level word map are merged to generate the target word map, and the last part of the incremental word map is generated.

8. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S5, the re-scoring language model obtained by decoding word graph and small corpus training once is merged by finite state transcription machine The algorithm obtains the target word map; trains the long-term memory neural network language model through importance sampling based on Monte Carlo method; remembers the decoded word map as T ₁ , and the G.fst converted by the LSTM-RNN language model is T ₂ , T ₁ and T ₂ generate the target word graph T based on the breadth-first search merging algorithm.

9. the speech recognition method based on incremental word graph re-scoring according to claim 1, is characterized in that, in step S6, obtains the optimal cost path word graph of target word graph; The word level word graph is converted into state level The word graph obtains the optimal state sequence; finally, the corresponding optimal word sequence is obtained as the final recognition result by backtracking to find the optimal precursor node.

10. a speech recognition system based on incremental word graph re-scoring, is characterized in that, the method according to claim 1, comprises:

The signal acquisition and detection module is used to obtain and detect the voice signal to be recognized, and retain the effective voice signal;

The preprocessing module is used to preprocess the valid speech signal;

A feature extraction module, which performs feature extraction on the preprocessed speech signal to obtain an acoustic feature sequence;

The incremental decoding module uses the decoder to combine the decoding map and the acoustic model to decode the acoustic feature sequence to construct a decoding network, and incrementally generate a word-level word map;

The word map generation module is used to generate the word-level word map corresponding to the last part of the state sequence of the decoding network, and combine it with the word-level word map of the incremental decoding module to obtain a decoded word map;

The re-scoring module re-scores the decoded word map through the language model trained on the specific scene corpus to generate the target word map;

The recognition module obtains the final recognition result from the target word graph.