CN1201284C

CN1201284C - Rapid decoding method for voice identifying system

Info

Publication number: CN1201284C
Application number: CNB021486824A
Authority: CN
Inventors: 韩疆; 颜永红; 潘接林; 张建平
Original assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Current assignee: Institute of Acoustics CAS; Beijing Kexin Technology Co Ltd
Priority date: 2002-11-15
Filing date: 2002-11-15
Publication date: 2005-05-11
Anticipated expiration: 2022-11-15
Also published as: CN1455387A

Abstract

The invention relates to a fast decoding method in a speech recognition system. The method comprises the following steps: (1) initializing the decoding operation unit in the speech recognition system; (2) taking out the feature of the next speech frame successively from the speech feature codeword sequence whose length is T in the input decoding operation unit Code word vector, set it as the speech frame O _t at time t, 1≤t≤T; (3) filter the speech frame O _t at time t; (4) based on the effective speech frame O _t ^V , set the dictionary tree at time t as Each active node in the token resource L _t [I] of each layer I of the card resource L _t is judged; (5) processing is in the token of the dictionary tree node; (6) according to the maximum probability of the local path at time t And the maximum probability of the local path at the moment  corresponding to the previous effective speech frame, and adaptively adjust the threshold related to pruning; (7) Repeat the above steps (2)-(6) to output the generated acoustic model at this moment The text string that has the best match with the language model produces a speech recognition result. Adopting this strategy can speed up the decoding operation compared with traditional methods.

Description

A Fast Decoding Method in Speech Recognition System

技术领域technical field

本发明涉及一种语音识别系统中的快速解码方法。The invention relates to a fast decoding method in a speech recognition system.

背景技术Background technique

解码运算是语音识别系统中的主要组成部分，它的功能是：在给定声学模型和语言模型下，对输入的声学观察特征矢量序列，让计算机在静态或动态构建的搜索空间中自动找出与声学模型和语言模型有最佳匹配的文本串，从而将用户的语音输入转换为对应的文本。The decoding operation is the main component of the speech recognition system. Its function is: under the given acoustic model and language model, for the input acoustic observation feature vector sequence, let the computer automatically find out in the search space constructed statically or dynamically. The text string that has the best match with the acoustic model and the language model to convert the user's speech input into the corresponding text.

图1所示是一种公知语音识别系统的结构框图，模拟语音经过模数变换单元11后变换为计算机可处理的数字信号，然后利用特征提取单元12对该数字信号进行分帧处理，通常帧长为20ms，帧移为10ms，提取每一帧语音的MFCC参数，得到MFCC矢量序列，解码运算单元14根据输入语音的特征矢量序列、声学模型13及语言模型15，采用一定的搜索策略，如深度优先搜索(Viterbi算法)或广度优先搜索，得到识别的结果，其中语言模型在进行大词表连续语音识别时，用于将语言层的知识应用到语音识别系统中，提高系统的识别精度。Shown in Fig. 1 is a kind of structural block diagram of known speech recognition system, analog speech is transformed into the digital signal that computer can process after analog-to-digital conversion unit 11, utilize feature extraction unit 12 to carry out frame processing to this digital signal then, usually frame The length is 20ms, and the frame shift is 10ms. Extract the MFCC parameters of each frame of speech to obtain the MFCC vector sequence. The decoding operation unit 14 adopts a certain search strategy according to the feature vector sequence of the input speech, the acoustic model 13 and the language model 15, such as Depth-first search (Viterbi algorithm) or breadth-first search to obtain the recognition results, in which the language model is used to apply the knowledge of the language layer to the speech recognition system to improve the recognition accuracy of the system when performing continuous speech recognition of large vocabulary.

基于图1的语音识别器对计算机的中央处理器速度以及内存容量有非常高的要求，目前的一些商品化的听写机系统，例如，IBM的ViaVoice系统和Microsoft Office XP中的听写机模块均要求高速的中央处理器(IntelPentium II 400MHz以上)和较大容量的内存资源(100MByte以上)。一般而言，解码运算占据了整个语音识别器中90％以上的中央处理器计算资源和几乎全部的内存资源；模数转换模块以及特征提取单元占据10％以下的中央处理器计算资源以及很少的内存资源。The speech recognizer based on Fig. 1 has very high requirements to the central processing unit speed and memory capacity of the computer, and some current commercialized dictation machine systems, for example, the dictation machine module in the ViaVoice system of IBM and Microsoft Office XP all requires High-speed central processing unit (IntelPentium II above 400MHz) and large-capacity memory resources (above 100MByte). Generally speaking, the decoding operation occupies more than 90% of the CPU computing resources and almost all memory resources in the entire speech recognizer; the analog-to-digital conversion module and feature extraction unit occupy less than 10% of the CPU computing resources and very few memory resources.

当前的商用嵌入式语音识别系统主要是采用基于简单模板匹配的小词量特定人语音识别，例如，手机中的语音拨号以及简单命令识别等，由于该技术需要用户注册语音数据，其易用性、适用性不强；一些非特定人嵌入式语音识别系统主要面向小词汇量的命令词识别，且计算量以及内存需求依然较大，例如，IBM的个人语音助理语音识别系统对于500个词的任务域，需要50DMIPS的计算能力的计算设备。The current commercial embedded speech recognition system mainly adopts small word volume specific person speech recognition based on simple template matching, for example, voice dialing and simple command recognition in mobile phones. Since this technology requires users to register voice data, its ease of use , The applicability is not strong; some non-specific embedded speech recognition systems are mainly for command word recognition with a small vocabulary, and the amount of calculation and memory requirements are still relatively large. For example, IBM's personal voice assistant speech recognition system is for The task domain requires a computing device with a computing capability of 50DMIPS.

公知的解码运算的基本原理及概念如下：The basic principles and concepts of known decoding operations are as follows:

1、词典树1. Dictionary tree

词典树是用来组织识别系统中所有词发音的一种树状结构。音素是构成词发音的基本单位，TRIPHONE音素是当前语音识别系统常用的音素单元，例如：词“中国”的TRIPHONE音素表示序列为：“sil-zh+ongzh-ong+g ong-g+uo g-uo+sil”(其中“sil”是一个特殊音素，用来描述用户语音中的停顿)，TRIPHONE音素是一种上下文有关的音素，与通常的拼音表示相比，它可以描述音素在不同上下文中产生的发音变异，从而可以更加精确地描述词发音的声学特征。识别系统的词之间可能有相同的前缀字或子词，例如：词“中间”和“中国”，它们有相同的前缀“中”，可用树结构来描述，假设识别系统的词表包含下面的5个词“abe”、“ab”，“acg”、“acgi”、以及“ac”，则该词表的词典树如图4所示：The dictionary tree is a tree structure used to organize the pronunciation of all words in the recognition system. A phoneme is the basic unit that constitutes the pronunciation of a word. The TRIPHONE phoneme is a commonly used phoneme unit in the current speech recognition system. For example, the TRIPHONE phoneme representation sequence of the word "China" is: "sil-zh+ongzh-ong+g ong-g+uo g -uo+sil" (where "sil" is a special phoneme used to describe the pause in the user's voice), the TRIPHONE phoneme is a context-sensitive phoneme, which can describe phonemes in different contexts compared with the usual pinyin representation The pronunciation variation produced in the word can more accurately describe the acoustic characteristics of word pronunciation. Words in the recognition system may have the same prefix or subwords, for example: the words "middle" and "China", they have the same prefix "zhong", which can be described in a tree structure, assuming that the vocabulary of the recognition system contains the following The five words "abe", "ab", "acg", "acgi", and "ac" of the vocabulary, the dictionary tree of the vocabulary is shown in Figure 4:

词典树中的每个节点对应的TRIPHONE音素关联一个对应于该TRIPHONE的隐马尔科夫模型(HMM)，图5给出了一种表示TRIPHONE音素的HMM拓扑结构，一个HMM由若干HMM状态组成。The TRIPHONE phoneme corresponding to each node in the dictionary tree is associated with a Hidden Markov Model (HMM) corresponding to the TRIPHONE. Figure 5 shows a HMM topology representing the TRIPHONE phoneme. An HMM consists of several HMM states.

2、令牌定义及令牌扩展策略2. Token Definition and Token Expansion Strategy

令牌是指从用户语音起始帧到当前语音帧的一条活动搜索路径，它包含路径标识信息以及路径与声学模型以及语言模型匹配的分值，其中路径标识信息包含在该路径中的所有词以及词的边界信息，每个令牌对应于一条活动的搜索路径，不同令牌间的差别在于它们有不同的声学上下文和不同的语言上下文。A token refers to an active search path from the start frame of the user's speech to the current speech frame, which contains path identification information and the matching score of the path with the acoustic model and language model, where the path identification information includes all words in the path And word boundary information, each token corresponds to an active search path, the difference between different tokens is that they have different acoustic contexts and different language contexts.

词典树中每个节点关联的HMM中的每个状态均可驻留可移动的令牌，该节点的每个状态均有一个令牌链表，用来存放任意时刻在该状态活动的所有令牌。假设在时刻t，词典树中一节点的状态i的令牌链表中的一个可扩展令牌的分值为s_i(t-1)，那么在搜索过程中，若该令牌的分值s_j(t-1)加上从状态i到状态j的转移概率、再加上状态i对于当前语音帧t的观察概率超过当前的剪枝阈值，则产生一个新的令牌，其分值为s_j(t)，并关联在状态j上。在完成对t-1时刻驻留在词典树上的所有令牌的处理后，将产生t时刻驻留在词典树上的待扩展令牌资源，并将删除全部t-1时刻驻留在词典树上的所有令牌。Each state in the HMM associated with each node in the dictionary tree can hold movable tokens, and each state of the node has a token list, which is used to store all the tokens that are active in the state at any time . Assuming that at time t, the score of an extensible token in the token linked list of state i of a node in the dictionary tree is s _i (t-1), then in the search process, if the token’s score s _j (t-1) plus the transition probability from state i to state j, plus the observation probability of state i for the current speech frame t exceeding the current pruning threshold, a new token is generated with a score of s _j (t), and associated with state j. After completing the processing of all tokens residing in the dictionary tree at time t-1, token resources to be expanded that reside in the dictionary tree at time t will be generated, and all token resources residing in the dictionary tree at time t-1 will be deleted All tokens on the tree.

在令牌传播过程中，可能的词以及词边界信息记录在一个标识路径的链表结构中。因此在语音输入结束时刻T，可通过回朔具有最佳分值令牌中的路径标识信息链表，提取出具有最佳匹配的词序列以及对应的词边界位置。During token propagation, possible words and word boundary information are recorded in a linked list structure identifying paths. Therefore, at the end time T of speech input, the word sequence with the best match and the corresponding word boundary position can be extracted by going back to the path identification information list in the token with the best score.

3、词典树节点的令牌资源定义3. The token resource definition of the dictionary tree node

假设词典树节点包含M个HMM状态：s₁…s_M，则一个词典树节点的令牌资源定义包含下列令牌资源信息：Suppose the dictionary tree node contains M HMM states: s ₁ …s _M , then the token resource definition of a dictionary tree node includes the following token resource information:

节点令牌资源：H_S1 H_S2 … H_SM Node token resources: H _S1 H _S2 … H _SM

其中，H_s1(1≤i≤M)为关于节点中的HMM状态S_i的令牌链表的表头。Wherein, H _s1 (1≤i≤M) is the head of the token linked list about the HMM state S _i in the node.

传统的解码运算方法对硬件计算能力和存储器要求过高，且性价比低。The traditional decoding operation method requires too much hardware computing power and memory, and the cost performance is low.

在中国专利申请02131086.6中公开了一种用于语音识别系统的特征矢量集的压缩方法，在对语音特征矢量集聚类得到码本的过程中，增加了根据子集合中矢量数及矢量的总距离度量来动态合并和分裂子集合的步骤，减小了聚类后集合中矢量与其对应的码字的距离度量总和，提高了聚类算法的精度，将该发明方法压缩后的码本应用于语音识别系统中，可在保证语音系统识别性能的同时，大大降低了系统的存储量。该发明还公开一种语音识别系统，其结构框图如图2所示，该系统用特征码本和概率表代替声学模型，在解码的过程中不需要计算高斯概率，只须从预先存储的概率表中查找出所需的概率值，大大减少了解码运算中计算高斯概率的开销，因而可在相当程度上提高系统的识别速度。In the Chinese patent application 02131086.6, a method for compressing the feature vector set of the speech recognition system is disclosed. The step of dynamically merging and splitting the sub-sets by the distance measure reduces the sum of the distance measures between the vectors in the clustered set and their corresponding codewords, improves the accuracy of the clustering algorithm, and applies the compressed codebook of the inventive method to In the speech recognition system, the memory capacity of the system can be greatly reduced while ensuring the recognition performance of the speech system. The invention also discloses a speech recognition system, the structural block diagram of which is shown in Figure 2, the system replaces the acoustic model with a feature codebook and a probability table, and does not need to calculate the Gaussian probability in the decoding process, only needs to use the pre-stored probability The required probability value can be found in the table, which greatly reduces the cost of calculating the Gaussian probability in the decoding operation, thus improving the recognition speed of the system to a considerable extent.

发明内容Contents of the invention

本发明的目的是为了提供一种改进当前非特定人大词表连续语音识别系统中的快速解码运算方法，该方法进一步解决了当前嵌入式语音识别技术，相对产品价格的市场接收能力，对硬件计算能力和存储器要求过高的问题，使得当前的语音识别技术亦可适用于嵌入式硬件平台，例如，PDA、Mobil Phone、Smart Phone等。The purpose of the present invention is to provide a kind of fast decoding operation method in the continuous speech recognition system of improving current non-specific NPC vocabulary, and this method further solves current embedded speech recognition technology, the market acceptability of relative product price, to hardware calculation The problem of high capacity and memory requirements makes the current speech recognition technology also applicable to embedded hardware platforms, such as PDA, Mobil Phone, Smart Phone, etc.

本发明的目的可通过如下措施来实现：The purpose of the present invention can be achieved through the following measures:

一种语音识别系统中的快速解码方法，包括下述步骤：A fast decoding method in a speech recognition system, comprising the steps of:

(1)对语音识别系统中的解码运算单元进行初始化；(1) initialize the decoding operation unit in the speech recognition system;

(2)从输入解码运算单元中的长度为T的语音特征码字序列中依次取出下一个语音帧的特征码字矢量，置其为t时刻语音帧O_t，1≤t≤T；(2) take out the characteristic codeword vector of next speech frame successively from the speech characteristic codeword sequence that the length in the input decoding operation unit is T, set it as t moment speech frame _Ot , 1≤t≤T;

(3)对t时刻语音帧O_t进行过滤，若该语音帧被过滤掉，则转到步骤(2)执行，否则置该语音帧O_t为有效语音帧O_t ^V；(3) voice frame O _t is filtered at t moment, if this voice frame is filtered out, then go to step (2) to carry out, otherwise put this voice frame O _t as valid voice frame O _t ^V ;

(4)基于有效语音帧O_t ^V，对t时刻词典树令牌资源L_t的每一层I的令牌资源L_t[I]中的每一个活动节点进行判断，并对判断属于可扩展的令牌则扩展该节点令牌资源表中的令牌，并将新产生的令牌链入目标节点的令牌资源表中；其中I为索引变量，1≤I≤H；H为词典树的高度；否则执行步骤(7)；(4) Based on the effective speech frame O _t ^V , judge each active node in the token resource L _t [I] of each layer I of the dictionary tree token resource L _t at time t, and judge whether it belongs to the scalable expand the token in the token resource table of the node, and link the newly generated token into the token resource table of the target node; where I is an index variable, 1≤I≤H; H is a dictionary tree height; otherwise, step (7);

(5)处理处于词典树节点的令牌；(5) process the token in the dictionary tree node;

(6)根据t时刻的局部路径最大概率以及前一有效语音帧对应的时刻

的局部路径最大概率，对与剪枝相关的阈值做自适应调整；(6) According to the maximum probability of the local path at time t and the time corresponding to the previous effective speech frame

The maximum probability of the local path of , and make adaptive adjustments to the thresholds related to pruning;

(7)重复上述(2)-(6)步得到输入语音结束时刻T的具有最佳分值令牌的全局路径，结束令牌扩展，输出此刻已生成的与声学模型和语言模型有最佳匹配的文本串，产生语音识别结果。(7) Repeat the above steps (2)-(6) to obtain the global path with the best score token at the end time T of the input speech, end the token expansion, and output the best value of the acoustic model and the language model that have been generated at this moment. Matched text strings generate speech recognition results.

本发明不涉及关于词节点令牌的扩展及相关的处理算法，用户可根据任务域(例如：命令词识别，汉语单音节识别，大词量连续语音识别等)定制相关的处理算法。The present invention does not involve the expansion of word node tokens and related processing algorithms, and users can customize related processing algorithms according to task domains (for example: command word recognition, Chinese monosyllable recognition, large word volume continuous speech recognition, etc.).

所述t时刻词典树令牌资源L_t为该时刻词典树中所有活动节点的令牌资源的总和。词典树中t时刻活动节点的索引方式为：依t时刻活动节点在词典树中所处的层次索引，即在相同层的所有活动节点串接在一起形成一张链表，词典树的每一层均有这样的一张链表，整体上是一个二维链表。The dictionary tree token resource L _t at time t is the sum of token resources of all active nodes in the dictionary tree at this time. The index method of the active node in the dictionary tree at time t is: according to the level index of the active node in the dictionary tree at time t, that is, all active nodes in the same layer are concatenated together to form a linked list, and each layer of the dictionary tree There is such a linked list, which is a two-dimensional linked list as a whole.

所述t时刻词典树令牌资源的第I层令牌资源L_t[I]为按上述方式索引的t时刻词典树活动节点令牌资源L_t的第I层链表。The first layer token resource L _t [I] of the dictionary tree token resource at time t is the first layer linked list of the dictionary tree active node token resource L _t indexed in the above manner.

所述t时刻的局部路径最大概率为：t时刻所有新产生令牌对应的局部路径集合中，所有局部路径分值的最大值。The maximum probability of the local path at time t is: the maximum value of all local path scores in the local path set corresponding to all newly generated tokens at time t.

所述的前一有效语音帧对应的时刻的局部路径最大概率为：前一有效语音帧对应的时刻所有新产生令牌对应的局部路径集合中，所有局部路径分值的最大值。The previous valid speech frame corresponding to The maximum probability of the local path at a moment is: the moment corresponding to the previous valid speech frame In the local path set corresponding to all newly generated tokens, the maximum value of all local path scores.

所述的初始化步骤(1)还包括下述步骤：Described initialization step (1) also comprises the following steps:

a、产生一个分值为零的令牌，并将该令牌链入词典树中的root节点的令牌资源表头，当前词典树的活动节点仅包含根节点root，它处在词典树的第一层；a. Generate a token with a score of zero, and link the token to the token resource header of the root node in the dictionary tree. The active node of the current dictionary tree only contains the root node root, which is at the root of the dictionary tree level one;

b、初始化全局剪枝阈值L_g为对数最小值；b. Initialize the global pruning threshold L _g as the logarithmic minimum;

c、初始化局部剪枝基线阈值L_b为对数最小值；c. Initialize the local pruning baseline threshold L _b as the logarithmic minimum;

d、初始化剪枝宽度阈值L_w为一个正常数L_w ^c，L_w ^c由用户预先设定。d. Initialize the pruning width threshold L _w as a normal constant L _w ^c , and L _w ^c is preset by the user.

所述的过滤步骤(3)还包括下述步骤：Described filtering step (3) also comprises the following steps:

3a、若t时刻语音帧O_t为用户语音输入的起始语音帧，则置其为有效语音帧，过滤操作完成；否则执行步骤b；3a, if voice frame 0 _t is the initial voice frame of user voice input at t moment, then put it as effective voice frame, and filter operation is finished; Otherwise execution step b;

3b、比较t时刻语音帧O_t的Y个特征码字矢量f₁ ^t f₂ ^t Λ f_Y ^t与t-1时刻语音帧O_t-1的Y个特征码字矢量f₁ ^t-1 f₂ ^t-1 Λ f_Y ^t-1的相似程度，得到一个相似度量值V；3b, compare the Y feature code word vector f ₁ ^t f ₂ ^t Λ f _Y ^t of the speech frame O _t at time t with the Y feature code vector f 1 _{t-1 f of speech frame O t-1} at time ^t- ₁ ₂ ^t-1 Λ f _Y ^t-1 degree of similarity, get a similarity measure V;

3c、将相似度量值V与判决阈值θ比较，若V≤θ则判定t时刻语音帧O_t为对解码运算无效的语音帧；否则判定t时刻语音帧O_t为对解码运算有效的语音帧。3c. Compare the similarity measure value V with the decision threshold θ, if V ≤ θ, it is determined that the speech frame O _t at time t is an invalid speech frame for decoding operation; otherwise, it is determined that the speech frame O t at time _t is an effective speech frame for decoding operation .

所述的判决阈值θ为一个由用户设定的大于0的常数。The decision threshold θ is a constant greater than 0 set by the user.

所述的节点令牌资源扩展步骤(4)，还包括下述步骤：The node token resource expansion step (4) also includes the following steps:

4a、基于有效语音帧O_t ^V，对当前节点关联的HMM的最后一个状态对应的令牌资源链表中的每个令牌做外部扩展，即对当前节点关联的HMM的最后一个状态对应的令牌资源链表中的每个令牌进行扩展至该节点在词典树中的所有子节点的令牌资源表中；4a. Based on the effective voice frame O _t ^V , externally expand each token in the token resource linked list corresponding to the last state of the HMM associated with the current node, that is, the command corresponding to the last state of the HMM associated with the current node Each token in the card resource list is extended to the token resource tables of all child nodes of the node in the dictionary tree;

4b、取当前节点关联的具有M个状态的HMM的一个HMM状态为当前待处理的HMM状态S_n，其中1≤n≤M；4b. Take one HMM state of the HMM with M states associated with the current node as the current pending HMM state S _n , where 1≤n≤M;

4c、取状态s_n对应的令牌资源表中的一个令牌为当前待处理令牌；4c. Take a token in the token resource table corresponding to state s _n as the current token to be processed;

4d、若状态S_n的当前待处理令牌的分值大于前一个有效语音帧对应的时刻

的全局剪枝阈值L_g，则根据当前节点关联的HMM模型的拓扑结构，取一个由状态s_n可达的状态，置为当前待处理状态s_m，否则转到步骤k开始执行；4d. If the score of the current token to be processed in the state S _n is greater than the time corresponding to the previous valid speech frame

The global pruning threshold L _g of the current node, then according to the topology structure of the HMM model associated with the current node, take a state that is reachable from the state s _n and set it as the current pending state s _m , otherwise go to step k to start execution;

4e、计算令牌从S_n到达状态s_m的分值s_m(t)；分值s_m(t)为令牌的当前分值加上状态s_n到状态s_m的转移概率、再加上状态s_m对于当前语音帧O_t的观察概率，该观察概率可从输入解码运算单元的概率表做查表操作得到；4e. Calculate the score s _m (t) of the token from S _n to state s _m ; the score s _m (t) is the current score of the token plus the transition probability from state s _n to state s _m , plus Upper state s _m is for the observation probability of current speech frame O _t , and this observation probability can be done table look-up operation to obtain from the probability table of input decoding operation unit;

4f、计算当前局部剪枝阈值L_p，其计算公式为：L_p＝L_b-L_w，公式中，L_b为当前的局部剪枝基线阈值；L_w为当前的剪枝宽度阈值；4f. Calculate the current local pruning threshold L _p , the calculation formula is: L _p =L _b -L _w , in the formula, L _b is the current local pruning baseline threshold; L _w is the current pruning width threshold;

4g、若令牌从s_n到达状态s_m的分值大于当前局部剪枝阈值L_p，则产生一个新的令牌，置其分值为s_m(t)；否则执行步骤j；4g. If the score of the token from s _n to the state s _m is greater than the current local pruning threshold L _p , generate a new token and set its score to s _m (t); otherwise, execute step j;

4h、将g步产生的新令牌链入节点中表头为H_sm的令牌资源表中，并检查该节点是否已在该节点于词典树所在层的活动节点表中，若不在，则链入其中；4h. Link the new token generated in step g into the token resource table whose head is H _sm in the node, and check whether the node is already in the active node table of the layer where the node is located in the dictionary tree. If not, then chain into it;

4i、根据该新令牌的分值s_m(t)，若s_m(t)-L_w＞L_b成立，则更新局部剪枝基线阈值L_b为L_b＝s_m(t)；4i. According to the score s _m (t) of the new token, if s _m (t)-L _w > L _b holds true, then update the local pruning baseline threshold L _b to L _b = s _m (t);

4i、取另一个由状态s_n可达的状态，置其为当前待处理状态s_m，重复上述e-i步，直到处理完所有由状态s_n可达的状态；转到步骤k执行；4i. Take another state reachable from state s _n , set it as the current pending state s _m , repeat the above step ei until all states reachable from state s _n are processed; go to step k to execute;

4k、取状态s_n对应的令牌资源表中的另一个令牌为当前待处理令牌；重复上述d-j步，直到对状态s_n对应的令牌资源表中的所有令牌的扩展操作均完成；转至步骤1执行；4k. Take another token in the token resource table corresponding to the state s _n as the current token to be processed; repeat the above dj steps until the expansion operations on all tokens in the token resource table corresponding to the state s _n are Complete; go to step 1 to execute;

4l、取节点关联的具有M个状态的HMM的另一个HMM状态为当前待处理的HMM状态S_n，其中1≤n≤M，重复上述c-k步，直至当前节点的所有令牌资源扩展操作均完成。4l. Take another HMM state of the HMM with M states associated with the node as the current HMM state S _n to be processed, where 1≤n≤M, repeat the above step ck until all token resource expansion operations of the current node are Finish.

所述的节点令牌资源扩展步骤(4)a步中，包括下述步骤：In the described node token resource expansion step (4) a step, comprise the following steps:

4a-i若令牌的当前分值小于或等于前一有效语音帧对应的时刻

的全局剪枝阈值L_g，则不需要作扩展当前令牌到其所在节点的所有子节点的操作，否则执行步骤ii；4a-i If the current score of the token is less than or equal to the time corresponding to the previous valid speech frame

global pruning threshold L _g , it is not necessary to extend the current token to all child nodes of the node where it is located, otherwise perform step ii;

4a-ii取当前令牌所在节点在词典树中的J个子节点中的第j个子节点node_j为当前待处理节点；4a-ii Take the jth child node node _j among the J child nodes of the node where the current token is located in the dictionary tree as the current node to be processed;

4a-iii累计令牌到达节点node_j的第一个状态s₁的分值s₁(t)，该分值s₁(t)为令牌的当前分值加上当前令牌所在节点的最后一个状态到node_j的第一个状态的转移概率、再加上node_j的第一个状态s₁对于当前语音帧O_t的观察概率，该观察概率可从输入解码运算单元的概率表做查表操作得到；4a-iii Accumulate the score s ₁ (t) of the first state s ₁ when the token reaches node _j . The score s ₁ (t) is the current score of the token plus the last value of the node where the current token is located. The transition probability of a state to the first state of node _j , plus the observation probability of the first state _s1 of node _j for the current speech frame O _t , the observation probability can be checked from the probability table of the input decoding operation unit Table operation gets;

4a-iv计算当前局部剪枝阈值L_p，其计算公式为：L_p＝L_b-L_w，其中，L_b为当前的局部剪枝基线阈值；L_w为当前的剪枝宽度阈值；4a-iv Calculate the current local pruning threshold L _p , the calculation formula is: L _p =L _b -L _w , where L _b is the current local pruning baseline threshold; L _w is the current pruning width threshold;

4a-v若令牌到达node_j的第一个状态s₁的分值大于当前局部剪枝阈值L_p，则执行步骤vi；否则执行步骤ix；4a-v If the token reaches the first state s ₁ of node _j and the score is greater than the current local pruning threshold L _p , then execute step vi; otherwise, execute step ix;

4a-vi产生一个新的令牌，置其分值为s₁(t)；4a-vi generates a new token and sets its score to s ₁ (t);

4a-vii将该令牌链入node_j节点中表头为H_s1的令牌资源表中，并检查node_j是否已在该节点于词典树所在层的活动节点表中，若不在，则链入其中；4a-vii Link the token into the token resource table whose header is H _s1 in the node _j node, and check whether node _j is already in the active node table of the node at the layer where the dictionary tree is located, if not, link into it;

4a-viii基于分值s₁(t)，若s₁(t)-L_w＞L_b，则更新局部剪枝基线阈值L_b即L_b＝s₁(t)；4a-viii Based on the score s ₁ (t), if s ₁ (t)-L _w >L _b , update the local pruning baseline threshold L _b ie L _b =s ₁ (t);

4a-ix取当前令牌所在节点在词典树中的另一个子节点为当前待处理节点node_j，重复上述iii-viii步，直至当前令牌到其所在节点在词典树中所有子节点的扩展操作完成。4a-ix Take another child node of the node where the current token is located in the dictionary tree as the current node _j to be processed, repeat the above steps iii-viii, until the extension of the current token to all child nodes of the node where it is located in the dictionary tree The operation is complete.

所述的基于局部路径最大概率的自适应剪枝步骤(6)，包括下述步骤：The described adaptive pruning step (6) based on the maximum probability of the local path comprises the following steps:

6a、根据当前t时刻以及前一有效语音帧对应的时刻的局部最大概率，计算剪枝宽度阈值调整因子L_f为：

其中，为到t时刻为止的所有有效语音帧的个数，当前的全局剪枝阈值L_g为前一有效语音帧对应的时刻的局部路径最大概率，当前的局部剪枝基线阈值L_b为t时刻的局部路径最大概率；6a. According to the current time t and the previous effective speech frame corresponding The local maximum probability at the moment, calculate the pruning width threshold adjustment factor L _f as:

in, is the number of all valid speech frames up to time t, and the current global pruning threshold L _g is the time corresponding to the previous valid speech frame The maximum probability of the local path of , the current local pruning baseline threshold L _b is the maximum probability of the local path at time t;

6b、对计算出的剪枝宽度阈值调整因子L_f作规整：若 $L_{f} > L_{f}^{MAX}$ 则置L_f为L_f ^MAX，若 $L_{f} < L_{f}^{MIN},$ 则置L_f为L_f ^MIN，其中：L_f ^MAX为调整因子L_f的上界，L_f ^MIN为调整因子L_f的下界，均为正常数，可由用户设定；6b. Regularize the calculated pruning width threshold adjustment factor L _f : if $L_{f} > L_{f}^{MAX}$ Then set L _f to L _f ^MAX , if $L_{f} < L_{f}^{MIN},$ Then set L _f as L _f ^MIN , where: L _f ^MAX is the upper bound of the adjustment factor L _f , and L _f ^MIN is the lower bound of the adjustment factor L _f , both of which are normal numbers and can be set by the user;

6c、根据计算出的剪枝宽度阈值调整因子L_f，更新剪枝宽度阈值L_w为： $L_{w} = L_{f} L_{w}^{c},$ L_w ^c是初始剪枝宽度阈值，可由初始化步骤(1)中得到；6c. According to the calculated pruning width threshold adjustment factor L _f , update the pruning width threshold L _w as: $L_{w} = L_{f} L_{w}^{c},$ L _w ^c is the initial pruning width threshold, which can be obtained in the initialization step (1);

6d、更新时刻t的全局剪枝阈值为L_g：L_g＝L_b，为针对下一有效语音帧的令牌扩展作准备；6d. The global pruning threshold at update time t is L _g : L _g =L _b , to prepare for token expansion for the next valid speech frame;

6e、重置局部剪枝基线阈值L_b为对数最小值，为针对下一有效语音帧的令牌扩展作准备。6e. Reset the local pruning baseline threshold L _b to a logarithmic minimum value to prepare for token expansion for the next valid speech frame.

本发明相比现有技术具有如下优点：Compared with the prior art, the present invention has the following advantages:

与传统解码运算方法相比，本发明包含以下改进：基于局部路径最大概率的自适应剪枝策略；基于特征码字矢量的语音帧过滤策略。Compared with the traditional decoding operation method, the present invention includes the following improvements: an adaptive pruning strategy based on the maximum probability of the local path; a speech frame filtering strategy based on the feature code word vector.

本发明涉及的解码运算采用了一种基于词典树及令牌扩展的带剪枝的宽度优先的搜索框架，该算法的计算复杂度为O(MT)，其中：T为进入搜索计算的语音帧的个数，M为在搜索计算过程中各个语音帧对应时刻的活动路径条数的平均值。The decoding operation involved in the present invention adopts a width-first search frame with pruning based on dictionary tree and token expansion, and the computational complexity of this algorithm is O(MT), wherein: T is the voice frame that enters the search calculation M is the average value of the number of active paths at the corresponding time of each speech frame in the search calculation process.

传统的解码运算是对用户语音输入的所有语音帧做搜索计算的，实际上，用户的语音输入是一个具有局部平稳性的时变信号，因此，在用户输入语音的平稳段，可去掉一些与其相邻语音帧相似的语音帧，即这些语音帧不进入搜索计算过程，且不影响解码运算的精度。为此，本发明给出了一种基于语音帧特征码字矢量的语音帧过滤策略，该策略可有效去除对搜索计算冗余的语音帧，使得实际进入搜索计算的语音帧数目少于用户语音输入所包含的实际语音帧数，由上述公式可见，与传统方法相比，采用这种策略可加快解码运算的速度。The traditional decoding operation is to search and calculate all the voice frames of the user's voice input. In fact, the user's voice input is a time-varying signal with local stationarity. Therefore, in the stable segment of the user's voice input, some Speech frames that are similar to adjacent speech frames, that is, these speech frames do not enter the search calculation process, and do not affect the accuracy of the decoding operation. For this reason, the present invention provides a kind of speech frame filtering strategy based on speech frame feature codeword vector, this strategy can effectively remove the speech frame that is redundant to search computation, makes the speech frame number that actually enters search computation be less than user speech The actual number of speech frames included in the input can be seen from the above formula. Compared with the traditional method, using this strategy can speed up the decoding operation.

另一方面，由上述公式可见，搜索计算的速度还依赖于搜索过程中各个语音帧对应时刻的活动路径条数的平均数，即有：T不变时，M越大，则搜索开销越大；M越小，则搜索开销越小。M的大小依赖于搜索计算采取的剪枝策略。为此，本发明给出了一种局部路径最大概率的自适应剪枝策略，与传统方法相比，可有效降低M的值，而对识别精度没有明显影响，从而可进一步加快解码运算的速度。On the other hand, it can be seen from the above formula that the speed of search calculation also depends on the average number of active paths at the corresponding time of each speech frame in the search process, that is: when T is constant, the larger M is, the greater the search cost ; The smaller M is, the smaller the search overhead. The size of M depends on the pruning strategy adopted by the search calculation. For this reason, the present invention provides an adaptive pruning strategy with the maximum probability of local paths. Compared with the traditional method, the value of M can be effectively reduced without significantly affecting the recognition accuracy, so that the speed of decoding operation can be further accelerated .

附图说明Description of drawings

图1为公知的语音识别系统的结构框图Fig. 1 is the structural block diagram of known speech recognition system

图2为另一公知的语音识别系统的结构框图Fig. 2 is the structural block diagram of another known speech recognition system

图3为本发明的解码运算流程图Fig. 3 is the decoding operation flowchart of the present invention

图4为公知的词典树结构图Fig. 4 is a known dictionary tree structure diagram

图5为公知的TRIPHONE音素的HMM拓朴结构示意图Fig. 5 is a schematic diagram of the HMM topology structure of the known TRIPHONE phoneme

具体的实施方式specific implementation

图3给出本发明解码运算流程图。由图2、结合图3，基于本发明的解码运算子系统的语音识别系统的运行流程为：将输入的语音模拟信号变换为数字信号；对该数字信号进行分帧处理，提取每一帧语音的特征参数，每一语音帧对应一个特征矢量，得到输入语音的特征矢量序列；利用特征码本对所述特征矢量序列进行量化编码，每一语音帧对应一个特征码字矢量，得到相应的特征码字矢量序列；将语音特征码字矢量序列输入解码运算子系统的语音帧过滤单元，做语音帧过滤操作，从语音特征码字矢量序列中去掉无效语音帧对应的特征码字矢量，得到有效语音特征码字矢量序列；对所述有效语音特征码字矢量序列做搜索计算得到识别结果，在搜索计算过程中，采用基于局部路径最大概率的自适应剪枝策略作局部搜索路径剪枝，对有效语音特征码字矢量序列的各个码字，从概率表中直接查到其在(局部)搜索路径上的观察概率。Fig. 3 shows the flow chart of the decoding operation of the present invention. By Fig. 2, in conjunction with Fig. 3, the operation process of the speech recognition system based on the decoding operation subsystem of the present invention is: the speech analog signal of input is transformed into digital signal; This digital signal is carried out sub-frame processing, extracts each frame of speech The feature parameters of each speech frame correspond to a feature vector to obtain the feature vector sequence of the input speech; use the feature codebook to quantize and encode the feature vector sequence, and each speech frame corresponds to a feature codeword vector to obtain the corresponding feature Codeword vector sequence; the speech feature codeword vector sequence is input into the speech frame filtering unit of the decoding operation subsystem, and the speech frame filtering operation is performed, and the feature codeword vector corresponding to the invalid speech frame is removed from the speech feature codeword vector sequence to obtain an effective Speech feature codeword vector sequence; search and calculate the recognition result to the effective speech feature codeword vector sequence, in the search calculation process, adopt the adaptive pruning strategy based on the local path maximum probability to do local search path pruning, to For each codeword of the effective speech feature codeword vector sequence, its observation probability on the (local) search path can be directly found from the probability table.

所述的词典树活动节点令牌资源及其索引方式定义The definition of the dictionary tree active node token resource and its indexing method

本发明中，称在任意时刻t，词典树中具有活动令牌的节点为t时刻活动节点。词典树中t时刻活动节点的索引方式为：依t时刻活动节点在词典树中所处的层次索引，即在相同层的所有活动节点串接在一起形成一张链表，词典树的每一层均有这样的一张链表，整体上是一个二维链表。In the present invention, at any time t, a node with an active token in the dictionary tree is called an active node at time t. The index method of the active node in the dictionary tree at time t is: according to the level index of the active node in the dictionary tree at time t, that is, all active nodes in the same layer are concatenated together to form a linked list, and each layer of the dictionary tree There is such a linked list, which is a two-dimensional linked list as a whole.

任意时刻t，词典树中所有活动节点的令牌资源的总和称为t时刻词典树活动节点令牌资源，它规定了t时刻的待扩展令牌资源。为了下面叙述的方便，即按上述索引的t时刻词典树活动节点令牌资源为L_t，其索引变量为I(1≤I≤H)，其中，H为词典树的高度。At any time t, the sum of the token resources of all active nodes in the dictionary tree is called the token resource of the active node of the dictionary tree at time t, which specifies the token resources to be expanded at time t. For the convenience of the following description, the dictionary tree active node token resource at time t according to the above index is L _t , and its index variable is I (1≤I≤H), where H is the height of the dictionary tree.

基于上述关于解码运算的基本原理和相关概念给出本发明快速解码方法的具体实施方式。Based on the above basic principles of decoding operations and related concepts, a specific implementation manner of the fast decoding method of the present invention is given.

本发明的解码方法包括下述步骤：Decoding method of the present invention comprises the following steps:

1、对语音识别系统中的解码运算单元进行初始化；1. Initialize the decoding operation unit in the speech recognition system;

2、从输入解码运算器的长度为T的语音特征码字矢量序列中取出第一个语音帧的特征码字矢量，置其为t时刻语音帧O_t(t＝1)；2, take out the characteristic codeword vector of first speech frame from the speech characteristic codeword vector sequence that the length of input decoding operator is T, put it as t moment speech frame O _t (t=1);

3、对t时刻语音帧O_t做过滤操作；3, the voice frame O _t of time t is filtered;

4、若O_t为有效语音帧，则对t时刻词典树令牌资源L_t每一层I(1≤I≤H)的词典树令牌资源L_t[I]的每一个活动节点，扩展该节点令牌资源表中的令牌，并将新产生的令牌链入目标节点的令牌资源表中；否则转到步骤7；4. If O _t is an effective speech frame, then to each active node of the dictionary tree token resource L _t [I] of each layer I (1≤I≤H) of _the dictionary tree token resource L t at time t, expand The token in the token resource table of the node, and link the newly generated token into the token resource table of the target node; otherwise, go to step 7;

5、处理处于词节点的令牌；本发明不涉及关于词节点令牌的扩展及相关的处理算法，用户可根据任务域(例如：命令词识别，汉语单音节识别，大词量连续语音识别等)定制相关的处理算法；5, process the token that is in word node; The present invention does not relate to the expansion and relevant processing algorithm about word node token, the user can according to task field (for example: command word recognition, Chinese monosyllable recognition, large word volume continuous speech recognition etc.) to customize relevant processing algorithms;

6、根据t时刻的局部路径最大概率以及前一有效语音帧对应的时刻的局部路径最大概率，对与剪枝相关的阈值做自适应调整，其中包括：全局剪枝阈值L_g，局部剪枝基线阈值L_b以及剪枝宽度阈值L_w；6. According to the maximum probability of the local path at time t and the time corresponding to the previous valid speech frame The maximum probability of the local path, and adaptively adjust the thresholds related to pruning, including: global pruning threshold L _g , local pruning baseline threshold L _b and pruning width threshold L _w ;

7、从输入解码运算器的长度为T的语音特征矢量序列中取出下一个语音帧，若可以取到，则置其为t时刻语音帧O_t(t≤T)并转到步骤3执行，否则转到步骤8；7, take out next speech frame from the speech feature vector sequence that the length of input decoding operator is T, if can get, then put it as t moment speech frame O _t (t≤T) and go to step 3 and carry out, Otherwise go to step 8;

8、结束令牌扩展产生识别结果：通过回朔时刻T具有最佳分值令牌的全局路径，输出与声学模型和语言模型有最佳匹配的文本串。8. End token expansion to generate recognition results: output the text string that best matches the acoustic model and language model through the global path of the token with the best score at time T.

上述解码方法的步骤4中的对当前节点令牌资源作扩展操作的分步骤为：The sub-steps for extending the current node token resource in step 4 of the above decoding method are as follows:

4a、对节点关联的HMM的最后一个状态对应的令牌资源链表中的每个令牌，将令牌扩展到该节点在词典树中的所有儿子节点的令牌资源表中；4a. For each token in the token resource linked list corresponding to the last state of the HMM associated with the node, extend the token to the token resource tables of all child nodes of the node in the dictionary tree;

4b、取节点关联的具有M个状态的HMM的第一个HMM状态为当前待处理的HMM状态S_n(n＝1)；4b. Take the first HMM state of the HMM with M states associated with the node as the current pending HMM state S _n (n=1);

4d、若令牌的当前分值大于前一个有效语音帧对应的时刻

的全局剪枝阈值L_g，则根据当前节点关联的HMM模型的拓扑结构，取一个由状态s_n可达的状态，置为当前待处理状态s_m，否则转到步骤k执行；4d. If the current score of the token is greater than the time corresponding to the previous valid speech frame

The global pruning threshold L _g of the current node, according to the topology structure of the HMM model associated with the current node, take a state reachable from the state s _n and set it as the current pending state s _m , otherwise go to step k for execution;

4e、计算令牌到达状态s_m的分值s_m(t)为：令牌的当前分值加上状态s_n到状态s_m的转移概率、再加上状态s_m对于当前语音帧O_t的观察概率，该观察概率可从输入解码运算单元的概率表做查表操作得到；4e. Calculating the score s _m (t) of the token reaching the state s _m is: the current score of the token plus the transition probability from the state s _n to the state s _m , plus the state s _m for the current speech frame O _t Observation probability, the observation probability can be obtained from the probability table of the input decoding operation unit by a table look-up operation;

4g、若令牌到达状态s_m的分值小于或等于当前局部剪枝阈值L_p，则转到步骤j执行；否则做下列操作：产生一个新的令牌，置其分值为s_m(t)；4g. If the score of the token reaching the state s _m is less than or equal to the current local pruning threshold L _p , go to step j for execution; otherwise, do the following: generate a new token and set its score to s _m ( t);

4h、将该令牌链入节点中表头为H_sm的令牌资源表中，并检查该节点是否己在该节点于词典树所在层的活动节点表中，若不在，则链入其中；4h. Link the token into the token resource table whose head is H _sm in the node, and check whether the node is already in the active node table of the layer where the node is located in the dictionary tree, if not, link into it;

4i、根据分值为s_m(t)，更新局部剪枝基线阈值L_b，其步骤为：若s_m(t)-L_w＞L_b，则有：L_b＝s_m(t)；否则，不作更新；4i. According to the score s _m (t), update the local pruning baseline threshold L _b , the steps are: if s _m (t)-L _w >L _b , then: L _b =s _m (t); Otherwise, do not update;

4j、取另一个状态s_n可达的状态s_m，若取到，则置其为当前待处理状态s_m并转到步骤e执行，否则转到步骤k执行；4j. Take another state s _m that is reachable by state s _n , if it is obtained, set it as the current pending state s _m and go to step e for execution, otherwise go to step k for execution;

4k、取状态s_n对应的令牌资源表中的另一个令牌，若取到，则置为当前待处理的令牌并转到步骤d执行，否则转到步骤l执行；4k. Take another token in the token resource table corresponding to the state s _n , if it is obtained, set it as the current token to be processed and go to step d for execution, otherwise go to step l for execution;

4l、取节点关联的具有M个状态的HMM的下一个HMM状态，若取到，则置为当前待处理的HMM状态S_n(n≤M)并转到步骤c执行，否则表明对当前节点令牌资源扩展操作已经完成。41. Take the next HMM state of the HMM with M states associated with the node. If it is obtained, set it as the current HMM state S _n (n≤M) to be processed and go to step c for execution, otherwise it indicates that the current node The token resource extension operation has completed.

上述解码方法的步骤4的第一个分步骤中a步关于扩展当前令牌到其所在节点的所有子节点的步骤为：Step a in the first sub-step of step 4 of the above-mentioned decoding method is about extending the current token to all child nodes of the node where it is located:

4a-1、若当前令牌的分值小于或等于前一有效语音帧对应的时刻的全局剪枝阈值L_g，则不需要作扩展当前令牌到其所在节点的所有儿子节点的操作，否则转步骤2执行；4a-1. If the score of the current token is less than or equal to the time corresponding to the previous valid speech frame The global pruning threshold L _g , there is no need to extend the current token to all the child nodes of the node where it is located, otherwise go to step 2;

4a-2、取当前令牌所在节点的一个儿子节点为当前待处理节点node_j(j＝1)；4a-2. Take a child node of the node where the current token is located as the current pending node node _j (j=1);

4a-3、累计令牌到达节点node_j的第一个状态s₁的分值s₁(t)为：令牌的当前分值加上令牌所在节点的最后一个状态到node_j的第一个状态的转移概率、再加上node_j的第一个状态s₁对于当前语音帧O_t的观察概率，该观察概率可从输入解码运算单元的概率表做查表操作得到；4a-3. The score s ₁ (t) of the cumulative token reaching the first state s ₁ of node _j is: the current score of the token plus the last state of the node where the token is located to the first state of node _j The transition probability of each state, plus the observation probability of the first state _s1 of node _j for the current speech frame O _t , this observation probability can be done table look-up operation from the probability table of input decoding operation unit and obtains;

4a-4、计算当前局部剪枝阈值L_p，其计算公式为：L_p＝L_b-L_w，其中，L_b为当前的局部剪枝基线阈值；L_w为当前的剪枝宽度阈值；4a-4. Calculate the current local pruning threshold L _p , the calculation formula is: L _p =L _b -L _w , where L _b is the current local pruning baseline threshold; L _w is the current pruning width threshold;

4a-5、若令牌到达node_j的第一个状态s₁的分值小于或等于当前局部剪枝阈值L_p，则转到步骤9，否则执行步骤6；4a-5. If the token reaches the first state s ₁ of node _j and the score is less than or equal to the current local pruning threshold L _p , go to step 9, otherwise go to step 6;

4a-6、产生一个新的令牌，置其分值为s₁(t)；4a-6. Generate a new token and set its score to s ₁ (t);

4a-7、将该令牌链入node_j节点中表头为H_s1的令牌资源表中，并检查node_j是否已在该节点于词典树所在层的活动节点表中，若不在，则链入其中；4a-7. Link the token into the token resource table whose head is H _s1 in node _j , and check whether node _j is already in the active node table of the node at the layer where the dictionary tree is located. If not, then chain into it;

4a-8、根据分值s₁(t)，更新局部剪枝基线阈值L_b，其步骤为：若s₁(t)-L_w＞L_b，则有：L_b＝s₁(t)，否则，不作更新；4a-8. According to the score s ₁ (t), update the local pruning baseline threshold L _b , the steps are: if s ₁ (t)-L _w >L _b , then: L _b =s ₁ (t) , otherwise, do not update;

4a-9、取当前令牌所在节点的另一个儿子节点，若取到，则置其为当前待处理节点node_j(j≤N)并转到步骤3执行，其中N为当前令牌所在节点在词典树中所有儿子节点的个数，否则表明完成了扩展当前令牌到其所在节点的所有儿子节点的操作。4a-9. Take another child node of the node where the current token is located. If it is obtained, set it as the current pending node node _j (j≤N) and go to step 3 for execution, where N is the node where the current token is located. The number of all child nodes in the dictionary tree, otherwise it indicates that the operation of extending the current token to all child nodes of the node where it is located is completed.

上述解码方法的步骤1对解码运算单元进行初始化的步骤包括下述步骤：Step 1 of the above-mentioned decoding method initializes the decoding operation unit and includes the following steps:

a、产生一个分值为零的令牌，并把该令牌链入词典树中的root节点的令牌资源表头，当前词典树的活动节点仅包含根节点root，它处在词典树的第一层；a. Generate a token with a score of zero, and link the token to the token resource header of the root node in the dictionary tree. The active node of the current dictionary tree only contains the root node root, which is at the root of the dictionary tree level one;

d、初始化全局剪枝阈值L_g为对数最小值；d. Initialize the global pruning threshold L _g as the logarithmic minimum;

b、初始化剪枝宽度阈值L_w为一个正常数L_w ^c，该值可从用户设定的解码运算器配置文件中读取。b. Initialize the pruning width threshold L _w as a normal constant L _w ^c , which can be read from the configuration file of the decoding operator set by the user.

上述解码方法的步骤3对当前语音帧做过滤操作的步骤包括下述步骤：The step 3 of above-mentioned decoding method is done the step of filter operation to current speech frame and comprises the following steps:

3a、若当前t时刻语音帧O_t为用户语音输入的起始语音帧，则置其为有效语音帧，过滤操作完成；否则执行步骤2；3a, if current t moment voice frame O _t is the initial voice frame of user's voice input, then put it as effective voice frame, and filter operation is finished; Otherwise execution step 2;

3b、比较当前t时刻语音帧O_t的特征码字矢量f₁ ^t f₂ ^t … f_Y ^t与t-1时刻语音帧Q_t-1的特征码字矢量f₁ ^t-1 f₂ ^t-1 … f_Y ^t-1的相似程度，得到一个相似度量值V，上述特征码字矢量表达式中的Y为语音帧特征码字矢量中包含的特征码字个数，相似度量值V可由下述公式计算得到：3b. Compare the feature codeword vector f ₁ ^t f ₂ ^t ... f _Y ^t of the speech frame O _t at the current moment t with the feature codeword vector f 1 t-1 f ₂ _{t-1 of the speech frame Q t-1} ^at the moment ^t - ₁ The degree of similarity of ¹ ... f _Y ^t-1 obtains a similarity measure value V, and Y in the above-mentioned feature codeword vector expression is the number of feature codewords contained in the speech frame feature codeword vector, and the similarity measure value V can be obtained by the following The above formula is calculated to get:

$V = Σ_{i = 1}^{Y} C (f_{i}^{t}, f_{i}^{t - 1}),$ 其中，定义为下式： $V = Σ_{i = 1}^{Y} C (f_{i}^{t}, f_{i}^{t - 1}),$ in, is defined as the following formula:

3c、将相似度量值V与判决阈值θ(是一个由用户设定的正常数，可从用户设定的解码运算器配置文件中读取)做比较，若V≤θ则判定语音帧O_t为对解码运算无效的语音帧，否则判定语音帧O_t为对解码运算有效的语音帧。3c. Compare the similarity measure value V with the decision threshold θ (a normal number set by the user, which can be read from the configuration file of the decoding operator set by the user), and if V≤θ, determine the speech frame O _t is an invalid speech frame for decoding operation, otherwise it is determined that speech frame O _t is an effective speech frame for decoding operation.

实验结果表明，上述语音帧过滤操作可去除所述语音特征码字矢量序列中20％～30％的无效语音帧，因此可加快解码运算的速度，且其识别性能较传统方法对用户的语速不敏感，这是因为对语速较慢的用户，上述语音帧过滤操作可去除较多的无效语音帧，而对语速较快的用户，上述语音帧过滤操作则可去除较少的无效语音帧，即上述语音帧过滤操作可对用户的不同用户的语速做一定程度的规整化。Experimental results show that the above speech frame filtering operation can remove 20% to 30% of invalid speech frames in the speech feature codeword vector sequence, so the speed of decoding operation can be accelerated, and its recognition performance is better than that of traditional methods. Insensitive, because the above speech frame filtering operation can remove more invalid speech frames for users who speak slowly, and less invalid speech frames for users who speak faster Frame, that is, the above-mentioned voice frame filtering operation can regularize the speech rates of different users of users to a certain extent.

上述解码方法的步骤6还包括下述步骤：Step 6 of the above-mentioned decoding method also includes the following steps:

由解码运算器的前5个步骤可得到：当前的全局剪枝阈值L_g为前一有效语音帧对应的时刻

的局部路径最大概率，当前的局部剪枝基线阈值L_b为t时刻的局部路径最大概率。据此，解码运算器的操作步骤中的步骤6的执行分步骤：From the first five steps of the decoding operator, it can be obtained that the current global pruning threshold L _g is the moment corresponding to the previous effective speech frame

The maximum probability of the local path of , the current local pruning baseline threshold L _b is the maximum probability of the local path at time t. Accordingly, the execution of step 6 in the operation steps of the decoding operator is divided into steps:

6a、根据时刻t、的局部最大概率，计算剪枝宽度阈值调整因子L_f为：

其中，为到t时刻为止，所有有效语音帧的个数；6a. According to time t, The local maximum probability of , calculate the pruning width threshold adjustment factor L _f as:

in, For the time t, the number of all valid speech frames;

6b、对计算出的剪枝宽度阈值调整因子L_f作规整：若 $L_{f} > L_{f}^{MAX}$ 则 $L_{f} = L_{f}^{MAX},$ 若 $L_{f} < L_{f}^{MIN}$ 则 $L_{f} = L_{f}^{MIN},$ 其中：L_f ^MAX为调整因子L_f的上界(例如：1.05)，L_f ^MIN为调整因子L_f的下界(例如：0.5)，均为正常数，可从用户设定的解码器配置文件中读取；6b. Regularize the calculated pruning width threshold adjustment factor L _f : if $L_{f} > L_{f}^{MAX}$ but $L_{f} = L_{f}^{MAX},$ like $L_{f} < L_{f}^{MIN}$ but $L_{f} = L_{f}^{MIN},$ Among them: L _f ^MAX is the upper bound of the adjustment factor L _f (for example: 1.05), L _f ^MIN is the lower bound of the adjustment factor L _f (for example: 0.5), both are normal numbers, which can be obtained from the decoder configuration file set by the user read in;

6c、根据计算出的剪枝宽度阈值调整因子L_f，更新剪枝宽度阈值L_w为： $L_{w} = L_{f} L_{w}^{c};$ 6c. According to the calculated pruning width threshold adjustment factor L _f , update the pruning width threshold L _w as: $L_{w} = L_{f} L_{w}^{c};$

6d、更新时刻t的全局剪枝阈值为L_g：L_g＝L_b，为针对对下一有效语音帧的令牌扩展作准备；6d. The global pruning threshold at the update time t is L _g : L _g =L _b , which is to prepare for the token expansion of the next valid speech frame;

在传统的搜索算法中，剪枝宽度阈值L_w是不变的，在本发明中，在对当前有效语音帧做搜索计算后，剪枝宽度阈值L_w可根据局部路径最大概率做自适应调整，从而可实现对下一有效语音帧做搜索计算时对局部路径的自适应剪枝，实验结果表明，在不影响识别精度的前提下，该方法可有效降低解码过程中的所述平均令牌数M(10％～20％)，从而可进一步加快解码运算的速度。In the traditional search algorithm, the pruning width threshold L _w is constant, but in the present invention, after performing search calculation on the current effective speech frame, the pruning width threshold L _w can be adaptively adjusted according to the maximum probability of the local path , so that the adaptive pruning of the local path can be realized when the next effective speech frame is searched and calculated. The experimental results show that, under the premise of not affecting the recognition accuracy, this method can effectively reduce the average token in the decoding process. The number M (10%-20%) can further speed up the speed of decoding operation.

Claims

1. A fast decoding method in a speech recognition system, comprising the steps of:

(1) initialize the decoding operation unit in the speech recognition system;

(2) take out the characteristic codeword vector of next speech frame successively from the speech characteristic codeword sequence that the length in the input decoding operation unit is T, set it as t moment speech frame _Ot , 1≤t≤T;

(3) filter the voice frame O _t at t moment, if this voice frame is filtered out, then perform step (2), otherwise put this voice frame as the current effective voice frame O _t ^V ;

(4) Based on the effective speech frame O _t ^V , judge each active node in the token resource L _t [I] of each layer I of the dictionary tree token resource L _t at time t, and judge whether it belongs to the scalable expand the token in the token resource table of the node, and link the newly generated token into the token resource table of the target node; where I is an index variable, 1≤I≤H; H is a dictionary tree height; otherwise, step (7);

(5) process the token in the dictionary tree node;

(6) According to the maximum probability of the local path at time t and the time corresponding to the previous effective speech frame The maximum probability of the local path of , and make adaptive adjustments to the thresholds related to pruning;

(7) Repeat the above steps (2)-(6) to obtain the global path with the best score token at the end time T of the input speech, end the token expansion, and output the best value of the acoustic model and the language model that have been generated at this moment. Matched text strings generate speech recognition results.

2, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described described t moment dictionary tree token resource L _t is the summation of the token resource of all active nodes in this moment dictionary tree .

3. The fast decoding method in the speech recognition system according to claim 1, wherein the maximum probability of the local path at the time t is that in the set of local paths corresponding to all newly generated tokens at the time t, all local path scores the maximum value.

4. The fast decoding method in the speech recognition system as claimed in claim 1, characterized in that the previous effective speech frame corresponds to

The maximum probability of the local path at a moment is the moment corresponding to the previous valid speech frame

In the local path set corresponding to all newly generated tokens, the maximum value of all local path scores.

5, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described initialization step (1) also comprises the following steps:

a. Generate a token with a score of zero, and link the token to the token resource header of the root node in the dictionary tree. The active node of the current dictionary tree only contains the root node root, which is at the root of the dictionary tree level one;

b. Initialize the global pruning threshold L _g as the logarithmic minimum;

c. Initialize the local pruning baseline threshold L _b as the logarithmic minimum;

d. Initialize the pruning width threshold L _w as a normal constant L _w ^c , and L _w ^c is preset by the user.

6, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described filtering step (3) also comprises the following steps:

3a, if voice frame 0 _t is the initial voice frame of user voice input at t moment, then put it as effective voice frame, and filter operation is finished; Otherwise execution step b;

3b. Compare the Y feature codeword vectors f ₁ ^t f ₂ ^t Λ f _Y t of the voice frame _O ^t at time t with the Y feature code vector f ₁ t-1 f of the voice frame O _t -1 at time ^t-1 ₂ ^t-1 Λ f _Y ^t-1 degree of similarity, get a similarity measure V;

3c. Compare the similarity measure value V with the decision threshold θ, if V ≤ θ, it is determined that the speech frame O _t at time t is an invalid speech frame for decoding operation: otherwise, it is determined that the speech frame O t at time _t is an effective speech frame for decoding operation .

7. The fast decoding method in the speech recognition system according to claim 1, characterized in that said decision threshold θ is a constant greater than 0 set by the user.

8, the fast decoding method in the speech recognition system as claimed in claim 1 is characterized in that described node token resource expansion step (4), also comprises the following steps:

4a. Based on the effective voice frame O _t ^V , externally expand each token in the token resource linked list corresponding to the last state of the HMM associated with the current node, that is, the command corresponding to the last state of the HMM associated with the current node Each token in the card resource list is extended to the token resource tables of all child nodes of the node in the dictionary tree;

4b. Take one HMM state of the HMM with M states associated with the current node as the current pending HMM state S _n , where 1≤n≤M;

4c. Take a token in the token resource table corresponding to state s _n as the current token to be processed;

4d. If the score of the current token to be processed in the state S _n is greater than the time corresponding to the previous valid speech frame The global pruning threshold L _g of the current node, then according to the topology structure of the HMM model associated with the current node, take a state that is reachable from the state s _n and set it as the current pending state s _m , otherwise go to step k to start execution;

4e. Calculate the score s _m (t) of the token from S _n to state s _m ; the score s _m (t) is the current score of the token plus the transition probability from state s _n to state s _m , plus The observation probability of the upper state s _m for the current speech frame O _t ;

4f. Calculate the current local pruning threshold L _p , the calculation formula is: L _p =L _b -L _w , in the formula, L _b is the current local pruning baseline threshold; L _w is the current pruning width threshold;

4g. If the token’s score from S _n to state s _m is greater than the current local pruning threshold L _p , generate a new token and set its score to s _m (t); otherwise, execute step j;

4h. Link the new token generated in step g into the token resource table whose head is H _sm in the node, and check whether the node is already in the active node table of the layer where the node is located in the dictionary tree. If not, then chain into it;

4i. According to the score s _m (t) of the new token, according to the formula s _m (t)-L _w >L _b , then update the local pruning baseline threshold L _b ie L _b =s _m (t);

4j. Take another state reachable from state s _n , set it as the current pending state s _m , repeat the above step ei until all states reachable from state s _n are processed; go to step k to execute;

4k. Take another token in the token resource table corresponding to the state s _n as the current token to be processed; repeat the above dj steps until the expansion operations on all tokens in the token resource table corresponding to the state s _n are Complete; go to step 1 to execute;

4l. Take another HMM state of the HMM with M states associated with the node as the current HMM state S _n to be processed, where 1≤n≤M, repeat the above step ck until all token resource expansion operations of the current node have been completed Finish.

9, the fast decoding method in the speech recognition system as claimed in claim 8 is characterized in that in described node token resource expansion step (4) a step, comprises the following steps:

4a-i) If the current score of the token is less than or equal to the time corresponding to the previous valid speech frame global pruning threshold L _g , it is not necessary to extend the current token to all child nodes of the node where it is located, otherwise perform step ii;

4a-ii) take the jth child node among the J child nodes of the node where the current token is located as the current pending node node _j ;

4a-iii) Accumulate the score s ₁ (t) of the first state s ₁ when the token reaches node _j , the score s ₁ (t) is the current score of the token plus the current score of the node where the token is located The transition probability from the last state to the first state of node _j , plus the observation probability of the first state s ₁ of node _j for the current speech frame O _t ;

4a-iv) Calculate the current local pruning threshold L _p , the calculation formula is: L _p =L _b -L _w , where L _b is the current local pruning baseline threshold; L _w is the current pruning width threshold;

4a-v) If the token reaches the first state s ₁ of node _j and the score is greater than the current local pruning threshold L _p , then execute step vi; otherwise, execute step ix;

4a-vi) Generate a new token and set its score to s ₁ (t);

4a-vii) Link the token into the token resource table whose head is H _s1 in the node _j node, and check whether node _j is already in the active node table of the node at the layer where the dictionary tree is located, if not, then chain into it;

4a-viii) According to the score s ₁ (t), if s ₁ (t)-L _w >L _b , update the local pruning baseline threshold L _b ie L _b =s ₁ (t);

4a-ix) Take another child node of the node where the current token is located in the dictionary tree as the current node _j to be processed, repeat the above steps iii-viii, until the current token reaches all the child nodes of the node where it is located in the dictionary tree The extension operation is complete.

10. The fast decoding method in the speech recognition system as claimed in claim 1, characterized in that the adaptive pruning step (6) based on the maximum probability of the local path comprises the following steps:

6a. According to the current time t and the previous effective speech frame corresponding

The local maximum probability at time, the calculation pruning width threshold adjustment factor L _f is:

L_{f} = \frac{(L_{b} - L_{g}) \tilde{t}}{L_{b}},

in,

is the number of all valid speech frames up to time t, and the current global pruning threshold L _g is the time corresponding to the previous valid speech frame The maximum probability of the local path of , the current local pruning baseline threshold L _b is the maximum probability of the local path at time t;

6b. Regularize the calculated pruning width threshold adjustment factor L _f : if

L_{f} > L_{f}^{MAX}

Then set L _f to L _f ^MAX , if

L_{f} < L_{f}^{MIN},

Then set L _f to L _f ^MIN , where: L _f ^MAX is the upper bound of the adjustment factor L _f , and L _f ^MIN is the lower bound of the adjustment factor L _f , both of which are normal numbers, which can be obtained from the decoder configuration file set by the user read;

6c. According to the calculated pruning width threshold adjustment factor L _f , update the pruning width threshold L _w as:

L_{w} = L_{f} L_{w}^{c},

L _w ^c is the initialization pruning width threshold obtained in the initialization step (1);

6d. The global pruning threshold at update time t is L _g : L _g =L _b , which is to prepare for token expansion for the next valid speech frame;

6e. Reset the local pruning baseline threshold L _b to a logarithmic minimum value to prepare for token expansion for the next valid speech frame.