CN106875936A

CN106875936A - Voice recognition method and device

Info

Publication number: CN106875936A
Application number: CN201710254628.8A
Authority: CN
Inventors: 李忠杰
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2017-04-18
Filing date: 2017-04-18
Publication date: 2017-06-20
Anticipated expiration: 2037-04-18
Also published as: CN106875936B; WO2018192186A1

Abstract

An embodiment of the present invention provides a speech recognition method and device, the method comprising: acquiring the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation characteristics of each speech signal frame and the pronunciation of each speech signal The probability that the signal frame is mapped to the corresponding pronunciation; based on the probability included in the feature classification result, filtering the pronunciation included in the feature classification result; identifying the speech signal based on the filtered feature classification result. Implementation of the embodiment of the present invention, in the process of recognizing the speech signal, it is not necessary to perform the recognition operation related to the pronunciation that is filtered out, such as: no need to search for a path related to the pronunciation that is filtered out in the recognition network, so it can effectively The time spent in the speech recognition process is reduced, thereby improving speech recognition efficiency.

Description

Speech recognition method and device

技术领域technical field

本发明涉及计算机技术领域，尤其涉及语音识别方法及装置。The invention relates to the field of computer technology, in particular to a voice recognition method and device.

背景技术Background technique

随着计算机技术的发展，语音识别(Automatic Speech Recognition,ASR)技术在人机交互等领域的应用越来越多。目前，语音识别技术主要通过信号处理模块、特征提取模块、声学模型、语言模型(Language Model,LM)、字典和解码器(Decoder)，将待识别的语音信号转换为文本信息，完成语音识别。With the development of computer technology, the application of automatic speech recognition (ASR) technology in human-computer interaction and other fields is increasing. At present, the speech recognition technology mainly converts the speech signal to be recognized into text information through the signal processing module, feature extraction module, acoustic model, language model (Language Model, LM), dictionary and decoder (Decoder), and completes speech recognition.

在语音识别过程中，信号处理模块和特征提取模块，先将待识别的语音信号划分成多个语音信号帧，然后通过消除噪音、信道失真等处理对各语音信号帧进行增强，再将各语音信号帧从时域转化到频域，并从转换后的语音信号帧内提取合适的声学特征。而根据训练语音库的特征参数训练出的声学模型，以特征提取模块所提取的声学特征为输入，映射到能够描述语音信号帧的发音特征的发音、并计算出语音信号帧映射到各发音的概率，得到特征分类结果。In the speech recognition process, the signal processing module and the feature extraction module first divide the speech signal to be recognized into multiple speech signal frames, and then enhance each speech signal frame by processing such as noise elimination and channel distortion, and then divide each speech signal frame The signal frame is converted from the time domain to the frequency domain, and the appropriate acoustic features are extracted from the converted speech signal frame. The acoustic model trained according to the characteristic parameters of the training speech library takes the acoustic features extracted by the feature extraction module as input, maps to the pronunciation that can describe the pronunciation characteristics of the speech signal frame, and calculates the mapping of the speech signal frame to each pronunciation. Probability to get feature classification results.

语言模型含有不同的字词(如：字、词、短语)之间关联关系、及其概率(可能性)，用于估计由不同字词组成的各种文本信息的可能性。解码器可以基于己经训练好的声学模型、语言模型及字典建立一个识别网络，识别网络中的各路径分别与各种文本信息、以及各文本信息的发音对应，然后针对声学模型输出的发音，在该识别网络中寻找最佳的一条路径，基于该路径能够以最大概率输出该语音信号对应的文本信息，完成语音识别。The language model contains associations between different words (such as words, words, and phrases) and their probabilities (possibility), and is used to estimate the possibility of various text information composed of different words. The decoder can establish a recognition network based on the trained acoustic model, language model and dictionary, and identify each path in the network corresponding to various text information and the pronunciation of each text information, and then output the pronunciation according to the acoustic model. Find the best path in the recognition network, and based on this path, the text information corresponding to the speech signal can be output with the maximum probability to complete the speech recognition.

但是，语言模型一般是基于大量语料训练出来的模型，包含大量字词之间的关联关系和可能性，所以，基于语音模型建立的识别网络包含的节点较多，每个节点的分支数量也非常多。在识别网络中进行路径搜索时，各语音信号帧的发音涉及的节点数会以指数形式暴增，导致路径搜索量极大，搜索过程耗费的时间较多，进而会降低语音识别效率。However, the language model is generally a model trained based on a large amount of corpus, which contains a large number of associations and possibilities between words. Therefore, the recognition network established based on the speech model contains many nodes, and the number of branches of each node is also very large. many. When performing path search in the recognition network, the number of nodes involved in the pronunciation of each speech signal frame will increase exponentially, resulting in a large amount of path search, and the search process takes more time, which in turn will reduce the efficiency of speech recognition.

发明内容Contents of the invention

有鉴于此，本发明提供一种语音识别方法及装置，以解决语音识别过程耗时多、识别效率低的问题。In view of this, the present invention provides a speech recognition method and device to solve the problems of time-consuming speech recognition process and low recognition efficiency.

根据本发明的第一方面，提供一种语音识别方法，包括步骤：According to a first aspect of the present invention, a kind of speech recognition method is provided, comprising steps:

获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率；Obtaining the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation characteristics of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation;

基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤；filtering the pronunciation contained in the feature classification result based on the probability contained in the feature classification result;

基于过滤后的特征分类结果识别所述语音信号。The speech signal is identified based on the filtered feature classification results.

在一个实施例中，所述基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤，包括：In one embodiment, the filtering the pronunciation contained in the feature classification result based on the probability contained in the feature classification result includes:

判断任一语音信号帧映射到对应的发音的概率是否满足预定过滤规则；Judging whether the probability that any speech signal frame is mapped to a corresponding pronunciation satisfies a predetermined filter rule;

如果所述对应的发音满足预定过滤规则，对所述对应的发音进行滤掉。If the corresponding pronunciation satisfies a predetermined filtering rule, the corresponding pronunciation is filtered out.

在一个实施例中，如果任一语音信号帧映射到对应的发音的概率，与该语音信号帧的最大映射概率之间的概率差，在预定的差值范围内，则确定所述对应的发音满足预定过滤规则；In one embodiment, if the probability difference between the probability that any speech signal frame is mapped to the corresponding pronunciation and the maximum mapping probability of the speech signal frame is within a predetermined difference range, then the corresponding pronunciation is determined Satisfy the predetermined filtering rules;

如果任一语音信号帧映射到对应的发音的概率，小于该语音信号帧映射到预定数目的发音中各发音的概率，则确定所述对应的发音满足预定过滤规则。If the probability that any speech signal frame is mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to each utterance in the predetermined number of utterances, then it is determined that the corresponding utterance satisfies a predetermined filtering rule.

在一个实施例中，所述预定数目为以下任一：In one embodiment, the predetermined number is any of the following:

该帧语音信号帧对应的发音中被保留在特征分类结果内的发音的数量；The number of pronunciations that are retained in the feature classification result in the pronunciation corresponding to the frame of the speech signal frame;

预定的比例阈值与该帧语音信号帧对应的发音的总数目的乘积。The product of the predetermined ratio threshold and the total number of utterances corresponding to the speech signal frame.

获取任一语音信号帧映射到各发音的概率的直方图分布；Obtain the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation;

获取与所述直方图分布对应的束宽；obtaining a beamwidth corresponding to the histogram distribution;

将概率分布在所述束宽之外的发音，确定为满足所述预定过滤规则的发音；determining an utterance whose probability distribution is outside the beam width as an utterance satisfying the predetermined filter rule;

将满足所述预定过滤规则的发音，从所述特征分类结果所含的发音中删除。The utterances satisfying the predetermined filter rule are deleted from the utterances included in the feature classification result.

在一个实施例中，所述将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除，包括：In one embodiment, the deleting the pronunciation satisfying the predetermined filter rule from the pronunciation contained in the feature classification result includes:

如果任一语音信号帧映射到对应的发音的概率满足预定过滤规则，将该发音确定为候选发音；If the probability that any speech signal frame is mapped to a corresponding pronunciation satisfies a predetermined filter rule, determining the pronunciation as a candidate pronunciation;

如果该语音信号帧的预定帧数的相邻语音信号帧中的任一帧，映射到该候选发音的概率满足预定过滤规则，则将该候选发音从所述特征分类结果所含的发音中删除；If any frame in the predetermined number of adjacent speech signal frames of the speech signal frame, the probability mapped to the candidate pronunciation satisfies the predetermined filter rule, then the candidate pronunciation is deleted from the pronunciation contained in the feature classification result ;

如果该语音信号帧的预定帧数的相邻语音信号帧，映射到该候选发音的概率均不满足预定过滤规则，则将该候选发音保留在所述特征分类结果所含的发音中。If the probabilities of the predetermined number of adjacent speech signal frames of the speech signal frame mapped to the candidate pronunciation do not satisfy the predetermined filtering rule, then the candidate pronunciation is retained in the pronunciation contained in the feature classification result.

根据本发明的第二方面，提供一种语音识别装置，包括：According to a second aspect of the present invention, a speech recognition device is provided, comprising:

分类结果获取模块，用于获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率；The classification result acquisition module is used to obtain the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation for describing the pronunciation characteristics of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation;

发音过滤模块，用于基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤；A pronunciation filtering module, configured to filter the pronunciation contained in the feature classification result based on the probability contained in the feature classification result;

语音识别模块，用于基于过滤后的特征分类结果识别所述语音信号。A speech recognition module, configured to recognize the speech signal based on the filtered feature classification result.

在一个实施例中，所述发音过滤模块还包括：In one embodiment, the pronunciation filtering module also includes:

第一过滤模块，用于在任一语音信号帧映射到对应的发音的概率，与该语音信号帧的最大映射概率之间的概率差，在预定的差值范围内时，对所述对应的发音进行过滤；The first filtering module is used for mapping the probability of any speech signal frame to the corresponding pronunciation, and the probability difference between the maximum mapping probability of the speech signal frame, when within a predetermined difference range, the corresponding pronunciation to filter;

第二过滤模块，用于在任一语音信号帧映射到对应的发音的概率，小于该语音信号帧映射到预定数目的发音中各发音的概率时，对所述对应的发音进行过滤。The second filtering module is configured to filter the corresponding utterance when the probability that any speech signal frame is mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to each utterance in the predetermined number of utterances.

在一个实施例中，所述发音过滤模块包括：In one embodiment, the pronunciation filtering module includes:

概率分布模块，用于获取任一语音信号帧映射到各发音的概率的直方图分布；The probability distribution module is used to obtain the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation;

束宽确定模块，用于获取与所述直方图分布对应的束宽；A beam width determination module, configured to obtain a beam width corresponding to the histogram distribution;

发音确定模块，用于将概率分布在所述束宽之外的发音，确定为满足所述预定过滤规则的发音；a pronunciation determination module, configured to determine a pronunciation whose probability distribution is outside the beam width as a pronunciation satisfying the predetermined filter rule;

发音删除模块，用于将满足所述预定过滤规则的发音从所述特征分类结果所含的发音中删除。A pronunciation deletion module, configured to delete pronunciations satisfying the predetermined filter rule from the pronunciations included in the feature classification result.

候选发音模块，用于在任一语音信号帧映射到对应的发音的概率满足预定过滤规则时，将该发音确定为候选发音；A candidate pronunciation module, configured to determine the pronunciation as a candidate pronunciation when the probability that any speech signal frame is mapped to the corresponding pronunciation satisfies a predetermined filter rule;

候选发音删除模块，用于在该语音信号帧的预定帧数的相邻语音信号帧中的任一帧，映射到该候选发音的概率满足预定过滤规则时，将该候选发音从所述特征分类结果所含的发音中删除；Candidate utterance deletion module, used to classify the candidate utterance from the features when the probability of mapping to the candidate utterance satisfies a predetermined filtering rule in any frame of the predetermined frame number of adjacent voice signal frames of the voice signal frame The pronunciation contained in the result is deleted;

候选发音保留模块，用于在该语音信号帧的预定帧数的相邻语音信号帧，映射到该候选发音的概率均不满足预定过滤规则时，将该候选发音保留在所述特征分类结果所含的发音中。The candidate pronunciation retention module is used for retaining the candidate pronunciation in the feature classification result when the probabilities of the adjacent speech signal frames of the predetermined frame number of the speech signal frame being mapped to the candidate pronunciation do not meet the predetermined filtering rules. Contains in the pronunciation.

实施本发明提供的实施例，在识别语音信号时，先获取该语音信号的特征分类结果，然后基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤，那么在识别语音信号的过程中，无需再执行与被过滤掉的发音相关的识别操作，如无需再在识别网络中搜索与被过滤掉的发音相关的路径，因此能有效降低语音识别过程耗费的时间，进而能提高语音识别效率。Implementing the embodiment provided by the present invention, when recognizing a speech signal, first obtain the feature classification result of the speech signal, and then filter the pronunciation contained in the feature classification result based on the probability contained in the feature classification result, then In the process of recognizing speech signals, there is no need to perform recognition operations related to the filtered pronunciation, such as no need to search for paths related to the filtered pronunciation in the recognition network, so the time spent on the speech recognition process can be effectively reduced , which can improve the efficiency of speech recognition.

附图说明Description of drawings

图1是本发明一示例性实施例示出的语音识别方法的流程图；Fig. 1 is the flowchart of the voice recognition method shown in an exemplary embodiment of the present invention;

图2是本发明另一示例性实施例示出的语音识别方法的流程图；Fig. 2 is a flow chart of a voice recognition method shown in another exemplary embodiment of the present invention;

图3是本发明一示例性实施例示出的语音识别装置的逻辑框图；Fig. 3 is a logical block diagram of a speech recognition device shown in an exemplary embodiment of the present invention;

图4是本发明另一示例性实施例示出的语音识别装置的逻辑框图；Fig. 4 is a logical block diagram of a speech recognition device shown in another exemplary embodiment of the present invention;

图5是本发明一示例性实施例示出的语音识别装置的硬件结构图。Fig. 5 is a hardware structural diagram of a speech recognition device according to an exemplary embodiment of the present invention.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with aspects of the invention as recited in the appended claims.

在本发明使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the present invention is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used herein and in the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本发明可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本发明范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present invention, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

本发明实施例的语音识别，在识别过程中会涉及到声学模型和语言模型，其中，声学模型是对声学、语音学、环境的变量、以及发出语音的人员的性别、口音等的差异进行的知识表示，可以通过LSTM(Long Short-Term Memory，时间递归神经网络)、CTC(Connectionist temporal classification)模型、或者隐马尔可夫模型HMM，对训练语音库所含的语音进行训练，获得语音的声学特征到发音的映射，构成声学模型，该发音与建模单元相关。如果建模单元为音节，该发音为音节；如果建模单元为音素，该发音为音素；如果建模单元为构成音素的状态，该发音为状态。The speech recognition of the embodiment of the present invention involves an acoustic model and a language model in the recognition process, wherein the acoustic model is based on differences in acoustics, phonetics, environmental variables, and the gender and accent of the person who makes the speech. Knowledge representation can use LSTM (Long Short-Term Memory, time recurrent neural network), CTC (Connectionist temporal classification) model, or hidden Markov model HMM to train the speech contained in the training speech library to obtain the acoustics of the speech The mapping of features to utterances, constituting the acoustic model, is associated with the modeling unit. If the modeling unit is a syllable, the pronunciation is a syllable; if the modeling unit is a phoneme, the pronunciation is a phoneme; if the modeling unit is a state constituting a phoneme, the pronunciation is a state.

而训练声学模型时，考虑到发音会随着字词、语速、语调、轻重音、以及方言等影响发音的因素不同而不同，训练语音库需要涵盖不同的字词、语速、语调、轻重音、以及方言等影响发音的因素的大量语音。此外，考虑到语音识别的精确性，可以选择音节、音素、状态等较小的发音单位为建模单元。因此，基于训练语音库所含的大量语音以及预定的建模单元，进行模型训练，会构建出大量的声学模型。语音识别过程中，通过大量的声学模型对待识别的语音信号进行特征分类，所获得特征分类结果会包含大量的发音(类别)，如：3000到10000个发音。When training the acoustic model, considering that the pronunciation will vary with the factors affecting pronunciation such as words, speech speed, intonation, severity, and dialect, the training speech library needs to cover different words, speech speed, intonation, and severity. A large number of voices that affect pronunciation, such as phonetics and dialects. In addition, considering the accuracy of speech recognition, smaller pronunciation units such as syllables, phonemes, and states can be selected as modeling units. Therefore, performing model training based on a large number of speeches contained in the training speech library and predetermined modeling units will construct a large number of acoustic models. During the speech recognition process, a large number of acoustic models are used to classify the features of the speech signal to be recognized, and the obtained feature classification results will contain a large number of pronunciations (categories), such as: 3000 to 10000 pronunciations.

此外，目前的语音识别技术要识别出语音信号对应的文本信息，需针对每一个发音，在识别网络中搜索所有可能的路径，在这个搜索过程中会产生指数形式的路径增量。如果在识别网络中搜索3000到10000个发音涉及的所有可能的路径，搜索所需的存储资源和计算量可能超出语音识别系统所能承受的极限，因此，目前的语音识别技术会耗费大量的时间和资源，存在语音识别效率低的问题，本发明针对如何提高语音识别效率，提出解决方案。In addition, in order to recognize the text information corresponding to the speech signal, the current speech recognition technology needs to search all possible paths in the recognition network for each pronunciation, and an exponential path increment will be generated during this search process. If all possible paths involved in 3,000 to 10,000 pronunciations are searched in the recognition network, the storage resources and calculations required for the search may exceed the limit that the speech recognition system can bear. Therefore, the current speech recognition technology will consume a lot of time and resources, there is a problem of low speech recognition efficiency, and the present invention proposes a solution for how to improve the speech recognition efficiency.

本发明的方案，为了解决语音识别效率低这个问题，针对语音识别过程所得的特征分类结果进行改进，预先根据语音识别涉及的设备资源、识别效率需求设定过滤规则，然后在识别语音信号时，先获取该语音信号的特征分类结果，然后基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤，那么在识别语音信号的过程中，无需再在识别网络中搜索与被过滤掉的发音相关的路径，因此能有效降低搜索过程耗费的时间，进而能提高语音识别效率。以下结合附图详细说明本发明的语音识别过程。In the solution of the present invention, in order to solve the problem of low speech recognition efficiency, the feature classification results obtained in the speech recognition process are improved, and the filtering rules are set in advance according to the equipment resources involved in speech recognition and the recognition efficiency requirements, and then when recognizing speech signals, First obtain the feature classification result of the speech signal, and then filter the pronunciation contained in the feature classification result based on the probability contained in the feature classification result, then in the process of recognizing the speech signal, there is no need to use the The path related to the filtered pronunciation is searched, so the time spent in the search process can be effectively reduced, and the efficiency of speech recognition can be improved. The voice recognition process of the present invention will be described in detail below in conjunction with the accompanying drawings.

请参阅图1，图1是本发明一示例性实施例示出的语音识别方法的流程图，该实施例能应用于具备语音处理能力的各种电子设备上，可以包括以下步骤S101-S103：Please refer to FIG. 1. FIG. 1 is a flow chart of a voice recognition method shown in an exemplary embodiment of the present invention. This embodiment can be applied to various electronic devices with voice processing capabilities, and may include the following steps S101-S103:

步骤S101、获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Step S101. Obtain a feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation feature of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation.

步骤S102、基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤。Step S102 , based on the probabilities contained in the feature classification result, filter the pronunciation contained in the feature classification result.

步骤S103、基于过滤后的特征分类结果识别所述语音信号。Step S103, identifying the speech signal based on the filtered feature classification result.

本发明实施例中，所述语音信号可以是本地语音采集设备所实时采集的用户发出的语音，也可以是其语音采集设备远程传送过来的语音。在获取语音信号的特征分类结果时，可以实时通过本领域的语音预处理模块对语音信号进行预处理，通过特征提取模块对预处理后的语音信号进行特征提取，所提取的特征可以包括PLP(Perceptual LinearPredictive，感知线性预测)、LPCC(Linear Predictive Cepstral Coefficient，线性预测倒谱系数)、FBANK(Mel-Scale Filter Bank，梅尔标度滤波器组)、MFCC(Mel-FrequencyCepstral Coefficients，梅尔倒谱系数)等，然后通过声学模型对提取的特征进行相应处理，获得特征分类结果，特征分类结果所含的概率，用于表示语音信号帧映射到对应的发音的可能性。在其他例子中，也可以直接接收其他终端设备传送过来的特征分类结果。In the embodiment of the present invention, the voice signal may be the voice uttered by the user collected in real time by the local voice collection device, or may be the voice transmitted remotely by the voice collection device. When obtaining the feature classification result of the speech signal, the speech signal can be preprocessed by the speech preprocessing module in the art in real time, and the preprocessed speech signal is carried out feature extraction by the feature extraction module, and the extracted features can include PLP ( Perceptual LinearPredictive, perceptual linear prediction), LPCC (Linear Predictive Cepstral Coefficient, linear predictive cepstral coefficient), FBANK (Mel-Scale Filter Bank, Mel scale filter bank), MFCC (Mel-FrequencyCepstral Coefficients, Mel cepstral coefficient number), etc., and then perform corresponding processing on the extracted features through the acoustic model to obtain the feature classification result, and the probability contained in the feature classification result is used to represent the possibility of mapping the speech signal frame to the corresponding pronunciation. In other examples, feature classification results transmitted from other terminal devices may also be directly received.

在得到特征分类结果后，本发明的方案，考虑到特征分类结果所含的部分发音，与待识别的语音信号的语音信号帧相关性较低，对语音识别准确率的影响较小，在降低特征分类结果所含的大量发音对语音识别效率的影响时，可以在基于特征分类结果进行语音识别前，将这些对语音识别准确率影响较小的发音从所述特征分类结果中过滤掉，来减少特征分类结果所含的发音的数量，进而提高语音识别效率。After obtaining the feature classification result, the scheme of the present invention, considering the partial pronunciation contained in the feature classification result, has a lower correlation with the speech signal frame of the speech signal to be recognized, and has less impact on the accuracy of speech recognition, reducing the When the impact of a large number of pronunciations contained in the feature classification results on the speech recognition efficiency, these pronunciations that have little influence on the speech recognition accuracy can be filtered out from the feature classification results before performing speech recognition based on the feature classification results. The number of pronunciations contained in the feature classification result is reduced, thereby improving the speech recognition efficiency.

一般情况下，发音与待识别的语音信号帧的相关性越低，在通过声学模型对语音信号的声学特征进行分类时，语音信号帧映射到该发音的概率越低。因此，可以基于语音信号帧映射到各发音的概率，来过滤特征分类结果所含的发音，过滤后，任一语音信号帧映射到被过滤掉的发音的概率，小于该语音信号帧映射到其他发音的概率。Generally, the lower the correlation between an utterance and the speech signal frame to be recognized, the lower the probability that the speech signal frame is mapped to the utterance when an acoustic model is used to classify the acoustic features of the speech signal. Therefore, the pronunciation contained in the feature classification result can be filtered based on the probability that the speech signal frame is mapped to each pronunciation. After filtering, the probability that any speech signal frame is mapped to the filtered pronunciation is less than the probability that the speech signal frame is mapped to other Probability of pronunciation.

此外，在过滤相关性较低的发音时，考虑到不同应用场景对语音识别准确率的需求，还需要衡量所过滤掉的发音对语音识别准确率的影响，因此，可以根据语音识别准确率的需求，预先设定能限制过滤掉的发音对识别准确率的影响程度的各种过滤规则。针对各种预定过滤规则，在过滤特征分类结果所含的发音时，判断任一语音信号帧映射到对应的发音的概率是否满足预定过滤规则，如果所述对应的发音满足预定过滤规则，对所述对应的发音进行滤掉。过滤掉的发音一般指从特征分类结果中删除掉的发音。In addition, when filtering less relevant pronunciations, considering the requirements of different application scenarios for the accuracy of speech recognition, it is also necessary to measure the impact of the filtered pronunciations on the accuracy of speech recognition. Therefore, according to the accuracy of speech recognition Requirements, pre-set various filtering rules that can limit the influence of filtered pronunciations on recognition accuracy. For various predetermined filtering rules, when filtering the pronunciation contained in the feature classification result, it is judged whether the probability that any speech signal frame is mapped to the corresponding pronunciation satisfies the predetermined filtering rule; if the corresponding pronunciation satisfies the predetermined filtering rule, all The above corresponding pronunciation is filtered out. Filtered pronunciations generally refer to pronunciations that are removed from the feature classification results.

以下列举几种对所述特征分类结果所含的发音进行过滤的方式：Several ways to filter the pronunciation contained in the feature classification result are listed below:

过滤方式一：按预定数目过滤掉低概率的发音，该预定数目可以指语音信号帧对应的发音中被保留在特征分类结果内的发音的数量；也可以指预定的比例阈值与语音信号帧对应的发音的总数目的乘积。在过滤时，如果任一语音信号帧映射到对应的发音的概率，小于该语音信号帧映射到预定数目的发音中各发音的概率，则确定所述对应的发音满足预定过滤规则。Filtering method 1: filter out low-probability utterances by a predetermined number, the predetermined number may refer to the number of utterances that are retained in the feature classification results in the utterances corresponding to the speech signal frame; it may also refer to the predetermined ratio threshold corresponding to the speech signal frame The product of the total number of pronunciations. When filtering, if the probability that any speech signal frame is mapped to the corresponding utterance is less than the probability that the speech signal frame is mapped to each utterance in the predetermined number of utterances, then it is determined that the corresponding utterance satisfies a predetermined filtering rule.

其中，预定的比例阈值，可以由本发明的设计人员根据需要达到的语音识别准确率来设定，例如，设定为1/4，指被保留的发音与所有发音的数量比例。Wherein, the predetermined ratio threshold can be set by the designer of the present invention according to the required speech recognition accuracy, for example, set to 1/4, which refers to the ratio of the retained pronunciation to all pronunciations.

在一例子中，实际过滤时，可以按概率从小到大的顺序，从特征分类结果中删除发音，当保留的发音的数量与原来所有发音的数量的比例，满足预定的比例阈值，完成对特征分类结果的过滤。In an example, during actual filtering, pronunciations can be deleted from the feature classification results in order of increasing probability, and when the ratio of the number of retained pronunciations to the number of all original pronunciations satisfies a predetermined ratio threshold, the feature classification is completed. Filtering of classification results.

在其他例子中，预定的比例阈值可以指未被过滤掉的发音与被过滤掉的发音的数量比例。实际过滤时，可以按概率从大到小的顺序，在特征分类结果中挑选发音，当挑选出的发音的数量与剩余的发音的数量的比例，满足预定的比例阈值时，完成对特征分类结果的过滤。In other examples, the predetermined ratio threshold may refer to the ratio of the number of unfiltered utterances to the number of filtered utterances. In actual filtering, pronunciations can be selected from the feature classification results in order of probability from large to small. When the ratio of the number of selected pronunciations to the number of remaining pronunciations meets the predetermined ratio threshold, the feature classification results are completed. filter.

实际应用中，预定数目指该帧语音信号帧对应的发音中被保留在特征分类结果内的发音的数量时，可以由本发明的设计人员根据需要达到的语音识别准确率来设定预定数目，例如，设定为2000至9000中的任一数值。过滤时，可以按概率从小到大的顺序，将每一语音信号帧所映射到的发音进行排列，然后将排列在前预定位数的发音从特征分类结果中给删除，完成对特征分类结果的过滤，所述预定位数与所述预定数目的数值相等。In practical applications, when the predetermined number refers to the number of pronunciations that are retained in the feature classification result in the pronunciation corresponding to the frame of the speech signal frame, the designer of the present invention can set the predetermined number according to the speech recognition accuracy that needs to be achieved, for example , set it to any value from 2000 to 9000. When filtering, the pronunciations mapped to each speech signal frame can be arranged in order of probability from small to large, and then the pronunciations arranged in the front predetermined digits can be deleted from the feature classification results to complete the classification of the feature classification results For filtering, the predetermined number of digits is equal to the predetermined number of values.

在其他例子中，预定数目可以指未被过滤掉的发音的数量，例如，设定为1000。实际过滤时，可以按概率从大到小的顺序，将每一语音信号帧所映射到的发音进行排列，然后将排列在前预定位数的发音保留在特征分类结果中，将其他发音从特征分类结果中删除，完成对特征分类结果的过滤，所述预定位数与所述数量阈值的数值相等。在其他实施例中，还可以采取其他技术手段按过滤方式一对特征分类结果进行过滤，本发明对此不做限制。In other examples, the predetermined number may refer to the number of utterances that are not filtered out, for example, set to 1000. In the actual filtering, the pronunciations mapped to each speech signal frame can be arranged in order of probability from large to small, and then the pronunciations arranged in the first predetermined digits are kept in the feature classification results, and other pronunciations are separated from the feature The classification result is deleted to complete the filtering of the feature classification result, and the predetermined number of digits is equal to the numerical value of the quantity threshold. In other embodiments, other technical means may also be used to filter the feature classification results in a filtering manner, which is not limited in the present invention.

过滤方式二：按预定的差值阈值过滤掉低概率的发音，该差值阈值可以由本发明的设计人员根据需要达到的语音识别准确率来设定，例如，设定为-0.5，指被过滤掉的发音的概率与同一语音信号帧映射到的概率最大的发音之间的概率差。过滤时，如果任一语音信号帧映射到对应的发音的概率，与该语音信号帧的最大映射概率之间的概率差，在预定的差值范围内，则确定所述对应的发音满足预定过滤规则，可以对所述对应的发音进行过滤。Filtering method two: filter out low-probability pronunciations according to a predetermined difference threshold, which can be set by the designer of the present invention according to the speech recognition accuracy required to be achieved, for example, set to -0.5, means to be filtered The probability difference between the probability of the dropped utterance and the most probable utterance to which the same speech signal frame is mapped. When filtering, if any speech signal frame is mapped to the probability of the corresponding pronunciation, and the probability difference between the maximum mapping probability of the speech signal frame is within a predetermined difference range, then it is determined that the corresponding pronunciation satisfies the predetermined filtering rule, the corresponding pronunciation can be filtered.

在一例子中，实际过滤时，可以按概率从大到小的顺序，将每一语音信号帧所映射到的发音进行排列，将该语音信号帧映射到排列在第一位的发音的概率，确定为最大概率，然后从排列在最后一位的发音开始，依次获得该语音信号帧映射到每个发音的概率与最大概率的差值，如果差值小于-0.5，则将该发音从特征分类结果中删除。在其他实施例中，还可以采取其他技术手段按过滤方式二对特征分类结果进行过滤，本发明对此不做限制。In an example, during actual filtering, the pronunciations mapped to each speech signal frame may be arranged in descending order of probability, and the speech signal frame may be mapped to the probability of the pronunciation arranged first, It is determined as the maximum probability, and then starting from the last pronunciation, the difference between the probability that the speech signal frame is mapped to each pronunciation and the maximum probability is obtained sequentially. If the difference is less than -0.5, the pronunciation is classified from the feature Deleted from the results. In other embodiments, other technical means may also be used to filter the feature classification results according to the second filtering method, which is not limited in the present invention.

过滤方式三：按概率的直方图分布过滤分布在所述束宽之外的发音，实际过滤时，可以先获取任一语音信号帧映射到各发音的概率的直方图分布；获取与所述直方图分布对应的束宽；然后将概率分布在所述束宽之外的发音，确定为满足所述预定过滤规则的发音；最终将满足所述预定过滤规则的发音，从所述特征分类结果所含的发音中删除。实际应用中，束宽可以由本发明的设计人员根据需要达到的语音识别准确率、以及直方图的分布状况来确定，如：预先设定需要过滤掉8000个低概率的发音，可以从直方图中低概率一侧开始查找8000个发音，将第8000个发音所在位置确定为束宽边界。在其他实施例中，还可以采取其他技术手段按过滤方式三对特征分类结果进行过滤，本发明对此不做限制。Filtering method three: filter the pronunciation distributed outside the beam width according to the histogram distribution of probability. During actual filtering, you can first obtain the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation; The beam width corresponding to the graph distribution; then the pronunciation whose probability distribution is outside the beam width is determined as the pronunciation satisfying the predetermined filtering rule; finally, the pronunciation satisfying the predetermined filtering rule is determined from the feature classification result The included pronunciation is deleted. In practical applications, the beam width can be determined by the designer of the present invention according to the speech recognition accuracy required to be achieved and the distribution of the histogram, such as: presetting needs to filter out 8000 low-probability pronunciations, which can be obtained from the histogram The low-probability side starts to search for 8000 pronunciations, and determines the position of the 8000th pronunciation as the beam width boundary. In other embodiments, other technical means may also be used to filter the feature classification results according to the third filtering method, which is not limited in the present invention.

在按以上任一过滤方式，对所述特征分类结果所含的发音进行过滤后，可以直接调取预定的识别网络，搜索与过滤后的特征分类结果所含的发音相关的路径，寻找最佳的一条路径，基于该路径以最大概率输出待识别的语音信号对应的文本信息，完成语音识别，这里提到的识别网络，可以指解码器针对待识别的语音信号，根据己经训练好的声学模型、语言模型及字典建立的识别网络。After filtering the pronunciation contained in the feature classification result according to any of the above filtering methods, the predetermined recognition network can be directly called to search for paths related to the pronunciation contained in the filtered feature classification result to find the best Based on this path, the text information corresponding to the speech signal to be recognized is output with the maximum probability to complete the speech recognition. The recognition network mentioned here can refer to the decoder according to the trained acoustic signal for the speech signal to be recognized. The recognition network built by model, language model and dictionary.

在寻找最佳的一条路径时，可以将特征分类结果所含的概率(声学得分)转换到和语音模型所含的字词(如：字、词、短语)之间关联概率(语言得分)相近的数值空间，并加权相加，作为路径搜索过程的综合分值，每一语音信号帧都会用一个预设的门限值来限制，与最佳路径的差值大于这个门限值，则该路径丢弃，否则保留；每一语音信号帧完成搜索后，会根据预设的最大路径数量，对所有路径进行排序，只保留此数量的最优路径，直至最后一帧完成，由此得出最后的路径图。When looking for the best path, the probability (acoustic score) contained in the feature classification result can be converted to a similar probability (language score) between words (such as words, words, phrases) contained in the speech model Numerical space, and weighted addition, as the comprehensive score of the path search process, each speech signal frame will be limited by a preset threshold value, and the difference with the best path is greater than this threshold value, then the The path is discarded, otherwise it is kept; after the search of each voice signal frame is completed, all paths will be sorted according to the preset maximum number of paths, and only the optimal path of this number will be kept until the last frame is completed, thus the final road map.

在某些例子中，输出特征分类结果的声学模型的建模单元较小，如以状态为建模单元，由于单个音素可以由三到五个状态组成，一个音素的发音所成的语音信号可以分割为多个语音信号帧，因此，易出现多个连续的语音信号帧的声学特征较类似的状况，那么特征分类结果中描述这些连续的语音信号帧中的各帧的发音，易出现类似状况。针对这种状况，如果本发明基于特征分类结果所含的概率和预定过滤规则，分别对每帧语音信号帧映射到的发音进行过滤，易将对识别准确率影响较大的发音过滤掉，为了避免误过滤这类发音，在过滤特征分类结果时，可以综合考虑连续的语音信号帧的过滤状况，具体实现过程可以参阅图2所示的方法，包括以下步骤S201-S205：In some examples, the modeling unit of the acoustic model that outputs the feature classification results is relatively small, such as using the state as the modeling unit. Since a single phoneme can be composed of three to five states, the speech signal formed by the pronunciation of a phoneme can be It is divided into multiple speech signal frames. Therefore, it is easy to have a situation where the acoustic features of multiple consecutive speech signal frames are relatively similar. Then, the pronunciation of each frame in these continuous speech signal frames is described in the feature classification result, and a similar situation is likely to occur. . For this situation, if the present invention is based on the probability contained in the feature classification result and the predetermined filtering rules, respectively filter the pronunciations mapped to each frame of the speech signal frame, it is easy to filter out the pronunciations that have a greater impact on the recognition accuracy, in order to To avoid filtering this type of pronunciation by mistake, when filtering the feature classification results, the filtering status of continuous speech signal frames can be considered comprehensively. The specific implementation process can refer to the method shown in Figure 2, including the following steps S201-S205:

步骤S201、获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Step S201. Obtain the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation feature of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation.

步骤S202、如果任一语音信号帧映射到对应的发音的概率，满足预定过滤规则，将该发音确定为候选发音。Step S202, if any speech signal frame is mapped to the probability of the corresponding utterance and satisfies a predetermined filtering rule, determine the utterance as a candidate utterance.

步骤S203、如果该语音信号帧的预定帧数的相邻语音信号帧中的任一帧，映射到该候选发音的概率满足预定过滤规则，则将该候选发音从所述特征分类结果所含的发音中删除。Step S203, if the probability of mapping to the candidate pronunciation in any frame of the predetermined number of adjacent speech signal frames of the speech signal frame satisfies a predetermined filtering rule, then select the candidate pronunciation from the feature classification results contained in the Pronunciation is deleted.

步骤S204、如果该语音信号帧的预定帧数的相邻语音信号帧，映射到该候选发音的概率均不满足预定过滤规则，则将该候选发音保留在所述特征分类结果所含的发音中。Step S204, if the probabilities of the predetermined number of adjacent speech signal frames of the speech signal frame being mapped to the candidate pronunciation do not satisfy the predetermined filtering rule, then keep the candidate pronunciation in the pronunciation contained in the feature classification result .

步骤S205、基于过滤后的特征分类结果识别所述语音信号。Step S205, identifying the speech signal based on the filtered feature classification result.

本发明实施例中，预定过滤规则可以是以上所述的过滤方式一至过滤方式四涉及的任一种规则，还可以是能限制过滤掉的发音对识别准确率的影响程度的其他过滤规则。In the embodiment of the present invention, the predetermined filtering rule may be any rule involved in the above-mentioned filtering method 1 to filtering method 4, and may also be other filtering rules that can limit the impact of the filtered pronunciation on the recognition accuracy.

连续的语音信号帧的预定帧数可以由本发明的设计人员根据需要达到的语音识别准确率来设定，例如，设定为6，相邻的前三帧以及相邻的后三帧。The predetermined number of consecutive speech signal frames can be set by the designers of the present invention according to the required speech recognition accuracy, for example, set to 6, the first three adjacent frames and the next three adjacent frames.

由上述实施例可知：本发明的语音识别方法在识别语音信号时，先获取该语音信号的特征分类结果，然后基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤，那么在识别语音信号的过程中，无需再执行与被过滤掉的发音执行相关的识别操作，如无需再在识别网络中搜索与被过滤掉的发音相关的路径，因此能有效降低语音识别过程耗费的时间，进而能提高语音识别效率。As can be seen from the foregoing embodiments: when the speech recognition method of the present invention recognizes a speech signal, the feature classification result of the speech signal is first obtained, and then based on the probability contained in the feature classification result, the pronunciation contained in the feature classification result is Filtering, then in the process of recognizing speech signals, there is no need to perform recognition operations related to the filtered pronunciation, such as no need to search for paths related to the filtered pronunciation in the recognition network, so it can effectively reduce the voice The time consumed by the recognition process can improve the efficiency of speech recognition.

进而，本发明实施例的语音识别方法可以应用于各种电子设备的人机交互软件内，例如：智能手机内的语音拨号、语音操控、语音查找，应用于智能手机内的语音查找时，如果用户在距离智能手机的预定范围内发出一段语音，那么应用于语音查找上的语音识别方法，可以在接收到语音采集设备采集的用户语音后，先获取该语音的特征分类结果，然后基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤，然后在识别网络中只搜索未被过滤掉的发音相关的路径，通过路径搜索快速识别出用户语音对应的文本信息，进而使语音助手基于该识别结果快速响应用户。Furthermore, the speech recognition method of the embodiment of the present invention can be applied to human-computer interaction software of various electronic devices, for example: voice dialing, voice control, and voice search in smart phones. When applied to voice search in smart phones, if If the user sends a voice within a predetermined range from the smart phone, the voice recognition method applied to the voice search can first obtain the feature classification result of the voice after receiving the user voice collected by the voice collection device, and then based on the described Probability contained in the feature classification results, filter the pronunciation contained in the feature classification results, and then only search for paths related to the pronunciation that have not been filtered out in the recognition network, and quickly identify the text information corresponding to the user's voice through path search , so that the voice assistant can quickly respond to the user based on the recognition result.

与前述方法的实施例相对应，本发明还提供了装置的实施例。Corresponding to the foregoing method embodiments, the present invention also provides device embodiments.

参见图3，图3是本发明一示例性实施例示出的语音识别装置的逻辑框图，该装置可以包括：分类结果获取模块310、发音过滤模块320和语音识别模块330。Referring to FIG. 3 , FIG. 3 is a logic block diagram of a speech recognition device according to an exemplary embodiment of the present invention. The device may include: a classification result acquisition module 310 , a pronunciation filtering module 320 and a speech recognition module 330 .

其中，分类结果获取模块310，用于获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Wherein, the classification result acquisition module 310 is used to obtain the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation characteristics of each speech signal frame and the mapping of each speech signal frame to the corresponding pronunciation. probability.

发音过滤模块320，用于基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤。The pronunciation filtering module 320 is configured to filter the pronunciation contained in the feature classification result based on the probability contained in the feature classification result.

语音识别模块330，用于基于过滤后的特征分类结果识别所述语音信号。A speech recognition module 330, configured to recognize the speech signal based on the filtered feature classification result.

一些例子中，发音过滤模块320可以包括：In some examples, the pronunciation filtering module 320 may include:

第一过滤模块，用于在任一语音信号帧映射到对应的发音的概率，与该语音信号帧的最大映射概率之间的概率差，在预定的差值范围内时，对所述对应的发音进行过滤。The first filtering module is used for mapping the probability of any speech signal frame to the corresponding pronunciation, and the probability difference between the maximum mapping probability of the speech signal frame, when within a predetermined difference range, the corresponding pronunciation to filter.

另一些例子中，发音过滤模块320还可以包括：In other examples, the pronunciation filtering module 320 may also include:

概率分布模块，用于获取任一语音信号帧映射到各发音的概率的直方图分布。The probability distribution module is used to obtain the histogram distribution of the probability that any speech signal frame is mapped to each pronunciation.

束宽确定模块，用于获取与所述直方图分布对应的束宽。A beam width determining module, configured to acquire the beam width corresponding to the histogram distribution.

发音确定模块，用于将概率分布在所述束宽之外的发音，确定为满足所述预定过滤规则的发音。An utterance determining module, configured to determine an utterance whose probability distribution is outside the beam width as an utterance satisfying the predetermined filter rule.

参见图4，图4是本发明另一示例性实施例示出的语音识别装置的逻辑框图，该装置可以包括：分类结果获取模块410、发音过滤模块420和语音识别模块430。发音过滤模块420可以包括候选发音确定模块421、候选发音删除模块422和候选发音保留模块423。Referring to FIG. 4 , FIG. 4 is a logic block diagram of a speech recognition device according to another exemplary embodiment of the present invention. The device may include: a classification result acquisition module 410 , a pronunciation filtering module 420 and a speech recognition module 430 . The pronunciation filtering module 420 may include a candidate pronunciation determination module 421 , a candidate pronunciation deletion module 422 and a candidate pronunciation retention module 423 .

其中，分类结果获取模块410，用于获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率。Wherein, the classification result acquisition module 410 is used to obtain the feature classification result of the speech signal to be recognized; the feature classification result includes the pronunciation used to describe the pronunciation characteristics of each speech signal frame and the mapping of each speech signal frame to the corresponding pronunciation. probability.

候选发音确定模块421，用于在任一语音信号帧映射到对应的发音的概率满足预定过滤规则时，将该发音确定为候选发音。The candidate utterance determining module 421 is configured to determine the utterance as a candidate utterance when the probability that any speech signal frame is mapped to the corresponding utterance satisfies a predetermined filter rule.

候选发音删除模块422，用于在该语音信号帧的预定帧数的相邻语音信号帧中的任一帧，映射到该候选发音的概率满足预定过滤规则时，将该候选发音从所述特征分类结果所含的发音中删除。The candidate utterance deletion module 422 is configured to remove the candidate utterance from the feature when the probability of mapping to the candidate utterance satisfies a predetermined filtering rule in any frame of the predetermined frame number of adjacent voice signal frames of the voice signal frame. Pronunciations included in the classification results are deleted.

候选发音保留模块423，用于在该语音信号帧的预定帧数的相邻语音信号帧，映射到该候选发音的概率均不满足预定过滤规则时，将该候选发音保留在所述特征分类结果所含的发音中。The candidate utterance retention module 423 is configured to retain the candidate utterance in the feature classification result when the probabilities of the predetermined number of frames of the speech signal frame being mapped to the candidate utterance do not satisfy the predetermined filtering rules. included in the pronunciation.

语音识别模块430，用于基于过滤后的特征分类结果识别所述语音信号。A speech recognition module 430, configured to recognize the speech signal based on the filtered feature classification result.

上述装置中各个单元(或模块)的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For the implementation process of the functions and effects of each unit (or module) in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元或模块可以是或者也可以不是物理上分开的，作为单元或模块显示的部件可以是或者也可以不是物理单元或模块，即可以位于一个地方，或者也可以分布到多个网络单元或模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, for related parts, please refer to the part description of the method embodiment. The device embodiments described above are only illustrative, and the units or modules described as separate components may or may not be physically separated, and the components shown as units or modules may or may not be physical units Or modules, which can be located in one place, or can be distributed to multiple network units or modules. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. It can be understood and implemented by those skilled in the art without creative effort.

本发明语音识别装置的实施例可以应用在电子设备上。具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现中，电子设备为计算机，计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备、互联网电视、智能机车、无人驾驶汽车、智能冰箱、其他智能家居设备或者这些设备中的任意几种设备的组合。Embodiments of the speech recognition device of the present invention can be applied to electronic equipment. Specifically, it can be realized by computer chips or entities, or by products with certain functions. In a typical implementation, the electronic device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email sending and receiving device , gaming consoles, tablet computers, wearables, internet-connected TVs, smart bikes, driverless cars, smart refrigerators, other smart home devices, or any combination of these.

装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在电子设备的处理器将非易失性存储器等可读介质中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图5所示，为本发明语音识别装置所在电子设备的一种硬件结构图，除了图5所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的电子设备通常根据该电子设备的实际功能，还可以包括其他硬件，对此不再赘述。电子设备的存储器可以存储处理器可执行的程序指令；处理器可以耦合存储器，用于读取所述存储器存储的程序指令，并作为响应，执行如下操作：获取待识别的语音信号的特征分类结果；所述特征分类结果包含用于描述各语音信号帧的发音特征的发音以及各语音信号帧映射到对应的发音的概率；基于所述特征分类结果所含的概率，对所述特征分类结果所含的发音进行过滤；基于过滤后的特征分类结果识别所述语音信号。The device embodiments can be implemented by software, or by hardware or a combination of software and hardware. Taking software implementation as an example, as a device in a logical sense, it is formed by reading the corresponding computer program instructions in a readable medium such as a non-volatile memory into the memory for execution by the processor of the electronic device where it is located. From the hardware level, as shown in Figure 5, it is a hardware structural diagram of the electronic device where the speech recognition device of the present invention is located, except for the processor, memory, network interface, and non-volatile memory shown in Figure 5 The electronic device where the device in the embodiment is located usually may also include other hardware according to the actual function of the electronic device, which will not be repeated here. The memory of the electronic device can store program instructions executable by the processor; the processor can be coupled to the memory for reading the program instructions stored in the memory, and in response, perform the following operations: obtain the feature classification result of the speech signal to be recognized The feature classification result includes the pronunciation used to describe the pronunciation features of each speech signal frame and the probability that each speech signal frame is mapped to the corresponding pronunciation; based on the probability contained in the feature classification result, the feature classification result contains filtering the pronunciation of containing; and identifying the speech signal based on the filtered feature classification result.

在其他实施例中，处理器所执行的操作可以参考上文方法实施例中相关的描述，在此不予赘述。In other embodiments, for operations performed by the processor, reference may be made to relevant descriptions in the foregoing method embodiments, and details are not repeated here.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. a kind of audio recognition method, it is characterised in that including step：

Obtain the tagsort result of voice signal to be identified；The tagsort result is comprising for describing each voice signal The pronunciation of the pronunciation character of frame and each speech signal frame are mapped to the probability of corresponding pronunciation；

Based on the probability contained by the tagsort result, the pronunciation contained by the tagsort result is filtered；

The voice signal is recognized based on the tagsort result after filtering.

2. method according to claim 1, it is characterised in that the probability based on contained by the tagsort result, Pronunciation contained by the tagsort result is filtered, including：

Judge that whether any speech signal frame is mapped to the probability of corresponding pronunciation and meets predetermined filtering rule；

If the corresponding pronunciation meets predetermined filtering rule, the corresponding pronunciation is filtered.

3. method according to claim 2, it is characterised in that：

If any speech signal frame is mapped to the probability of corresponding pronunciation, and the maximum mapping probabilities of the speech signal frame between Probability difference, in predetermined difference range, it is determined that it is described it is corresponding pronunciation meet predetermined filtering rule；

If any speech signal frame is mapped to the probability of corresponding pronunciation, predetermined number is mapped to less than the speech signal frame The probability of each pronunciation in pronunciation, it is determined that the corresponding pronunciation meets predetermined filtering rule.

4. method according to claim 3, it is characterised in that the predetermined number is following any：

The quantity of the pronunciation being retained in the corresponding pronunciation of frame speech signal frame in tagsort result；

The product of the total number of predetermined proportion threshold value pronunciation corresponding with the frame speech signal frame.

5. method according to claim 1, it is characterised in that the probability based on contained by the tagsort result, Pronunciation contained by the tagsort result is filtered, including：

Obtain the histogram distribution that any speech signal frame is mapped to the probability of each pronunciation；

Obtain beamwidth corresponding with the histogram distribution；

By pronunciation of the probability distribution outside the beamwidth, it is defined as meeting the pronunciation of predetermined filtering rule；

The pronunciation of predetermined filtering rule will be met, deleted from the pronunciation contained by the tagsort result.

6. method according to any one of claim 1 to 5, it is characterised in that described based on the tagsort result Contained probability, filters to the pronunciation contained by the tagsort result, including：

If the probability that any speech signal frame is mapped to corresponding pronunciation meets predetermined filtering rule, the pronunciation is defined as to wait Publish sound；

If any frame in the adjacent speech signal frame of the predetermined frame number of the speech signal frame, the general of candidate pronunciation is mapped to Rate meets predetermined filtering rule, then candidate pronunciation is deleted from the pronunciation contained by the tagsort result；

If the adjacent speech signal frame of the predetermined frame number of the speech signal frame, the probability for being mapped to candidate pronunciation is unsatisfactory for Predetermined filtering rule, then in pronunciation candidate pronunciation being retained in contained by the tagsort result.

7. a kind of speech recognition equipment, it is characterised in that including：

Classification results acquisition module, the tagsort result for obtaining voice signal to be identified；The tagsort result Pronunciation and each speech signal frame comprising the pronunciation character for describing each speech signal frame are mapped to the general of corresponding pronunciation Rate；

Voice synthesizing filter module, for based on the probability contained by the tagsort result, to contained by the tagsort result Pronunciation is filtered；

Sound identification module, for recognizing the voice signal based on the tagsort result after filtering.

8. device according to claim 7, it is characterised in that the voice synthesizing filter module also includes：

First filtering module, the probability for being mapped to corresponding pronunciation in any speech signal frame, with the speech signal frame Probability difference between maximum mapping probabilities, when in predetermined difference range, filters to the corresponding pronunciation；

Second filtering module, the probability for being mapped to corresponding pronunciation in any speech signal frame, less than the speech signal frame When being mapped to the probability of each pronunciation in the pronunciation of predetermined number, the corresponding pronunciation is filtered.

9. device according to claim 7, it is characterised in that the voice synthesizing filter module includes：

Probability distribution module, for obtaining the histogram distribution that any speech signal frame is mapped to the probability of each pronunciation；

Beamwidth determining module, for obtaining beamwidth corresponding with the histogram distribution；

Pronunciation determining module, for the pronunciation by probability distribution outside the beamwidth, is defined as meeting the predetermined filtering rule Pronunciation then；

Pronunciation removing module, for the pronunciation contained by the pronunciation from the tagsort result of the predetermined filtering rule will to be met Middle deletion.

10. the device according to any one of claim 7 to 9, it is characterised in that the voice synthesizing filter module includes：

Candidate's pronunciation determining module, the probability for being mapped to corresponding pronunciation in any speech signal frame meets predetermined filtering rule When then, the pronunciation is defined as candidate's pronunciation；

Candidate's pronunciation removing module, for any frame in the adjacent speech signal frame of the predetermined frame number of the speech signal frame, When the probability for being mapped to candidate pronunciation meets predetermined filtering rule, by candidate pronunciation from contained by the tagsort result Deleted in pronunciation；

Candidate's pronunciation reservation module, for the adjacent speech signal frame of the predetermined frame number in the speech signal frame, is mapped to the time When the probability for publishing sound is unsatisfactory for predetermined filtering rule, candidate pronunciation is retained in the hair contained by the tagsort result In sound.