CN116129879A

CN116129879A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN116129879A
Application number: CN202211619425.1A
Authority: CN
Inventors: 潘敏
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-05-16
Anticipated expiration: 2042-12-14
Also published as: CN116129879B

Abstract

The invention provides a voice recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: receiving voice information to be recognized; performing voice recognition on the voice information to be recognized by using a voice recognition model to obtain posterior data corresponding to the voice information to be recognized of each frame; the voice recognition model is obtained based on short word vocabulary training, and the short word vocabulary is obtained by carrying out semantic cutting on each long word in the long word vocabulary; decoding the posterior data by utilizing the word list of the short words to sequentially obtain a plurality of recognition short words corresponding to the voice information to be recognized; and determining a voice recognition result of the voice information to be recognized according to the long word list and the plurality of recognition short words. The invention can reject the word of the mood or other irrelevant words in the voice information to be recognized, thereby guaranteeing the accuracy of the voice recognition result and the success rate of the voice recognition.

Description

Speech recognition method, device, equipment and storage medium

技术领域technical field

本发明涉及人工智能技术领域，尤其涉及一种语音识别方法、装置、设备及存储介质。The present invention relates to the technical field of artificial intelligence, in particular to a voice recognition method, device, equipment and storage medium.

背景技术Background technique

随着科技的进步，智能语音技术也在不断发展，如：语音识别技术，现如今语音识别技术逐渐普及，各种智能设备现在基本都带上了语音识别的技术。现在低功耗设备如：智能穿戴设备由于资源有限，基本上都是提前准备好一定数量的识别词进行支持，能支持的识别词的数量有限。With the advancement of science and technology, intelligent voice technology is also developing continuously, such as: voice recognition technology, and now voice recognition technology is gradually becoming popular, and various smart devices are now basically equipped with voice recognition technology. Due to limited resources, low-power devices such as smart wearable devices basically prepare a certain number of recognition words in advance for support, and the number of recognition words that can be supported is limited.

当前大部分的智能设备上的离线语音识别，都是基于产品的词表来进行识别的。根据产品预先设置的词表，来进行语音识别模型训练，然后将音频送进训练好的语音识别模型来输出后验概率，通过后验概率来获取识别的结果。但是，模型训练过程中，很大可能是无法预测说话人在说话过程中是否有一些自己的口语习惯或者说话习惯的，由于每个人的说话习惯不同，一旦实际说话内容和智能设备中预置的词表内容发生了偏差就可能无法识别，进而影响语音识别的结果。Most of the current offline speech recognition on smart devices is based on the product vocabulary. According to the pre-set vocabulary of the product, the speech recognition model is trained, and then the audio is sent to the trained speech recognition model to output the posterior probability, and the recognition result is obtained through the posterior probability. However, in the process of model training, it is very likely that it is impossible to predict whether the speaker has some oral habits or speaking habits during the speaking process. If there is a deviation in the content of the vocabulary, it may not be recognized, which will affect the result of speech recognition.

因此，本领域亟需一种能够提升语音识别结果准确性的语音识别方案。Therefore, there is an urgent need in the art for a speech recognition solution that can improve the accuracy of speech recognition results.

发明内容Contents of the invention

鉴于此，本发明实施例提供了一种语音识别方法、装置、设备及存储介质，以消除或改善现有技术中存在的一个或更多个缺陷。In view of this, the embodiments of the present invention provide a voice recognition method, device, device and storage medium, so as to eliminate or improve one or more defects existing in the prior art.

本发明的一个方面提供了一种语音识别方法，该方法包括以下步骤：One aspect of the present invention provides a method of speech recognition, the method comprising the following steps:

接收待识别语音信息；Receive voice information to be recognized;

利用语音识别模型对所述待识别语音信息进行语音识别，获得每一帧所述待识别语音信息对应的后验数据；其中，所述语音识别模型基于短词词表训练获得，所述短词词表为长词词表中各个长词进行语义切割获得；Use the speech recognition model to perform speech recognition on the speech information to be recognized, and obtain the posterior data corresponding to the speech information to be recognized in each frame; wherein, the speech recognition model is obtained based on short word vocabulary training, and the short words The vocabulary is obtained by semantic segmentation of each long word in the long word vocabulary;

利用所述短词词表对所述后验数据进行解码，依次获得所述待识别语音信息对应的多个识别短词；Decoding the posterior data by using the short word vocabulary, sequentially obtaining a plurality of recognized short words corresponding to the speech information to be recognized;

根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果。A speech recognition result of the to-be-recognized speech information is determined according to the long word vocabulary and the plurality of recognized short words.

在本发明的一些实施例中，所述语音识别模型的训练方法包括：In some embodiments of the present invention, the training method of the speech recognition model includes:

获取原始语音识别模型的长词词表；Obtain the long word vocabulary of the original speech recognition model;

将所述长词词表中的长词进行语义切割，将所述长词词表中的各个长词切割为多个短词，获得所述短词词表；Long words in the long word vocabulary are semantically cut, each long word in the long word vocabulary is cut into a plurality of short words, and the short word vocabulary is obtained;

基于所述短词词表采集短词语音样本，利用所述短词语音样本进行模型训练，获得所述语音识别模型。Collect short word speech samples based on the short word vocabulary, use the short word speech samples to perform model training, and obtain the speech recognition model.

在本发明的一些实施例中，所述方法还包括：In some embodiments of the present invention, the method also includes:

根据短词词表中各个短词的来源，将所述短词词表中的各个短词与所述长词词表中的各个长词建立映射关系；According to the source of each short word in the short word vocabulary, each short word in the short word vocabulary is set up a mapping relationship with each long word in the long word vocabulary;

所述根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果，包括：The determining the speech recognition result of the to-be-recognized speech information according to the long word vocabulary and the plurality of recognized short words includes:

根据所述多个识别短词对应的识别顺序，将所述多个识别短词进行组合，获得组合词语；According to the recognition order corresponding to the multiple recognition short words, combine the multiple recognition short words to obtain the combined words;

根据所述组合词语中各个识别短词在所述长词词表中的映射关系，将所述组合词语与所述长词词表进行匹配，获得所述语音识别结果。According to the mapping relationship of each recognized short word in the combined word in the long word vocabulary, the combined word is matched with the long word vocabulary to obtain the speech recognition result.

在本发明的一些实施例中，所述后验数据为每一帧所述待识别语音信息对应的音素的概率；In some embodiments of the present invention, the posterior data is the probability of a phoneme corresponding to the speech information to be recognized in each frame;

所述利用所述短词词表对所述后验数据进行解码，依次获得所述待识别语音信息对应的多个识别短词，包括：The decoding of the posterior data by using the short word vocabulary, and sequentially obtaining a plurality of recognized short words corresponding to the speech information to be recognized includes:

根据所述后验数据中每一帧所述待识别语音信息对应的音素的概率的排名，获得每一帧所述待识别语音信息对应的排名在指定名次内的筛选音素集合；According to the ranking of the probability of the phonemes corresponding to the speech information to be recognized in each frame in the posteriori data, obtain a screened phoneme set corresponding to the speech information to be recognized in each frame within a specified ranking;

利用所述短词词表依次对各帧所述待识别语音信息对应的筛选音素集合进行匹配解码，依次获得所述待识别语音信息对应的多个识别短词。The short word vocabulary is used to sequentially perform matching decoding on the screened phoneme sets corresponding to the speech information to be recognized in each frame, and sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized.

在本发明的一些实施例中，所述利用所述短词词表对所述后验数据进行解码，包括：In some embodiments of the present invention, the decoding of the posterior data by using the short word vocabulary includes:

利用所述短词词表采用维特比解码算法对所述后验数据进行解码。The a posteriori data is decoded using the short word vocabulary using a Viterbi decoding algorithm.

在本发明的一些实施例中，所述根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果，包括：In some embodiments of the present invention, the determination of the speech recognition result of the speech information to be recognized according to the long word vocabulary and the plurality of recognized short words includes:

将所述多个识别短词与所述长词词表进行匹配，若匹配失败，则利用所述语音识别模型继续对所述待识别语音信息进行识别，确定出所述待识别语音信息对应的新的识别短词；Match the plurality of recognized short words with the long word vocabulary, if the matching fails, continue to recognize the speech information to be recognized by using the speech recognition model, and determine the corresponding speech information of the speech information to be recognized new recognition phrases;

将所述新的识别短词与所述多个识别短词进行组合，并基于组合后的识别短词和所述长词词表进行匹配，直至在所述长词词表中确定出所述待识别语音信息的语音识别结果。Combining the new recognition short word with the plurality of recognition short words, and matching based on the combined recognition short word and the long word vocabulary, until the long word vocabulary is determined in the long word vocabulary The speech recognition result of the speech information to be recognized.

本发明的另一方面提供了一种语音识别装置，该装置包括：Another aspect of the present invention provides a speech recognition device, the device comprising:

语音接收模块，用于接收待识别语音信息；Voice receiving module, used for receiving the voice information to be recognized;

后验数据获取模块，用于利用语音识别模型对所述待识别语音信息进行语音识别，获得每一帧所述待识别语音信息对应的后验数据；其中，所述语音识别模型基于短词词表训练获得，所述短词词表为长词词表中各个长词进行语义切割获得；The posterior data acquisition module is used to perform speech recognition on the speech information to be recognized by using a speech recognition model, and obtain the posterior data corresponding to the speech information to be recognized in each frame; wherein, the speech recognition model is based on short words Table training obtains, and described short word vocabulary carries out semantic segmentation and obtains for each long word in the long word vocabulary;

短词解码模块，用于利用所述短词词表对所述后验数据进行解码，依次获得所述待识别语音信息对应的多个识别短词；A short word decoding module, configured to use the short word vocabulary to decode the posterior data, and sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized;

识别结果确定模块，用于根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果。The recognition result determination module is configured to determine the speech recognition result of the speech information to be recognized according to the long word vocabulary and the plurality of recognized short words.

在本发明的一些实施例中，所述装置还包括词表映射模块用于：根据短词词表中各个短词的来源，将所述短词词表中的各个短词与所述长词词表中的各个长词建立映射关系；In some embodiments of the present invention, the device also includes a vocabulary mapping module for: according to the source of each short word in the short word vocabulary, associate each short word in the short word vocabulary with the long word Each long word in the vocabulary establishes a mapping relationship;

所述识别结果确定模块具体用于：The identification result determination module is specifically used for:

本发明的另一方面提供了一种语音识别设备，包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该设备实现上述语音识别方法。Another aspect of the present invention provides a speech recognition device, including a processor and a memory, wherein computer instructions are stored in the memory, and the processor is used to execute the computer instructions stored in the memory, when the computer instructions are When the processor executes, the device realizes the above speech recognition method.

本发明的又一方面提供了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时实现上述语音识别方法。Another aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the above speech recognition method is realized.

本发明的提供的一种语音识别方法、装置、设备及存储介质，将语音识别中的长词词表切割为短词词表，利用短词词表训练构建语音识别模型，在进行语音识别时，利用新的语音识别模型对待识别语音信息进行识别，确定出待识别语音信息对应的多个识别短词，基于识别出的识别短词的组合和原有的长词词表，可以准确的确定出待识别语音信息的语音识别结果。在利用短词词表训练的语音识别模型对待识别语音信息进行语音识别时，可以将待识别语音信息中的语气词或其他无关词语剔除，确保不同的用户说话习惯，均可以完成语音识别的过程，保障了语音识别结果的准确性和语音识别的成功率。A kind of speech recognition method, device, equipment and storage medium provided by the present invention cut the long word vocabulary in the speech recognition into short word vocabulary, utilize the short word vocabulary training to build the speech recognition model, when performing speech recognition , use the new speech recognition model to recognize the speech information to be recognized, and determine a plurality of recognized short words corresponding to the speech information to be recognized, based on the combination of recognized short words and the original long word vocabulary, it can be accurately determined Get the voice recognition result of the voice information to be recognized. When using the speech recognition model trained by the short word vocabulary to perform speech recognition on the speech information to be recognized, the modal particles or other irrelevant words in the speech information to be recognized can be removed to ensure that different users' speaking habits can complete the speech recognition process , ensuring the accuracy of the speech recognition result and the success rate of the speech recognition.

本发明的附加优点、目的，以及特征将在下面的描述中将部分地加以阐述，且将对于本领域普通技术人员在研究下文后部分地变得明显，或者可以根据本发明的实践而获知。本发明的目的和其它优点可以通过在说明书以及附图中具体指出的结构实现到并获得。Additional advantages, objects, and features of the present invention will be set forth in part in the following description, and will be partly apparent to those of ordinary skill in the art after studying the following text, or can be learned from the practice of the present invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and appended drawings.

本领域技术人员将会理解的是，能够用本发明实现的目的和优点不限于以上具体所述，并且根据以下详细说明将更清楚地理解本发明能够实现的上述和其他目的。It will be understood by those skilled in the art that the objects and advantages that can be achieved by the present invention are not limited to the above specific ones, and the above and other objects that can be achieved by the present invention will be more clearly understood from the following detailed description.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，并不构成对本发明的限定。附图中的部件不是成比例绘制的，而只是为了示出本发明的原理。为了便于示出和描述本发明的一些部分，附图中对应部分可能被放大，即，相对于依据本发明实际制造的示例性装置中的其它部件可能变得更大。在附图中：The drawings described here are used to provide further understanding of the present invention, constitute a part of the application, and do not limit the present invention. The components in the figures are not drawn to scale, merely illustrating the principles of the invention. For ease of illustration and description of some parts of the present invention, corresponding parts in the figures may be exaggerated, ie, may be made larger relative to other components in an exemplary device actually manufactured in accordance with the present invention. In the attached picture:

图1是本说明书一个实施例中提供的语音识别方法流程示意图；Fig. 1 is a schematic flow chart of a voice recognition method provided in an embodiment of this specification;

图2是本说明书一些实施例中语音识别模型的训练过程示意图；Fig. 2 is a schematic diagram of the training process of the speech recognition model in some embodiments of the present specification;

图3是本说明书另一个实施例中语音识别的流程示意图；Fig. 3 is a schematic flow chart of speech recognition in another embodiment of the present specification;

图4是本说明书提供的语音识别装置一个实施例的模块结构示意图；Fig. 4 is a schematic diagram of the module structure of an embodiment of the speech recognition device provided in this specification;

图5是本说明书一个实施例中语音识别服务器的硬件结构框图。Fig. 5 is a block diagram of the hardware structure of the voice recognition server in one embodiment of the present specification.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下面结合实施方式和附图，对本发明做进一步详细说明。在此，本发明的示意性实施方式及其说明用于解释本发明，但并不作为对本发明的限定。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with the embodiments and accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.

在此，还需要说明的是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与根据本发明的方案密切相关的结构和/或处理步骤，而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the related Other details are not relevant to the invention.

应该强调，术语“包括/包含”在本文使用时指特征、要素、步骤或组件的存在，但并不排除一个或更多个其它特征、要素、步骤或组件的存在或附加。It should be emphasized that the term "comprising/comprising" when used herein refers to the presence of a feature, element, step or component, but does not exclude the presence or addition of one or more other features, elements, steps or components.

在此，还需要说明的是，如果没有特殊说明，术语“连接”在本文不仅可以指直接连接，也可以表示存在中间物的间接连接。Here, it should also be noted that, unless otherwise specified, the term "connection" herein may refer not only to a direct connection, but also to an indirect connection with an intermediate.

在下文中，将参考附图描述本发明的实施例。在附图中，相同的附图标记代表相同或类似的部件，或者相同或类似的步骤。Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

一般的语音识别设备尤其是低功耗语音识别设备中会配置识别词，在进行语音识别时，会将用户输入的语音与预置的识别词进行匹配，以识别出用户输入的语音。但，语音识别设备可能会被各种各样的用户使用，有些用户在说话时会有自己的说话习惯，这和预置的识别词内容可能会不一致。比如：预置了识别词时“打开空调”，但是实际用户在生活中都是用的“打开一下空调”、“打开下空调”这样的说法，这时，“打开空调”可能就无法被成功识别出来。而且这种情况又千差万别，不同的地区，说话习惯差别也不是能够穷举的，无法将这些情况一一在词表中添加进去。General speech recognition devices, especially low-power speech recognition devices, are equipped with recognition words. During speech recognition, the speech input by the user is matched with the preset recognition words to recognize the speech input by the user. However, speech recognition equipment may be used by various users, and some users may have their own speaking habits when speaking, which may be inconsistent with the preset recognition word content. For example: "Turn on the air conditioner" when the recognition word is preset, but actual users always use the phrases "Turn on the air conditioner" and "Turn on the air conditioner" in daily life. At this time, "Turn on the air conditioner" may not be successfully recognized. recognized. Moreover, this situation is very different, and the differences in speaking habits in different regions cannot be exhaustive, and it is impossible to add these situations one by one to the vocabulary.

本说明书实施例中提供一种语音识别方法，将语音识别中的长词词表切割为短词词表，利用短词词表训练构建语音识别模型，在进行语音识别时，利用新的语音识别模型对待识别语音信息进行识别，确定出待识别语音信息对应的多个识别短词，基于识别出的识别短词的组合和原有的长词词表，可以准确的确定出待识别语音信息的语音识别结果。在利用短词词表训练的语音识别模型对待识别语音信息进行语音识别时，可以将待识别语音信息中的语气词或其他无关词语剔除，确保不同的用户说话习惯，均可以完成语音识别的过程，保障了语音识别结果的准确性和语音识别的成功率。In the embodiment of this specification, a speech recognition method is provided, which cuts the long word vocabulary in speech recognition into a short word vocabulary, uses the short word vocabulary to train and build a speech recognition model, and uses the new speech recognition method when performing speech recognition. The model recognizes the speech information to be recognized, and determines multiple recognized short words corresponding to the speech information to be recognized. Based on the combination of recognized short words and the original long word list, it can accurately determine the Speech recognition results. When using the speech recognition model trained by the short word vocabulary to perform speech recognition on the speech information to be recognized, the modal particles or other irrelevant words in the speech information to be recognized can be removed to ensure that different users' speaking habits can complete the speech recognition process , ensuring the accuracy of the speech recognition result and the success rate of the speech recognition.

图1是本说明书一个实施例中提供的语音识别方法流程示意图，如图l所示，本说明书提供的语音识别方法的一个实施例中，所述方法可以应用在计算机、平板电脑、服务器、智能手机、智能穿戴设备等终端设备中，所述方法可以包括如下步骤：Fig. 1 is a schematic flow chart of the speech recognition method provided in one embodiment of this specification, as shown in Fig. In terminal devices such as mobile phones and smart wearable devices, the method may include the following steps:

步骤102、接收待识别语音信息。Step 102, receiving voice information to be recognized.

在具体的实施过程中，在进行语音识别时，用户可以对语音识别设备如：智能穿戴设备或智能音箱等说出自己要求，语音识别设备识别出用户的意图后，可以根据用户的要求输出对应的结果。如：一般的智能家居中，语音识别设备可以联网控制空调、电视等家用电器，用户对语音识别设备说“打开空调”，语音识别设备识别出后，可以打开空调。待识别语音信息可以是用户对语音识别设备说出的语音，可以是普通话，也可以是方言或其他国家的语言，具体可以根据实际需要进行设置，本说明书实施例不做具体限定，本说明书实施例中待识别语音信息一般为中文。In the specific implementation process, when performing speech recognition, users can speak their own requirements to speech recognition devices such as smart wearable devices or smart speakers. After the speech recognition device recognizes the user's intention, it can output corresponding the result of. For example, in general smart homes, voice recognition devices can be connected to the Internet to control air conditioners, TVs and other household appliances. The user says "turn on the air conditioner" to the voice recognition device, and after the voice recognition device recognizes it, the air conditioner can be turned on. The voice information to be recognized can be the voice spoken by the user to the voice recognition device, it can be Mandarin, dialects or languages of other countries, it can be set according to actual needs, and the embodiment of this specification does not make specific limitations. In the example, the speech information to be recognized is generally in Chinese.

步骤104、利用语音识别模型对所述待识别语音信息进行语音识别，获得每一帧所述待识别语音信息对应的后验数据；其中，所述语音识别模型基于短词词表训练获得，所述短词词表为长词词表中各个长词进行语义切割获得。Step 104: Use a speech recognition model to perform speech recognition on the speech information to be recognized, and obtain the posterior data corresponding to each frame of the speech information to be recognized; wherein, the speech recognition model is obtained based on short word vocabulary training, and the The short word vocabulary is obtained by semantic segmentation of each long word in the long word vocabulary.

在具体的实施过程中，接收到待识别语音信息后，可以利用语音识别模型对待识别语音信息进行语音识别，语音识别模型是可以识别出待识别语音信息中的词语或相关信息的智能学习模型。本说明书实施例中的语音识别模型可以为神经网络模型，具体模型结构可以根据实际需要进行设置，本说明书实施例不做具体限定。利用神经网络的语音识别模型可以识别出待识别语音信息的每一帧对应的后验数据，后验数据一般可以为待识别语音信息是什么发音或什么词语或什么识别结果的概率。此外，本说明书实施例中的语音识别模型与现有的语音识别模型的区别为本说明书实施例是利用短词词表训练获得的，短词词表为对现有的语音识别设备中的长词词表中的各个长词进行语义切割获得的，即避免了重新配置词表的步骤，提升了语音识别模型训练的效率，又为后续语音识别奠定了数据基础。也就是说，可以将现有的语音设备中预置的长词词表中的长词进行语音切割，获得各个长词对应的短词，进而获得短词词表，利用短词词表训练构建新的语音识别模型，利用新的语音识别模型进行语音识别。从而将待识别语音信息中的语气词或其他无关词语剔除，确保不同的用户说话习惯，均可以完成语音识别的过程，保障了语音识别结果的准确性和语音识别的成功率。In the specific implementation process, after receiving the speech information to be recognized, the speech recognition model can be used to perform speech recognition on the speech information to be recognized. The speech recognition model is an intelligent learning model that can recognize words or related information in the speech information to be recognized. The speech recognition model in the embodiments of this specification can be a neural network model, and the specific model structure can be set according to actual needs, which is not specifically limited in the embodiments of this specification. The speech recognition model using the neural network can identify the posterior data corresponding to each frame of the speech information to be recognized, and the posterior data can generally be the probability of the pronunciation or word or recognition result of the speech information to be recognized. In addition, the difference between the speech recognition model in the embodiment of this specification and the existing speech recognition model is that the embodiment of this specification uses short word vocabulary training to obtain, and the short word vocabulary is the long-term Each long word in the vocabulary is obtained by semantic segmentation, which avoids the step of reconfiguring the vocabulary, improves the efficiency of speech recognition model training, and lays a data foundation for subsequent speech recognition. That is to say, the long words in the long word vocabulary preset in the existing voice equipment can be voice-cut to obtain the short words corresponding to each long word, and then obtain the short word vocabulary, and use the short word vocabulary training to build New Speech Recognition Model, Utilize the new Speech Recognition Model for speech recognition. In this way, the modal particles or other irrelevant words in the voice information to be recognized are eliminated, so as to ensure that different users' speaking habits can complete the voice recognition process, and ensure the accuracy of voice recognition results and the success rate of voice recognition.

本说明书一些实施例中，所述语音识别模型的训练方法包括：In some embodiments of this specification, the training method of the speech recognition model includes:

在具体的实施过程中，图2是本说明书一些实施例中语音识别模型的训练过程示意图，如图2所示，先获取现有语音识别设备中预置的长词词表，如：原始语音识别模型训练时的长词词表，将长词词表中的长词进行语义切割，将长词切割成符合语义情况的多个短词。如：可以将长词词表中的长词中的动词、名词切割出来，例如：将长词“打开空调”切割为“打开”和“空调”两个短词。在实际使用过程中，可以利用语义模块将长词词表进行拆分，或利用语义拆分算法等将长词词表进行拆分，具体的拆分方法本说明书实施例不做具体限定。将长词词表中的长词拆分成短词后，各个短词构成了短词词表。基于短词词表采集短词语音样本，如：找相关人员依次录制短词词表中各个短词对应的语音，获得短词语音样本。如上述实施例的记载，语音识别模型的模型结构可以根据实际需要进行设置，本说明书实施例不做具体限定，获得短词语音样本后，设置模型结构，利用短词语音样本进行模型训练，构建出语音识别模型。In the specific implementation process, Fig. 2 is a schematic diagram of the training process of the speech recognition model in some embodiments of this specification, as shown in Fig. 2, first obtain the long word vocabulary preset in the existing speech recognition device, such as: original speech Identify the long word vocabulary during model training, perform semantic segmentation on the long words in the long word vocabulary, and cut the long words into multiple short words that meet the semantic conditions. For example: the verbs and nouns in the long words in the long word vocabulary can be cut out, for example: the long word "turn on the air conditioner" is cut into two short words "open" and "air conditioner". In actual use, the long word vocabulary can be split by using the semantic module, or the long word vocabulary can be split by using the semantic split algorithm, and the specific split method is not specifically limited in the embodiment of this specification. After splitting the long words in the long word vocabulary into short words, each short word forms a short word vocabulary. Collect short word voice samples based on the short word vocabulary, for example: find relevant personnel to record the voices corresponding to each short word in the short word vocabulary in turn, and obtain short word voice samples. As described in the above embodiments, the model structure of the speech recognition model can be set according to actual needs. The embodiment of this specification does not make specific limitations. After obtaining the short word speech samples, set the model structure, use the short word speech samples for model training, and construct A speech recognition model.

需要说明的是，在切割长词词表获得短词词表时，对于重复的短词只需要保留一个即可。It should be noted that when the short word vocabulary is obtained by cutting the long word vocabulary, only one of the repeated short words needs to be reserved.

本说明书实施例利用短词词表训练构建语音识别模型，进而使得训练构建的语音识别模型在进行语音识别时，可以识别出待识别语音信息中的短词，而待识别语音信息中的语气词或无关词语，短词词表中没有，语音识别模型不会识别出，进而可以将一些用户习惯的语气词等剔除，避免因用户习惯用词不在词表中导致语音识别失败。将识别出的短词进行组合，可以准确快速的确定出语音识别结果，提升了语音识别的准确性和成功率。The embodiment of this specification utilizes short word vocabulary training to build a speech recognition model, and then makes the speech recognition model constructed by training can recognize the short words in the speech information to be recognized when performing speech recognition, and the modal particles in the speech information to be recognized Or irrelevant words, which are not in the short word vocabulary, the speech recognition model will not recognize them, and some modal particles that users are accustomed to can be eliminated to avoid speech recognition failures caused by users' habitual words not in the vocabulary. Combining the recognized short words can accurately and quickly determine the speech recognition result, which improves the accuracy and success rate of speech recognition.

步骤106、利用所述短词词表对所述后验数据进行解码，依次获得所述待识别语音信息对应的多个识别短词。Step 106 , using the short word vocabulary to decode the posteriori data, and sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized.

在具体的实施过程中，利用语音识别模型对待识别语音信息进行识别后，模型输出每一帧待识别语音信息的后验数据，再利用语音识别模型训练使用的短词词表对后验数据进行匹配解码，即可以获得待识别语音信息对应的多个识别短词。本说明书一些实施例中，在解码时，可以利用所述短词词表采用维特比解码算法对所述后验数据进行解码。维特比解码算法即viterbi算法，维特比算法是一个通用的求序列最短路径的动态规划算法，利用维特比解码算法可以快速准确的确定出语音识别模型输出的后验数据中与短词词表匹配的短词，进而提升了语音识别的效率和准确性。In the specific implementation process, after using the speech recognition model to recognize the speech information to be recognized, the model outputs the posterior data of each frame of the speech information to be recognized, and then uses the short word vocabulary used in speech recognition model training to carry out the posteriori data By matching and decoding, multiple recognized short words corresponding to the voice information to be recognized can be obtained. In some embodiments of this specification, during decoding, the a posteriori data may be decoded by using the short word vocabulary using a Viterbi decoding algorithm. The Viterbi decoding algorithm is the viterbi algorithm. The Viterbi algorithm is a general dynamic programming algorithm for finding the shortest path of a sequence. Using the Viterbi decoding algorithm, it can quickly and accurately determine the matching of the short word vocabulary in the posterior data output by the speech recognition model. Short words, thereby improving the efficiency and accuracy of speech recognition.

本说明书一些实施例中，所述后验数据为每一帧所述待识别语音信息对应的音素的概率；In some embodiments of this specification, the a posteriori data is the probability of a phoneme corresponding to the speech information to be recognized in each frame;

在具体的实施过程中，通常进行语音识别时，需要先对待识别语音信息进行特征提取，如：提取待识别语音信息中的发音特征等，将提取出的特征输入到语音识别模型，模型或输出对应的识别结果。如上述实施例的记载，本说明书实施例中的语音识别模型可以选择神经网络模型，利用该神经网络结构的语音识别模型对待识别语音信息进行语音识别时，模型可以输出每一帧待识别语音信息对应的音素的概率。也就是说，语音识别模型可以输出每一帧待识别语音信息对应的发音的概率，其中，音素是根据语音的自然属性划分出来的最小语音单位，依据音节里的发音动作来分析，一个动作构成一个音素。如：待识别语音信息为“打开一下空调”，语音识别模型进行语音识别时，可以输出第一个字“打”对应的音素为“d、a”的概率为90％。其中，待识别语音信息的一个字对应的语音可以为一帧或多帧音频，具体根据实际情况而定，本说明书实施例不做具体限定。语音识别模型输出的每一帧待识别语音识别信息对应的音素可以有一个或多个，并且每一个音素对应有一个后验概率，可以根据语音识别模型输出的每一帧待识别语音信息对应的音素的概率的排名，筛选出每一帧对应的排名在指定名次内的音素，获得每一帧待识别语音信息对应的筛选音素集合。再利用短词词表依次对各帧待识别语音信息对应的筛选音素集合进行匹配解码，进而依次获得待识别语音信息对应的多个识别短词。In the specific implementation process, usually when performing speech recognition, it is necessary to first perform feature extraction on the speech information to be recognized, such as: extract the pronunciation features in the speech information to be recognized, and input the extracted features to the speech recognition model, model or output corresponding recognition results. As described in the above-mentioned embodiments, the speech recognition model in the embodiments of this specification can choose a neural network model, and when the speech recognition model of the neural network structure is used for speech recognition of speech information to be recognized, the model can output each frame of speech information to be recognized The probability of the corresponding phoneme. That is to say, the speech recognition model can output the probability of pronunciation corresponding to each frame of speech information to be recognized. Among them, a phoneme is the smallest speech unit divided according to the natural attributes of speech, and is analyzed according to the pronunciation action in a syllable. An action constitutes a phoneme. For example, the speech information to be recognized is "turn on the air conditioner", and the speech recognition model can output the phonemes "d, a" corresponding to the first word "打" when performing speech recognition, with a probability of 90%. Wherein, the voice corresponding to one word of the voice information to be recognized may be one or more frames of audio, depending on the actual situation, which is not specifically limited in the embodiment of this specification. There can be one or more phonemes corresponding to each frame of speech recognition information to be recognized output by the speech recognition model, and each phoneme corresponds to a posteriori probability, which can be based on the corresponding The ranking of the probability of the phonemes is to screen out the phonemes corresponding to each frame whose ranking is within the specified ranking, and obtain the screened phoneme set corresponding to the voice information to be recognized in each frame. Then, the short word vocabulary is used to sequentially match and decode the selected phoneme sets corresponding to the speech information to be recognized in each frame, and then sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized.

例如：待识别语音信息为“打开一下空调”，基于语音识别模型进行语音识别时的输出，获得第一帧对应的筛选音素集合为：“d、t”，第二帧对应的筛选音素集合为：“a、ai、ao”，中间可能有一些无关的帧，然后到了第五帧对应的筛选音素集合为：“k、g、l”，第六帧对应的筛选音素集合为：“ai，ao，i，a”将这多帧语音对应的筛选音素集合进行组合后，在短词词表中进行查找，发现短词词表中的“打开”与多帧语音的筛选音素集合组合的“d、a”、“k、a、i”相匹配，则可以确定待识别语音信息对应的一个短词为“打开”。For example: the speech information to be recognized is "Turn on the air conditioner", and the output of speech recognition based on the speech recognition model is to obtain the filtered phoneme set corresponding to the first frame as: "d, t", and the filtered phoneme set corresponding to the second frame is : "a, ai, ao", there may be some irrelevant frames in the middle, and then the screening phoneme set corresponding to the fifth frame is: "k, g, l", and the screening phoneme set corresponding to the sixth frame is: "ai, ao, i, a" After combining the screened phoneme sets corresponding to the multi-frame speech, search in the short word vocabulary, and find the combination of "open" in the short word vocabulary and the screened phoneme set of the multi-frame speech d, a", "k, a, i" match, then it can be determined that a short word corresponding to the speech information to be recognized is "open".

本说明书实施例中利用语音识别模型识别出每一帧待识别语音信息对应的发音及其后验概率，进而确定出每一帧待识别语音信息对应的音素，再利用短词词表对音素进行匹配解码，识别出待识别语音信息中的短词，为后续语音识别奠定了数据基础。In the embodiment of this specification, the speech recognition model is used to identify the pronunciation corresponding to each frame of speech information to be recognized and its posterior probability, and then the phoneme corresponding to each frame of speech information to be recognized is determined, and then the phoneme is performed using the short word list Matching decoding recognizes short words in the voice information to be recognized, laying a data foundation for subsequent voice recognition.

步骤108、根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果。Step 108: Determine the speech recognition result of the speech information to be recognized according to the long word vocabulary and the plurality of recognized short words.

在具体的实施过程中，在识别出待识别语音信息对应的多个短词后，可以利用长词词表对识别出的多个短词进行匹配，进而确定出待识别语音信息的语音识别结果。如：可以将各个识别短词进行组合，将组合后的短词与长词词表进行匹配，以确定出语音识别结果。本说明书实施例中的短词词表是拆分长词词表获得的，而语音识别模型是基于短词词表训练获得，基于语音识别模型识别出的识别短词属于短词词表，短词词表中的短词在长词词表中均可以获得，因此，将多个识别短词与长词词表进行匹配就可以准确快速的获得待识别语音信息的语音识别结果。In the specific implementation process, after identifying a plurality of short words corresponding to the speech information to be recognized, the long word vocabulary can be used to match the recognized short words, and then determine the speech recognition result of the speech information to be recognized . For example, various recognized short words can be combined, and the combined short words can be matched with the long word vocabulary to determine the speech recognition result. The short word vocabulary in the embodiment of this specification is obtained by splitting the long word vocabulary, and the speech recognition model is obtained based on the training of the short word vocabulary, and the recognition short words recognized based on the speech recognition model belong to the short word vocabulary, and the short word The short words in the word list can be obtained in the long word list. Therefore, matching multiple recognized short words with the long word list can accurately and quickly obtain the speech recognition result of the speech information to be recognized.

本说明书一些实施例中，所述方法还包括：In some embodiments of this specification, the method also includes:

在具体的实施过程中，在获得短词词表后，可以根据短词词表中各个短词的来源也就是短词是从哪个长词拆分获得的，建立短词词表中各个短词与长词词表中各个长词之间的映射关系。例如：短词词表中存在“打开”，长词词表中有“打开空调”、“打开电视”“打开主卧电灯”，可见，长词词表中这三个长词都包括“打开”，那么短词词表中的短词“打开”的来源可以来源于这三个长词，则将短词词表中的“打开”与长词词表中的“打开空调”、“打开电视”“打开主卧电灯”建立映射关系。在确定出待识别语音信息中的多个识别短词后，可以根据这多个识别短词在待识别语音信息中的顺序或者多个识别短词的识别顺序，将多个识别短词进行组合，获得组合词语。再根据组合词语中各个识别短词在长词词表中的映射关系，可以快速的将组合词语与长词词表进行匹配，获得语音识别结果。In the specific implementation process, after obtaining the short word vocabulary, each short word in the short word vocabulary can be established according to the source of each short word in the short word vocabulary, that is, which long word the short word is obtained from The mapping relationship with each long word in the long word vocabulary. For example: there is "open" in the short word vocabulary, and "turn on the air conditioner", "turn on the TV" and "turn on the light in the master bedroom" in the long word vocabulary, it can be seen that these three long words in the long word vocabulary all include "open ", then the source of the short word "open" in the short word vocabulary can come from these three long words, then "open" in the short word vocabulary and "turn on the air conditioner" and "open" in the long word vocabulary TV" and "Turn on the light in the master bedroom" to establish a mapping relationship. After determining a plurality of recognized short words in the speech information to be recognized, the plurality of recognized short words can be combined according to the order of the plurality of recognized short words in the speech information to be recognized or the recognition order of the plurality of recognized short words , to obtain compound words. Then, according to the mapping relationship of each recognized short word in the compound word in the long word vocabulary, the compound word can be quickly matched with the long word vocabulary to obtain the speech recognition result.

例如：基于待识别语音信息确定出的多个识别短词包括“打开”、“空调”，组合后即为“打开空调”。在与长词词表进行匹配时，发现短词“打开”、“空调”均与长词词表中的长词“打开空调”存在映射关系，并且组合后的组合词语与长词词表中的词语相同，则可以确定语音识别结果为“打开空调”。For example: multiple recognized short words determined based on the voice information to be recognized include "turn on" and "air conditioner", which are combined to "turn on the air conditioner". When matching with the long word vocabulary, it is found that the short words "open" and "air conditioner" all have a mapping relationship with the long word "turn on the air conditioner" in the long word vocabulary, and the combined combined words and the long word vocabulary If the words are the same, it can be determined that the speech recognition result is "turn on the air conditioner".

本说明书实施例在拆分长词词表获得短词词表时，还根据拆分的过程建立短词词表与长词词表中各个词语的映射关系，在确定识别结果时，基于两个词表之间的映射关系可以快速准确的找出长词词表中的结果，进而快速获得语音识别结果，提升了语音识别的效率。In the embodiment of this specification, when splitting the long word vocabulary to obtain the short word vocabulary, the mapping relationship between the short word vocabulary and the long word vocabulary is also established according to the process of splitting. When determining the recognition result, based on two The mapping relationship between the vocabulary can quickly and accurately find the results in the long word vocabulary, and then quickly obtain the speech recognition result, which improves the efficiency of speech recognition.

图3是本说明书另一个实施例中语音识别的流程示意图，如图3所示，本说明书一些实施例中，所述根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果，包括：Fig. 3 is a schematic flow chart of speech recognition in another embodiment of this specification. As shown in Fig. 3, in some embodiments of this specification, according to the long word vocabulary and the plurality of recognized short words, determine the Describe the voice recognition results of the voice information to be recognized, including:

在具体的实施过程中，参见上述实施例的记载，在进行语音识别时，可以先利用语音识别模型识别并解码出待识别语音信息中的识别短词，将各个识别短词进行组合与长词词表进行匹配，若匹配失败，也就是说没有在长词词表中找到与多个识别短词组合后相匹配的长词，则利用语音识别模型继续对待识别语音信息进行识别并解码，获得新的识别短词，将新的识别短词与之前识别出的多个识别短词进行组合，并将组合后的词语与长词词表进行匹配，若匹配成功，则输出语音识别结果，若匹配失败，继续利用语音识别模型继续对待识别语音信息进行识别并解码，重复上述动作，直至组合后的识别短词与长词词表中的长词匹配成功，则完成语音识别。In the specific implementation process, refer to the records of the above-mentioned embodiments. When performing speech recognition, the speech recognition model can be used to recognize and decode the recognized short words in the speech information to be recognized, and each recognized short word can be combined with the long word Vocabulary for matching, if the matching fails, that is to say, no long word matching the combination of multiple recognized short words is found in the long word vocabulary, then use the speech recognition model to continue to recognize and decode the speech information to be recognized, and obtain New recognition short words, combine the new recognition short words with multiple recognition short words previously recognized, and match the combined words with the long word vocabulary, if the matching is successful, then output the speech recognition result, if If the matching fails, continue to use the speech recognition model to continue to recognize and decode the speech information to be recognized, and repeat the above actions until the combined recognized short words are successfully matched with the long words in the long word vocabulary, then the speech recognition is completed.

本说明书实施例通过语义分析长词词表中的识别词内容，根据语义将这些识别词进行分割，从而变成更短小的识别词。在语音识别过程中，识别到这种被切割过后的识别词，不输出识别结果，而是等待一定的时间后，看后续是否出现被切割词的后半部分。如果后续音频中出现了后半部分的识别词，则输出切割之前的识别词即获得语音识别结果。避免不同用户的说话习惯不同，导致待识别语音与词表内容不匹配，而语音识别失败的问题，提升了语音识别的准确性和成功率。In this embodiment of the present specification, the contents of the recognized words in the long word vocabulary are semantically analyzed, and these recognized words are segmented according to the semantics, so as to become shorter recognized words. In the process of speech recognition, if such a recognized word after being cut is recognized, the recognition result is not output, but wait for a certain period of time to see whether the second half of the cut word appears later. If the recognized word in the second half appears in the subsequent audio, the recognized word before outputting is output to obtain the speech recognition result. Avoid the problem that different users have different speaking habits, resulting in a mismatch between the speech to be recognized and the content of the vocabulary, and the failure of speech recognition, and improve the accuracy and success rate of speech recognition.

如上述图1-图3所示，本说明书实施例中的语音识别过程可以参考如下：As shown in Figures 1-3 above, the speech recognition process in the embodiment of this specification can be referred to as follows:

步骤1.在得到长词词表时，通过语义模块将长词词表拆分，长词切割成符合语义情况的多个短词。Step 1. When the long word vocabulary is obtained, the long word vocabulary is split by the semantic module, and the long word is cut into multiple short words that meet the semantic conditions.

步骤2.用这些短词新建一个短词词表，并和原来的长词词表建立映射关系。Step 2. Use these short words to create a new short word vocabulary, and establish a mapping relationship with the original long word vocabulary.

步骤3.使用短词词表训练语音识别模型。Step 3. Train the speech recognition model using the short word vocabulary.

步骤4.对神经网络结构的语音识别模型输出的后验数据，用短词词表进行解码，获得识别短词。Step 4. Decoding the posterior data output by the speech recognition model of the neural network structure with the short word vocabulary to obtain the recognized short words.

步骤5.每次输出的识短词识别结果即识别短词记录下来，将后续的识别结果和之前的识别结果组合后在长词词表中进行查找。Step 5. The short word recognition result that is output each time is the short word recognition record, and the subsequent recognition result is combined with the previous recognition result to search in the long word vocabulary.

步骤6.找到符合要求的长词词表中的识别词后输出最终的语音识别结果。Step 6. Output the final speech recognition result after finding the recognized word in the long word vocabulary that meets the requirements.

本说明书实施例提供的语音识别方法，针对现有的语音识别设备，发现虽然不同人的说话习惯会有所不同，但是其实都是包含了预先设置的识别词，只是在识别词的基础上增加了某些语法中的结构。本说明书实施例将较长的识别词根据语义拆分成为较小的元素，将词表中的词拆分成较小的基础词表元素，识别结果也按最小的词元素来进行识别，这样可以跳过中间那些说话人特殊的习惯用词，将说话人实际内容和词表内容匹配，进而提升了语音识别的准确性和成功率。The speech recognition method provided by the embodiment of this specification is aimed at the existing speech recognition equipment, and it is found that although different people have different speaking habits, they all contain pre-set recognition words, and only add recognition words on the basis of recognition words. structure in certain grammars. In the embodiment of this specification, the longer recognition words are split into smaller elements according to the semantics, and the words in the vocabulary are split into smaller basic vocabulary elements, and the recognition results are also recognized according to the smallest word elements, so that It can skip the special idiomatic words of the speaker in the middle, and match the actual content of the speaker with the content of the vocabulary, thereby improving the accuracy and success rate of speech recognition.

本说明书中上述方法的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参考即可，每个实施例重点说明的都是与其他实施例的不同之处。相关之处参考方法实施例的部分说明即可。Each embodiment of the above-mentioned method in this specification is described in a progressive manner, the same and similar parts of each embodiment can be referred to each other, and each embodiment focuses on the difference from other embodiments. For relevant parts, please refer to part of the descriptions of the method embodiments.

基于上述所述的语音识别方法，本说明书一个或多个实施例还提供一种语音识别的装置。所述装置可以包括使用了本说明书实施例所述方法的装置(包括分布式系统)、软件(应用)、模块、组件、服务器、客户端等并结合必要的实施硬件的装置。基于同一创新构思，本说明书实施例提供的一个或多个实施例中的装置如下面的实施例所述。由于装置解决问题的实现方案与方法相似，因此本说明书实施例具体的装置的实施可以参考前述方法的实施，重复之处不再赘述。以下所使用的，术语“单元”或者“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。Based on the speech recognition method described above, one or more embodiments of this specification further provide a speech recognition device. The device may include devices (including distributed systems), software (applications), modules, components, servers, clients, etc. that use the methods described in the embodiments of this specification combined with necessary implementation hardware. Based on the same innovative idea, the devices in one or more embodiments provided by the embodiments of this specification are as described in the following embodiments. Since the implementation of the device to solve the problem is similar to the method, the implementation of the specific device in the embodiment of this specification can refer to the implementation of the aforementioned method, and the repetition will not be repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

具体地，图4是本说明书提供的语音识别装置一个实施例的模块结构示意图，如图4所示，本说明书中提供的装置可以包括：Specifically, FIG. 4 is a schematic diagram of the module structure of an embodiment of the speech recognition device provided in this specification. As shown in FIG. 4, the device provided in this specification may include:

语音接收模块41，用于接收待识别语音信息；Voice receiving module 41, for receiving voice information to be recognized;

后验数据获取模块42，用于利用语音识别模型对所述待识别语音信息进行语音识别，获得每一帧所述待识别语音信息对应的后验数据；其中，所述语音识别模型基于短词词表训练获得，所述短词词表为长词词表中各个长词进行语义切割获得；The posterior data acquisition module 42 is used to perform speech recognition on the speech information to be recognized by using a speech recognition model, and obtain the posterior data corresponding to the speech information to be recognized in each frame; wherein, the speech recognition model is based on short words Vocabulary training obtains, and described short word vocabulary carries out semantic segmentation and obtains for each long word in the long word vocabulary;

短词解码模块43，用于利用所述短词词表对所述后验数据进行解码，依次获得所述待识别语音信息对应的多个识别短词；The short word decoding module 43 is used to decode the posteriori data by using the short word vocabulary, and sequentially obtain a plurality of recognized short words corresponding to the voice information to be recognized;

识别结果确定模块44，用于根据所述长词词表以及所述多个识别短词，确定出所述待识别语音信息的语音识别结果。The recognition result determination module 44 is configured to determine the speech recognition result of the speech information to be recognized according to the long word vocabulary and the plurality of recognized short words.

本说明书一些实施例中，所述装置还包括词表映射模块，用于：根据短词词表中各个短词的来源，将所述短词词表中的各个短词与所述长词词表中的各个长词建立映射关系；In some embodiments of this specification, the device also includes a vocabulary mapping module, configured to: associate each short word in the short word vocabulary with the long word word according to the source of each short word in the short word vocabulary Each long word in the table establishes a mapping relationship;

本说明书实施例提供的语音识别装置，可以对目标文本进行多种不同语言的语音识别，实现了跨语言迁移，语言和音色可以灵活的进行组合，提升了语音识别的灵活性和准确性，降低了语音识别的成本。The speech recognition device provided by the embodiment of this specification can perform speech recognition in multiple different languages on the target text, realizes cross-language transfer, and can flexibly combine languages and timbres, which improves the flexibility and accuracy of speech recognition, reduces the the cost of speech recognition.

本说明书一些实施例中，还提供了一种语言合成设备，包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该设备实现上述实施例中记载的语音识别方法，如：In some embodiments of this specification, there is also provided a language synthesis device, including a processor and a memory, the memory stores computer instructions, and the processor is used to execute the computer instructions stored in the memory, when the computer When the instruction is executed by the processor, the device implements the voice recognition method recorded in the above embodiments, such as:

接收待识别语音信息；Receive voice information to be recognized;

需要说明的，上述所述的装置、设备根据方法实施例的描述还可以包括其他的实施方式。具体的实现方式可以参照相关方法实施例的描述，在此不作一一赘述。It should be noted that the above-mentioned apparatus and equipment may also include other implementation manners according to the description of the method embodiment. For specific implementation manners, reference may be made to descriptions of related method embodiments, and details are not repeated here.

本说明书实施例所提供的方法实施例可以在移动终端、计算机终端、服务器或者类似的运算装置中执行。以运行在服务器上为例，图5是本说明书一个实施例中语音识别服务器的硬件结构框图，该计算机终端可以是上述实施例中的语音识别服务器或语音识别装置。如图5所示服务器10可以包括一个或多个(图中仅示出一个)处理器100(处理器100可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)、用于存储数据的非易失性存储器200、以及用于通信功能的传输模块300。本领域普通技术人员可以理解，图5所示的结构仅为示意，其并不对上述电子装置的结构造成限定。例如，服务器10还可包括比图5中所示更多或者更少的组件，例如还可以包括其他的处理硬件，如数据库或多级缓存、GPU，或者具有与图5所示不同的配置。The method embodiments provided in the embodiments of this specification may be executed in mobile terminals, computer terminals, servers or similar computing devices. Taking running on a server as an example, FIG. 5 is a block diagram of the hardware structure of the speech recognition server in an embodiment of this specification. The computer terminal may be the speech recognition server or the speech recognition device in the above embodiments. As shown in Figure 5, the server 10 may include one or more (only one is shown in the figure) processors 100 (the processor 100 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.), A non-volatile memory 200 for storing data, and a transmission module 300 for communication functions. Those of ordinary skill in the art can understand that the structure shown in FIG. 5 is only a schematic diagram, which does not limit the structure of the above-mentioned electronic device. For example, the server 10 may also include more or fewer components than those shown in FIG. 5 , for example, may also include other processing hardware, such as a database or multi-level cache, GPU, or have a configuration different from that shown in FIG. 5 .

非易失性存储器200可用于存储应用软件的软件程序以及模块，如本说明书实施例中的语音识别方法对应的程序指令/模块，处理器100通过运行存储在非易失性存储器200内的软件程序以及模块，从而执行各种功能应用以及资源数据更新。非易失性存储器200可包括高速随机存储器，还可包括非易失性存储器，如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中，非易失性存储器200可进一步包括相对于处理器100远程设置的存储器，这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The non-volatile memory 200 can be used to store software programs and modules of application software, such as program instructions/modules corresponding to the speech recognition method in the embodiment of this specification, and the processor 100 can run the software stored in the non-volatile memory 200 Programs and modules to perform various functional applications and resource data updates. The non-volatile memory 200 may include high-speed random access memory, and may also include non-volatile memories, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memories. In some examples, the non-volatile memory 200 may further include a memory that is remotely located relative to the processor 100, and these remote memories may be connected to a computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

传输模块300用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中，传输模块300包括一个网络适配器(Network Interface Controller，NIC)，其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中，传输模块300可以为射频(Radio Frequency，RF)模块，其用于通过无线方式与互联网进行通讯。The transmission module 300 is used to receive or transmit data via a network. The specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal. In one example, the transmission module 300 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices through a base station so as to communicate with the Internet. In an example, the transmission module 300 may be a radio frequency (Radio Frequency, RF) module, which is used to communicate with the Internet in a wireless manner.

与上述方法相应地，本发明还提供了一种装置，该装置包括计算机设备，所述计算机设备包括处理器和存储器，所述存储器中存储有计算机指令，所述处理器用于执行所述存储器中存储的计算机指令，当所述计算机指令被处理器执行时该装置实现如前所述方法的步骤。Corresponding to the above method, the present invention also provides an apparatus, the apparatus includes computer equipment, the computer equipment includes a processor and a memory, the memory stores computer instructions, and the processor is used to execute the instructions in the memory. Stored computer instructions, when the computer instructions are executed by the processor, the device implements the steps of the aforementioned method.

本发明实施例还提供一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时以实现前述边缘计算服务器部署方法的步骤。该计算机可读存储介质可以是有形存储介质，诸如随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、软盘、硬盘、可移动存储盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。An embodiment of the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the aforementioned method for deploying an edge computing server can be implemented. The computer readable storage medium may be a tangible storage medium such as random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

本领域普通技术人员应该可以明白，结合本文中所公开的实施方式描述的各示例性的组成部分、系统和方法，能够以硬件、软件或者二者的结合来实现。具体究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。当以硬件方式实现时，其可以例如是电子电路、专用集成电路(ASIC)、适当的固件、插件、功能卡等等。当以软件方式实现时，本发明的元素是被用于执行所需任务的程序或者代码段。程序或者代码段可以存储在机器可读介质中，或者通过载波中携带的数据信号在传输介质或者通信链路上传送。Those of ordinary skill in the art should understand that each exemplary component, system and method described in conjunction with the embodiments disclosed herein can be implemented by hardware, software or a combination of the two. Whether it is implemented in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an application specific integrated circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments employed to perform the required tasks. Programs or code segments can be stored in machine-readable media, or transmitted over transmission media or communication links by data signals carried in carrier waves.

需要明确的是，本发明并不局限于上文所描述并在图中示出的特定配置和处理。为了简明起见，这里省略了对已知方法的详细描述。在上述实施例中，描述和示出了若干具体的步骤作为示例。但是，本发明的方法过程并不限于所描述和示出的具体步骤，本领域的技术人员可以在领会本发明的精神后，作出各种改变、修改和添加，或者改变步骤之间的顺序。It is to be understood that the invention is not limited to the specific arrangements and processes described above and shown in the drawings. For conciseness, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the sequence of steps after understanding the spirit of the present invention.

本发明中，针对一个实施方式描述和/或例示的特征，可以在一个或更多个其它实施方式中以相同方式或以类似方式使用，和/或与其他实施方式的特征相结合或代替其他实施方式的特征。In the present invention, features described and/or exemplified for one embodiment can be used in the same or similar manner in one or more other embodiments, and/or can be combined with features of other embodiments or replace other Features of the implementation.

以上所述仅为本发明的优选实施例，并不用于限制本发明，对于本领域的技术人员来说，本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, various modifications and changes may be made to the embodiments of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A speech recognition method, characterized in that the method comprises:

Receive voice information to be recognized;

Use the speech recognition model to perform speech recognition on the speech information to be recognized, and obtain the posterior data corresponding to the speech information to be recognized in each frame; wherein, the speech recognition model is obtained based on short word vocabulary training, and the short words The vocabulary is obtained by semantic segmentation of each long word in the long word vocabulary;

Decoding the posterior data by using the short word vocabulary, sequentially obtaining a plurality of recognized short words corresponding to the speech information to be recognized;

A speech recognition result of the to-be-recognized speech information is determined according to the long word vocabulary and the plurality of recognized short words.

2. method according to claim 1, is characterized in that, the training method of described speech recognition model comprises:

Obtain the long word vocabulary of the original speech recognition model;

Long words in the long word vocabulary are semantically cut, each long word in the long word vocabulary is cut into a plurality of short words, and the short word vocabulary is obtained;

Collect short word speech samples based on the short word vocabulary, use the short word speech samples to perform model training, and obtain the speech recognition model.

3. The method according to claim 1, wherein the method further comprises:

According to the source of each short word in the short word vocabulary, each short word in the short word vocabulary is set up a mapping relationship with each long word in the long word vocabulary;

The determining the speech recognition result of the to-be-recognized speech information according to the long word vocabulary and the plurality of recognized short words includes:

According to the recognition order corresponding to the multiple recognition short words, combine the multiple recognition short words to obtain the combined words;

According to the mapping relationship of each recognized short word in the combined word in the long word vocabulary, the combined word is matched with the long word vocabulary to obtain the speech recognition result.

4. The method according to claim 1, wherein the a posteriori data is the probability of a phoneme corresponding to the speech information to be recognized in each frame;

The decoding of the posterior data by using the short word vocabulary, and sequentially obtaining a plurality of recognized short words corresponding to the speech information to be recognized includes:

According to the ranking of the probability of the phonemes corresponding to the speech information to be recognized in each frame in the posteriori data, obtain a screened phoneme set corresponding to the speech information to be recognized in each frame within a specified ranking;

The short word vocabulary is used to sequentially perform matching decoding on the screened phoneme sets corresponding to the speech information to be recognized in each frame, and sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized.

5. The method according to claim 1, wherein said utilizing said short word vocabulary to decode said posteriori data comprises:

The a posteriori data is decoded using the short word vocabulary using a Viterbi decoding algorithm.

6. The method according to any one of claims 1-5, wherein the speech recognition result of the to-be-recognized speech information is determined according to the long word vocabulary and the plurality of recognized short words ,include:

Match the plurality of recognized short words with the long word vocabulary, if the matching fails, continue to recognize the speech information to be recognized by using the speech recognition model, and determine the corresponding speech information of the speech information to be recognized new recognition phrases;

Combining the new recognition short word with the plurality of recognition short words, and matching based on the combined recognition short word and the long word vocabulary, until the long word vocabulary is determined in the long word vocabulary The speech recognition result of the speech information to be recognized.

7. A speech recognition device, characterized in that the device comprises:

Voice receiving module, used for receiving the voice information to be recognized;

The posterior data acquisition module is used to perform speech recognition on the speech information to be recognized by using a speech recognition model, and obtain the posterior data corresponding to the speech information to be recognized in each frame; wherein, the speech recognition model is based on short words Table training obtains, and described short word vocabulary carries out semantic segmentation and obtains for each long word in the long word vocabulary;

A short word decoding module, configured to use the short word vocabulary to decode the posterior data, and sequentially obtain a plurality of recognized short words corresponding to the speech information to be recognized;

The recognition result determination module is configured to determine the speech recognition result of the speech information to be recognized according to the long word vocabulary and the plurality of recognized short words.

8. The device according to claim 7, characterized in that, the device also includes a vocabulary mapping module for: according to the source of each short word in the short word vocabulary, each short word in the short word vocabulary Each long word in the word and described long word vocabulary establishes mapping relationship;

The identification result determination module is specifically used for:

9. A speech recognition device, comprising a processor and a memory, wherein computer instructions are stored in the memory, and the processor is used to execute the computer instructions stored in the memory, when the computer instructions are executed by the processor When executed, the device implements the steps of the method according to any one of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are realized.