CN106157956A

CN106157956A - The method and device of speech recognition

Info

Publication number: CN106157956A
Application number: CN201510130636.2A
Authority: CN
Inventors: 罗炜; 贾鑫
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2016-11-23
Also published as: WO2016150001A1

Abstract

The invention discloses a voice recognition method and device, wherein the method acquires the voice recognition information of the user's current voice, and acquires the auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice; according to the voice recognition Information and auxiliary identification information determine the final recognition result of the user's current voice. The present invention solves the problem in the related art that the accuracy of voice recognition is not high due to the acquisition of the user's speech content only through the user's voice, thereby improving the accuracy of voice recognition.

Description

Speech recognition method and device

技术领域technical field

本发明涉及通信领域，具体而言，涉及一种语音识别的方法及装置。The present invention relates to the communication field, in particular, to a voice recognition method and device.

背景技术Background technique

语音识别技术随着计算机和相关软硬件技术的发展，已越来越多的应用在各个领域，其识别率也在不断的提高。在环境安静、发音标准等特定条件下，目前应用在语音识别输入文字系统的识别率已经达到95％以上。常规语音识别技术已比较成熟，针对移动终端的语音识别，由于语音质量相对于普通语音识别场景相对较差，因此语音识别效果受到限制。这里语音质量很差包括如下的原因，例如客户端有背景噪声、客户端语音采集设备、通话设备的噪声、通信线路的噪声和干扰、还有本身说话带有口音或者使用了方言、说话人本身的说话含糊或者不清楚等。所有这些因素都可能造成语音识别效果变差。其识别率受到很多因素的影响，针对相关技术中语音识别率低而导致的用户体验度差的问题，目前尚未提出有效的解决方案。在车上或噪声较大、发音不标准的情况下，其识别率将大打折扣，以至于无法达到真正实用目的。其正确识别率低，影响精确操控，效果不够理想。若能采用其它方法来辅助判断以提高其语音识别的准确率，那么语音识别的实用性将显著提高。With the development of computer and related software and hardware technology, speech recognition technology has been more and more applied in various fields, and its recognition rate is also constantly improving. Under specific conditions such as a quiet environment and standard pronunciation, the recognition rate of the current speech recognition input text system has reached more than 95%. Conventional speech recognition technology is relatively mature. For speech recognition of mobile terminals, the effect of speech recognition is limited because the speech quality is relatively poor compared with ordinary speech recognition scenarios. The poor voice quality here includes the following reasons, such as the background noise of the client, the voice collection device of the client, the noise of the communication device, the noise and interference of the communication line, and the accent or the use of a dialect, the speaker himself Slurred or unclear speech. All of these factors can contribute to poor speech recognition. Its recognition rate is affected by many factors, and no effective solution has been proposed for the problem of poor user experience caused by low speech recognition rate in related technologies. In the car or in the case of loud noise and non-standard pronunciation, the recognition rate will be greatly reduced, so that it cannot achieve the real practical purpose. Its correct recognition rate is low, affects precise control, and the effect is not ideal enough. If other methods can be used to assist the judgment to improve the accuracy of speech recognition, the practicability of speech recognition will be significantly improved.

人类的语言认知过程是一个多通道的感知过程。在人与人日常交流的过程中，通过声音来感知他人讲话的内容，在喧闹的环境或对方发音模糊不清时，还需要眼睛观察其口型，表情等的变化，才能准确地理解对方所讲的内容。现行的语音识别系统忽略了语言感知的视觉特性这一面，仅仅利用了单一的听觉特性，使得现有的语音识别系统在噪声环境或多话者条件下，其识别率都显著下降，降低了语音识别的实用性，应用范围也受限制。The human language cognition process is a multi-channel perception process. In the process of daily communication between people, the content of other people's speech is perceived through the sound. In a noisy environment or when the other party's pronunciation is unclear, it is necessary to observe the changes in the mouth shape and expression of the other party in order to accurately understand what the other party is saying. The content of the speech. The current speech recognition system ignores the visual characteristics of language perception, and only uses a single auditory characteristic, so that the recognition rate of the existing speech recognition system is significantly reduced in a noisy environment or under the condition of many speakers, and the speech quality is reduced. The practicability and application range of identification are also limited.

针对相关技术中，仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题，还未提出有效的解决方案。No effective solution has been proposed for the problem in the related art that the accuracy of speech recognition is not high due to the acquisition of the user's speech content only through the user's voice.

发明内容Contents of the invention

本发明提供了一种语音识别的方法及装置，以至少解决相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题。The present invention provides a voice recognition method and device to at least solve the problem in the related art that the accuracy of voice recognition is not high due to the acquisition of the user's speech content only through the user's voice.

根据本发明的一个方面，提供了一种语音识别的方法，包括：获取用户当前语音的语音识别信息，以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息；根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。According to one aspect of the present invention, a voice recognition method is provided, including: acquiring voice recognition information of the user's current voice, and acquiring auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice ; Determine the final recognition result of the user's current voice according to the voice recognition information and the auxiliary recognition information.

进一步地，根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果包括：根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇；根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇；根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果；或者，根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。Further, determining the final recognition result of the user's current voice according to the voice recognition information and the auxiliary recognition information includes: acquiring one or more first candidate words corresponding to the user's current voice according to the voice recognition information; Obtain the vocabulary category corresponding to the user's current voice or one or more second candidate vocabulary according to the auxiliary identification information; determine the final vocabulary of the user's current voice according to the one or more first candidate vocabulary and the vocabulary type A recognition result; or, determining a final recognition result of the user's current voice according to the one or more first candidate words and the one or more second candidate words.

进一步地，根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果包括：从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇，将所述第一特定词汇作为所述用户当前语音的最终识别结果。Further, determining the final recognition result of the user's current speech according to the one or more first candidate words and the vocabulary type includes: selecting a word that meets the vocabulary category from the one or more first candidate words The first specific vocabulary, using the first specific vocabulary as the final recognition result of the user's current voice.

进一步地，根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果包括：从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇，将所述第二特定词汇作为所述用户当前语音的最终识别结果。Further, determining the final recognition result of the user's current voice according to the one or more first candidate words and the one or more second candidate words includes: selecting from the one or more second candidate words The second specific vocabulary with high similarity to the one or more first candidate vocabulary is used as the final recognition result of the user's current speech.

进一步地，基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息包括：获取用于指示所述用户当前状态的图像；根据所述图像获取图像特征信息；根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇，将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。Further, obtaining the auxiliary recognition information of the speech recognition information based on the current state of the user corresponding to the current voice of the user includes: obtaining an image indicating the current state of the user; obtaining image feature information according to the image; The image feature information acquires a vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information, and uses the vocabulary category and/or the one or more candidate vocabulary as the auxiliary identification information.

进一步地，根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇包括：在预定的图像库中查找与所述图像特征信息相似度最高的特定图像；根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系，获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。Further, obtaining the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information according to the image feature information includes: searching for a specific image with the highest similarity to the image feature information in a predetermined image library ; Obtain the vocabulary category or one or more candidate vocabulary corresponding to the specific image according to the preset correspondence between the image and the vocabulary category or one or more candidate vocabulary.

进一步地，所述用户当前状态包括以下至少之一：所述用户的唇形运动状态、所述用户的喉部振动状态、所述用户的脸部运动状态、所述用户的手势运动状态。Further, the current user state includes at least one of the following: the user's lip movement state, the user's throat vibration state, the user's face movement state, and the user's gesture movement state.

进一步地，获取用户当前语音的语音识别信息，以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息之前包括：判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。Further, before acquiring the voice recognition information of the user's current voice, and acquiring the auxiliary recognition information of the voice recognition information based on the user's current state corresponding to the user's current voice includes: determining to determine the user's current voice based on the voice recognition information The accuracy rate of the final speech recognition result is less than a predetermined threshold.

根据本发明的另一个方面，提供了一种语音识别的装置，所述装置包括：获取模块，用于获取用户当前语音的语音识别信息，以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息；确定模块，用于根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。According to another aspect of the present invention, there is provided a voice recognition device, the device comprising: an acquisition module, configured to acquire voice recognition information of the user's current voice, and acquire information based on the user's current state corresponding to the user's current voice Auxiliary identification information of the voice recognition information; a determination module configured to determine a final recognition result of the user's current voice according to the voice recognition information and the auxiliary identification information.

进一步地，所述确定模块包括：第一获取单元，用于根据所述语音识别信息获取所述用户当前语音对应的一个或者多个第一候选词汇；第二获取单元，用于根据所述辅助识别信息获取所述用户当前语音对应的词汇类别或者一个或者多个第二候选词汇；确定单元，用于根据所述一个或者多个第一候选词汇和所述词汇类型确定所述用户当前语音的最终识别结果；或者，根据所述一个或者多个第一候选词汇和所述一个或者多个第二候选词汇确定所述用户当前语音的最终识别结果。Further, the determining module includes: a first acquiring unit, configured to acquire one or more first candidate words corresponding to the user's current voice according to the speech recognition information; a second acquiring unit, configured to acquire The recognition information acquires the vocabulary category corresponding to the user's current voice or one or more second candidate vocabulary; a determination unit is configured to determine the user's current voice according to the one or more first candidate vocabulary and the vocabulary type A final recognition result; or, determining a final recognition result of the user's current voice according to the one or more first candidate words and the one or more second candidate words.

进一步地，所述确定单元还用于从所述一个或者多个第一候选词汇中选择符合所述词汇类别的第一特定词汇，将所述第一特定词汇作为所述用户当前语音的最终识别结果。Further, the determining unit is further configured to select a first specific vocabulary conforming to the vocabulary category from the one or more first candidate vocabulary, and use the first specific vocabulary as the final recognition of the user's current voice result.

进一步地，所述确定单元还用于从所述一个或者多个第二候选词汇中选择与所述一个或者多个第一候选词汇相似度高的第二特定词汇，将所述第二特定词汇作为所述用户当前语音的最终识别结果。Further, the determining unit is further configured to select a second specific word with a high similarity with the one or more first candidate words from the one or more second candidate words, and select the second specific word As the final recognition result of the user's current voice.

进一步地，所述获取模块还包括：第三获取单元，用于获取用于指示所述用户当前状态的图像；第四获取单元，用于根据所述图像获取图像特征信息；第五获取单元，用于根据所述图像特征信息获取与所述图像特征信息对应的词汇类别和/或一个或者多个候选词汇，将所述词汇类别和/或所述一个或者多个候选词汇作为所述辅助识别信息。Further, the acquiring module further includes: a third acquiring unit, configured to acquire an image indicating the current state of the user; a fourth acquiring unit, configured to acquire image feature information according to the image; a fifth acquiring unit, Obtaining the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information according to the image feature information, using the vocabulary category and/or the one or more candidate vocabulary as the auxiliary recognition information.

进一步地，所述第五获取单元还包括：查找子单元，用于在预定的图像库中查找与所述图像特征信息相似度最高的特定图像；获取子单元，用于根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系，获取与所述特定图像对应的词汇类别或者一个或者多个候选词汇。Further, the fifth acquisition unit further includes: a search subunit, configured to search for a specific image with the highest similarity to the image feature information in a predetermined image library; an acquisition subunit, configured to Vocabulary categories or correspondences of one or more candidate words, acquiring the vocabulary category or one or more candidate words corresponding to the specific image.

进一步地，所述装置还包括：判定模块，用于判定基于所述语音识别信息确定所述用户当前语音的最终识别结果的正确率小于预定阈值。Further, the device further includes: a judging module, configured to judge that the accuracy rate of the final recognition result of the user's current speech determined based on the speech recognition information is less than a predetermined threshold.

根据本发明的另一个方面，还提供了一种终端，包括处理器，所述处理器用于获取用户当前语音的语音识别信息，以及基于与所述用户当前语音对应的用户当前状态获取所述语音识别信息的辅助识别信息；根据所述语音识别信息和所述辅助识别信息确定所述用户当前语音的最终识别结果。According to another aspect of the present invention, there is also provided a terminal, including a processor, the processor is used to acquire the voice recognition information of the user's current voice, and acquire the voice based on the user's current state corresponding to the user's current voice Auxiliary identification information of the identification information; determining a final recognition result of the user's current voice according to the voice recognition information and the auxiliary identification information.

通过本发明，获取用户当前语音的语音识别信息，以及基于与用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息；根据语音识别信息和辅助识别信息确定用户当前语音的最终识别结果。解决了相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题，进而提高了语音识别的准确性。Through the present invention, the voice recognition information of the user's current voice is acquired, and the auxiliary identification information of the voice recognition information is obtained based on the user's current state corresponding to the user's current voice; the final recognition result of the user's current voice is determined according to the voice recognition information and the auxiliary identification information . It solves the problem in the related art that the accuracy of speech recognition is not high due to the acquisition of the user's speech content only through the user's voice, thereby improving the accuracy of speech recognition.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是根据本发明实施例的语音识别方法的流程图；Fig. 1 is the flowchart of the speech recognition method according to the embodiment of the present invention;

图2是根据本发明实施例的语音识别装置的结构框图；Fig. 2 is a structural block diagram of a speech recognition device according to an embodiment of the present invention;

图3是根据本发明实施例的语音识别装置的结构框图(一)；Fig. 3 is a structural block diagram (1) of a speech recognition device according to an embodiment of the present invention;

图4是根据本发明实施例的语音识别装置的结构框图(二)；Fig. 4 is the structural block diagram (2) of the speech recognition device according to the embodiment of the present invention;

图5是根据本发明实施例的语音识别装置的结构框图(三)；Fig. 5 is a structural block diagram (3) of a speech recognition device according to an embodiment of the present invention;

图6是根据本发明实施例的语音识别装置的结构框图(四)；Fig. 6 is a structural block diagram (4) of a speech recognition device according to an embodiment of the present invention;

图7是根据本发明实施例的语音识别处理方法的流程图；7 is a flowchart of a voice recognition processing method according to an embodiment of the present invention;

图8根据本发明实施例的语音识别处理装置的结构框图；FIG. 8 is a structural block diagram of a speech recognition processing device according to an embodiment of the present invention;

图9是根据本发明实施例的语音识别处理流程图。FIG. 9 is a flowchart of speech recognition processing according to an embodiment of the present invention.

具体实施方式detailed description

下文中将参考附图并结合实施例来详细说明本发明。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。Hereinafter, the present invention will be described in detail with reference to the drawings and examples. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other.

在本实施例中提供了一种语音识别的方法，图1是根据本发明实施例的语音识别方法的流程图，如图1所示，该流程包括如下步骤：A method for speech recognition is provided in the present embodiment, and Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in Fig. 1, the process includes the following steps:

步骤S102，获取用户当前语音的语音识别信息，以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息；Step S102, acquiring voice recognition information of the user's current voice, and acquiring auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice;

步骤S104，根据语音识别信息和辅助识别信息确定用户当前语音的最终识别结果。Step S104, determine the final recognition result of the user's current voice according to the voice recognition information and auxiliary recognition information.

通过上述步骤，获取用户当前语音的语音识别信息，并且获取用户在发出语音时的状态特征信息，将用户在发出语音时的状态特征信息作为识别当前语音的辅助信息，相比于现有技术中仅通过用户的当前语音进行语音的识别准确率较低，上述步骤解决了相关技术中仅通过用户的声音获取用户的讲话内容导致语音识别的准确度不高的问题，进而提高了语音识别的准确性。Through the above steps, the voice recognition information of the user's current voice is obtained, and the state feature information of the user when the voice is uttered is obtained, and the state feature information of the user when the voice is uttered is used as auxiliary information for recognizing the current voice. Compared with the prior art The accuracy rate of voice recognition is low only through the user's current voice. The above steps solve the problem in the related art that the accuracy of voice recognition is not high due to the acquisition of the user's speech content only through the user's voice, thereby improving the accuracy of voice recognition. sex.

上述步骤S104中涉及根据语音识别信息和辅助识别信息确定该用户当前语音的最终识别结果，在一个可选实施例中，根据语音识别信息获取用户当前语音对应的一个或者多个第一候选词汇；根据辅助识别信息获取该用户当前语音对应的词汇类别或者一个或者多个第二候选词汇；根据一个或者多个第一候选词汇和该词汇类型确定该用户当前语音的最终识别结果；或者，根据一个或者多个第一候选词汇和一个或者多个第二候选词汇确定用户当前语音的最终识别结果。The above step S104 involves determining the final recognition result of the user's current voice according to the voice recognition information and auxiliary recognition information. In an optional embodiment, one or more first candidate words corresponding to the user's current voice are obtained according to the voice recognition information; Obtain the vocabulary category corresponding to the user's current voice or one or more second candidate vocabulary according to the auxiliary recognition information; determine the final recognition result of the user's current voice according to one or more first candidate vocabulary and the vocabulary type; or, according to one Or a plurality of first candidate words and one or more second candidate words determine the final recognition result of the user's current speech.

根据一个或者多个第一候选词汇和词汇类型确定该用户当前语音的最终识别结果的方式可以有很多种，在一个可选实施例中，从一个或者多个第一候选词汇中选择符合词汇类别的第一特定词汇，将第一特定词汇作为该用户当前语音的最终识别结果。在另一个可选实施例中，从一个或者多个第二候选词汇中选择与一个或者多个第一候选词汇相似度高的第二特定词汇，将第二特定词汇作为用户当前语音的最终识别结果。There are many ways to determine the final recognition result of the user's current voice according to one or more first candidate vocabulary and vocabulary types. In an optional embodiment, select the corresponding vocabulary category from one or more first candidate vocabulary The first specific vocabulary is used as the final recognition result of the user's current voice. In another optional embodiment, a second specific vocabulary with high similarity to one or more first candidate vocabulary is selected from one or more second candidate vocabulary, and the second specific vocabulary is used as the final recognition of the user's current voice result.

上述在根据一个或者多个第一候选词汇和一个或者多个第二候选词汇确定该用户当前语音的最终识别结果的过程中，在一个可选实施例中，首先获取用于指示该用户当前状态的图像，然后根据该图像获取图像特征信息，再根据该图像特征信息获取与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇，将该词汇类别和/或该一个或者多个候选词汇作为该辅助识别信息。In the above process of determining the final recognition result of the user's current voice according to one or more first candidate words and one or more second candidate words, in an optional embodiment, first obtain the image, and then obtain image feature information according to the image, and then obtain the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information according to the image feature information, and the vocabulary category and/or the one or more The candidate words are used as the auxiliary identification information.

在一个可选实施例中，在预定的图像库中查找与该图像特征信息相似度最高的特定图像，根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系，获取与该特定图像对应的词汇类别或者一个或者多个候选词汇。从而根据图像特征信息获取到了与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇。In an optional embodiment, the specific image with the highest similarity to the image feature information is searched in a predetermined image library, and according to the preset corresponding relationship between the image and the vocabulary category or one or more candidate vocabulary, the specific The word category corresponding to the image or one or more candidate words. Thus, the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information are obtained according to the image feature information.

用户当前状态可以包括多种，下面对此进行举例说明。在一个可选实施例中，该用户的唇形运动状态、该用户的喉部振动状态、该用户的脸部运动状态、该用户的手势运动状态。上述用户的当前状态特征所包括的信息仅作为举例说明，对此不作限制。例如在现实生活中，仅可以通过唇语即可识别说话者所说的内容。因此，唇语是识别语音的重要的辅助因素。The current status of the user may include various types, which will be illustrated below with an example. In an optional embodiment, the user's lip motion state, the user's throat vibration state, the user's face motion state, and the user's gesture motion state. The information included in the above-mentioned user's current state characteristics is only for illustration, and is not limited thereto. For example, in real life, what a speaker says can be recognized only by lip reading. Therefore, lip language is an important auxiliary factor for speech recognition.

在一个可选实施例中，获取用户当前语音的语音识别信息，以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息之前，判定基于该语音识别信息确定该用户当前语音的最终识别结果的正确率小于预定阈值。In an optional embodiment, before obtaining the voice recognition information of the user's current voice and obtaining the auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice, it is determined based on the voice recognition information that the user's current The accuracy rate of the final speech recognition result is less than a predetermined threshold.

在本实施例中还提供了一种语音识别的装置，该装置用于实现上述实施例及优选实施方式，已经进行过说明的不再赘述。如以下所使用的，术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现，但是硬件，或者软件和硬件的组合的实现也是可能并被构想的。This embodiment also provides a voice recognition device, which is used to implement the above embodiments and preferred implementation modes, and what has been described will not be repeated. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.

图2是根据本发明实施例的语音识别装置的结构框图，如图2所示，该装置包括：获取模块22，用于获取用户当前语音的语音识别信息，以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息；确定模块24，用于根据该语音识别信息和该辅助识别信息确定该用户当前语音的最终识别结果。Fig. 2 is a structural block diagram of a speech recognition device according to an embodiment of the present invention. As shown in Fig. 2 , the device includes: an acquisition module 22 for obtaining speech recognition information of the user's current speech, and based on the speech recognition information corresponding to the user's current speech The user's current state acquires the auxiliary recognition information of the speech recognition information; the determining module 24 is configured to determine the final recognition result of the user's current speech according to the speech recognition information and the auxiliary recognition information.

图3是根据本发明实施例的语音识别装置的结构框图(一)，如图3所示，确定模块24包括：第一获取单元242，用于根据该语音识别信息获取该用户当前语音对应的一个或者多个第一候选词汇；第二获取单元244，用于根据该辅助识别信息获取该用户当前语音对应的词汇类别或者一个或者多个第二候选词汇；确定单元246，用于根据该一个或者多个第一候选词汇和该词汇类型确定该用户当前语音的最终识别结果；或者，根据该一个或者多个第一候选词汇和该一个或者多个第二候选词汇确定该用户当前语音的最终识别结果。Fig. 3 is a structural block diagram (1) of a speech recognition device according to an embodiment of the present invention. As shown in Fig. 3 , the determination module 24 includes: a first obtaining unit 242, which is used to obtain the corresponding speech of the user's current speech according to the speech recognition information. One or more first candidate words; the second acquiring unit 244, configured to acquire the vocabulary category corresponding to the user's current voice or one or more second candidate words according to the auxiliary identification information; the determining unit 246, configured to Or a plurality of first candidate words and the vocabulary type determine the final recognition result of the user's current speech; or, determine the final recognition result of the user's current speech according to the one or more first candidate words and the one or more second candidate words recognition result.

可选地，确定单元246还用于从该一个或者多个第一候选词汇中选择符合该词汇类别的第一特定词汇，将该第一特定词汇作为该用户当前语音的最终识别结果。Optionally, the determining unit 246 is further configured to select a first specific word matching the vocabulary category from the one or more first candidate words, and use the first specific word as a final recognition result of the user's current speech.

可选地，确定单元246还用于从该一个或者多个第二候选词汇中选择与该一个或者多个第一候选词汇相似度高的第二特定词汇，将该第二特定词汇作为该用户当前语音的最终识别结果。Optionally, the determining unit 246 is further configured to select a second specific word with a high similarity with the one or more first candidate words from the one or more second candidate words, and use the second specific word as the user's The final recognition result of the current speech.

图4是根据本发明实施例的语音识别装置的结构框图(二)，如图4所述，获取模块22还包括：第三获取单元222，用于获取用于指示该用户当前状态的图像；第四获取单元224，用于根据该图像获取图像特征信息；第五获取单元226，用于根据该图像特征信息获取与该图像特征信息对应的词汇类别和/或一个或者多个候选词汇，将该词汇类别和/或该一个或者多个候选词汇作为该辅助识别信息。FIG. 4 is a structural block diagram (2) of a speech recognition device according to an embodiment of the present invention. As shown in FIG. 4 , the acquiring module 22 further includes: a third acquiring unit 222, configured to acquire an image indicating the current state of the user; The fourth acquiring unit 224 is configured to acquire image feature information according to the image; the fifth acquiring unit 226 is configured to acquire the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information according to the image feature information, and The vocabulary category and/or the one or more candidate vocabulary serve as the auxiliary identification information.

图5是根据本发明实施例的语音识别装置的结构框图(三)，如图5所示，第五获取单元226还包括：查找子单元2262，用于在预定的图像库中查找与该图像特征信息相似度最高的特定图像；获取子单元2264，用于根据预设的图像与词汇类别或者一个或者多个候选词汇的对应关系，获取与该特定图像对应的词汇类别或者一个或者多个候选词汇。Fig. 5 is the structural block diagram (3) of the speech recognition device according to the embodiment of the present invention, as shown in Fig. 5, the 5th acquisition unit 226 also comprises: search sub-unit 2262, is used for searching in predetermined image storehouse and this image The specific image with the highest similarity of feature information; the obtaining subunit 2264, configured to obtain the vocabulary category or one or more candidate vocabulary corresponding to the specific image according to the preset correspondence between the image and the vocabulary category or one or more candidate vocabulary vocabulary.

可选地，用户当前状态包括以下至少之一：该用户的唇形运动状态、该用户的喉部振动状态、该用户的脸部运动状态、该用户的手势运动状态。Optionally, the user's current state includes at least one of the following: the user's lip motion state, the user's throat vibration state, the user's face motion state, and the user's gesture motion state.

图6是根据本发明实施例的语音识别装置的结构框图(四)，如图6所示，该装置还包括：判定模块26，用于判定基于该语音识别信息确定该用户当前语音的最终识别结果的正确率小于预定阈值。Fig. 6 is a structural block diagram (four) of a speech recognition device according to an embodiment of the present invention. As shown in Fig. 6, the device also includes: a determination module 26, which is used to determine the final recognition of the user's current voice based on the speech recognition information The accuracy of the result is less than a predetermined threshold.

根据本发明的另一个方面，还提供了一种终端，包括处理器，该处理器用于获取用户当前语音的语音识别信息，以及基于与该用户当前语音对应的用户当前状态获取该语音识别信息的辅助识别信息；根据该语音识别信息和该辅助识别信息确定该用户当前语音的最终识别结果。According to another aspect of the present invention, there is also provided a terminal, including a processor, the processor is used to obtain the speech recognition information of the user's current speech, and obtain the speech recognition information based on the user's current state corresponding to the user's current speech Auxiliary identification information: determine the final recognition result of the user's current voice according to the voice recognition information and the auxiliary identification information.

需要说明的是，上述各个模块是可以通过软件或硬件来实现的，对于后者，可以通过以下方式实现，但不限于此：上述各个模块均位于同一处理器中；或者，上述各个模块分别位于第一处理器、第二处理器和第三处理器…中。It should be noted that each of the above-mentioned modules can be implemented by software or hardware. For the latter, it can be implemented in the following manner, but not limited to this: each of the above-mentioned modules is located in the same processor; or, each of the above-mentioned modules is located in the In the first processor, the second processor and the third processor . . .

针对相关技术中存在的上述问题，下面结合具体的可选实施例进行说明，在下述可选实施例中结合了上述可选实施例及其可选实施方式。Aiming at the above-mentioned problems existing in the related technologies, the following description will be made in conjunction with specific optional embodiments, and the above-mentioned optional embodiments and their optional implementation manners are combined in the following optional embodiments.

本可选实施例提供了一种语音识别处理方法及装置，以解决相关技术中语音识别率低而导致的用户体验度差的问题。为了克服现有技术的上述缺点与不足，本可选实施例的目的在于提供一种基于辅助交互方式的智能语音识别方法和装置，在语音识别的基础上，作为基本信号，配合使用唇形识别、人脸识别、手势识别、喉部振动识别等，作为辅助信号。利用各技术在其应用领域的优势，取长补短，各技术模块相对独立又相互融合，大大提高语音处理识别率，优选的，辅助信号识别的增加可以由语音识别结果决定，当语音识别结果可能性小于阈值则增加辅助数据。符合人类的语言认知过程是一个多通道的感知过程。让终端基于通过声音来感知讲话的内容，配合识别其口型，面部变化等准确地理解所讲的内容。This optional embodiment provides a speech recognition processing method and device to solve the problem of poor user experience caused by low speech recognition rate in the related art. In order to overcome the above-mentioned shortcomings and deficiencies of the prior art, the purpose of this optional embodiment is to provide an intelligent voice recognition method and device based on an auxiliary interactive mode. On the basis of voice recognition, lip shape recognition is used as a basic signal. , face recognition, gesture recognition, throat vibration recognition, etc., as auxiliary signals. Utilizing the advantages of each technology in its application field, learning from each other, each technology module is relatively independent and integrated with each other, greatly improving the recognition rate of speech processing. Preferably, the increase of auxiliary signal recognition can be determined by the speech recognition result. When the speech recognition result is less likely than Thresholds add auxiliary data. In line with the human language cognitive process is a multi-channel perception process. Let the terminal perceive the content of the speech based on the sound, and cooperate with the recognition of its mouth shape, facial changes, etc. to accurately understand the content of the speech.

根据本可选实施例的一个方面，提供了一种语音识别处理方法，通过音频传感器获取音频数据作为基本信号进行语音识别的基础上，通过终端设备摄像头或者外置的传感器采集人体的运动图像，包括手势运动、面部运动、喉部振动，唇形识别等，并通过集成的图像算法和动作处理芯片进行解析，作为语音识别的辅助信号，基本信号和辅助信号识别结果由终端综合处理并执行相应操作。将辅助信号识别结果与语音识别基本信号结果进行累加处理形成统一的识别结果，对语音识别起辅助作用，提高语音识别率。According to one aspect of this optional embodiment, a voice recognition processing method is provided, on the basis of acquiring audio data through an audio sensor as a basic signal for voice recognition, and collecting moving images of a human body through a terminal device camera or an external sensor, Including gesture movement, facial movement, throat vibration, lip shape recognition, etc., which are analyzed by integrated image algorithms and motion processing chips, and used as auxiliary signals for voice recognition. The recognition results of basic signals and auxiliary signals are comprehensively processed by the terminal and corresponding operate. The auxiliary signal recognition result and the speech recognition basic signal result are accumulated and processed to form a unified recognition result, which plays an auxiliary role in speech recognition and improves the speech recognition rate.

将手势运动、面部运动、喉部振动，唇形识别综合起来、每种方式都通过特征提取、模板训练、模板分类、判决过程有机的结合起来，运用先语音识别作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列、有效的降低因噪声和外界声音干扰产生识别错误的几率。在辅助信号识别过程中，通过传感器和摄像头采集特征数据，进行特征数据提取，与预置的模板库数据进行一系列匹配判断识别，再与相应的识别特征结果进行比对，识别出在语音识别模型词库中可能的候选词词汇。Combining gesture movement, facial movement, throat vibration, and lip shape recognition, each method is organically combined through feature extraction, template training, template classification, and judgment processes. First, speech recognition is used as the basic signal for analysis and confirmation. Auxiliary signal for auxiliary judgment logic judgment sequence, effectively reducing the probability of recognition errors due to noise and external sound interference. In the process of auxiliary signal recognition, feature data is collected by sensors and cameras, feature data is extracted, and a series of matching judgments and recognition are performed with the preset template library data, and then compared with the corresponding recognition feature results to identify the The vocabulary of possible candidate words in the model lexicon.

可选地，上述唇形识别通过摄像头采集说话者的唇形图像，对唇形图像进行图像处理，实时动态提取唇形特征，然后用唇形模式识别算法确定说话内容。采用唇形和唇色相结合的判断方法，准确定位口唇位置。采用适当的唇形匹配算法进行识别。Optionally, the above-mentioned lip shape recognition collects the lip shape image of the speaker through a camera, performs image processing on the lip shape image, extracts lip shape features dynamically in real time, and then uses a lip shape pattern recognition algorithm to determine the speech content. Using the combination of lip shape and lip color to judge the position of lips accurately. Appropriate lip-matching algorithms are used for recognition.

可选地，上述唇形识别对预处理后的视频数据取出唇形图像的特征，利用唇形图像的特征识别当前用户的嘴型变化；探测用户嘴部运动来实现唇形的识别，提高识别效率和准确率。对上述嘴部运动特征图进行分类，获得分类信息，将上述嘴部运动特征图进行归类，每种特征类型的嘴部运动特征图都对应有若干词汇类别。上述唇形识别获取信息，经过去噪、模数(A/D)转换等一系列处理后，分别与预设在图像/语音识别处理模块中的模板库数据比对，比较上述唇形识别信息的与预先采样的所有嘴部运动特征图的相似度，读取相似度最高的嘴部运动特征图所对应的若干词汇类别。Optionally, the above-mentioned lip shape recognition extracts the features of the lip shape image from the preprocessed video data, and uses the features of the lip shape image to identify the current user's mouth shape change; detect the user's mouth movement to realize the lip shape recognition, and improve the recognition efficiency and accuracy. Classify the above-mentioned mouth motion feature maps to obtain classification information, and classify the above-mentioned mouth motion feature maps, and each feature type of mouth motion feature maps corresponds to several vocabulary categories. The information acquired by the above-mentioned lip shape recognition, after a series of processing such as denoising and analog-to-digital (A/D) conversion, is compared with the template library data preset in the image/speech recognition processing module, and the above-mentioned lip shape recognition information is compared The similarity with all pre-sampled mouth movement feature maps, read several vocabulary categories corresponding to the mouth movement feature map with the highest similarity.

可选地，上述喉部振动识别通过外置传感器采集说话者的喉部振动形态，对振动形态进行处理，实时动态提取振动形态特征，然后用振动形态模式识别算法确定说话内容。Optionally, the above throat vibration recognition collects the speaker's throat vibration pattern through an external sensor, processes the vibration pattern, extracts the characteristics of the vibration pattern dynamically in real time, and then uses the vibration pattern pattern recognition algorithm to determine the speech content.

可选地，在对用户进行喉部振动识别之前，需先对用户的喉部振动运动特征图进行采样，对不同用户建立不同的喉部振动运动特征档案。在预先采样用户的喉部振动运动特征图时，可对用户发出一个音节的喉部振动运动特征图进行采样，也可对用户发出一个单词的喉部振动运动特征图进行采样。对于发音不同的语音事件，喉部振动运动不同，由于用户发出的每个语音事件之间是相关的，在完成对喉部振动的识别后，通过使用上下文的纠错技术，对识别的喉部振动进行验证，减少同类别喉部振动运动特征图的识别错误，进一步提高喉部振动识别的准确率。Optionally, before laryngeal vibration recognition is performed on the user, the user's laryngeal vibration motion feature map needs to be sampled first, and different throat vibration motion feature files are established for different users. When pre-sampling the user's throat vibration motion feature map, the user may sample the throat vibration motion feature map of a syllable, or sample the user's throat vibration motion feature map of a word. For voice events with different pronunciations, the laryngeal vibration movement is different. Since each voice event uttered by the user is related, after the recognition of the laryngeal vibration is completed, by using contextual error correction technology, the recognized laryngeal Vibration is verified to reduce the recognition error of the same type of laryngeal vibration motion feature map, and further improve the accuracy of laryngeal vibration recognition.

可选地，上述喉部振动识别对预处理后的振动数据取出喉部振动图像的特征，利用喉部振动图像的特征识别当前用户的喉部振动变化；探测用户喉部振动运动来实现喉部振动的识别，提高识别效率和准确率。对上述喉部振动运动特征图进行分类，获得分类信息，将上述喉部振动运动特征图进行归类，每种特征类型的喉部振动运动特征都对应有若干词汇类别。上述喉部振动识别获取信息，分别与预设在图像/语音识别处理模块中的模板库数据比对，比较上述喉部振动识别信息的与预先采样的所有喉部振动运动特征图的相似度，读取相似度最高的喉部振动运动特征图所对应的若干词汇类别。Optionally, the above laryngeal vibration recognition extracts the features of the laryngeal vibration image from the preprocessed vibration data, and uses the features of the laryngeal vibration image to identify the change of the laryngeal vibration of the current user; Vibration recognition improves recognition efficiency and accuracy. The above laryngeal vibration motion feature map is classified to obtain classification information, and the above laryngeal vibration motion feature map is classified, and the throat vibration motion feature of each feature type corresponds to several vocabulary categories. The acquired information of laryngeal vibration recognition is compared with the template library data preset in the image/speech recognition processing module, and the similarity between the above laryngeal vibration recognition information and all pre-sampled laryngeal vibration motion feature maps is compared, Several vocabulary categories corresponding to the laryngeal vibration motion feature map with the highest similarity are read.

上述人脸识别用于对视频数据中用户脸部特征进行提取，对用户的身份和位置进行确定；说话时面部肌肉也对应着不同的运动模式，通过采集面部肌肉的动作，完全可以从信号特征中识别对应的肌肉动作模式，进而辅助进行识别语音信息。The above-mentioned face recognition is used to extract the facial features of the user in the video data, and determine the identity and position of the user; when speaking, the facial muscles also correspond to different movement patterns. Identify the corresponding muscle action patterns in the computer, and then assist in the recognition of voice information.

根据本可选实施例的一个方面，还提供了一种语音识别处理装置，包括：基本信号模块。辅助信号模块、信号处理模块。According to an aspect of this optional embodiment, a speech recognition processing device is further provided, including: a basic signal module. Auxiliary signal module, signal processing module.

基本信号模块，为传统的语音识别模块，上述语音识别模块通过音频传感器用于对预处理后的音频数据进行识别；语音识别模块的识别对象包括孤立词汇的语音识别和连续大词汇量的语音识别，前者主要用来确定控制指令，后者主要用于文本的输入。在本发明中主要以孤立词汇的识别为例进行说明，连续大词汇量的识别采用相同的处理方式。The basic signal module is a traditional speech recognition module. The above-mentioned speech recognition module is used to recognize the preprocessed audio data through the audio sensor; the recognition objects of the speech recognition module include speech recognition of isolated vocabulary and speech recognition of continuous large vocabulary , the former is mainly used to determine control instructions, and the latter is mainly used for text input. In the present invention, the recognition of isolated vocabulary is mainly used as an example for illustration, and the recognition of continuous large vocabulary adopts the same processing method.

可选地，音频传感器为麦克风阵列或指向性麦克风。由于环境中存在各种形式的噪声干扰，而现有基于普通麦克风的音频获取方式对于用户语音及环境噪声具有相同的灵敏度，没有区别语音与噪声的能力，因此容易造成用户语音识别指令操作正确率的下降。使用麦克风阵列或指向性麦克风可以克服上述问题，使用声源定位与语音增强算法跟踪操作用户的声音并对其声音信号进行增强，抑制周围环境噪声及人声干扰的影响，提高系统语音音频输入的信噪比，保证后端算法获取数据质量的可靠。Optionally, the audio sensor is a microphone array or a directional microphone. Due to the existence of various forms of noise interference in the environment, the existing audio acquisition methods based on ordinary microphones have the same sensitivity to user voice and environmental noise, and have no ability to distinguish voice from noise, so it is easy to cause the correctness of user voice recognition instructions. Decline. The above problems can be overcome by using a microphone array or a directional microphone, using sound source localization and voice enhancement algorithms to track the voice of the operating user and enhance its voice signal, suppressing the influence of ambient noise and human voice interference, and improving the system voice and audio input. The signal-to-noise ratio ensures the reliability of the data quality obtained by the back-end algorithm.

辅助信号模块，包括前端摄像头、音频传感器、喉部振动传感器；用于获取视频数据、音频数据和动作数据；Auxiliary signal module, including front-end camera, audio sensor, throat vibration sensor; used to obtain video data, audio data and motion data;

可选地，喉部振动传感器集成于可穿戴设备，位置和用户喉部接触，检测用户产生的语音振动，一个温度传感器放置于可穿戴设备内侧，一个温度传感器放置于可穿戴设备的外侧，微处理器通过比较两个传感器检测的温度，判断可穿戴设备是否被用户穿戴,可穿戴设备在不被穿戴的状况下，将自动进入到休眠模式，降低可穿戴设备整体功耗。微处理器将检测振动传感器状态判断并识别用户发出的语音指令，并将语音指令通过蓝牙设备发送到需要控制的设备，执行语音识别指令。Optionally, the throat vibration sensor is integrated into the wearable device, and the position is in contact with the user's throat to detect the voice vibration generated by the user. A temperature sensor is placed on the inside of the wearable device, and a temperature sensor is placed on the outside of the wearable device. The processor judges whether the wearable device is worn by the user by comparing the temperatures detected by the two sensors. When the wearable device is not worn, it will automatically enter the sleep mode to reduce the overall power consumption of the wearable device. The microprocessor will detect the state of the vibration sensor to judge and recognize the voice command sent by the user, and send the voice command to the device to be controlled through the Bluetooth device to execute the voice recognition command.

信号处理单元，包括唇形识别模块、人脸识别模块、振动识别模块、手势识别模块、语音识别模块和分调整模块；用于对基本信号(语音信号)和辅助信号进行识别，选择基本信号作为主要的语音信息，将辅助信号作为辅助语音信息；The signal processing unit includes a lip shape recognition module, a face recognition module, a vibration recognition module, a gesture recognition module, a speech recognition module and a sub-adjustment module; it is used to recognize the basic signal (voice signal) and the auxiliary signal, and select the basic signal as The main voice information, using the auxiliary signal as auxiliary voice information;

运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列，具体识别过程中，选择语音信号识别得出的可能性分值最高的若干个词作为候选词，用于对于每个候选词，根据预定的词表生成多级相关词集合。辅助信号产生的辅助语音信息用于提高语音识别模型中候选词和相关词集合中的相关词在语音别模型词库中的分值。当基本信号和辅助信号全部处理完毕后，选择分值最高的候选词或相关词作为识别结果。Use the basic signal (speech signal) as the basic signal for analysis and confirmation, and then the auxiliary signal for auxiliary judgment. In the specific recognition process, select several words with the highest probability scores obtained from speech signal recognition as candidate words , for generating a multi-level set of related words according to a predetermined vocabulary for each candidate word. The auxiliary speech information generated by the auxiliary signal is used to improve the scores of the candidate words in the speech recognition model and the related words in the related word set in the phonetic recognition model lexicon. After the basic signal and the auxiliary signal are all processed, the candidate word or related word with the highest score is selected as the recognition result.

上述唇形识别模块用于对预处理后的视频数据取出唇形图像的特征，利用唇形信息识别当前用户的嘴型变化；The lip shape recognition module is used to extract the features of the lip shape image from the preprocessed video data, and use the lip shape information to identify the current user's mouth shape changes;

上述人脸识别模块用于对视频数据中用户脸部特征进行提取，对用户的身份和位置进行确定，识别出不同注册用户的身份主要是有利于整个装置个性化操作的定制，如不同控制权的授予，用户的位置信息可以用于辅助手势识别确定用户手的操作区域、确定用户进行语音操作时的方位，以提高麦克风用户方位的音频输入增益；当有多个可能的用户时，此模块能够识别出所有人脸的位置，并对所有用户身份进行判断，并分别进行处理。问用户哪位摄像头视野中的用户将被授予控制权；The above-mentioned face recognition module is used to extract the facial features of the user in the video data, determine the identity and location of the user, and identify the identity of different registered users is mainly conducive to the customization of the personalized operation of the entire device, such as different control rights Granted, the user's location information can be used to assist gesture recognition to determine the operating area of the user's hand and determine the orientation of the user when performing voice operations, so as to improve the audio input gain of the microphone user's orientation; when there are multiple possible users, this module It can identify the position of all faces, judge all user identities, and process them separately. Ask the user which user in the camera's field of view will be granted control;

上述手势识别模块用于对预处理后的视频数据中手势信息进行提取，确定手型、手的运动轨迹、手在图像中的坐标信息，进而对任意手型进行跟踪，对手在图像中的轮廓进行分析，用户通过特定的手势或动作以获得整个终端的启动和控制权。The above gesture recognition module is used to extract the gesture information in the preprocessed video data, determine the hand shape, the movement trajectory of the hand, and the coordinate information of the hand in the image, and then track any hand shape, and the outline of the hand in the image For analysis, the user obtains the start and control rights of the entire terminal through specific gestures or actions.

通过可选实施例，对现有的各种形式的人机交互技术，包括手势识别、喉部振动识别、语音识别、人脸识别、唇形识别技术等进行融合，语音识别作为基本信号，配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为辅助信号进行语音识别候选词的分调整。运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列，利用各技术在其应用领域的优势，取长补短，各技术模块相对独立又相互融合，利用唇形信息识别当前用户的嘴型变化，以此为依据降低用户进行语音识别操作时的误判率，以保证在噪声环境中语音操作也能正常识别；人脸识别模块识别出用户的位置信息，可以用于辅助手势识别确定用户手的操作区域、确定用户进行语音操作时的方位，以提高麦克风用户方位的音频输入增益。从而克服噪音的影响，显著提高了语音识别率，再把结果转化成相关指令。很好地做到了提升终端语音识别稳定与操作的舒适。Through an optional embodiment, the existing various forms of human-computer interaction technologies, including gesture recognition, throat vibration recognition, voice recognition, face recognition, lip shape recognition technology, etc., are integrated, and voice recognition is used as the basic signal. Use lip shape recognition, face recognition, gesture recognition, throat vibration recognition, etc. as auxiliary signals to adjust the score of speech recognition candidate words. Using the basic signal (speech signal) as the basic signal for analysis and confirmation, and then the auxiliary signal for auxiliary judgment, the logical judgment sequence takes advantage of the advantages of each technology in its application field, learns from each other, and each technology module is relatively independent and integrated with each other. The information identifies the current user's mouth shape changes, based on this to reduce the misjudgment rate when the user performs voice recognition operations, so as to ensure that voice operations can be recognized normally in noisy environments; the face recognition module recognizes the user's location information, which can It is used to assist gesture recognition to determine the operating area of the user's hand and determine the orientation of the user when performing voice operations, so as to increase the audio input gain of the microphone user's orientation. Thereby overcoming the influence of noise, significantly improving the speech recognition rate, and then converting the results into relevant instructions. It has done a good job of improving the stability of terminal voice recognition and the comfort of operation.

在附图的流程图示出的步骤可以在用户终端诸如智能手机、平板电脑等中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。The steps shown in the flowcharts of the accompanying drawings may be performed in a user terminal such as a smartphone, a tablet computer, etc., and, although a logical order is shown in the flowcharts, in some cases, may be performed in a manner different from that shown here. Perform the steps shown or described in the order shown or described.

本实施例提供了一种语音识别处理方法，图7是根据本发明实施例的语音识别处理方法的流程图，如图7所示，该流程包括：This embodiment provides a speech recognition processing method. FIG. 7 is a flowchart of a speech recognition processing method according to an embodiment of the present invention. As shown in FIG. 7, the process includes:

步骤S702，将音频传感器获取的语音信息作为基本信号进行识别处理；Step S702, using the voice information acquired by the audio sensor as a basic signal for recognition processing;

步骤S704，将唇形识别、人脸识别、振动识别、手势识别作为辅助信号进行识别处理，并对基本信号的识别结果进行分调整。In step S704, lip shape recognition, face recognition, vibration recognition, and gesture recognition are used as auxiliary signals for recognition processing, and the recognition results of basic signals are sub-adjusted.

语音识别对象包括孤立词汇的语音识别和连续大词汇量的语音识别，前者主要用来确定控制指令，后者主要用于文本的输入。在本实施例中以孤立词汇的识别为例进行说明，连续大词汇量的识别采用相同的处理方式。通过上述各个步骤，采用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列，选择语音信号识别得出的可能性分值最高的若干个词作为候选词，用于对于每个候选词，根据预定的词表生成多级相关词集合。辅助信号识别产生的可能性分值最高的候选词类别作为辅助信息，依次判断基本信号识别出的若干个候选词，如果符合辅助信号识别出的候选词类别，则提高该候选词和相关词集合中的相关词在语音别模型词库中的分值。当基本信号和辅助信号全部处理完毕后，选择分值最高的候选词或相关词作为识别结果。Speech recognition objects include speech recognition of isolated vocabulary and speech recognition of continuous large vocabulary. The former is mainly used to determine control instructions, and the latter is mainly used for text input. In this embodiment, the recognition of isolated vocabulary is taken as an example for illustration, and the recognition of continuous large vocabulary adopts the same processing method. Through the above steps, the basic signal (speech signal) is used as the basic signal for analysis and confirmation, and then the auxiliary signal is used for auxiliary judgment of the logical judgment sequence, and several words with the highest possibility scores obtained by speech signal recognition are selected as candidate words , for generating a multi-level set of related words according to a predetermined vocabulary for each candidate word. The candidate word category with the highest probability score generated by auxiliary signal recognition is used as auxiliary information, and several candidate words recognized by the basic signal are judged in turn. The score of the related words in the phonetic model lexicon. After the basic signal and the auxiliary signal are all processed, the candidate word or related word with the highest score is selected as the recognition result.

在具体实施过程中，唇形识别、人脸识别、振动识别、手势识别作为辅助信号进行识别处理，各种识别方式是相互独立的，可以同时使用一个或多个识别方式作为辅助信号输入。In the specific implementation process, lip shape recognition, face recognition, vibration recognition, and gesture recognition are used as auxiliary signals for recognition processing. Various recognition methods are independent of each other, and one or more recognition methods can be used as auxiliary signal input at the same time.

在实施例中还提供了一种装置，该装置与上述实施例中的方法相对应，已经进行过说明的在此不再赘述。该装置中的模块或单元可以是存储在存储器或用户终端中并可以被处理器运行的代码，也可以用其他方式实现，在此不再一一举例。An apparatus is also provided in the embodiment, and the apparatus corresponds to the method in the foregoing embodiment, and what has been described will not be repeated here. The modules or units in the device may be codes stored in a memory or in a user terminal and executed by a processor, or may be implemented in other ways, which will not be exemplified here.

根据本发明的一个方面，还提供了一种语音识别处理装置，图8是根据本发明实施例的语音识别处理装置的结构框图，如图8所示，该装置包括：According to one aspect of the present invention, a speech recognition processing device is also provided. FIG. 8 is a structural block diagram of a speech recognition processing device according to an embodiment of the present invention. As shown in FIG. 8, the device includes:

基本信号模块，包括音频传感器、为传统的语音识别模块，上述语音识别模块通过音频传感器用于对预处理后的音频数据进行识别；The basic signal module, including an audio sensor, is a traditional speech recognition module, and the above-mentioned speech recognition module is used to recognize the preprocessed audio data through the audio sensor;

辅助信号模块，包括前端摄像头、喉部振动传感器；用于获取视频数据、音频数据和动作数据，包括唇形识别、人脸识别、喉部振动识别、手势识别等；Auxiliary signal module, including front-end camera, throat vibration sensor; used to acquire video data, audio data and motion data, including lip shape recognition, face recognition, throat vibration recognition, gesture recognition, etc.;

信号处理模块，包括唇形识别模块、人脸识别模块、振动识别模块、手势识别模块、语音识别模块和分调整模块；用于对基本信号(语音信号)和辅助信号进行识别，选择基本信号作为主要的语音信息，将辅助信号作为辅助信息进行分调整；Signal processing module, including lip shape recognition module, face recognition module, vibration recognition module, gesture recognition module, voice recognition module and sub-adjustment module; used to identify the basic signal (voice signal) and auxiliary signal, select the basic signal as For the main voice information, the auxiliary signal is used as auxiliary information for sub-adjustment;

上述人脸识别模块用于对视频数据中用户脸部特征进行提取，对用户的身份和位置进行确定，识别出不同注册用户的身份主要是有利于整个装置个性化操作的定制，如不同控制权的授予；The above-mentioned face recognition module is used to extract the facial features of the user in the video data, determine the identity and location of the user, and identify the identity of different registered users is mainly conducive to the customization of the personalized operation of the entire device, such as different control rights the grant of;

上述手势识别模块用于对预处理后的视频数据中手势信息进行提取，确定手型、手的运动轨迹、手在图像中的坐标信息，进而对任意手型进行跟踪，对手在图像中的轮廓进行分析，用户通过特定的手势或动作以获得整个终端的启动和控制权；The above gesture recognition module is used to extract the gesture information in the preprocessed video data, determine the hand shape, the movement trajectory of the hand, and the coordinate information of the hand in the image, and then track any hand shape, and the outline of the hand in the image For analysis, the user obtains the activation and control rights of the entire terminal through specific gestures or actions;

图9是根据本发明语音识别处理方法的流程图，如图9所示，该实施例的语音识别方法如下：Fig. 9 is a flow chart of the speech recognition processing method according to the present invention, as shown in Fig. 9, the speech recognition method of this embodiment is as follows:

步骤S902，从音频传感器获取的语音信息，从前端摄像头、喉部振动传感器获取视频数据、动作数据，包括唇形识别、人脸识别、喉部振动识别、手势识别等信息；Step S902, the voice information obtained from the audio sensor, the video data and motion data obtained from the front-end camera and the throat vibration sensor, including information such as lip shape recognition, face recognition, throat vibration recognition, and gesture recognition;

步骤S904，以孤立词汇的语音识别为例，对语音信号作为基本信号进行识别确认，识别该孤立词汇得到该可能性最大的若干个词作为候选词；Step S904, taking speech recognition of an isolated vocabulary as an example, identifying and confirming the speech signal as a basic signal, identifying the isolated vocabulary and obtaining several words with the highest probability as candidate words;

步骤S906，对终端设备摄像头或者外置的传感器采集人体的运动图像，包括手势运动、面部运动、喉部振动，唇形识别等作为辅助信号，进行分析确认，得到可能性分值最高的候选词类别；Step S906, collect motion images of the human body with the camera of the terminal device or an external sensor, including gesture motion, facial motion, throat vibration, lip shape recognition, etc. as auxiliary signals, analyze and confirm, and obtain the candidate word with the highest possible score category;

步骤S908，依次判断基本信号识别出的若干个候选词，如果符合辅助信号识别出的候选词类别，则提高该候选词在语音别模型词库中的分值；Step S908, judging several candidate words identified by the basic signal in turn, if they meet the category of candidate words identified by the auxiliary signal, then increase the score of the candidate word in the phonetic model lexicon;

步骤S910，当基本信号和辅助信号全部处理完毕后，选择分值最高的候选词作为识别结果。Step S910, after the basic signal and the auxiliary signal are all processed, the candidate word with the highest score is selected as the recognition result.

下面以一个具体示例对本可选实施例进行说明。例如通过对机主的语音进行识别，得到以下结果：The following uses a specific example to illustrate this optional embodiment. For example, by recognizing the owner's voice, the following results are obtained:

“请(0.6)名片夹(0.9)呼叫(0.9)浏览器(0.7)，其中括号中的数值为可能性分值值，代表可能性大小，分值越大可能性越大。选择可能性分值最高的词为候选词，例如选择如下的候选词：名片夹(0.9)呼叫(0.9)作为语音识别结果。"Please (0.6) business card folder (0.9) call (0.9) browser (0.7), where the value in brackets is the possibility score value, representing the possibility, the greater the score, the greater the possibility. Select the possibility score The word with the highest value is the candidate word. For example, the following candidate words are selected: business card folder (0.9) call (0.9) as the speech recognition result.

同时进行的手势运动、面部运动、喉部振动，唇形识别等多种方式组合或者只使用其中一种或多种方式作为辅助信号进行识别，得到可能性分值最高的候选词类别。Simultaneous gesture movement, facial movement, throat vibration, lip shape recognition, etc. are combined or only one or more of them are used as auxiliary signals for recognition, and the candidate word category with the highest possible score is obtained.

依次判断语音信号识别出的名片夹(0.9)呼叫(0.9)，判断是否符合辅助信号识别出的候选词类别。假设名片夹符合候选词类别。则提高名片夹的可能性分值，例如更新为名片夹(1.0)呼叫(0.9)。Sequentially judge the call (0.9) of the business card holder (0.9) recognized by the voice signal, and judge whether it matches the candidate word category recognized by the auxiliary signal. Assume that the business card holder meets the candidate word category. Then increase the possibility score of the contacts, for example update to contacts (1.0) call (0.9).

当语音基本信号和辅助信号全部处理完毕后，选择分值最高的候选词名片夹(1.0)作为识别结果。When the speech basic signal and auxiliary signal are all processed, the candidate word card folder (1.0) with the highest score is selected as the recognition result.

作为本实施例的可选实施例，可以运用先辅助信号识别确定候选词类别，后通过语音信号作为基本信号进行分析确认的逻辑判断序列。先通过手势运动、面部运动、喉部振动，唇形识别等多种方式组合或者只使用其中一种或多种方式作为辅助信号进行识别，当使用多种方式进行识别时，每种方式的识别结果累加处理，得到可能性分值最高的候选词类别，在此的基础上结合语音识别结果，从中选择可能性分值最高的词为最终识别结果。下面以一个具体示例对本方案进行说明。例如通过对机主的语音进行识别，得到以下结果：As an optional embodiment of this embodiment, a logical judgment sequence may be used in which the auxiliary signal is first identified to determine the category of the candidate words, and then the voice signal is used as the basic signal for analysis and confirmation. First, a combination of gesture movement, facial movement, throat vibration, lip shape recognition, etc., or only one or more of them are used as auxiliary signals for recognition. When multiple methods are used for recognition, the recognition of each method The results are accumulated and processed to obtain the candidate word category with the highest possible score. On this basis, combined with the speech recognition results, the word with the highest possible score is selected as the final recognition result. The scheme is described below with a specific example. For example, by recognizing the owner's voice, the following results are obtained:

“请(0.6)名片夹(0.9)呼叫(0.9)浏览器(0.7)，其中括号中的数值为可能性分值。选择可能性分值最高的词为候选词，例如选择如下的候选词：名片夹(0.9)呼叫(0.9)作为语音识别结果。"Please (0.6) name card folder (0.9) call (0.9) browser (0.7), where the value in brackets is the possibility score. Select the word with the highest possibility score as the candidate word, for example, select the following candidate word: Contacts (0.9) call (0.9) as a speech recognition result.

同时进行的喉部振动和唇形识别两种方式组合作为辅助信号进行识别，假设首先是喉部振动识别，依次判断基本信号识别出的名片夹(0.9)呼叫(0.9)，判断是否符合喉部振动识别识别出的候选词类别。假设名片夹符合喉部振动识别的类别，则提高名片夹的可能性分值，例如更新为名片夹(1.0)呼叫(0.9)。在上一次识别结果的基础上继续进行唇形识别判断，依次判断名片夹(1.0)呼叫(0.9)，判断是否符合唇形识别的候选词类别。假设名片夹符合唇形识别的类别，则提高名片夹的可能性分值，例如更新为名片夹(1.1)呼叫(0.9)。两种方式的识别结果进行了累加处理。Simultaneous laryngeal vibration and lip shape recognition are combined as an auxiliary signal for recognition. Assume that the laryngeal vibration recognition is the first, and then judge the business card holder (0.9) and call (0.9) recognized by the basic signal to determine whether it meets the throat. Candidate word categories identified by vibration recognition. Assuming that the business card meets the category of throat vibration recognition, the likelihood score of the business card is increased, eg, updated to business card (1.0) call (0.9). On the basis of the last recognition result, the lip shape recognition judgment is continued, and the business card holder (1.0) and call (0.9) are judged in turn to judge whether it meets the candidate word category of lip shape recognition. Assuming that the name card holder meets the category of lip shape recognition, then increase the possibility score of the name card holder, for example, update to name card holder (1.1) call (0.9). The recognition results of the two methods are accumulated.

当语音基本信号和辅助信号全部处理完毕后，选择分值最高的候选词名片夹(1.1)作为识别结果。After the speech basic signal and auxiliary signal are all processed, select the candidate word card folder (1.1) with the highest score as the recognition result.

作为本实施例的可选实施例，进一步筛选的过程是通过分调整来完成，即可以增加符合辅助信号识别的候选词的分值，也可以减小不符合辅助信号识别的候选词的分值，当基本信号和辅助信号全部处理完毕后，选择分值最高的候选词作为识别结果。As an optional embodiment of this embodiment, the further screening process is completed through score adjustment, that is, the score of candidate words that meet auxiliary signal recognition can be increased, and the score of candidate words that do not meet auxiliary signal recognition can also be reduced. , when the basic signal and the auxiliary signal are all processed, the candidate word with the highest score is selected as the recognition result.

作为本实施例的可选实施例，为了提高语音识别准确率加入的利用辅助信息对识别结果进行确认对用户是可选的，语音识别器根据输入语音确定识别结果。为上述识别结果计算出一个可能性度量值。如果该可能性度量值小于阈值，则向用户提示是否输入辅助数据或者自动开启辅助数据识别。如果该可能性度量值大于阈值，则向用户提示是否关闭辅助数据或者自动关闭辅助数据识别。阈值的具体数值不进行限定，由经验值得出或者根据用户体验得出。As an optional embodiment of this embodiment, it is optional for the user to use auxiliary information to confirm the recognition result added in order to improve the accuracy of the speech recognition, and the speech recognizer determines the recognition result according to the input speech. A likelihood measure is calculated for the above recognition results. If the possibility metric value is less than the threshold, the user is prompted whether to input auxiliary data or the identification of auxiliary data is automatically enabled. If the possibility metric is greater than the threshold, the user is prompted whether to turn off the auxiliary data or automatically turn off the recognition of the auxiliary data. The specific numerical value of the threshold is not limited, and is derived from empirical values or based on user experience.

基于本上述实施例提高的语音识别方法，对现有的各种形式的人机交互技术，包括手势识别、喉部振动识别、语音识别、人脸识别、唇形识别技术等进行融合，语音识别作为基本信号，配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为辅助信号进行语音识别候选词的分调整。运用先基本信号(语音信号)作为基本信号进行分析确认、后辅助信号进行辅助判断的逻辑判断序列，利很好地做到了提升终端语音识别稳定与操作的舒适。Based on the speech recognition method improved in the above-mentioned embodiments, the existing various forms of human-computer interaction technologies, including gesture recognition, throat vibration recognition, speech recognition, face recognition, lip shape recognition technology, etc. are integrated, speech recognition As the basic signal, lip shape recognition, face recognition, gesture recognition, throat vibration recognition, etc. are used as auxiliary signals to adjust the points of speech recognition candidate words. Using the basic signal (speech signal) as the basic signal for analysis and confirmation, and then the auxiliary signal for auxiliary judgment, the logic judgment sequence is well done to improve the stability of terminal speech recognition and the comfort of operation.

综上所述，通过本发明提供的一种语音识别处理方法及装置，在语音识别的基础上，作为基本信号，配合使用唇形识别、人脸识别、手势识别、喉部振动识别等作为辅助信号。解决了相关技术中语音识别率低而导致的用户体验度差的问题。利用各技术在其应用领域的优势，取长补短，各技术模块相对独立又相互融合，大大提高语音处理识别率。In summary, through the speech recognition processing method and device provided by the present invention, on the basis of speech recognition, as the basic signal, lip shape recognition, face recognition, gesture recognition, throat vibration recognition, etc. are used as auxiliary signals. Signal. The problem of poor user experience caused by low voice recognition rate in the related art is solved. Taking advantage of the advantages of each technology in its application field, learning from each other, each technology module is relatively independent and integrated with each other, which greatly improves the recognition rate of speech processing.

在另外一个实施例中，还提供了一种软件，该软件用于执行上述实施例及优选实施方式中描述的技术方案。In another embodiment, software is also provided, and the software is used to implement the technical solutions described in the above embodiments and preferred implementation manners.

在另外一个实施例中，还提供了一种存储介质，该存储介质中存储有上述软件，该存储介质包括但不限于：光盘、软盘、硬盘、可擦写存储器等。In another embodiment, there is also provided a storage medium, in which the software is stored, the storage medium includes but not limited to: optical discs, floppy disks, hard disks, rewritable memories, and the like.

显然，本领域的技术人员应该明白，上述的本发明的各模块或各步骤可以用通用的计算装置来实现，它们可以集中在单个的计算装置上，或者分布在多个计算装置所组成的网络上，可选地，它们可以用计算装置可执行的程序代码来实现，从而，可以将它们存储在存储装置中由计算装置来执行，并且在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤，或者将它们分别制作成各个集成电路模块，或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样，本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices Alternatively, they may be implemented in program code executable by a computing device so that they may be stored in a storage device to be executed by a computing device, and in some cases in an order different from that shown here The steps shown or described are carried out, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for speech recognition, characterized in that, comprising:

Acquiring voice recognition information of the user's current voice, and acquiring auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice;

A final recognition result of the user's current voice is determined according to the voice recognition information and the auxiliary recognition information.

2. The method according to claim 1, wherein determining the final recognition result of the user's current voice according to the voice recognition information and the auxiliary recognition information comprises:

Obtain one or more first candidate words corresponding to the user's current voice according to the voice recognition information;

Obtain the vocabulary category or one or more second candidate vocabulary corresponding to the user's current voice according to the auxiliary identification information;

Determine the final recognition result of the user's current voice according to the one or more first candidate vocabulary and the vocabulary category; or, according to the one or more first candidate vocabulary and the one or more second candidate Vocabulary determines the final recognition result of the user's current speech.

3. The method according to claim 2, wherein determining the final recognition result of the user's current voice according to the one or more first candidate vocabulary and the vocabulary type comprises:

Selecting a first specific word matching the vocabulary category from the one or more first candidate words, and using the first specific word as a final recognition result of the user's current speech.

4. The method according to claim 2, wherein determining the final recognition result of the user's current voice according to the one or more first candidate words and the one or more second candidate words comprises:

Select a second specific vocabulary with a high similarity with the one or more first candidate vocabulary from the one or more second candidate vocabulary, and use the second specific vocabulary as the final recognition result of the user's current voice .

5. The method according to claim 1, wherein obtaining the auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice comprises:

obtaining an image indicative of the user's current status;

Acquiring image feature information according to the image;

The vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information are acquired according to the image feature information, and the vocabulary category and/or the one or more candidate vocabulary are used as the auxiliary identification information.

6. The method according to claim 5, wherein obtaining the vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information according to the image feature information comprises:

Searching for a specific image with the highest similarity to the image feature information in a predetermined image library;

According to the preset correspondence between the image and the vocabulary category or one or more candidate vocabulary, the vocabulary category or one or more candidate vocabulary corresponding to the specific image is acquired.

7. The method according to any one of claims 1 to 6, wherein the current state of the user includes at least one of the following: the user's lip movement state, the user's throat vibration state, The facial movement state of the user and the gesture movement state of the user.

8. The method according to any one of claims 1 to 7, wherein the speech recognition information of the user's current voice is obtained, and the speech recognition information of the speech recognition information is obtained based on the user's current state corresponding to the user's current speech. Secondary identifying information previously included:

Determining that the accuracy rate of the final recognition result of the user's current voice determined based on the voice recognition information is less than a predetermined threshold.

9. A device for speech recognition, characterized in that the device comprises:

An acquisition module, configured to acquire voice recognition information of the user's current voice, and acquire auxiliary identification information of the voice recognition information based on the user's current state corresponding to the user's current voice;

A determining module, configured to determine a final recognition result of the user's current voice according to the voice recognition information and the auxiliary recognition information.

10. The device according to claim 9, wherein the determining module comprises:

A first acquiring unit, configured to acquire one or more first candidate words corresponding to the user's current voice according to the voice recognition information;

A second acquiring unit, configured to acquire the vocabulary category corresponding to the user's current voice or one or more second candidate vocabulary according to the auxiliary identification information;

A determining unit, configured to determine the final recognition result of the user's current speech according to the one or more first candidate words and the vocabulary category; or, according to the one or more first candidate words and the one or A plurality of second candidate words determine the final recognition result of the user's current speech.

11. The device according to claim 10, wherein the determining unit is further configured to select a first specific vocabulary that conforms to the vocabulary category from the one or more first candidate vocabulary, and the second A specific word is used as the final recognition result of the user's current speech.

12. The device according to claim 10, wherein the determining unit is further configured to select, from the one or more second candidate words, those with high similarity to the one or more first candidate words The second specific vocabulary, using the second specific vocabulary as the final recognition result of the user's current voice.

13. The device according to claim 9, wherein the acquiring module further comprises:

a third acquiring unit, configured to acquire an image indicating the current state of the user;

a fourth acquiring unit, configured to acquire image feature information according to the image;

A fifth acquiring unit, configured to acquire, according to the image feature information, a vocabulary category and/or one or more candidate vocabulary corresponding to the image feature information, and convert the vocabulary category and/or the one or more candidate vocabulary to as the auxiliary identification information.

14. The device according to claim 13, wherein the fifth acquiring unit further comprises:

A search subunit, configured to search for a specific image with the highest similarity to the image feature information in a predetermined image library;

The acquiring subunit is configured to acquire the vocabulary category or one or more candidate vocabulary corresponding to the specific image according to the preset correspondence between the image and the vocabulary category or one or more candidate vocabulary.

15. The device according to any one of claims 9 to 14, wherein the current state of the user includes at least one of the following: the user's lip movement state, the user's throat vibration state, The facial movement state of the user and the gesture movement state of the user.

16. The device according to any one of claims 9 to 15, further comprising:

A judging module, configured to judge that the accuracy rate of the final recognition result of the user's current speech based on the speech recognition information is less than a predetermined threshold.

17. A terminal, comprising a processor, wherein the processor is configured to acquire voice recognition information of the user's current voice, and acquire auxiliary recognition of the voice recognition information based on the user's current state corresponding to the user's current voice information; determine the final recognition result of the user's current voice according to the voice recognition information and the auxiliary recognition information.