CN111241814B

CN111241814B - Error correction method, device, electronic equipment and storage medium for speech recognition text

Info

Publication number: CN111241814B
Application number: CN201911410367.XA
Authority: CN
Inventors: 章翔; 孟越涛; 张俊杰; 罗红; 荣玉军
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-04-28
Anticipated expiration: 2039-12-31
Also published as: CN111241814A

Abstract

The embodiment of the invention relates to the field of natural language processing, and discloses an error correction method and device for a voice recognition text, electronic equipment and a storage medium. In the invention, voice information is received; identifying at least one user intention corresponding to the voice information; according to the identified at least one user intention, selecting all data of the user intention from the cloud data set as a personalized fuzzy data set; and combining the personalized fuzzy data set and a preset basic fuzzy data set to correct the text identified according to the voice information, and reducing the data quantity required by error correction while ensuring the accuracy of error correction through the personalized user intention of the user, thereby improving the error correction efficiency.

Description

Speech recognition text error correction method, device, electronic device and storage medium

技术领域Technical Field

本发明实施例涉及自然语言处理领域，特别涉及一种语音识别文本的纠错方法、装置、电子设备及存储介质。The embodiments of the present invention relate to the field of natural language processing, and in particular to a method, device, electronic device and storage medium for correcting errors in speech recognition text.

背景技术Background Art

随着人工智能技术的发展，用户可以通过语音控制智能设备的工作。在通过语音控制智能设备时，为了使音箱等智能设备可以更加准确的识别用户的语音，从而更加准确的执行相应的动作，需要对识别到的用户的语音信息转换为文字，并对文字进行纠错。当前的文本纠错通常是利用云端数据集中的数据，通过语言模型检测错别字的位置，通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。在云端数据集中通常需要添加大量的数据，以保证语言模型可以准确的识别出文本中的错别字。With the development of artificial intelligence technology, users can control the operation of smart devices through voice. When controlling smart devices through voice, in order for smart devices such as speakers to more accurately recognize the user's voice and thus more accurately perform corresponding actions, it is necessary to convert the recognized user's voice information into text and correct the text. Current text error correction usually uses data in cloud data sets, detects the location of typos through language models, and corrects typos through pinyin sound-like features, stroke five-stroke edit distance features, and language model confusion features. A large amount of data usually needs to be added to the cloud data set to ensure that the language model can accurately identify typos in the text.

发明人发现相关技术中至少存在如下问题：云端数据集中的数据的数量过大，会导致通过语言模型纠错时的纠错候选数据过多，从而降低纠错的效率。The inventors have discovered that there are at least the following problems in the related art: the amount of data in the cloud data set is too large, which will lead to too much error correction candidate data when correcting errors through the language model, thereby reducing the efficiency of error correction.

发明内容Summary of the invention

本发明实施例的目的在于提供一种语音识别文本的纠错方法、装置、电子设备及存储介质，通过用户的个性化用户意图在保证纠错准确性的同时减少纠错所需数据量，提高纠错效率。The purpose of the embodiments of the present invention is to provide a method, device, electronic device and storage medium for correcting speech recognition text, which can reduce the amount of data required for error correction while ensuring the accuracy of error correction through the user's personalized user intention, thereby improving the error correction efficiency.

为解决上述技术问题，本发明的实施例提供了一种语音识别文本的纠错方法，包括：接收语音信息；识别语音信息对应的至少一个用户意图；根据识别的至少一个用户意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集；结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错。To solve the above technical problems, an embodiment of the present invention provides a method for correcting speech recognition text, comprising: receiving speech information; identifying at least one user intent corresponding to the speech information; selecting all data of the user intent in a cloud data set as a personalized fuzzy data set based on the identified at least one user intent; and correcting the text recognized according to the speech information by combining the personalized fuzzy data set with a pre-set basic fuzzy data set.

本发明的实施例还提供了一种语音识别文本的纠错装置，包括：接收模块，识别模块，选取模块，纠错模块；接收模块用于接收语音信息；识别模块用于识别语音信息对应的至少一个用户意图；选取模块用于根据识别的至少一个用户意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集；纠错模块用于结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错。An embodiment of the present invention also provides an error correction device for speech recognition text, comprising: a receiving module, a recognition module, a selection module, and an error correction module; the receiving module is used to receive speech information; the recognition module is used to recognize at least one user intention corresponding to the speech information; the selection module is used to select all data of the user intention in the cloud data set as a personalized fuzzy data set based on the recognized at least one user intention; the error correction module is used to combine the personalized fuzzy data set with a pre-set basic fuzzy data set to correct the text recognized according to the speech information.

本发明的实施方式还提供了一种电子设备，包括：至少一个处理器；以及，与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够执行语音识别文本的纠错方法。An embodiment of the present invention also provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute a method for correcting errors in speech recognition text.

本发明的实施方式还提供了一种存储介质，存储有计算机程序，计算机程序被处理器执行时上述的语音识别文本的纠错方法。The embodiment of the present invention further provides a storage medium storing a computer program, and the computer program, when executed by a processor, performs the above-mentioned method for correcting errors in speech recognition text.

本发明实施例相对于现有技术而言，接收语音信息，通过对语音信息的识别确定语音信息中表达的用户意图，比如接收到的语音信息所表达的是听歌的意图或是播报天气的意图，并利用用户意图对云端数据集中的数据进行筛选，筛选出本次纠错过程可能会用到的数据，即将云端数据集中与识别的用户意图具有相同意图的数据提取作为个性化模糊数据集，结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错，从而利用语音信息中包含的用户的个性化用户意图减少了纠错过程中所用到的数据的数量，提高了纠错的效率。另由于设置的基础模糊数据集中包含有同音词，谐音词等基本混淆词语，结合预设的基础模糊数据集可以保证纠错的准确性。Compared with the prior art, the embodiment of the present invention receives voice information, determines the user intention expressed in the voice information by identifying the voice information, for example, the received voice information expresses the intention of listening to music or reporting the weather, and uses the user intention to filter the data in the cloud data set, and filters out the data that may be used in this error correction process, that is, extracts the data in the cloud data set that has the same intention as the identified user intention as a personalized fuzzy data set, and combines the personalized fuzzy data set with the pre-set basic fuzzy data set to correct the text identified according to the voice information, thereby using the personalized user intention of the user contained in the voice information to reduce the amount of data used in the error correction process and improve the efficiency of error correction. In addition, since the set basic fuzzy data set contains basic confusing words such as homophones and homophonic words, the accuracy of error correction can be guaranteed by combining the preset basic fuzzy data set.

另外，识别语音信息对应的至少一个用户意图，包括：识别语音信息的声纹特征；根据声纹特征确定语音信息对应的用户信息；获取用户信息对应的历史语音信息；根据历史语音信息确定用户惯用的至少一个用户意图，并将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图。这样做可以通过用户的历史语音信息推测出该用户的惯用的用户意图，从而可以更加准确的确定用户的个性化行为。In addition, identifying at least one user intention corresponding to the voice information includes: identifying the voiceprint feature of the voice information; determining the user information corresponding to the voice information according to the voiceprint feature; obtaining historical voice information corresponding to the user information; determining at least one user intention that the user is used to according to the historical voice information, and using the at least one user intention that the user is used to as the at least one user intention corresponding to the voice information. In this way, the user's used user intention can be inferred from the user's historical voice information, thereby more accurately determining the user's personalized behavior.

另外，根据历史语音信息确定用户惯用的至少一个用户意图，包括：将历史语音信息对应的特征输入预先训练的神经网络模型；其中，神经网络模型利用各个用户意图的语音信息的特征进行训练，用于识别用户对各个用户意图的使用率；特征值至少包括以下任一特征或其组合：历史语音信息中的各个用户意图对应的语音交互的总时间，历史语音信息中最近一次的语音信息对应的用户意图，用户的年龄或用户的性别；根据神经网络模型的输出结果确定用户惯用的至少一个用户意图。In addition, determining at least one user intention that the user is accustomed to based on historical voice information includes: inputting features corresponding to the historical voice information into a pre-trained neural network model; wherein the neural network model is trained using features of the voice information of each user intention to identify the user's usage rate of each user intention; the feature value includes at least any one of the following features or a combination thereof: the total time of voice interaction corresponding to each user intention in the historical voice information, the user intention corresponding to the most recent voice information in the historical voice information, the user's age or the user's gender; determining at least one user intention that the user is accustomed to based on the output result of the neural network model.

另外，在将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图之前，还包括：将语音信息转换为文本信息，并对文本信息进行意图的识别得到文本意图；将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图，包括：若文本意图与用户惯用的至少一个用户意图中的任一用户意图相同，将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图；若文本意图与用户惯用的至少一个用户意图中的任一用户意图均不相同，将用户惯用的至少一个用户意图与文本意图共同作为语音信息对应的用户意图。In addition, before taking at least one user intention that the user is accustomed to as the at least one user intention corresponding to the voice information, it also includes: converting the voice information into text information, and identifying the intent of the text information to obtain the text intent; taking at least one user intention that the user is accustomed to as the at least one user intention corresponding to the voice information, including: if the text intent is the same as any user intention of the at least one user intention that the user is accustomed to, taking the at least one user intention that the user is accustomed to as the at least one user intention corresponding to the voice information; if the text intent is different from any user intention of the at least one user intention that the user is accustomed to, taking the at least one user intention that the user is accustomed to and the text intent together as the user intent corresponding to the voice information.

另外，对文本信息进行意图的识别得到文本意图，包括：通过词嵌入方法将文本信息转换为向量矩阵；将向量矩阵输入预先训练的文本分类模型；根据文本分类模型输出的结果得到文本意图。通过当前接收的语音信息识别的文本信息确定文本意图，从而可以使确定的个性化模糊数据集中包含满足当前语音信息的个性化数据，从而使语音识别文本的纠错更加精确。In addition, the text information is identified for intent to obtain the text intent, including: converting the text information into a vector matrix through a word embedding method; inputting the vector matrix into a pre-trained text classification model; and obtaining the text intent according to the output of the text classification model. The text intent is determined by the text information recognized by the currently received voice information, so that the determined personalized fuzzy data set contains personalized data that meets the current voice information, thereby making the error correction of the voice recognition text more accurate.

另外，结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错，包括：根据个性化模糊数据集和预先设置的基础模糊数据集对语音信息所识别的文本中出现错误词语的位置进行定位；在个性化模糊数据集和预先设置的基础模糊数据集中选择错误词语的至少一个替换词；通过语言模型分别对至少一个替换词的混淆度得分进行计算；利用混淆度得分小于第一预设阈值的替换词对语音信息所识别的文本进行纠错。In addition, the personalized fuzzy data set and the pre-set basic fuzzy data set are combined to correct the text recognized based on the voice information, including: locating the position of the erroneous words in the text recognized by the voice information according to the personalized fuzzy data set and the pre-set basic fuzzy data set; selecting at least one replacement word for the erroneous word in the personalized fuzzy data set and the pre-set basic fuzzy data set; calculating the confusion score of at least one replacement word through a language model; and using the replacement word whose confusion score is less than a first preset threshold to correct the text recognized by the voice information.

另外，对语音信息所识别的文本中出现错误词语的位置进行定位，包括：对语音信息所识别的文本进行划分，划分为不同的词段；根据文本中各个词段之间的相关性，分别计算各个词段为错误词段的概率；将概率大于第二预设阈值的错误词段作为错误词语，并将错误词段处于文本中的位置作为文本中出现错误词语的位置。In addition, the position of the erroneous word in the text recognized by the voice information is located, including: dividing the text recognized by the voice information into different segments; calculating the probability of each segment being an erroneous segment based on the correlation between the segments in the text; treating the erroneous segment with a probability greater than a second preset threshold as an erroneous word, and treating the position of the erroneous segment in the text as the position where the erroneous word appears in the text.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

一个或多个实施例通过与之对应的附图中的图片进行示例性说明，这些示例性说明并不构成对实施例的限定。One or more embodiments are exemplarily described by the pictures in the corresponding drawings, and these exemplary descriptions do not constitute limitations on the embodiments.

图1是根据本发明第一实施例中的语音识别文本的纠错方法的流程图；1 is a flow chart of a method for correcting errors in speech recognition text according to a first embodiment of the present invention;

图2是根据本发明第二实施例中的语音识别文本的纠错方法的流程图；2 is a flow chart of a method for correcting errors in speech recognition text according to a second embodiment of the present invention;

图3是根据本发明第三实施例中的语音识别文本的纠错装置的结构示意图；3 is a schematic diagram of the structure of a device for correcting speech recognition text according to a third embodiment of the present invention;

图4是根据本发明第四实施例中的电子设备的结构示意图。FIG. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的各实施例进行详细的阐述。然而，本领域的普通技术人员可以理解，在本发明各实施例中，为了使读者更好地理解本申请而提出了许多技术细节。但是，即使没有这些技术细节和基于以下各实施例的种种变化和修改，也可以实现本申请所要求保护的技术方案。In order to make the purpose, technical scheme and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those skilled in the art that in the embodiments of the present invention, many technical details are provided to enable the reader to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical scheme claimed in the present application can be implemented.

以下各个实施例的划分是为了描述方便，不应对本发明的具体实现方式构成任何限定，各个实施例在不矛盾的前提下可以相互结合相互引用。The following embodiments are divided for the convenience of description and shall not constitute any limitation on the specific implementation of the present invention. The embodiments may be combined with each other and referenced to each other without contradiction.

本发明的第一实施例涉及一种语音识别文本的纠错方法，包括：接收语音信息；识别语音信息对应的至少一个用户意图；根据识别的至少一个用户意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集；结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错，通过用户的个性化用户意图在保证纠错准确性的同时减少纠错所需数据量，提高纠错效率。下面对本实施例的语音识别文本的纠错方法的实现细节进行具体的说明，以下内容仅为方便理解提供的实现细节，并非实施本方案的必须。The first embodiment of the present invention relates to a method for correcting speech recognition text, comprising: receiving speech information; identifying at least one user intent corresponding to the speech information; selecting all data of the user intent in the cloud data set as a personalized fuzzy data set according to the identified at least one user intent; combining the personalized fuzzy data set with a pre-set basic fuzzy data set to correct the text recognized according to the speech information, and reducing the amount of data required for error correction while ensuring the accuracy of error correction through the user's personalized user intent, thereby improving the error correction efficiency. The implementation details of the speech recognition text correction method of this embodiment are described in detail below. The following content is only the implementation details provided for easy understanding and is not necessary for implementing this solution.

具体流程如图1所示，第一实施例涉及一种语音识别文本的纠错方法，包括：The specific process is shown in FIG1 . The first embodiment relates to a method for correcting errors in speech recognition text, including:

步骤101，接收语音信息。具体地说，用户通过语音控制智能设备时，需要与音箱等设备进行语音交互。智能设备通过音箱接收用户的语音信息，比如，接收用户输入的“播放音乐”，“播放小说”等语音信息。Step 101, receiving voice information. Specifically, when a user controls a smart device by voice, it is necessary to interact with a device such as a speaker by voice. The smart device receives the user's voice information through the speaker, for example, receiving voice information such as "play music" or "play novel" input by the user.

步骤102，识别语音信息对应的至少一个用户意图。Step 102: Identify at least one user intention corresponding to the voice information.

具体地说，语音信息均有其对应的用户意图，例如，语音信息“播放音乐”对应播放音乐的用户意图，语音信息“播放小说”对应播放小说的用户意图，等等。Specifically, each voice message has its corresponding user intention. For example, the voice message "play music" corresponds to the user intention of playing music, the voice message "play novel" corresponds to the user intention of playing novel, and so on.

识别语音信息对应的用户意图的方式可以如下：通过用户的历史语音信息确定用户惯用的用户意图，将用户惯用的用户意图作为识别语音信息对应的用户意图；还可以通过语音信息识别的文本信息确定文本意图，并将文本意图作为识别语音信息对应的用户意图；还可以综合上述用户惯用的用户意图和文本意图，共同作为语音信息对应的用户意图。The method for identifying the user intent corresponding to voice information can be as follows: determine the user's usual user intent through the user's historical voice information, and use the user's usual user intent as the user intent corresponding to the identified voice information; determine the text intent through the text information recognized by the voice information, and use the text intent as the user intent corresponding to the identified voice information; and combine the user's usual user intent and text intent as the user intent corresponding to the voice information.

步骤103，根据识别的至少一个用户意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集。Step 103 , based on the identified at least one user intention, select all data of the user intention in the cloud data set as a personalized fuzzy data set.

具体地说，云端数据集中包含有大量的数据，可以将云端数据集中的数据按照用户意图对数据进行分类，例如，将播放音乐的用户意图的数据放入一个数据列表，将播放小说的用户意图的数据放入另一个数据列表，从而实现按照用户意图对云端数据集中的数据进行分类。在识别出至少一个用户意图之后，将对应该用户意图的数据列表选出作为个性化模糊数据集。Specifically, the cloud data set contains a large amount of data, and the data in the cloud data set can be classified according to user intentions. For example, the data of the user intention of playing music is put into one data list, and the data of the user intention of playing novels is put into another data list, so as to classify the data in the cloud data set according to user intentions. After identifying at least one user intention, the data list corresponding to the user intention is selected as a personalized fuzzy data set.

步骤104，结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错。Step 104 , combining the personalized fuzzy data set with the preset basic fuzzy data set to correct the text recognized according to the voice information.

具体地说，将个性化模糊数据集和基础模糊数据集进行汇总。在利用个性化模糊数据集和基础模糊数据集对语音信息识别的文本信息进行纠错时，首先可以通过结巴(jieba)等中文分词工具进行分词。例如，对于自动语音识别技术(ASR)识别后的用户交互文本“来一首周杰伦的爽截棍”，通过结巴分词之后的结果为“来”|“一首”|“周杰伦”|“的”|“爽”|“截棍”。在对文本分词之后，对分词进行可能的错误进行定位，通常可以采取基于字粒度的语言模型进行可能的错误位置的定位。语言模型主要通过N-Gram算法进行处理，取大于平均得分的位置作为可能的错误位置。Specifically, the personalized fuzzy data set and the basic fuzzy data set are summarized. When using the personalized fuzzy data set and the basic fuzzy data set to correct the text information of speech information recognition, the Chinese word segmentation tools such as Jieba can be used for word segmentation first. For example, for the user interaction text "Come to Jay Chou's Shuang Jie Gun" recognized by the automatic speech recognition technology (ASR), the result after Jieba's word segmentation is "Come" | "A Song" | "Jay Chou" | "Of" | "Shuang" | "Jie Gun". After the text is segmented, the possible errors in the word segmentation are located, and the possible error positions can usually be located by using a language model based on word granularity. The language model is mainly processed by the N-Gram algorithm, and the position greater than the average score is taken as the possible error position.

上述说明中的N-Gram是一种基于统计语言模型的算法。它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列；每一个字节片段称为gram，对所有gram的出现频度进行统计，并且按照事先设定好的阈值进行过滤，形成关键gram列表，也就是这个文本的向量特征空间，列表中的每一种gram就是一个特征向量维度；该模型基于这样一种假设，第N个词的出现只与前面N-1个词相关，而与其它任何词都不相关，整句的概率就是各个词出现概率的乘积。这些概率可以通过直接从语料中统计N个词同时出现的次数得到。常用的算法是二元的Bi-Gram算法和三元的Tri-Gram算法。The N-Gram mentioned above is an algorithm based on a statistical language model. Its basic idea is to perform a sliding window operation of size N on the content of the text according to bytes, forming a sequence of byte fragments of length N; each byte fragment is called a gram, and the frequency of occurrence of all grams is counted, and filtered according to a pre-set threshold to form a key gram list, which is the vector feature space of the text. Each gram in the list is a feature vector dimension; the model is based on the assumption that the appearance of the Nth word is only related to the previous N-1 words, but not to any other words, and the probability of the whole sentence is the product of the probability of occurrence of each word. These probabilities can be obtained by directly counting the number of times N words appear at the same time in the corpus. Commonly used algorithms are the binary Bi-Gram algorithm and the ternary Tri-Gram algorithm.

在定位可能出现的错误位置之后，在个性化模糊数据集与预先设置的基础模糊数据集中选择作为错误词语的替换词作为错误候选，例如上述说明中“爽”可能的错误候选为“双”，“霜”等，对于所有的可能情况，我们需要对所有的情况进行处理评价，然后对结果混淆度得分进行排序，选择最可能候选。混淆度还是通过基于n-gram语言模型的混淆度得分进行计算。最后将混淆度得分最低的句子和设定的混淆度得分阈值进行比较，如果计算的结果小于阈值，则选择纠错后的句子进行输出；否则选择原始输入文本进行输出。After locating the possible error positions, we select replacement words for the wrong words as error candidates in the personalized fuzzy dataset and the pre-set basic fuzzy dataset. For example, the possible error candidates for "爽" in the above description are "双", "霜", etc. For all possible situations, we need to process and evaluate all situations, and then sort the resulting confusion scores and select the most likely candidate. The confusion is still calculated based on the confusion score of the n-gram language model. Finally, the sentence with the lowest confusion score is compared with the set confusion score threshold. If the calculated result is less than the threshold, the corrected sentence is selected for output; otherwise, the original input text is selected for output.

本发明的第二实施例涉及一种语音识别文本的纠错方法。在本发明第二实施例中，通过用户的历史语音信息获取用户惯用的用户意图，并通过当前接收的语音信息识别的文本确定文本意图，利用用户惯用的用户意图和文本意图在云端数据集中选取数据作为个性化模糊数据集。具体流程如图2所示，包括：The second embodiment of the present invention relates to a method for correcting errors in speech recognition text. In the second embodiment of the present invention, the user's usual user intent is obtained through the user's historical speech information, and the text intent is determined through the text recognized by the currently received speech information, and the user's usual user intent and text intent are used to select data from the cloud data set as a personalized fuzzy data set. The specific process is shown in Figure 2, including:

步骤201，接收语音信息。Step 201, receiving voice information.

步骤202，识别语音信息的声纹特征。Step 202: Identify the voiceprint features of the speech information.

步骤203，根据声纹特征确定语音信息对应的用户信息。Step 203: determine the user information corresponding to the voice information according to the voiceprint feature.

步骤204，获取用户信息对应的历史语音信息。Step 204: Acquire historical voice information corresponding to the user information.

步骤205，根据历史语音信息确定用户惯用的至少一个用户意图。Step 205: determining at least one user intention that the user is accustomed to based on the historical voice information.

具体地说，对用户输入的语音信息进行识别，识别出语音信息的声纹特征。由于每个用户均有其独特的声纹特征，所以根据语音信息的声纹特征即可确定该语音信息输入的用户，从而可以得到该用户的历史语音信息。Specifically, the voice information input by the user is recognized to identify the voiceprint features of the voice information. Since each user has his or her own unique voiceprint features, the user who input the voice information can be determined based on the voiceprint features of the voice information, thereby obtaining the historical voice information of the user.

通过文本信息识别用户意图时，可以利用文本分类(textCNN)技术对文本信息进行分类，文本分类实现如下：首先，通过词嵌入(embedding)方法将每个字转换为相同长度的向量，对于一个句子就可以形成一个向量矩阵。其次，通过卷积对向量矩阵进行卷积处理，并在卷积层之后接上池化层和全连接层，并在最后通过逻辑回归(softmax)层进行分类。When identifying user intent through text information, text classification (textCNN) technology can be used to classify text information. Text classification is implemented as follows: First, each word is converted into a vector of the same length through the word embedding method, and a vector matrix can be formed for a sentence. Secondly, the vector matrix is convolved through convolution, and the pooling layer and the fully connected layer are connected after the convolution layer, and finally the classification is performed through the logistic regression (softmax) layer.

在通过历史语音信息确定用户惯用的至少一个用户意图时，利用上述文本分类(textCNN)技术建立神经网络模型，神经网络模型利用各个用户意图的语音信息的特征进行训练，用于识别用户对各个用户意图的使用率。特征值至少包括以下任一特征或其组合：历史语音信息中的各个用户意图对应的语音交互的总时间，历史语音信息中最近一次的语音信息对应的用户意图，用户的年龄或用户的性别等。将历史语音信息的特征值输入神经网络模型，通过神经网络模型的输出结果可以确定用户惯用的至少一个用户意图，输出结果可以是：各个用户意图和各个用户意图对应的使用率，使用率越大的用户意图用户使用概率越大。When determining at least one user intent that the user is accustomed to using historical voice information, the above-mentioned text classification (textCNN) technology is used to establish a neural network model, and the neural network model is trained using the features of the voice information of each user intent to identify the user's usage rate of each user intent. The feature value includes at least any of the following features or a combination thereof: the total time of voice interaction corresponding to each user intent in the historical voice information, the user intent corresponding to the most recent voice information in the historical voice information, the user's age or the user's gender, etc. The feature value of the historical voice information is input into the neural network model, and at least one user intent that the user is accustomed to using can be determined through the output result of the neural network model. The output result can be: each user intent and the usage rate corresponding to each user intent. The user intent with a greater usage rate has a greater probability of being used by the user.

下面对神经网络模型的原理进行说明：神经网络模型主要通过BP神经网络进行实现，其主要的计算公式如下：由于BP神经网络是有监督学习算法，我们指定模型的训练数据集为D＝{(x₁,y₁),(x₂,y₂),…,(x_q,y_q)}，同时已知激活函数的导数

训练集中包括各个语音信息对应的特征值。由以上结果可知隐藏层第h个神经元的输入值为：The principle of the neural network model is explained below: The neural network model is mainly implemented through the BP neural network, and its main calculation formula is as follows: Since the BP neural network is a supervised learning algorithm, we specify the training data set of the model as D = {(x ₁ ,y ₁ ), (x ₂ ,y ₂ ), …, (x _q ,y _q )}, and the derivative of the activation function is known

The training set includes the feature values corresponding to each voice information. From the above results, we can know that the input value of the hth neuron in the hidden layer is:

其中，I_hh为隐藏层第h个神经元的输入值；W_ih为预先设置隐藏层的第h个神经元的权重；x_i为训练数据集中的特征值。

Among them, I _hh is the input value of the hth neuron in the hidden layer; _Wih is the pre-set weight of the hth neuron in the hidden layer; _Xi is the feature value in the training data set.

隐藏层第h个神经元的输出为：O_hh＝S(I_hh-H_h)；O_hh为隐藏层第h个神经元的输出值；H_h为隐藏层第h个神经元对应的预设阈值；S为上述说明中的激活函数。The output of the hth neuron in the hidden layer is: _Ohh =S( _Ihh _-Hh ); _Ohh is the output value of the hth neuron in the hidden layer; _Hh is the preset threshold corresponding to the hth neuron in the hidden layer; S is the activation function in the above description.

输出层第j个神经元的输入值为：

I_oj为输出层第j个神经元的输入值；W_hj为预先设置的输出层第j个神经元的权重。The input value of the jth neuron in the output layer is:

I _oj is the input value of the j-th neuron in the output layer; W _hj is the preset weight of the j-th neuron in the output layer.

输出层第j个神经元的输出值为：O_oj＝S(I_oj-θ_j)；O_oj为输出层第j个神经元的输出值；θ_j为输出层第j个神经元对应的预设阈值。The output value of the j-th neuron in the output layer is: O _oj =S(I _oj -θ _j ); O _oj is the output value of the j-th neuron in the output layer; θ _j is the preset threshold corresponding to the j-th neuron in the output layer.

于是可得神经网络模型在(x_k,y_k)上的均方误差为：

y_j为训练集中的特征值对应的用户意图。So the mean square error of the neural network model on (x _k ,y _k ) is:

_yj is the user intention corresponding to the feature value in the training set.

为了使结果的总误差E最小，模型需要通过不断的迭代对所有参数进行更新。BP神经网络算法通过梯度下降算法实现这一过程。梯度下降算法以目标结果的负梯度方向作为参数调整的方向进行更新。对于给定的学习率α可得：In order to minimize the total error E of the result, the model needs to update all parameters through continuous iteration. The BP neural network algorithm implements this process through the gradient descent algorithm. The gradient descent algorithm updates the negative gradient direction of the target result as the direction of parameter adjustment. For a given learning rate α, we can get:

化简之后可得：

After simplification, we can get:

△W_hj＝α*O_oj*(1-O_oj)*(y_j-O_oj)*O_hh，从而可以减少各个神经元的权重的误差。同理可以对神经网络模型中的其他参数的误差进行校正，如通过△θ_j＝-α*g_j减少输出层神经元对应的预设阈值；通过

减少预先设置隐藏层的神经元的权重；通过

减少隐藏层神经元对应的预设阈值。△W _hj = α*O _oj *(1-O _oj )*(y _j -O _oj )*O _hh , thereby reducing the error of the weight of each neuron. Similarly, the errors of other parameters in the neural network model can be corrected, such as reducing the preset threshold corresponding to the output layer neuron through △θ _j = -α*g _j ;

Reduce the weights of neurons in the pre-set hidden layer; by

Reduce the preset threshold corresponding to the hidden layer neurons.

步骤206，将语音信息转换为文本信息，并对文本信息进行意图的识别得到文本意图。具体地说，在通过上述神经网络模型得到用户惯用的用户意图和文本意图之后，若文本意图与用户惯用的至少一个用户意图中的任一用户意图相同，将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图；若文本意图与用户惯用的至少一个用户意图中的任一用户意图均不相同，将用户惯用的至少一个用户意图与文本意图共同作为语音信息对应的用户意图。Step 206, converting the voice information into text information, and identifying the intent of the text information to obtain the text intent. Specifically, after obtaining the user's usual user intent and text intent through the above neural network model, if the text intent is the same as any one of the at least one user intent that the user is accustomed to, the at least one user intent that the user is accustomed to is used as the at least one user intent corresponding to the voice information; if the text intent is different from any one of the at least one user intent that the user is accustomed to, the at least one user intent that the user is accustomed to and the text intent are used together as the user intent corresponding to the voice information.

步骤207，根据用户惯用的用户意图和文本意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集。具体地说，在获取用户惯用的用户意图和文本意图之后，判断现有的个性化模糊数据集中的意图集合与获取用户惯用的用户意图和文本意图是否相同，若现有的个性化模糊数据集中的意图集合与获取用户惯用的用户意图和文本意图相同，则无需对现有的个性化模糊数据集中的数据进行更新；若现有的个性化模糊数据集中的意图集合与获取用户惯用的用户意图和文本意图不相同，则将现有的个性化模糊数据集中多余的意图进行删除，将现有的个性化模糊数据集中缺少的意图进行添加，从而实现现有的个性化模糊数据集的更新。在更新之后保存当前个性化模糊数据集的状态，以便下一次对语音信息识别的文本进行纠错时，对个性化模糊数据集进行再一次比较。Step 207, based on the user's usual user intent and text intent, select all the data of the user's intent in the cloud data set as a personalized fuzzy data set. Specifically, after obtaining the user's usual user intent and text intent, determine whether the intent set in the existing personalized fuzzy data set is the same as the user's usual user intent and text intent. If the intent set in the existing personalized fuzzy data set is the same as the user's usual user intent and text intent, there is no need to update the data in the existing personalized fuzzy data set; if the intent set in the existing personalized fuzzy data set is not the same as the user's usual user intent and text intent, delete the redundant intents in the existing personalized fuzzy data set, and add the missing intents in the existing personalized fuzzy data set, so as to update the existing personalized fuzzy data set. After the update, save the state of the current personalized fuzzy data set so that the personalized fuzzy data set can be compared again when the text of the speech information recognition is corrected next time.

步骤208，结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错。Step 208, combining the personalized fuzzy data set with the preset basic fuzzy data set to correct the text recognized according to the voice information.

上面各种方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包括相同的逻辑关系，都在本专利的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该专利的保护范围内。The step division of the above methods is only for the purpose of clear description. When implemented, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent; adding insignificant modifications to the algorithm or process or introducing insignificant designs without changing the core design of the algorithm and process are all within the scope of protection of this patent.

本发明第三实施例涉及一种语音识别文本的纠错装置，如图3所示，包括：接收模块31，识别模块32，选取模块33，纠错模块34；接收模块31用于接收语音信息；识别模块32用于识别语音信息对应的至少一个用户意图；选取模块33用于根据识别的至少一个用户意图，在云端数据集中选取用户意图的所有数据作为个性化模糊数据集；纠错模块34用于结合个性化模糊数据集与预先设置的基础模糊数据集对根据语音信息所识别的文本进行纠错。The third embodiment of the present invention relates to an error correction device for speech recognition text, as shown in Figure 3, comprising: a receiving module 31, a recognition module 32, a selection module 33, and an error correction module 34; the receiving module 31 is used to receive speech information; the recognition module 32 is used to identify at least one user intention corresponding to the speech information; the selection module 33 is used to select all data of the user intention in the cloud data set as a personalized fuzzy data set based on the identified at least one user intention; the error correction module 34 is used to combine the personalized fuzzy data set with a pre-set basic fuzzy data set to correct the text recognized according to the speech information.

不难发现，本实施例为与第一实施例相对应的系统实施例，本实施例可与第一实施例互相配合实施。第一实施例中提到的相关技术细节在本实施例中依然有效，为了减少重复，这里不再赘述。相应地，本实施例中提到的相关技术细节也可应用在第一实施例中。It is not difficult to find that this embodiment is a system embodiment corresponding to the first embodiment, and this embodiment can be implemented in conjunction with the first embodiment. The relevant technical details mentioned in the first embodiment are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the relevant technical details mentioned in this embodiment can also be applied in the first embodiment.

另外，识别模块32用于识别语音信息的声纹特征；根据声纹特征确定语音信息对应的用户信息；获取用户信息对应的历史语音信息；根据历史语音信息确定用户惯用的至少一个用户意图，并将用户惯用的至少一个用户意图作为语音信息对应的至少一个用户意图。In addition, the recognition module 32 is used to recognize the voiceprint features of the voice information; determine the user information corresponding to the voice information based on the voiceprint features; obtain historical voice information corresponding to the user information; determine at least one user intention that the user is accustomed to based on the historical voice information, and use the at least one user intention that the user is accustomed to as the at least one user intention corresponding to the voice information.

另外，识别模块32用于将历史语音信息对应的特征输入预先训练的神经网络模型；其中，神经网络模型利用各个用户意图的语音信息的特征进行训练，用于识别用户对各个用户意图的使用率；特征值至少包括以下任一特征或其组合：历史语音信息中的各个用户意图对应的语音交互的总时间，历史语音信息中最近一次的语音信息对应的用户意图，用户的年龄或用户的性别；根据神经网络模型的输出结果确定用户惯用的至少一个用户意图。In addition, the recognition module 32 is used to input the features corresponding to the historical voice information into a pre-trained neural network model; wherein the neural network model is trained using the features of the voice information of each user intention, and is used to identify the user's usage rate of each user intention; the feature value includes at least any one of the following features or a combination thereof: the total time of voice interaction corresponding to each user intention in the historical voice information, the user intention corresponding to the most recent voice information in the historical voice information, the user's age or the user's gender; and at least one user intention that the user is accustomed to is determined based on the output result of the neural network model.

另外，识别模块32用于将语音信息转换为文本信息，并对文本信息进行意图的识别得到文本意图。In addition, the recognition module 32 is used to convert the voice information into text information, and recognize the intent of the text information to obtain the text intent.

另外，识别模块32用于通过词嵌入方法将文本信息转换为向量矩阵；将向量矩阵输入预先训练的文本分类模型；根据文本分类模型输出的结果得到文本意图。In addition, the recognition module 32 is used to convert text information into a vector matrix through a word embedding method; input the vector matrix into a pre-trained text classification model; and obtain the text intent based on the output result of the text classification model.

另外，纠错模块34用于根据个性化模糊数据集和预先设置的基础模糊数据集对语音信息所识别的文本中出现错误词语的位置进行定位；在个性化模糊数据集和预先设置的基础模糊数据集中选择错误词语的至少一个替换词；通过语言模型分别对至少一个替换词的混淆度得分进行计算；利用混淆度得分小于第一预设阈值的替换词对语音信息所识别的文本进行纠错。In addition, the error correction module 34 is used to locate the position of erroneous words in the text recognized by the speech information based on the personalized fuzzy data set and the preset basic fuzzy data set; select at least one replacement word for the erroneous word in the personalized fuzzy data set and the preset basic fuzzy data set; calculate the confusion score of at least one replacement word through the language model; and use the replacement word whose confusion score is less than the first preset threshold to correct the text recognized by the speech information.

值得一提的是，本实施例中所涉及到的各模块均为逻辑模块，在实际应用中，一个逻辑单元可以是一个物理单元，也可以是一个物理单元的一部分，还可以以多个物理单元的组合实现。此外，为了突出本发明的创新部分，本实施例中并没有将与解决本发明所提出的技术问题关系不太密切的单元引入，但这并不表明本实施例中不存在其它的单元。It is worth mentioning that all modules involved in this embodiment are logic modules. In practical applications, a logic unit can be a physical unit, a part of a physical unit, or a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, this embodiment does not introduce units that are not closely related to solving the technical problem proposed by the present invention, but this does not mean that there are no other units in this embodiment.

本发明第四实施例涉及一种电子设备，如图4所示，包括至少一个处理器401；以及，与至少一个处理器401通信连接的存储器402；其中，存储器402存储有可被至少一个处理器401执行的指令，指令被至少一个处理器401执行，以使至少一个处理器401能够执行上述语音识别文本的纠错方法。A fourth embodiment of the present invention relates to an electronic device, as shown in FIG4 , comprising at least one processor 401; and a memory 402 communicatively connected to the at least one processor 401; wherein the memory 402 stores instructions executable by the at least one processor 401, and the instructions are executed by the at least one processor 401 so that the at least one processor 401 can execute the above-mentioned speech recognition text error correction method.

其中，存储器402和处理器401采用总线方式连接，总线可以包括任意数量的互联的总线和桥，总线将一个或多个处理器401和存储器402的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起，这些都是本领域所公知的，因此，本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件，也可以是多个元件，比如多个接收器和发送器，提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输，进一步，天线还接收数据并将数据传送给处理器401。The memory 402 and the processor 401 are connected in a bus manner, and the bus may include any number of interconnected buses and bridges, and the bus connects various circuits of one or more processors 401 and the memory 402 together. The bus can also connect various other circuits such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and are therefore not further described herein. The bus interface provides an interface between the bus and the transceiver. The transceiver can be one element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices on a transmission medium. The data processed by the processor is transmitted on a wireless medium via an antenna, and further, the antenna also receives data and transmits the data to the processor 401.

处理器401负责管理总线和通常的处理，还可以提供各种功能，包括定时，外围接口，电压调节、电源管理以及其他控制功能。而存储器402可以被用于存储处理器401在执行操作时所使用的数据。The processor 401 is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management and other control functions. The memory 402 can be used to store data used by the processor 401 when performing operations.

本发明第五实施例涉及一种计算机可读存储介质，存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program, which implements the above method embodiment when executed by a processor.

即，本领域技术人员可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the above-mentioned embodiment method can be completed by instructing the relevant hardware through a program, and the program is stored in a storage medium, including a number of instructions to enable a device (which can be a single-chip microcomputer, chip, etc.) or a processor to execute all or part of the steps of the method described in each embodiment of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

本领域的普通技术人员可以理解，上述各实施例是实现本发明的具体实施例，而在实际应用中，可以在形式上和细节上对其作各种改变，而不偏离本发明的精神和范围。Those skilled in the art will appreciate that the above embodiments are specific embodiments for implementing the present invention, and that in actual applications, various changes may be made in form and detail without departing from the spirit and scope of the present invention.

Claims

1. A method for error correction of speech recognition text, comprising:

receiving voice information;

identifying at least one user intention corresponding to the voice information;

according to the identified at least one user intention, selecting all data of the user intention from a cloud data set as a personalized fuzzy data set;

combining the personalized fuzzy data set with a preset basic fuzzy data set to correct the text identified according to the voice information;

the identifying at least one user intention corresponding to the voice information comprises the following steps:

identifying voiceprint features of the voice information;

determining user information corresponding to the voice information according to the voiceprint characteristics;

acquiring historical voice information corresponding to the user information;

and determining at least one user intention of the user according to the historical voice information, and taking the at least one user intention of the user as the at least one user intention corresponding to the voice information.

2. The method of claim 1, wherein said determining at least one user intention of a user from said historical speech information comprises:

inputting the characteristic value corresponding to the historical voice information into a pre-trained neural network model; the neural network model is trained by utilizing characteristic values of voice information of each user intention and is used for identifying the utilization rate of the user on each user intention;

the characteristic value at least comprises any one or combination of the following characteristics: the total time of voice interaction corresponding to each user intention in the historical voice information, the user intention corresponding to the last voice information in the historical voice information, the age of the user or the gender of the user;

and determining at least one user intention used by the user according to the output result of the neural network model.

3. The error correction method of a speech recognition text according to claim 1 or 2, characterized by further comprising, before said regarding at least one user intention familiar to said user as at least one user intention corresponding to said speech information:

converting the voice information into text information, and carrying out intention recognition on the text information to obtain text intention;

the step of using the at least one user intention which is used by the user as the at least one user intention corresponding to the voice information comprises the following steps:

if the text intention is the same as any user intention of at least one user intention which is used by the user, taking the at least one user intention which is used by the user as at least one user intention corresponding to the voice information;

and if the text intention is different from any user intention of at least one user intention which is used by the user, using the at least one user intention which is used by the user and the text intention together as the user intention corresponding to the voice information.

4. The method for text error correction in speech recognition according to claim 3, wherein said identifying the text information for intent results in text intent, comprising:

converting the text information into a vector matrix by a word embedding method;

inputting the vector matrix into a pre-trained text classification model;

and obtaining the text intention according to the output result of the text classification model.

5. The method of claim 1, wherein said combining the personalized fuzzy data set with a pre-set basic fuzzy data set to correct the text recognized from the speech information comprises:

positioning the position of the error word in the text identified by the voice information according to the personalized fuzzy data set and a preset basic fuzzy data set;

selecting at least one replacement word of the wrong word from the personalized fuzzy data set and a preset basic fuzzy data set;

calculating confusion degree scores of the at least one replacement word through a language model respectively;

and correcting the error of the text identified by the voice information by using the replacement words with the confusion degree score smaller than a first preset threshold value.

6. The method for correcting errors in speech recognition text according to claim 5, wherein locating the position of the erroneous word in the text recognized by the speech information comprises:

dividing the text identified by the voice information into different word segments;

according to the correlation among the word segments in the text, respectively calculating the probability that each word segment is an error word segment;

and taking the error word segment with the probability larger than a second preset threshold value as the error word, and taking the position of the error word segment in the text as the position of the error word in the text.

7. An error correction device for speech recognition text, comprising: the device comprises a receiving module, an identification module, a selection module and an error correction module;

the receiving module is used for receiving voice information;

the recognition module is used for recognizing voiceprint features of the voice information; determining user information corresponding to the voice information according to the voiceprint characteristics; acquiring historical voice information corresponding to the user information; determining at least one user intention of the user according to the historical voice information, and taking the at least one user intention of the user as at least one user intention corresponding to the voice information;

the selecting module is used for selecting all data of the user intention from a cloud data set as a personalized fuzzy data set according to the identified at least one user intention;

the error correction module is used for correcting errors of texts identified according to the voice information by combining the personalized fuzzy data set and a preset basic fuzzy data set.

8. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the error correction method of speech recognition text as claimed in any one of claims 1 to 6.

9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the error correction method of speech recognition text according to any one of claims 1 to 6.