CN111369978A

CN111369978A - A data processing method, device and device for data processing

Info

Publication number: CN111369978A
Application number: CN201811603538.6A
Authority: CN
Inventors: 周盼
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-03
Anticipated expiration: 2038-12-26
Also published as: CN111369978B

Abstract

The embodiment of the invention provides a data processing method and device and a device for data processing. The method specifically comprises the following steps: determining the language type of a voice frame in the voice information according to the multi-language acoustic model; the multi-language acoustic model is obtained by training according to acoustic data of at least two language types; decoding the voice frame according to a decoding network corresponding to the language type of the voice frame to obtain a first decoding result of the voice frame; and determining the recognition result corresponding to the voice information according to the first decoding result. The embodiment of the invention can improve the accuracy of voice recognition.

Description

A data processing method, device and device for data processing

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种数据处理方法、装置和用于数据处理的装置。The present invention relates to the field of computer technology, and in particular, to a data processing method, device and device for data processing.

背景技术Background technique

语音识别技术，也被称为ASR(Automatic Speech Recognition，自动语音识别)，其目标是将语音中的词汇内容转换为计算机可读的输入，例如按键、二进制编码或者字符序列。Speech recognition technology, also known as ASR (Automatic Speech Recognition), whose goal is to convert lexical content in speech into computer-readable input, such as keystrokes, binary codes, or sequences of characters.

在日常的语言表达中，可能会出现多种语言混合表达的情况。以中文和英文混合表达为例，用户在使用中文进行表达的过程中，可以穿插使用英文词句。例如，“我买了最新款的iPhone”、“来一首Yesterday once more”。In daily language expression, there may be mixed expressions of multiple languages. Taking the mixed expression of Chinese and English as an example, the user can use English words and phrases interspersed in the process of expressing in Chinese. For example, "I bought the latest iPhone", "Let me have a song Yesterday once more".

然而，目前的语音识别技术，对于单一语言的语音识别较为准确，而在语音中包含多种语言的情况下，识别的准确率明显下降。However, the current speech recognition technology is relatively accurate for the speech recognition of a single language, but in the case that the speech contains multiple languages, the recognition accuracy is obviously reduced.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种数据处理方法、装置和用于数据处理的装置，可以提高在语音中包含多种语言的情况下，语音识别的准确率。Embodiments of the present invention provide a data processing method, an apparatus, and a data processing apparatus, which can improve the accuracy of speech recognition when the speech contains multiple languages.

为了解决上述问题，本发明实施例公开了一种数据处理方法，所述方法包括：In order to solve the above problems, an embodiment of the present invention discloses a data processing method, the method includes:

根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；Determine the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；According to the decoding network corresponding to the language type of the speech frame, the speech frame is decoded to obtain the first decoding result of the speech frame;

根据所述第一解码结果，确定所述语音信息对应的识别结果。According to the first decoding result, the recognition result corresponding to the voice information is determined.

另一方面，本发明实施例公开了一种数据处理装置，所述装置包括：On the other hand, an embodiment of the present invention discloses a data processing apparatus, and the apparatus includes:

类型确定模块，用于根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；a type determination module, used for determining the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

第一解码模块，用于根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；a first decoding module, configured to decode the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

结果确定模块，用于根据所述第一解码结果，确定所述语音信息对应的识别结果。A result determination module, configured to determine a recognition result corresponding to the voice information according to the first decoding result.

再一方面，本发明实施例公开了一种用于数据处理的装置，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：In another aspect, an embodiment of the present invention discloses an apparatus for data processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more programs Execution of the one or more programs by the above processor includes instructions for:

又一方面，本发明实施例公开了一种机器可读介质，其上存储有指令，当由一个或多个处理器执行时，使得装置执行如前述一个或多个所述的数据处理方法。In yet another aspect, an embodiment of the present invention discloses a machine-readable medium with instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform one or more of the data processing methods described above.

本发明实施例包括以下优点：The embodiments of the present invention include the following advantages:

本发明实施例可以根据至少两种语言类型的声学数据训练得到多语言模型，通过所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此，在语音信息中包含多种语言类型的情况下，本发明实施例可以准确区分语音信息中不同语言类型的语音帧，并且可以根据相应语言类型的解码网络对语音帧进行解码，以得到语音帧的第一解码结果，该第一解码结果为根据语音帧的语音类型对应的解码网络解码得到，可以保证解码的准确性，进而可以提高语音识别的准确率。In this embodiment of the present invention, a multilingual model can be obtained by training acoustic data of at least two language types. Through the multilingual acoustic model, the language types of the speech frames in the speech information can be determined. Therefore, the speech information includes multiple language types. In this case, the embodiment of the present invention can accurately distinguish the voice frames of different language types in the voice information, and can decode the voice frames according to the decoding network of the corresponding language type, so as to obtain the first decoding result of the voice frame. The result is obtained by decoding the decoding network corresponding to the speech type of the speech frame, which can ensure the accuracy of decoding, and thus can improve the accuracy of speech recognition.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present invention. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图1是本发明的一种数据处理方法实施例的步骤流程图；Fig. 1 is the step flow chart of a kind of data processing method embodiment of the present invention;

图2是本发明的一种数据处理装置实施例的结构框图；2 is a structural block diagram of an embodiment of a data processing apparatus of the present invention;

图3是本发明的一种用于数据处理的装置800的框图；及FIG. 3 is a block diagram of an apparatus 800 for data processing of the present invention; and

图4是本发明的一些实施例中服务器的结构示意图。FIG. 4 is a schematic structural diagram of a server in some embodiments of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

方法实施例Method embodiment

参照图1，示出了本发明的一种数据处理方法实施例的步骤流程图，具体可以包括如下步骤：Referring to FIG. 1, a flow chart of steps of an embodiment of a data processing method of the present invention is shown, which may specifically include the following steps:

步骤101、根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；Step 101: Determine the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

步骤102、根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；Step 102, according to the decoding network corresponding to the language type of the speech frame, decode the speech frame to obtain the first decoding result of the speech frame;

步骤103、根据所述第一解码结果，确定所述语音信息对应的识别结果。Step 103: Determine a recognition result corresponding to the voice information according to the first decoding result.

本发明实施例的数据处理方法可用于对包含至少两种语言类型的语音信息进行识别的场景，所述数据处理方法可应用于电子设备，所述电子设备包括但不限于：服务器、智能手机、平板电脑、电子书阅读器、MP3(动态影像专家压缩标准音频层面3，MovingPicture Experts Group Audio Layer III)播放器、MP4(动态影像专家压缩标准音频层面4，Moving Picture Experts Group Audio Layer IV)播放器、膝上型便携计算机、车载电脑、台式计算机、机顶盒、智能电视机、可穿戴设备等等。The data processing method according to the embodiment of the present invention can be used in a scenario of recognizing voice information including at least two language types, and the data processing method can be applied to an electronic device, where the electronic device includes but is not limited to: a server, a smart phone, a Tablet PC, e-book reader, MP3 (Moving Picture Experts Group Audio Layer III) player, MP4 (Moving Picture Experts Group Audio Layer IV) player , laptops, car computers, desktop computers, set-top boxes, smart TVs, wearables, and more.

可以理解，本发明实施例对待识别的语音信息的获取方式不加以限制，例如，所述电子设备可以通过有线连接方式或者无线连接的方式，从客户端或网络中获取待识别的语音信息，或者，可以通过所述电子设备实时录制得到待识别的语音信息，或者，还可以根据即时通讯应用中获取的即时通讯消息得到待识别的语音信息等。It can be understood that the manner of acquiring the voice information to be recognized is not limited in this embodiment of the present invention. For example, the electronic device may acquire the voice information to be recognized from a client or a network through a wired connection or a wireless connection, or , the voice information to be recognized can be obtained by real-time recording of the electronic device, or the voice information to be recognized can also be obtained according to the instant messaging message obtained in the instant messaging application.

在本发明实施例中，可以根据预先设定的窗长和帧移，将待识别的语音信息切分为多个语音帧，其中，每一个语音帧可以为一个语音片段，进而可以对所述语音信息逐帧进行解码。如果待识别的语音信息为模拟语音信息(例如用户通话的录音)，则需要先将模拟语音信息转换为数字语音信息，然后再进行语音信息的切分。In this embodiment of the present invention, the speech information to be recognized may be divided into multiple speech frames according to the preset window length and frame shift, wherein each speech frame may be a speech segment, and further Speech information is decoded frame by frame. If the voice information to be recognized is analog voice information (for example, a recording of a user's call), the analog voice information needs to be converted into digital voice information first, and then the voice information is segmented.

其中，窗长可用于表示每一帧语音片段的时长，帧移可用于表示相邻帧之间的时差。例如，当窗长为25ms帧移15ms时，第一帧语音片段为0～25ms，第二帧语音片段为15～40ms，依次类推，可以实现对待识别的语音信息的切分。可以理解，具体的窗长和帧移可以根据实际需求自行设定，本发明实施例对此不加以限制。The window length can be used to represent the duration of each frame of speech segment, and the frame shift can be used to represent the time difference between adjacent frames. For example, when the window length is 25ms and the frame shift is 15ms, the first frame of voice segment is 0-25ms, the second frame of voice segment is 15-40ms, and so on, the segmentation of the voice information to be recognized can be realized. It can be understood that the specific window length and frame shift can be set by themselves according to actual requirements, which are not limited in this embodiment of the present invention.

可选地，在对待识别的语音信息进行切分之前，所述电子设备还可以对待识别的语音信息进行降噪处理，以提高后续对该语音信息的处理能力。Optionally, before segmenting the to-be-recognized speech information, the electronic device may further perform noise reduction processing on the to-be-recognized speech information, so as to improve the subsequent processing capability of the speech information.

在本发明实施例中，可以将语音信息输入预先训练的多语言声学模型，并基于多语言声学模型的输出，得到语音识别结果。所述多语言声学模型可以是融合了多种神经网络的分类模型。所述神经网络包括但不限于以下的至少一种或者至少两种的组合、叠加、嵌套：CNN(Convolutional Neural Network，卷积神经网络)、LSTM(Long Short-TermMemory，长短时记忆)网络、RNN(Simple Recurrent Neural Network，循环神经网络)、注意力神经网络等。In the embodiment of the present invention, speech information may be input into a pre-trained multilingual acoustic model, and a speech recognition result may be obtained based on the output of the multilingual acoustic model. The multilingual acoustic model may be a classification model incorporating various neural networks. The neural network includes but is not limited to at least one of the following or a combination, superposition and nesting of at least two: CNN (Convolutional Neural Network, convolutional neural network), LSTM (Long Short-TermMemory, long short-term memory) network, RNN (Simple Recurrent Neural Network, recurrent neural network), attention neural network, etc.

为了提高对包含多种语言类型的语音信息识别的准确率，本发明实施例预先根据至少两种语言类型的声学数据训练得到多语言声学模型，根据所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此可以根据该语言类型对应的解码网络，对该语音帧进行解码，以得到该语音帧对应的第一解码结果，进而可以根据所述第一解码结果，确定所述语音信息对应的识别结果。In order to improve the accuracy of recognizing speech information containing multiple language types, the embodiment of the present invention obtains a multi-language acoustic model by pre-training according to acoustic data of at least two language types. According to the multi-language acoustic model, it is possible to determine the The language type of the speech frame, so the speech frame can be decoded according to the decoding network corresponding to the language type to obtain the first decoding result corresponding to the speech frame, and then the speech frame can be determined according to the first decoding result. The identification result corresponding to the information.

可以理解，本发明实施例对训练多语言声学模型的声学数据包含的语言类型数目以及语言类型均不加以限制。为便于描述，本发明实施例中均以包含中文和英文两种语言类型的语音信息为例进行说明，也即，所述多语言声学模型可以为根据收集的中文声学数据和英文声学数据训练得到。当然，还可以收集两种以上的语言类型的声学数据，如中文、英文、日文、德文等语言类型的声学数据，以训练多语言声学模型。对于两种以上的语言类型的应用场景，实现过程与两种语言类型类似，相互参照即可。It can be understood that the embodiments of the present invention do not limit the number of language types and language types included in the acoustic data for training the multilingual acoustic model. For ease of description, in the embodiments of the present invention, speech information including two language types, Chinese and English, is used as an example for description, that is, the multilingual acoustic model may be obtained by training according to the collected Chinese acoustic data and English acoustic data. . Of course, it is also possible to collect acoustic data of more than two language types, such as Chinese, English, Japanese, German and other language types, to train a multilingual acoustic model. For application scenarios of two or more language types, the implementation process is similar to that of the two language types, and it is sufficient to refer to each other.

本发明实施例的解码网络可以包含至少两种语言类型对应的解码网络，例如，在识别中英文混合的语音信息的场景下，可以分别构建中文解码网络和英文解码网络。具体地，可以收集中文文本语料训练中文语言模型，根据中文语言模型、中文发音字典等知识源构建中文解码网络；同样地，可以收集英文文本语料训练英文语言模型，根据英文语言模型、英文发音字典等知识源构建英文解码网络。The decoding network in this embodiment of the present invention may include decoding networks corresponding to at least two language types. For example, in the scenario of recognizing speech information mixed in Chinese and English, a Chinese decoding network and an English decoding network may be constructed respectively. Specifically, Chinese text corpus can be collected to train a Chinese language model, and a Chinese decoding network can be constructed based on knowledge sources such as a Chinese language model and a Chinese pronunciation dictionary; similarly, an English language model can be trained by collecting English text corpus, and the English language model and English pronunciation dictionary and other knowledge sources to build an English decoding network.

在对语音信息逐帧进行解码的过程中，若根据多语言声学模型确定语音帧的语言类型为中文，则可以根据中文解码网络对语音帧进行解码，若根据多语言声学模型确定语音帧的语言类型为英文，则可以根据英文解码网络对语音帧进行解码。In the process of decoding the speech information frame by frame, if the language type of the speech frame is determined to be Chinese according to the multilingual acoustic model, the speech frame can be decoded according to the Chinese decoding network. If the language of the speech frame is determined according to the multilingual acoustic model If the type is English, the speech frame can be decoded according to the English decoding network.

在本发明的一种应用示例中，假设待识别的语音信息为“我喜欢apple”。具体地，首先可以根据多语言声学模型，确定该语音信息中第一帧语音帧的语言类型，假设确定第一帧语音帧的语言类型为中文，则可以根据中文解码网络，对第一帧语音帧进行解码，以得到第一帧语音帧的第一解码结果；然后再根据多语言声学模型，确定第二帧语音帧的语言类型，并且将第二帧语音帧输入其语言类型对应的解码网络进行解码，以得到第二帧语音帧的第一解码结果；以此类推，假设根据多语言声学模型，确定第m帧语音帧的语言类型为英文，则可以根据英文解码网络，对第m帧语音帧进行解码，以得到第m帧语音帧的第一解码结果，直到最后一帧语音帧解码完成；最后，可以根据各语音帧的第一解码结果，得到该语音信息的识别结果，如该识别结果可以包括如下文本信息“我喜欢apple”。In an application example of the present invention, it is assumed that the voice information to be recognized is "I like apple". Specifically, first, the language type of the first frame of speech in the speech information can be determined according to the multilingual acoustic model. Assuming that the language type of the first frame of speech is Chinese, the first frame of speech can be decoded according to the Chinese decoding network. frame decoding to obtain the first decoding result of the first frame of speech frame; then determine the language type of the second frame of speech frame according to the multilingual acoustic model, and input the second frame of speech frame into the decoding network corresponding to its language type Decoding is performed to obtain the first decoding result of the second frame of speech; by analogy, if the language type of the mth speech frame is determined to be English according to the multilingual acoustic model, then the mth frame can be decoded according to the English decoding network. The speech frame is decoded to obtain the first decoding result of the mth speech frame, until the decoding of the last speech frame is completed; finally, the recognition result of the speech information can be obtained according to the first decoding result of each speech frame, such as this The recognition result may include text information "I like apple" as follows.

可以看出，本发明实施例通过已训练的多语言声学模型，可以确定语音信息中语音帧的语言类型，由此可以根据相应语言类型的解码网络对语音帧进行解码，以得到更加准确的识别结果。It can be seen that in the embodiment of the present invention, the language type of the speech frame in the speech information can be determined by using the trained multilingual acoustic model, so that the speech frame can be decoded according to the decoding network of the corresponding language type, so as to obtain a more accurate recognition result.

在本发明的一种可选实施例中，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，具体可以包括：In an optional embodiment of the present invention, determining the language type of the speech frame in the speech information according to the multilingual acoustic model may specifically include:

步骤S11、根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；Step S11, according to the multilingual acoustic model, determine the posterior probability of the speech frame corresponding to each state; wherein, there is a corresponding relationship between the state and the language type;

步骤S12、根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；Step S12, according to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state, determine the probability ratio of the posterior probability of the speech frame corresponding to the state of each language type;

步骤S13、根据所述概率比值，确定所述语音帧的语言类型。Step S13: Determine the language type of the speech frame according to the probability ratio.

所述多语言声学模型可以将输入的所述语音帧的特征转化为各状态的后验概率，所述状态具体可以为HMM(Hidden Markov Model，隐马尔可夫模型)状态，具体地，多个状态可以对应一个音素，多个音素可以对应一个字，多个字可以组成一个句子。The multilingual acoustic model can convert the input features of the speech frame into the posterior probability of each state, and the state can specifically be an HMM (Hidden Markov Model, Hidden Markov Model) state. A state can correspond to a phoneme, multiple phonemes can correspond to a word, and multiple words can form a sentence.

例如，假设所述多语言声学模型在输出层可以输出(M1+M2)个状态对应的后验概率，其中M1个状态对应的语言类型可以为中文，M2个状态对应的语言类型可以为英文。For example, it is assumed that the multilingual acoustic model can output posterior probabilities corresponding to (M1+M2) states in the output layer, wherein the language type corresponding to the M1 states can be Chinese, and the language type corresponding to the M2 states can be English.

将语音帧输入所述多语言声学模型，可以输出该语音帧对应各状态的后验概率，根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，可以确定该语音帧的后验概率对应各语言类型状态的概率比值，如该语音帧的后验概率中对应中文状态和英文状态的概率比值，进而可以根据该概率比值，确定该语音帧对应的语言类型。The voice frame is input into the multilingual acoustic model, and the posterior probability of the voice frame corresponding to each state can be output. According to the posterior probability of the voice frame corresponding to each state and the language type corresponding to each state, the voice frame can be determined. The posterior probability of is corresponding to the probability ratio of the state of each language type, such as the probability ratio of the corresponding Chinese state and the English state in the posterior probability of the speech frame, and then the language type corresponding to the speech frame can be determined according to the probability ratio.

例如，将M1个中文状态的后验概率相加得到的概率值为p1，将M2个英文状态的后验概率相加得到的概率值为p2，且p1+p2＝1。如果p1大于p2，说明该语音帧的后验概率中对应中文状态的概率较大，则可以确定该语音帧的语言类型为中文，反之，可以确定该语音帧的语言类型为英文。For example, the probability value obtained by adding the posterior probabilities of M1 Chinese states is p1, and the probability value obtained by adding the posterior probabilities of M2 English states is p2, and p1+p2=1. If p1 is greater than p2, it means that the posterior probability of the speech frame corresponds to a higher probability of the Chinese state, and the language type of the speech frame can be determined to be Chinese; otherwise, the language type of the speech frame can be determined to be English.

然而，对于中英文混合的语音信息中，英文的后验概率通常较小，很少超过0.5，因此，为了减少误判，本发明实施例可以设置预设阈值，通过将所述概率比值与所述预设阈值进行对比，确定语音帧的语言类型。However, in the speech information mixed with Chinese and English, the posterior probability of English is usually small and rarely exceeds 0.5. Therefore, in order to reduce misjudgment, the embodiment of the present invention can set a preset threshold, by comparing the probability ratio with all The preset threshold is compared to determine the language type of the speech frame.

以中英文混合为例，假设语音帧的后验概率对应英文状态和中文状态的概率比值为p2/p1，如果p2/p1超过预设阈值(如0.25)，则可以确定该语音帧的语言类型为英文；同理，该语音帧的后验概率对应中文状态和英文状态的概率比值为p1/p2，如果p1/p2超过4，则可以确定该语音帧的语言类型为中文。所述预设阈值可以根据实验进行调整，可以理解，本发明实施例对所述预设阈值的具体取值不加以限制。Taking the mixture of Chinese and English as an example, it is assumed that the probability ratio of the posterior probability of the speech frame corresponding to the English state and the Chinese state is p2/p1. If p2/p1 exceeds a preset threshold (such as 0.25), the language type of the speech frame can be determined. In the same way, the probability ratio of the posterior probability of the speech frame corresponding to the Chinese state and the English state is p1/p2. If p1/p2 exceeds 4, it can be determined that the language type of the speech frame is Chinese. The preset threshold may be adjusted according to experiments, and it can be understood that a specific value of the preset threshold is not limited in this embodiment of the present invention.

当然，由于p1+p2＝1，p2/p1>0.25等价于p2>0.2，因此也可以单纯以p1或者p2的值进行判断。Of course, since p1+p2=1, p2/p1>0.25 is equivalent to p2>0.2, so the judgment can also be made simply by the value of p1 or p2.

在具体应用中，若出现用户频繁切换语言类型的情况，或者语音信息较短的情况，根据单帧语音帧判断该语音帧的语言类型，可能会导致判断出错。In a specific application, if the user frequently switches the language type, or the voice information is short, judging the language type of the voice frame according to a single frame of voice frame may lead to a judgment error.

为了提高确定语音帧的语言类型的准确性，在本发明的一种可选实施例中，可以根据语音帧所在预设窗长内的连续语音帧的后验概率对应各语言类型状态的概率比值的平均值，确定所述语音帧的语言类型。In order to improve the accuracy of determining the language type of the speech frame, in an optional embodiment of the present invention, the probability ratio of the state of each language type corresponding to the posterior probability of the consecutive speech frames within the preset window length of the speech frame may be to determine the language type of the speech frame.

可以理解，本发明实施例对所述预设窗长的具体数值不加以限制，例如，可以设置预设窗长为包含连续10帧语音帧的时间长度。具体地，可以获取包含所述语音帧的连续10帧语音帧，分别计算这10帧语音帧中每一帧语音帧的后验概率对应英文状态和中文状态的概率比值p2/p1，再将这10个p2/p1求和取平均值，如果该平均值超过预设阈值0.25，则可以确定所述语音帧的语言类型为英文，以避免通过单帧判断出现误判的概率，进而可以提高确定语音帧的语言类型的准确性。It can be understood that the embodiment of the present invention does not limit the specific value of the preset window length. For example, the preset window length can be set to include a time length of 10 consecutive speech frames. Specifically, 10 consecutive speech frames including the speech frame can be obtained, and the posterior probability of each of the 10 speech frames corresponding to the probability ratio p2/p1 of the English state and the Chinese state is calculated respectively, and then the 10 p2/p1 sums are averaged, and if the average value exceeds a preset threshold of 0.25, it can be determined that the language type of the speech frame is English, so as to avoid the probability of misjudgment through single-frame judgment, and thus can improve the determination of The accuracy of the language type of the speech frame.

在本发明的一种可选实施例中，在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之前，所述方法还可以包括：In an optional embodiment of the present invention, before determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method may further include:

步骤S21、从所述至少两种语言类型中确定目标语言类型；Step S21, determining the target language type from the at least two language types;

步骤S22、根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；Step S22, decoding each speech frame in the speech information according to the decoding network corresponding to the target language type, to obtain the second decoding result of each speech frame;

在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之后，所述方法还可以包括：After determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method may further include:

从所述语音信息的语音帧中，确定目标语音帧，以及确定所述目标语音帧的第二解码结果；其中，所述目标语音帧的语言类型为非目标语言类型；From the voice frame of the voice information, determine the target voice frame, and determine the second decoding result of the target voice frame; wherein, the language type of the target voice frame is a non-target language type;

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，具体可以包括：根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；The decoding network corresponding to the language type of the speech frame to decode the speech frame to obtain the first decoding result of the speech frame may specifically include: a decoding network corresponding to the language type of the target speech frame. a decoding network, which decodes the target speech frame to obtain a first decoding result of the target speech frame;

所述根据所述第一解码结果，确定所述语音信息对应的识别结果，具体可以包括：将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。The determining the recognition result corresponding to the speech information according to the first decoding result may specifically include: replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame , and use the replaced second decoding result as the recognition result corresponding to the voice information.

在具体应用中，用户通常使用两种类型的语言混合进行表达，并且大部分语句使用其中一种语言类型，只有小部分语句中会穿插出现另一种语言类型。此外，在语音信息较短的情况下，例如语音信息只包含一个英语单词，则在解码时由于一个单词没有上下文信息，可能导致解码结果不够准确。In specific applications, users usually use a mixture of two types of languages to express, and most of the sentences use one of the language types, and only a small part of the sentences will be interspersed with the other language type. In addition, when the speech information is short, for example, the speech information contains only one English word, since a word has no context information during decoding, the decoding result may be inaccurate.

因此，本发明实施例可以从所述至少两种语言类型中确定目标语言类型，所述目标语言类型可以为混合语言表达中使用的主要语言，例如，可以确定所述目标语言类型为中文。在对语音信息进行解码的过程中，对于语音信息中的每一帧语音帧，都根据中文解码网络进行解码，以得到每一帧语音帧对应的第二解码结果(如R1)，该R1为中文解码结果。由于第二解码结果为对一段完整的语音信息解码得到，每一帧语音帧在解码过程中，可以参考其对应的上下文信息，因此，可以提高第二解码结果的准确性。Therefore, in this embodiment of the present invention, a target language type may be determined from the at least two language types, and the target language type may be the main language used in mixed language expression. For example, it may be determined that the target language type is Chinese. In the process of decoding the voice information, each frame of the voice frame in the voice information is decoded according to the Chinese decoding network to obtain the second decoding result (such as R1) corresponding to each frame of voice frame, and the R1 is Chinese decoding result. Since the second decoding result is obtained by decoding a complete piece of speech information, during the decoding process of each speech frame, its corresponding context information can be referred to, so the accuracy of the second decoding result can be improved.

在目标语言类型对应的解码网络对语音信息中的所有语音帧解码完成后，可以从所述语音信息的语音帧中，确定目标语音帧；其中，所述目标语音帧的语言类型为非目标语言类型。例如，对于中英文混合的语音信息，若确定目标语言类型为中文，则英文为非目标语言类型，也即，可以从语音信息中确定语言类型为英文的语音帧为目标语音帧，并且确定该目标语音帧对应英文的第一解码结果(如R2)，该R2为根据英文解码网络对该目标语音帧解码得到，也即该R2为英文解码结果。最后用R2替换对应的R1，可以得到该语音信息对应的识别结果。After the decoding network corresponding to the target language type completes the decoding of all the speech frames in the speech information, the target speech frame can be determined from the speech frames of the speech information; wherein, the language type of the target speech frame is a non-target language type. For example, for speech information mixed with Chinese and English, if it is determined that the target language type is Chinese, then English is the non-target language type, that is, it can be determined from the speech information that the speech frame whose language type is English is the target speech frame, and the The target speech frame corresponds to the first decoding result of English (eg R2), and the R2 is obtained by decoding the target speech frame according to the English decoding network, that is, the R2 is the English decoding result. Finally, replace the corresponding R1 with R2, and the recognition result corresponding to the speech information can be obtained.

在本发明的一种应用示例中，假设待识别的语音信息为“我喜欢apple”，且假设目标语言类型为中文。具体地，首先将该语音信息输入多语言声学模型，得到每一帧语音帧对应的状态后验概率序列，根据中文解码网络，对每一帧语音帧的中文状态的后验概率序列进行解码，以得到每一帧语音帧的第二解码结果，假设得到该语音信息的第二解码结果为“我喜欢爱破”；然后，根据各语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定各语音帧的语言类型，并且将语言类型为英文的语音帧确定为目标语音帧；再根据英文解码网络对目标语音帧进行解码，以得到所述目标语音帧对应英文的第一解码结果，假设为“apple”；最后，将第二解码结果“我喜欢爱破”中与“apple”相对应的“爱破”替换为“apple”，可以得到替换后的第二解码结果为如下文本：“我喜欢apple”。In an application example of the present invention, it is assumed that the speech information to be recognized is "I like apple" and the target language type is Chinese. Specifically, firstly, input the voice information into the multilingual acoustic model to obtain the state posterior probability sequence corresponding to each frame of voice frame, and decode the posterior probability sequence of the Chinese state of each frame of voice frame according to the Chinese decoding network, In order to obtain the second decoding result of each frame of speech frame, it is assumed that the second decoding result of the speech information is "I like love broken"; then, according to the posterior probability of each speech frame corresponding to each state, and the corresponding Language type, determine the language type of each speech frame, and determine the speech frame whose language type is English as the target speech frame; and then decode the target speech frame according to the English decoding network, to obtain the first English corresponding to the target speech frame. The decoding result is assumed to be "apple"; finally, replace the "aipo" corresponding to "apple" in the second decoding result "I like lovepo" with "apple", and the second decoding result after replacement can be obtained as Text like: "I like apple".

需要说明的是，在本发明实施例中，对于语言类型为目标语言类型的语音帧，其第一解码结果和第二解码结果相同，例如，在上述示例中“我喜欢”对应的语音帧，其语言类型为中文，目标语言类型也为中文，则“我喜欢”对应语音帧的第一解码结果和第二解码结果均为文本“我喜欢”。It should be noted that, in this embodiment of the present invention, for a speech frame whose language type is the target language type, the first decoding result and the second decoding result are the same. For example, in the above example, the speech frame corresponding to "I like", If the language type is Chinese and the target language type is also Chinese, the first decoding result and the second decoding result of the speech frame corresponding to "I like" are both the text "I like".

在本发明的一种可选实施例中，所述第一解码结果、以及所述第二解码结果可以包括：对应语音帧的时间边界信息；In an optional embodiment of the present invention, the first decoding result and the second decoding result may include: time boundary information corresponding to the speech frame;

所述将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，具体可以包括：The replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame may specifically include:

步骤S31、从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；Step S31, from the second decoding result of the target speech frame, determine the replaced result; wherein, the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame;

步骤S32、将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。Step S32: Replace the replaced result with the first decoding result of the language type corresponding to the target speech frame.

为了保证将目标语音帧的第二解码结果可以准确替换为所述目标语音帧对应语言类型的第一解码结果，本发明实施例的第一解码结果、以及第二解码结果可以包括：对应语音帧的时间边界信息。In order to ensure that the second decoding result of the target speech frame can be accurately replaced with the first decoding result of the language type corresponding to the target speech frame, the first decoding result and the second decoding result in this embodiment of the present invention may include: the corresponding speech frame time boundary information.

例如，在上述示例中，对于第二解码结果“我喜欢爱破”，其中每个字都包括该字对应语音帧的时间边界信息，可以根据该时间边界信息，从该第二解码结果中，确定被替换结果，以使被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合，根据上述示例可知，所述目标语音帧对应语言类型的第一解码结果为“apple”，假设确定第二解码结果“我喜欢爱破”中与“apple”的时间边界信息相重合的被替换结果为“爱破”，则可以将“我喜欢爱破”中的“爱破”替换为“apple”，得到替换后的解码结果为“我喜欢apple”。For example, in the above example, for the second decoding result "I like love breaking", each word includes the time boundary information of the speech frame corresponding to the word. According to the time boundary information, from the second decoding result, Determine the replaced result, so that the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame. According to the above example, the first decoding result corresponding to the language type of the target speech frame is " apple", if it is determined that the replaced result that coincides with the time boundary information of "apple" in the second decoding result "I like Aipo" is "Aipo", then the "Aipo" in "I like Aipo" can be replaced. " is replaced with "apple", and the decoded result after replacement is "I like apple".

在本发明的一种可选实施例中，所述解码网络，具体可以包括：通用解码网络和专业解码网络；其中，所述通用解码网络中可以包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中可以包括：根据预置领域的文本语料训练得到的语言模型；In an optional embodiment of the present invention, the decoding network may specifically include: a general decoding network and a specialized decoding network; wherein, the general decoding network may include: a language model trained according to a general text corpus ; The professional decoding network may include: a language model obtained by training the text corpus in the preset field;

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，具体可以包括：The decoding network corresponding to the language type of the speech frame to decode the speech frame to obtain the first decoding result of the speech frame may specifically include:

步骤S41、分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；Step S41, respectively decode the voice frame according to the general decoding network and the professional decoding network, so as to obtain the first score of the voice frame corresponding to the general decoding network, and the voice frame corresponding to the professional decoding network. the second score of the decoding network;

步骤S42、将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。Step S42: Use the decoding result with the highest score among the first score and the second score as the first decoding result of the speech frame.

在具体应用中，对于用户日常交流类的语音，解码网络通常具有较好的解码效果，然而，对于一些专业领域的语音，例如医疗专业领域的语音，通常包含较多的医疗专业词汇，如“阿斯匹林”、“帕金森症”等，将会影响解码的效果。In specific applications, the decoding network usually has a good decoding effect for the voice of the user's daily communication. However, for the voice in some professional fields, such as the voice in the medical professional field, it usually contains more medical professional vocabulary, such as " Aspirin", "Parkinson's disease", etc., will affect the decoding effect.

为解决上述问题，本发明实施例的解码网络可以包括通用解码网络和专业解码网络。其中，通用解码网络可以为用户日常交流过程中使用的通用的解码网络，通用解码网络中可以包括：根据通用的文本语料训练得到的语言模型，因此，通用解码网络可以对大多用户的日常语音进行识别。专业解码网络可以为专门为专业领域定制的解码网络，专业解码网络中可以包括：根据预置领域的文本语料训练得到的语言模型；所述预置领域可以为医学领域、法律领域、计算机领域等任意领域。To solve the above problem, the decoding network in the embodiment of the present invention may include a general decoding network and a specialized decoding network. Among them, the general decoding network can be a general decoding network used in the user's daily communication process. The general decoding network can include: a language model trained according to a general text corpus. Therefore, the general decoding network can be used for most users' daily speech. identify. The professional decoding network can be a decoding network specially customized for the professional field. The professional decoding network can include: a language model trained according to the text corpus in the preset field; the preset field can be the medical field, the legal field, the computer field, etc. any field.

例如，在某个医学研讨会上，演讲者可能会使用很多中英文混合的句子，并且还会使用大量的医疗专业词汇，本发明实施例可以将演讲者的语音实时识别为文字，并显示在大屏幕上供参会者观看。For example, in a medical seminar, the speaker may use a lot of mixed sentences in Chinese and English, and also use a large number of medical professional vocabulary, the embodiment of the present invention can recognize the speaker's speech as text in real time, and display it on the on the big screen for participants to watch.

具体地，可以分别根据所述通用解码网络和所述专业解码网络对演讲者的语音进行逐帧解码，以得到语音帧对应通用解码网络的第一得分，以及语音帧所述专业解码网络的第二得分，并且将第一得分和第二得分中得分高的解码结果作为语音帧的第一解码结果。Specifically, the speech of the speaker can be decoded frame by frame according to the general decoding network and the professional decoding network respectively, so as to obtain the first score of the speech frame corresponding to the general decoding network, and the first score of the professional decoding network for the speech frame. Two scores are obtained, and the decoding result with the highest score among the first score and the second score is used as the first decoding result of the speech frame.

可以理解，本发明实施例的解码网络可以包括多个不同语言类型对应的解码网络，每一个语言类型的解码网络又可以包括该语言类型对应的通用解码网络和专业解码网络。由此，本发明实施例可以通过专业解码网络对通用解码网络的解码结果进行补充或修正，在语音信息中包含专业领域词汇的情况下，可以提高解码的准确性。It can be understood that the decoding network in this embodiment of the present invention may include a plurality of decoding networks corresponding to different language types, and the decoding network of each language type may further include a general decoding network and a professional decoding network corresponding to the language type. Therefore, the embodiment of the present invention can supplement or correct the decoding result of the general decoding network through the professional decoding network, and can improve the accuracy of decoding when the speech information contains vocabulary in the professional field.

可以理解，本发明实施例对训练所述多语言声学模型的训练方式不加以限制。在本发明的一种可选实施例中，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型。It can be understood that the embodiment of the present invention does not limit the training method for training the multilingual acoustic model. In an optional embodiment of the present invention, each of the acoustic data of the at least two language types corresponds to at least two language types.

具体地，本发明实施例可以收集包含至少两种语言类型的混合声学数据，以训练多语言声学模型，所述混合声学数据指其中的每一个数据都对应至少两种语言类型。例如，“我喜欢apple”对应的语音可以为一个混合声学数据。Specifically, in this embodiment of the present invention, mixed acoustic data including at least two language types can be collected to train a multilingual acoustic model, where the mixed acoustic data refers to each of the data corresponding to at least two language types. For example, the speech corresponding to "I like apple" may be a mixed acoustic data.

根据混合声学数据训练多语言声学模型，需要将不同类型的语言中相似的发音单元进行合并，以生成适应于混合语言的发音字典，然而在对发音单元进行合并的过程中，可能会带来一定的误差。此外，包含至少两种语言类型的混合声学数据通常具有数据稀少、难以收集的特点，因此，将会影响多语言声学模型识别的准确性。To train a multilingual acoustic model based on mixed acoustic data, it is necessary to merge similar pronunciation units in different types of languages to generate a pronunciation dictionary suitable for mixed languages. However, in the process of merging pronunciation units, it may bring some error. In addition, mixed acoustic data containing at least two language types is usually characterized by sparse data and difficult to collect, therefore, it will affect the accuracy of multilingual acoustic model recognition.

为解决上述问题，在本发明的一种可选实施例中，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。To solve the above problem, in an optional embodiment of the present invention, each of the acoustic data of the at least two language types corresponds to one language type.

具体地，本发明实施例可以分别收集至少两种语言类型各自对应的单语言声学数据，并根据各语言类型对应的单语言数据组成的训练数据集训练多语言声学模型。例如，“今天天气很好”对应的语音可以为一个单语言声学数据，“What's the weather liketoday”对应的语音也可以为一个单语言声学数据。Specifically, the embodiment of the present invention can separately collect monolingual acoustic data corresponding to at least two language types, and train a multilingual acoustic model according to a training data set composed of monolingual data corresponding to each language type. For example, the speech corresponding to "the weather is fine today" may be a monolingual acoustic data, and the speech corresponding to "What's the weather like today" may also be a monolingual acoustic data.

在本发明的一种可选实施例中，所述多语言声学模型的训练步骤具体可以包括：In an optional embodiment of the present invention, the training step of the multilingual acoustic model may specifically include:

步骤S51、根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；Step S51, according to the collected acoustic data of at least two language types, respectively train the monolingual acoustic model corresponding to each language type;

步骤S52、根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；Step S52, according to the single-language acoustic model, state annotating the acoustic data of the at least two language types respectively, wherein there is a corresponding relationship between the state and the language type;

步骤S53、根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。Step S53: Train a multilingual acoustic model according to the marked data set composed of acoustic data of at least two language types.

具体地，可以根据收集的中文声学数据L1，训练中文对应的单语言声学模型NN1，其中，L1中的每一个数据对应的语言类型均为中文。可以设置中文语音的HMM绑定状态个数为NN1网络输出层的结点个数，如M1。所述单语言声学模型的输出可以包括：一种语言类型对应的状态概率，也即，网络输出层的M1个节点的状态概率均对应中文语言类型。Specifically, a single-language acoustic model NN1 corresponding to Chinese can be trained according to the collected Chinese acoustic data L1, wherein the language type corresponding to each data in L1 is Chinese. The number of HMM binding states for Chinese voice can be set to the number of nodes in the output layer of the NN1 network, such as M1. The output of the single-language acoustic model may include: state probabilities corresponding to one language type, that is, the state probabilities of the M1 nodes in the network output layer all correspond to the Chinese language type.

同样地，可以根据收集的英文声学数据L2，训练英文对应的单语言声学模型NN2，其中，L2中的每一个数据对应的语言类型均为英文。可以设置英文语音的HMM绑定状态个数为NN2网络输出层的结点个数，如M2，且M2个节点的状态概率均对应英文语言类型。Similarly, a single-language acoustic model NN2 corresponding to English can be trained according to the collected English acoustic data L2, wherein the language type corresponding to each data in L2 is English. You can set the number of HMM binding states of English speech to the number of nodes in the output layer of the NN2 network, such as M2, and the state probabilities of M2 nodes all correspond to the English language type.

然后，根据训练得到的NN1和NN2分别对中文声学数据L1和英文声学数据L2进行强制对齐，以对L1和L2进行状态标注。具体地，可以通过NN1确定L1中每个数据的语音帧对应的状态，以及通过NN2确定L2中每个数据的语音帧对应的状态。Then, according to the NN1 and NN2 obtained by training, the Chinese acoustic data L1 and the English acoustic data L2 are forcibly aligned, respectively, to label L1 and L2. Specifically, the state corresponding to the speech frame of each data in L1 may be determined by NN1, and the state corresponding to the speech frame of each data in L2 may be determined by NN2.

最后，将标注后的L1和L2混合在一起得到标注后的数据集(L1+L2)，以训练多语言声学模型NN3。所述多语言声学模型的输出可以包括：至少两种语言类型对应的状态概率。例如，所述NN3的输出层结点个数可以为M1+M2，其中，前M1个结点可以对应中文的HMM状态，后M2个结点可以对应英文HMM的状态。Finally, the labeled L1 and L2 are mixed together to obtain the labeled dataset (L1+L2) to train the multilingual acoustic model NN3. The output of the multilingual acoustic model may include: state probabilities corresponding to at least two language types. For example, the number of nodes in the output layer of the NN3 may be M1+M2, wherein the first M1 nodes may correspond to the Chinese HMM state, and the last M2 nodes may correspond to the English HMM state.

本发明实施例在训练多语言声学模型的过程中，可以使用各语言类型对应的单语言声学数据，以保留各语言类型的发音特征，因此，在声学层面，对不同语言类型可以具有一定的区分性。此外，在收集训练数据的过程中，分别收集各语言类型的声学数据，可以避免收集多种语言类型的混合声学数据导致数据不足的问题，因此，可以提高多语言声学模型识别的准确性。In the process of training a multilingual acoustic model in this embodiment of the present invention, monolingual acoustic data corresponding to each language type may be used to retain the pronunciation characteristics of each language type. Therefore, at the acoustic level, different language types may be distinguished to a certain extent. sex. In addition, in the process of collecting training data, the acoustic data of each language type is collected separately, which can avoid the problem of insufficient data caused by collecting mixed acoustic data of multiple language types, and thus can improve the accuracy of multilingual acoustic model recognition.

综上，本发明实施例可以根据至少两种语言类型的声学数据训练得到多语言模型，通过所述多语言声学模型，可以确定语音信息中语音帧的语言类型，因此，在语音信息中包含多种语言类型的情况下，本发明实施例可以准确区分语音信息中不同语言类型的语音帧，并且可以根据相应语言类型的解码网络对语音帧进行解码，以得到语音帧的第一解码结果，该第一解码结果为根据语音帧的语音类型对应的解码网络解码得到，可以保证解码的准确性，进而可以提高语音识别的准确率。To sum up, in this embodiment of the present invention, a multilingual model can be obtained by training according to acoustic data of at least two language types. Through the multilingual acoustic model, the language type of the speech frame in the speech information can be determined. Therefore, the speech information contains multiple languages. In the case of different language types, the embodiment of the present invention can accurately distinguish the speech frames of different language types in the speech information, and can decode the speech frames according to the decoding network of the corresponding language type, so as to obtain the first decoding result of the speech frame, the The first decoding result is obtained by decoding according to the decoding network corresponding to the speech type of the speech frame, which can ensure the accuracy of decoding, and thus can improve the accuracy of speech recognition.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明实施例并不受所描述的动作顺序的限制，因为依据本发明实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simple description, the method embodiments are described as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not limited by the described action sequences, because According to embodiments of the present invention, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

装置实施例Device embodiment

参照图2，示出了本发明的一种数据处理装置实施例的结构框图，所述装置具体可以包括：Referring to FIG. 2, a structural block diagram of an embodiment of a data processing apparatus of the present invention is shown, and the apparatus may specifically include:

类型确定模块201，用于根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；The type determination module 201 is used to determine the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

第一解码模块202，用于根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；A first decoding module 202, configured to decode the speech frame according to a decoding network corresponding to the language type of the speech frame, to obtain a first decoding result of the speech frame;

结果确定模块203，用于根据所述第一解码结果，确定所述语音信息对应的识别结果。The result determination module 203 is configured to determine the recognition result corresponding to the voice information according to the first decoding result.

可选地，所述类型确定模块，具体可以包括：Optionally, the type determination module may specifically include:

概率确定子模块，用于根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；The probability determination submodule is used to determine the posterior probability of the speech frame corresponding to each state according to the multilingual acoustic model; wherein, there is a corresponding relationship between the state and the language type;

比值确定子模块，用于根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；a ratio determination submodule, configured to determine the probability ratio of the posterior probability of the speech frame corresponding to the state of each language type according to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state;

类型确定子模块，用于根据所述概率比值，确定所述语音帧的语言类型。A type determination submodule, configured to determine the language type of the speech frame according to the probability ratio.

可选地，所述装置还可以包括：Optionally, the device may also include:

目标语言确定模块，用于从所述至少两种语言类型中确定目标语言类型；a target language determination module for determining a target language type from the at least two language types;

第二解码模块，用于根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；A second decoding module, configured to decode each speech frame in the speech information according to the decoding network corresponding to the target language type, to obtain a second decoding result of each speech frame;

所述装置还可以包括：The apparatus may also include:

目标帧确定模块，用于从所述语音信息的语音帧中，确定目标语音帧，以及确定所述目标语音帧的第二解码结果；其中，所述目标语音帧的语言类型为非目标语言类型；A target frame determination module for determining a target speech frame from the speech frame of the speech information, and determining a second decoding result of the target speech frame; wherein, the language type of the target speech frame is a non-target language type ;

所述第一解码模块，具体可以包括：The first decoding module may specifically include:

第一解码子模块，用于根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；The first decoding submodule is used to decode the target speech frame according to the decoding network corresponding to the language type of the target speech frame, to obtain the first decoding result of the target speech frame;

所述结果确定模块，具体可以包括：The result determination module may specifically include:

第一结果确定子模块，用于将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。The first result determination submodule is used to replace the second decoding result of the target speech frame with the first decoding result of the corresponding language type of the target speech frame, and use the replaced second decoding result as the speech The identification result corresponding to the information.

可选地，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；Optionally, the first decoding result and the second decoding result include: time boundary information corresponding to the speech frame;

所述第一结果确定子模块，具体可以包括：The first result determination sub-module may specifically include:

结果确定单元，用于从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；A result determination unit, configured to determine a replaced result from the second decoding result of the target speech frame; wherein, the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame ;

替换单元，用于将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。A replacement unit, configured to replace the replaced result with the first decoding result of the language type corresponding to the target speech frame.

可选地，所述解码网络，具体可以包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；Optionally, the decoding network may specifically include: a general decoding network and a specialized decoding network; wherein, the general decoding network includes: a language model trained according to a general text corpus; the specialized decoding network includes: The language model obtained by training the text corpus in the preset domain;

得分确定子模块，用于分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；A score determination submodule, configured to decode the speech frame according to the general decoding network and the specialized decoding network respectively, so as to obtain the first score of the speech frame corresponding to the general decoding network, and the speech frame corresponding to the second score of the professional decoding network;

第二结果确定子模块，用于将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。The second result determination submodule is configured to use the decoding result with the highest score among the first score and the second score as the first decoding result of the speech frame.

可选地，所述装置还可以包括：模型训练模块，用于训练所述多语言声学模型；所述模型训练模块，具体可以包括：Optionally, the apparatus may further include: a model training module for training the multilingual acoustic model; the model training module may specifically include:

第一训练子模块，用于根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；The first training submodule is used to separately train the monolingual acoustic model corresponding to each language type according to the collected acoustic data of at least two language types;

状态标注子模块，用于根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；A state labeling submodule, configured to label the acoustic data of the at least two language types respectively according to the single language acoustic model, wherein the state and the language type have a corresponding relationship;

第二训练子模块，用于根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。The second training sub-module is used for training a multilingual acoustic model according to the data set composed of the labeled acoustic data of at least two language types.

可选地，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。Optionally, each of the at least two language types of acoustic data corresponds to at least two language types; or, each of the at least two language types of acoustic data corresponds to one language type.

对于装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。As for the apparatus embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

关于上述实施例中的装置，其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述，此处将不做详细阐述说明。Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

本发明实施例提供了一种用于数据处理的装置，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；根据所述第一解码结果，确定所述语音信息对应的识别结果。Embodiments of the present invention provide an apparatus for data processing, including a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors The one or more programs include instructions for performing the following operations: determining language types of speech frames in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is based on acoustic data of at least two language types Training is obtained; according to the decoding network corresponding to the language type of the speech frame, the speech frame is decoded to obtain the first decoding result of the speech frame; according to the first decoding result, it is determined that the speech information corresponds to recognition result.

图3是根据一示例性实施例示出的一种用于数据处理的装置800的框图。例如，装置800可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等。FIG. 3 is a block diagram of an apparatus 800 for data processing according to an exemplary embodiment. For example, apparatus 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.

参照图3，装置800可以包括以下一个或多个组件：处理组件802，存储器804，电源组件806，多媒体组件808，音频组件810，输入/输出(I/O)的接口812，传感器组件814，以及通信组件816。3, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and communication component 816.

处理组件802通常控制装置800的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理元件802可以包括一个或多个处理器820来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件802可以包括一个或多个模块，便于处理组件802和其他组件之间的交互。例如，处理组件802可以包括多媒体模块，以方便多媒体组件808和处理组件802之间的交互。The processing component 802 generally controls the overall operation of the device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.

存储器804被配置为存储各种类型的数据以支持在设备800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。Memory 804 is configured to store various types of data to support operation at device 800 . Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and the like. Memory 804 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

电源组件806为装置800的各种组件提供电力。电源组件806可以包括电源管理系统，一个或多个电源，及其他与为装置800生成、管理和分配电力相关联的组件。Power supply assembly 806 provides power to the various components of device 800 . Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 800 .

多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件808包括一个前置摄像头和/或后置摄像头。当设备800处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。Multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundaries of a touch or swipe action, but also detect the duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 808 includes a front-facing camera and/or a rear-facing camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each of the front and rear cameras can be a fixed optical lens system or have focal length and optical zoom capability.

音频组件810被配置为输出和/或输入音频信号。例如，音频组件810包括一个麦克风(MIC)，当装置800处于操作模式，如呼叫模式、记录模式和语音信息处理模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中，音频组件810还包括一个扬声器，用于输出音频信号。Audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a microphone (MIC) that is configured to receive external audio signals when device 800 is in operating modes, such as call mode, recording mode, and voice information processing mode. The received audio signal may be further stored in memory 804 or transmitted via communication component 816 . In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

I/O接口812为处理组件802和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, or the like. These buttons may include, but are not limited to: home button, volume buttons, start button, and lock button.

传感器组件814包括一个或多个传感器，用于为装置800提供各个方面的状态评估。例如，传感器组件814可以检测到设备800的打开/关闭状态，组件的相对定位，例如所述组件为装置800的显示器和小键盘，传感器组件814还可以检测装置800或装置800一个组件的位置改变，用户与装置800接触的存在或不存在，装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器，如CMOS或CCD图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件814还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。Sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of device 800 . For example, the sensor assembly 814 can detect the open/closed state of the device 800, the relative positioning of components, such as the display and keypad of the device 800, and the sensor assembly 814 can also detect a change in the position of the device 800 or a component of the device 800 , the presence or absence of user contact with the device 800 , the orientation or acceleration/deceleration of the device 800 and the temperature change of the device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件816被配置为便于装置800和其他设备之间有线或无线方式的通信。装置800可以接入基于通信标准的无线网络，如WiFi，2G或3G，或它们的组合。在一个示例性实施例中，通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件816还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频信息处理(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。Communication component 816 is configured to facilitate wired or wireless communication between apparatus 800 and other devices. Device 800 may access wireless networks based on communication standards, such as WiFi, 2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, apparatus 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation is used to perform the above method.

在示例性实施例中，还提供了一种包括指令的非临时性计算机可读存储介质，例如包括指令的存储器804，上述指令可由装置800的处理器820执行以完成上述方法。例如，所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory 804 including instructions, executable by the processor 820 of the apparatus 800 to perform the method described above. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

图4是本发明的一些实施例中服务器的结构示意图。该服务器1900可因配置或性能不同而产生比较大的差异，可以包括一个或一个以上中央处理器(central processingunits，CPU)1922(例如，一个或一个以上处理器)和存储器1932，一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中，存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出)，每个模块可以包括对服务器中的一系列指令操作。更进一步地，中央处理器1922可以设置为与存储介质1930通信，在服务器1900上执行存储介质1930中的一系列指令操作。FIG. 4 is a schematic structural diagram of a server in some embodiments of the present invention. The server 1900 may vary greatly due to differences in configuration or performance, and may include one or more central processing units (CPU) 1922 (eg, one or more processors) and memory 1932, one or more A storage medium 1930 (eg, one or more mass storage devices) that stores applications 1942 or data 1944. Among them, the memory 1932 and the storage medium 1930 may be short-term storage or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processing unit 1922 may be configured to communicate with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900 .

服务器1900还可以包括一个或一个以上电源1926，一个或一个以上有线或无线网络接口1950，一个或一个以上输入输出接口1958，一个或一个以上键盘1956，和/或，一个或一个以上操作系统1941，例如Windows ServerTM，Mac OS XTM，UnixTM,LinuxTM，FreeBSDTM等等。Server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941 , such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.

一种非临时性计算机可读存储介质，当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时，使得装置能够执行图1所示的数据处理方法。A non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by a processor of an apparatus (server or terminal), the apparatus enables the apparatus to execute the data processing method shown in FIG. 1 .

一种非临时性计算机可读存储介质，当所述存储介质中的指令由装置(服务器或者终端)的处理器执行时，使得装置能够执行一种数据处理方法，所述方法包括：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果；根据所述第一解码结果，确定所述语音信息对应的识别结果。A non-transitory computer-readable storage medium, when instructions in the storage medium are executed by a processor of a device (server or terminal), enabling the device to execute a data processing method, the method comprising: according to a multilingual The acoustic model determines the language type of the speech frame in the speech information; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types; according to the decoding network corresponding to the language type of the speech frame, the The speech frame is decoded to obtain a first decoding result of the speech frame; and a recognition result corresponding to the speech information is determined according to the first decoding result.

本发明实施例公开了A1、一种数据处理方法，包括：根据多语言声学模型，确定语音信息中语音帧的语言类型；其中，所述多语言声学模型为根据至少两种语言类型的声学数据训练得到；The embodiment of the present invention discloses A1, a data processing method, comprising: determining language types of speech frames in speech information according to a multilingual acoustic model; wherein the multilingual acoustic model is acoustic data according to at least two language types trained;

A2、根据A1所述的方法，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，包括：A2. The method according to A1, wherein determining the language type of the speech frame in the speech information according to the multilingual acoustic model, including:

根据多语言声学模型，确定语音帧对应各状态的后验概率；其中，所述状态与语言类型之间具有对应关系；Determine the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein, there is a corresponding relationship between the state and the language type;

根据所述语音帧对应各状态的后验概率、以及各状态对应的语言类型，确定所述语音帧的后验概率对应各语言类型状态的概率比值；According to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state, determine the probability ratio of the posterior probability of the speech frame corresponding to the state of each language type;

根据所述概率比值，确定所述语音帧的语言类型。The language type of the speech frame is determined according to the probability ratio.

A3、根据A1所述的方法，在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之前，所述方法还包括：A3. The method according to A1, before determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further includes:

从所述至少两种语言类型中确定目标语言类型；determining a target language type from the at least two language types;

根据所述目标语言类型对应的解码网络，对所述语音信息中的各语音帧进行解码，以得到所述各语音帧的第二解码结果；According to the decoding network corresponding to the target language type, each speech frame in the speech information is decoded to obtain the second decoding result of each speech frame;

在所述根据多语言声学模型，确定语音信息中语音帧的语言类型之后，所述方法还包括：After determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further includes:

所述根据所述语音帧的语言类型对应的解码网络，对所述语音帧进行解码，以得到所述语音帧的第一解码结果，包括：Decoding the speech frame according to the decoding network corresponding to the language type of the speech frame to obtain the first decoding result of the speech frame, including:

根据所述目标语音帧的语言类型对应的解码网络，对所述目标语音帧进行解码，以得到所述目标语音帧的第一解码结果；According to the decoding network corresponding to the language type of the target speech frame, the target speech frame is decoded to obtain the first decoding result of the target speech frame;

所述根据所述第一解码结果，确定所述语音信息对应的识别结果，包括：The determining the recognition result corresponding to the voice information according to the first decoding result includes:

将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，以及将替换后的第二解码结果，作为所述语音信息对应的识别结果。The second decoding result of the target speech frame is replaced with the first decoding result of the language type corresponding to the target speech frame, and the replaced second decoding result is used as the recognition result corresponding to the speech information.

A4、根据A3所述的方法，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；A4. The method according to A3, wherein the first decoding result and the second decoding result include: time boundary information of the corresponding speech frame;

所述将所述目标语音帧的第二解码结果替换为所述目标语音帧对应语言类型的第一解码结果，包括：Described replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, including:

从所述目标语音帧的第二解码结果中，确定被替换结果；其中，所述被替换结果与所述目标语音帧对应语言类型的第一解码结果的时间边界相重合；From the second decoding result of the target speech frame, determine the replaced result; wherein, the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame;

将所述被替换结果替换为所述目标语音帧对应语言类型的第一解码结果。The replaced result is replaced with the first decoding result of the language type corresponding to the target speech frame.

A5、根据A1所述的方法，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；A5. The method according to A1, wherein the decoding network includes: a general decoding network and a specialized decoding network; wherein, the general decoding network includes: a language model trained according to a general text corpus; the specialized decoding network Including: the language model obtained by training the text corpus in the preset field;

分别根据所述通用解码网络和所述专业解码网络对所述语音帧进行解码，以得到所述语音帧对应所述通用解码网络的第一得分，以及所述语音帧对应所述专业解码网络的第二得分；The voice frame is decoded according to the general decoding network and the professional decoding network respectively, so as to obtain the first score of the voice frame corresponding to the general decoding network, and the voice frame corresponding to the professional decoding network. second score;

将所述第一得分和所述第二得分中得分高的解码结果作为所述语音帧的第一解码结果。A decoding result with a higher score among the first score and the second score is used as the first decoding result of the speech frame.

A6、根据A1所述的方法，所述多语言声学模型的训练步骤包括：A6. According to the method of A1, the training steps of the multilingual acoustic model include:

根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；According to the collected acoustic data of at least two language types, respectively train the monolingual acoustic model corresponding to each language type;

根据所述单语言声学模型，对所述至少两种语言类型的声学数据分别进行状态标注，其中，所述状态与语言类型之间具有对应关系；According to the single-language acoustic model, state labeling is performed on the acoustic data of the at least two language types respectively, wherein there is a corresponding relationship between the states and the language types;

根据标注后的至少两种语言类型的声学数据组成的数据集，训练多语言声学模型。A multilingual acoustic model is trained according to a dataset consisting of annotated acoustic data of at least two language types.

A7、根据A1至A6中任一所述的方法，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。A7. The method according to any one of A1 to A6, wherein each of the acoustic data in the at least two language types corresponds to at least two language types; or, in the acoustic data in the at least two language types Each data corresponds to a language type.

本发明实施例公开了B8、一种数据处理装置，包括：The embodiment of the present invention discloses B8, a data processing device, comprising:

B9、根据B8所述的装置，所述类型确定模块，包括：B9. The device according to B8, wherein the type determination module includes:

B10、根据B8所述的装置，所述装置还包括：B10. The device according to B8, further comprising:

所述装置还包括：The device also includes:

所述第一解码模块，包括：The first decoding module includes:

所述结果确定模块，包括：The result determination module includes:

B11、根据B10所述的装置，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；B11. The device according to B10, wherein the first decoding result and the second decoding result include: time boundary information corresponding to the speech frame;

所述第一结果确定子模块，包括：The first result determination submodule includes:

B12、根据B8所述的装置，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；B12. The device according to B8, wherein the decoding network includes: a general decoding network and a specialized decoding network; wherein, the general decoding network includes: a language model trained according to a general text corpus; the specialized decoding network Including: the language model obtained by training the text corpus in the preset field;

所述第一解码模块，包括：The first decoding module includes:

B13、根据B8所述的装置，所述装置还包括：模型训练模块，用于训练所述多语言声学模型；所述模型训练模块，包括：B13. The device according to B8, further comprising: a model training module for training the multilingual acoustic model; the model training module comprising:

B14、根据B8至B13中任一所述的装置，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。B14. The apparatus according to any one of B8 to B13, each of the acoustic data in the at least two language types corresponds to at least two language types; or, in the acoustic data in the at least two language types Each data corresponds to a language type.

本发明实施例公开了C15、一种用于数据处理的装置，其特征在于，包括有存储器，以及一个或者一个以上的程序，其中一个或者一个以上程序存储于存储器中，且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：An embodiment of the present invention discloses C15, an apparatus for data processing, characterized in that it includes a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be executed by a Or execution of the one or more programs by one or more processors includes instructions for:

C16、根据C15所述的装置，所述根据多语言声学模型，确定语音信息中语音帧的语言类型，包括：C16. The device according to C15, wherein determining the language type of the speech frame in the speech information according to the multilingual acoustic model, comprising:

C17、根据C15所述的装置，所述装置还经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：C17. The device of C15, further configured to execute the one or more programs by one or more processors comprising instructions for:

所述装置还经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行以下操作的指令：The device is also configured to execute, by one or more processors, the one or more programs including instructions for:

C18、根据C17所述的装置，所述第一解码结果、以及所述第二解码结果包括：对应语音帧的时间边界信息；C18. The device according to C17, wherein the first decoding result and the second decoding result include: time boundary information corresponding to the speech frame;

C19、根据C15所述的装置，所述解码网络，包括：通用解码网络和专业解码网络；其中，所述通用解码网络中包括：根据通用的文本语料训练得到的语言模型；所述专业解码网络中包括：根据预置领域的文本语料训练得到的语言模型；C19. The device according to C15, wherein the decoding network includes: a general decoding network and a specialized decoding network; wherein, the general decoding network includes: a language model trained according to a general text corpus; the specialized decoding network Including: the language model obtained by training the text corpus in the preset field;

C20、根据C15所述的装置，所述多语言声学模型的训练步骤包括：C20. The device according to C15, wherein the training steps of the multilingual acoustic model include:

根据收集的至少两种语言类型的声学数据，分别训练各语言类型对应的单语言声学模型；According to the collected acoustic data of at least two language types, separately train the monolingual acoustic model corresponding to each language type;

C21、根据C15至C20中任一所述的装置，所述至少两种语言类型的声学数据中的每一个数据对应至少两种语言类型；或者，所述至少两种语言类型的声学数据中的每一个数据对应一种语言类型。C21. The apparatus according to any one of C15 to C20, each of the acoustic data in the at least two language types corresponds to at least two language types; or, in the acoustic data in the at least two language types Each data corresponds to a language type.

本发明实施例公开了D22、一种机器可读介质，其上存储有指令，当由一个或多个处理器执行时，使得装置执行如A1至A7中一个或多个所述的数据处理方法。The embodiment of the present invention discloses D22, a machine-readable medium on which instructions are stored, and when executed by one or more processors, cause an apparatus to execute the data processing method described in one or more of A1 to A7 .

本领域技术人员在考虑说明书及实践这里公开的发明后，将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本发明的真正范围和精神由下面的权利要求指出。Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. The present invention is intended to cover any modifications, uses or adaptations of the present invention that follow the general principles of the invention and include common knowledge or common technical means in the art not disclosed by this disclosure . The specification and examples are to be regarded as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.

应当理解的是，本发明并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。It should be understood that the present invention is not limited to the precise structures described above and illustrated in the accompanying drawings, and that various modifications and changes may be made without departing from its scope. The scope of the present invention is limited only by the appended claims.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

以上对本发明所提供的一种数据处理方法、一种数据处理装置和一种用于数据处理的装置，进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。A data processing method, a data processing device, and a data processing device provided by the present invention have been described above in detail. In this paper, specific examples are used to illustrate the principles and implementations of the present invention. The description of the above embodiment is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scope, In conclusion, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a data processing method, is characterized in that, described method comprises:

Determine the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

According to the decoding network corresponding to the language type of the speech frame, the speech frame is decoded to obtain the first decoding result of the speech frame;

According to the first decoding result, the recognition result corresponding to the voice information is determined.

2. The method according to claim 1, wherein, determining the language type of the speech frame in the speech information according to the multilingual acoustic model, comprising:

Determine the posterior probability of each state corresponding to the speech frame according to the multilingual acoustic model; wherein, there is a corresponding relationship between the state and the language type;

According to the posterior probability of the speech frame corresponding to each state and the language type corresponding to each state, determine the probability ratio of the posterior probability of the speech frame corresponding to the state of each language type;

The language type of the speech frame is determined according to the probability ratio.

3. The method according to claim 1, wherein, before determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further comprises:

determining a target language type from the at least two language types;

According to the decoding network corresponding to the target language type, each speech frame in the speech information is decoded to obtain the second decoding result of each speech frame;

After determining the language type of the speech frame in the speech information according to the multilingual acoustic model, the method further includes:

From the voice frame of the voice information, determine the target voice frame, and determine the second decoding result of the target voice frame; wherein, the language type of the target voice frame is a non-target language type;

Decoding the speech frame according to the decoding network corresponding to the language type of the speech frame to obtain the first decoding result of the speech frame, including:

According to the decoding network corresponding to the language type of the target speech frame, the target speech frame is decoded to obtain the first decoding result of the target speech frame;

The determining the recognition result corresponding to the voice information according to the first decoding result includes:

The second decoding result of the target speech frame is replaced with the first decoding result of the language type corresponding to the target speech frame, and the replaced second decoding result is used as the recognition result corresponding to the speech information.

4. The method according to claim 3, wherein the first decoding result and the second decoding result comprise: time boundary information corresponding to the speech frame;

Described replacing the second decoding result of the target speech frame with the first decoding result of the language type corresponding to the target speech frame, including:

From the second decoding result of the target speech frame, determine the replaced result; wherein, the replaced result coincides with the time boundary of the first decoding result of the language type corresponding to the target speech frame;

The replaced result is replaced with the first decoding result of the language type corresponding to the target speech frame.

5. The method according to claim 1, wherein the decoding network comprises: a general decoding network and a specialized decoding network; wherein, the general decoding network comprises: a language model trained according to general text corpus ; The professional decoding network includes: a language model obtained by training the text corpus in the preset field;

The voice frame is decoded according to the general decoding network and the professional decoding network respectively, so as to obtain the first score of the voice frame corresponding to the general decoding network, and the voice frame corresponding to the professional decoding network. second score;

A decoding result with a higher score among the first score and the second score is used as the first decoding result of the speech frame.

6. The method according to claim 1, wherein the training step of the multilingual acoustic model comprises:

According to the collected acoustic data of at least two language types, respectively train the monolingual acoustic model corresponding to each language type;

According to the single-language acoustic model, state labeling is performed on the acoustic data of the at least two language types respectively, wherein there is a corresponding relationship between the states and the language types;

A multilingual acoustic model is trained according to a dataset consisting of annotated acoustic data of at least two language types.

7. The method according to any one of claims 1 to 6, wherein each of the acoustic data of the at least two language types corresponds to at least two language types; or, the at least two languages Each of the types of acoustic data corresponds to a language type.

8. A data processing device, wherein the device comprises:

a type determination module, used for determining the language type of the speech frame in the speech information according to the multilingual acoustic model; wherein, the multilingual acoustic model is obtained by training according to the acoustic data of at least two language types;

a first decoding module, configured to decode the speech frame according to a decoding network corresponding to the language type of the speech frame to obtain a first decoding result of the speech frame;

A result determination module, configured to determine a recognition result corresponding to the voice information according to the first decoding result.

9. An apparatus for data processing, comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors The one or more programs include instructions for:

10. A machine-readable medium having stored thereon instructions which, when executed by one or more processors, cause an apparatus to perform a data processing method as claimed in one or more of claims 1 to 7.