CN111816160A - Mandarin and Cantonese hybrid speech recognition model training method and system - Google Patents
Mandarin and Cantonese hybrid speech recognition model training method and system Download PDFInfo
- Publication number
- CN111816160A CN111816160A CN202010737658.6A CN202010737658A CN111816160A CN 111816160 A CN111816160 A CN 111816160A CN 202010737658 A CN202010737658 A CN 202010737658A CN 111816160 A CN111816160 A CN 111816160A
- Authority
- CN
- China
- Prior art keywords
- training
- network layers
- model
- speech recognition
- cantonese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
本发明公开一种普通话和粤语混合语音识别模型训练方法,包括:采用N种语言的混合语音训练样本对多任务模型进行训练,多任务模型包括多个共享网络层和与多个共享网络层中的最后一层连接的对应于N种语言的N个任务神经网络层;将多个共享网络层的网络参数迁移至待训练语音识别模型,以完成待训练语音识别模型的训练。本发明实施例首先采用多种语言的混合语音训练样本训练多任务模型,然后通过数据迁移的方式复用多任务模型的网络参数,基于普通话和粤语混合建模,训练普通话和粤语混合语音识别模型。能够解决普通话和粤语混合语音识别的问题,不需要对原来的识别服务进行大的修改,可以利用当前已有的成果,降低模型训练成本和服务开发成本。
The invention discloses a method for training a Mandarin-Cantonese mixed speech recognition model, comprising: using mixed speech training samples of N languages to train a multi-task model. The last layer of is connected with N task neural network layers corresponding to N languages; network parameters of multiple shared network layers are transferred to the speech recognition model to be trained to complete the training of the speech recognition model to be trained. The embodiment of the present invention firstly uses mixed speech training samples of multiple languages to train a multi-task model, and then reuses the network parameters of the multi-task model by means of data migration, and trains a mixed speech recognition model of Putonghua and Cantonese based on the mixed modeling of Putonghua and Cantonese . It can solve the problem of mixed speech recognition in Mandarin and Cantonese, without the need to make major modifications to the original recognition service, and can use the existing achievements to reduce the cost of model training and service development.
Description
技术领域technical field
本发明涉及语音识别技术领域,尤其涉及普通话和粤语混合语音识别模型训练方法及系统。The invention relates to the technical field of speech recognition, in particular to a method and system for training a mixed speech recognition model in Mandarin and Cantonese.
背景技术Background technique
随着移动终端设备及语音识别技术的不断发展,出现了一些普通话和方言混合语音识别方案。如讯飞语音输入法、百度语音输入法、搜狗输入法、阿里智能客服,它们均具有普通话和方言混合语音识别功能。With the continuous development of mobile terminal equipment and speech recognition technology, there are some mixed speech recognition solutions in Mandarin and dialects. Such as iFLYTEK voice input method, Baidu voice input method, Sogou input method, Ali intelligent customer service, they all have the function of mixing Mandarin and dialect voice recognition.
现有解决方案均是基于深度学习框架的算法,根据各自的实际情况使用不同的声学建模单元,通过不同的声学训练流程和算法,支持同时识别多种语言。The existing solutions are all algorithms based on the deep learning framework, using different acoustic modeling units according to their actual conditions, and supporting the simultaneous recognition of multiple languages through different acoustic training processes and algorithms.
常见的解决方案有两种,一种是使用语种分类器判断语音属于哪种语言,然后再将语音输入相应的语音识别系统进行识别,如图1所示。但这种方法需要引入语种识别模块进行语种分类,这样,造成识别结果依赖于语种识别模块的分类性能,当语种识别模块、分类器不稳定时,会造成识别效果差。语音识别准确率是基于语种分类的准确率、语音识别系统准确率的累积确定的,因此,语音识别准确率低于单独的一个语音识别系统,这种方案难以在各种场景具有较强的鲁棒性。并且,服务器上需要部署多套语音识别系统,工程代价高。There are two common solutions. One is to use a language classifier to determine which language the speech belongs to, and then input the speech into the corresponding speech recognition system for recognition, as shown in Figure 1. However, this method needs to introduce a language recognition module for language classification, so that the recognition result depends on the classification performance of the language recognition module. When the language recognition module and classifier are unstable, the recognition effect will be poor. The accuracy of speech recognition is determined based on the accuracy of language classification and the accumulation of the accuracy of speech recognition systems. Therefore, the accuracy of speech recognition is lower than that of a single speech recognition system. This solution is difficult to have strong robustness in various scenarios. Awesome. In addition, multiple sets of speech recognition systems need to be deployed on the server, and the engineering cost is high.
另一种方案是采用混合语音识别方法,将多种语言的建模单元混合在一起,然后将不同语言的音频数据和文本数据混合,复用常规的训练流程,进行混合语音识别;也可以将多种语言的字典、训练数据和语料文本混合,再复用常规的训练流程,进行混合语音识别。这种混合语音识别方法容易实现,工程代价比较低,但这种语音识别方法是将多种不同的语言的数据混合在一起训练,现实情况中很难做到各语言的训练数据是均衡的,并且不同的语言存在发音差异性,数据量不均衡或选取不合适,会造成不同语言的发音音素在训练集中分布不均衡,这样训练出的识别结果会偏向数据量大的语言,这样的混合语音识别系统的性能相比单独的各个语言的语音识别系统性能会下降很多,整体识别率难以做到将每种语言都很好地识别,ASR(Automatic Speech Recognition)性能损失较多。Another solution is to use a hybrid speech recognition method, which mixes modeling units in multiple languages, then mixes audio data and text data in different languages, and reuses the conventional training process for hybrid speech recognition; Dictionaries, training data and corpus texts in multiple languages are mixed, and then the conventional training process is reused for mixed speech recognition. This hybrid speech recognition method is easy to implement and the engineering cost is relatively low, but this method of speech recognition is to mix the data of multiple different languages for training. In reality, it is difficult to achieve balanced training data for each language. And different languages have differences in pronunciation, the amount of data is unbalanced or the selection is inappropriate, which will cause the phonemes of different languages to be unevenly distributed in the training set, so the trained recognition results will be biased towards languages with large amounts of data, such mixed speech. The performance of the recognition system will drop a lot compared to the performance of the speech recognition system of each language alone, the overall recognition rate is difficult to recognize each language well, and the performance of ASR (Automatic Speech Recognition) will lose a lot.
发明内容SUMMARY OF THE INVENTION
本发明实施例提供一种普通话和粤语混合语音识别模型训练方法及系统、普通话和粤语混合语音识别方法及系统,用于至少解决上述技术问题之一。Embodiments of the present invention provide a method and system for training a Mandarin-Cantonese hybrid speech recognition model, and a Mandarin-Cantonese hybrid speech recognition method and system, which are used to solve at least one of the above-mentioned technical problems.
第一方面,本发明实施例提供一种普通话和粤语混合语音识别模型训练方法,包括:In a first aspect, an embodiment of the present invention provides a method for training a Mandarin-Cantonese hybrid speech recognition model, including:
采用N种语言的混合语音训练样本对多任务模型进行训练,所述多任务模型包括多个共享网络层和与所述多个共享网络层中的最后一层连接的对应于N种语言的N个任务神经网络层;A multi-task model is trained using mixed speech training samples in N languages, the multi-task model comprising a plurality of shared network layers and N corresponding to N languages connected to the last layer of the plurality of shared network layers a task neural network layer;
将所述多个共享网络层的网络参数迁移至待训练语音识别模型,以完成所述待训练语音识别模型的训练。The network parameters of the multiple shared network layers are migrated to the speech recognition model to be trained to complete the training of the speech recognition model to be trained.
第二方面,本发明实施例提供一种普通话和粤语混合语音识别方法,包括:将普通话和方言混合语音输入至本发明实施例所述的普通话和粤语混合语音识别模型训练方法训练得到的语音识别模型,进行混合语音识别。In a second aspect, an embodiment of the present invention provides a mixed speech recognition method for Mandarin and Cantonese, including: inputting the mixed speech of Mandarin and dialect into the speech recognition obtained by the training method of the mixed speech recognition model training method for Mandarin and Cantonese described in the embodiment of the present invention model for hybrid speech recognition.
第三方面,本发明实施例提供一种普通话和粤语混合语音识别模型训练系统,包括:In a third aspect, an embodiment of the present invention provides a Mandarin-Cantonese hybrid speech recognition model training system, including:
多任务模型训练模块,用于采用N种语言的混合语音训练样本对多任务模型进行训练,所述多任务模型包括多个共享网络层和与所述多个共享网络层中的最后一层连接的对应于N种语言的N个任务神经网络层;A multi-task model training module, used for training a multi-task model using mixed speech training samples of N languages, the multi-task model includes a plurality of shared network layers and is connected to the last layer of the plurality of shared network layers of N task neural network layers corresponding to N languages;
语音识别模型训练模块,用于将所述多个共享网络层的网络参数迁移至待训练语音识别模型,以完成所述待训练语音识别模型的训练。The speech recognition model training module is used for migrating the network parameters of the plurality of shared network layers to the speech recognition model to be trained, so as to complete the training of the speech recognition model to be trained.
第四方面,本发明实施例提供一种普通话和粤语混合语音识别系统,包括:In a fourth aspect, an embodiment of the present invention provides a Mandarin-Cantonese hybrid speech recognition system, including:
语音识别模型,采用本发明实施例所述的普通话和粤语混合语音识别模型训练方法训练得到;The speech recognition model is obtained by training using the Mandarin and Cantonese mixed speech recognition model training method described in the embodiment of the present invention;
语音输入模块,用于将普通话和粤语混合语音输入至所述语音识别模型,进行混合语音识别。The speech input module is used for inputting the mixed speech of Mandarin and Cantonese into the speech recognition model for mixed speech recognition.
第五方面,提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本发明上述任一项普通话和粤语混合语音识别。A fifth aspect provides an electronic device comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, The instructions are executed by the at least one processor to enable the at least one processor to perform any of the above-described mixed Mandarin and Cantonese speech recognition of the present invention.
第六方面,本发明实施例提供一种存储介质,所述存储介质中存储有一个或多个包括执行指令的程序,所述执行指令能够被电子设备(包括但不限于计算机,服务器,或者网络设备等)读取并执行,以用于执行本发明上述任一项普通话和粤语混合语音识别。In a sixth aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored in the storage medium, and the execution instructions can be executed by an electronic device (including but not limited to a computer, a server, or a network). equipment, etc.) to read and execute, so as to perform any one of the above-mentioned mixed speech recognition of Mandarin and Cantonese of the present invention.
本发明实施例的有益效果在于:首先采用多种语言的混合语音训练样本训练多任务模型,然后通过数据迁移的方式复用多任务模型的网络参数,基于普通话和粤语混合建模,训练普通话和粤语混合语音识别模型。能够解决普通话和粤语混合语音识别的问题,不需要对原来的识别服务进行大的修改,可以利用当前已有的成果,降低模型训练成本和服务开发成本。The beneficial effect of the embodiment of the present invention is that: first, a multi-task model is trained by using mixed speech training samples of multiple languages, and then the network parameters of the multi-task model are reused by means of data migration. Cantonese hybrid speech recognition model. It can solve the problem of mixed speech recognition in Mandarin and Cantonese, and does not require major modifications to the original recognition service. It can use the existing achievements to reduce the cost of model training and service development.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.
图1为现有技术中使用语种分类器进行混合语音识别的方案示意图;Fig. 1 is the scheme schematic diagram of using language classifier to carry out mixed speech recognition in the prior art;
图2为本发明的普通话和粤语混合语音识别模型训练方法的一实施例的流程图;2 is a flowchart of an embodiment of a method for training a Mandarin and Cantonese hybrid speech recognition model of the present invention;
图3为本发明的普通话和粤语混合语音识别模型训练系统的一实施例的原理框图;3 is a schematic block diagram of an embodiment of the Mandarin and Cantonese hybrid speech recognition model training system of the present invention;
图4为本发明的普通话和粤语混合语音识别系统的一实施例的原理框图;4 is a schematic block diagram of an embodiment of the Mandarin and Cantonese hybrid speech recognition system of the present invention;
图5为本发明的普通话和粤语混合语音识别模型训练方法的实施例的示意图;Fig. 5 is the schematic diagram of the embodiment of the Mandarin and Cantonese mixed speech recognition model training method of the present invention;
图6为本发明的普通话和粤语混合语音识别模型训练方法的另一实施例的示意图;FIG. 6 is a schematic diagram of another embodiment of the Mandarin and Cantonese hybrid speech recognition model training method of the present invention;
图7为本发明的普通话和粤语混合语音识别方法的一实施例的流程图;7 is a flow chart of an embodiment of the Mandarin and Cantonese hybrid speech recognition method of the present invention;
图8为本发明的电子设备的一种实施例的结构示意图。FIG. 8 is a schematic structural diagram of an embodiment of an electronic device of the present invention.
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。It should be noted that the embodiments in the present application and the features of the embodiments may be combined with each other in the case of no conflict.
本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、元件、数据结构等等。也可以在分布式计算环境中实践本发明,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, elements, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
在本发明中,“模块”、“装置”、“系统”等指应用于计算机的相关实体,如硬件、硬件和软件的组合、软件或执行中的软件等。详细地说,例如,元件可以、但不限于是运行于处理器的过程、处理器、对象、可执行元件、执行线程、程序和/或计算机。还有,运行于服务器上的应用程序或脚本程序、服务器都可以是元件。一个或多个元件可在执行的过程和/或线程中,并且元件可以在一台计算机上本地化和/或分布在两台或多台计算机之间,并可以由各种计算机可读介质运行。元件还可以根据具有一个或多个数据包的信号,例如,来自一个与本地系统、分布式系统中另一元件交互的,和/或在因特网的网络通过信号与其它系统交互的数据的信号通过本地和/或远程过程来进行通信。In the present invention, "module", "device", "system", etc. refer to relevant entities applied to a computer, such as hardware, a combination of hardware and software, software or software in execution, and the like. In detail, for example, an element may be, but is not limited to, a process running on a processor, a processor, an object, an executable element, a thread of execution, a program, and/or a computer. Also, an application program or script program running on the server, and the server can be a component. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be executed from various computer readable media . Elements may also pass through a signal having one or more data packets, for example, a signal from one interacting with another element in a local system, in a distributed system, and/or with data interacting with other systems through a network of the Internet local and/or remote processes to communicate.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”,不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Furthermore, the terms "comprising" and "comprising" include not only those elements, but also other elements not expressly listed, or elements inherent to such a process, method, article or apparatus. Without further limitation, an element defined by the phrase "comprises" does not preclude the presence of additional identical elements in a process, method, article, or device that includes the element.
如图2所示为本发明的普通话和粤语混合语音识别模型训练方法的一实施例的流程图,该实施例中所述方法包括:FIG. 2 is a flowchart of an embodiment of a method for training a Mandarin-Cantonese hybrid speech recognition model of the present invention. The method in this embodiment includes:
S10、采用N种语言的混合语音训练样本对多任务模型进行训练,所述多任务模型包括多个共享网络层和与所述多个共享网络层中的最后一层连接的对应于N种语言的N个任务神经网络层;S10. Use mixed speech training samples of N languages to train a multi-task model, where the multi-task model includes a plurality of shared network layers and a network corresponding to N languages connected to the last layer of the plurality of shared network layers The N task neural network layers;
S20、将所述多个共享网络层的网络参数迁移至待训练语音识别模型,以完成所述待训练语音识别模型的训练。S20. Migrate the network parameters of the multiple shared network layers to the speech recognition model to be trained, so as to complete the training of the speech recognition model to be trained.
本发明实施例首先采用多种语言的混合语音训练样本训练多任务模型,然后通过数据迁移的方式复用多任务模型的网络参数,基于普通话和粤语混合建模,训练普通话和粤语混合语音识别模型。能够解决普通话和粤语混合语音识别的问题,不需要对原来的识别服务进行大的修改,可以利用当前已有的成果,降低模型训练成本和服务开发成本。The embodiment of the present invention firstly uses mixed speech training samples of multiple languages to train a multi-task model, and then reuses the network parameters of the multi-task model by means of data migration, and trains a mixed speech recognition model of Putonghua and Cantonese based on the mixed modeling of Putonghua and Cantonese . It can solve the problem of mixed speech recognition in Mandarin and Cantonese, and does not require major modifications to the original recognition service. It can use the existing achievements to reduce the cost of model training and service development.
本发明实施例的多种语言多任务训练方式,使得不同语言之间的共性可以通过多层语言共享网络层参数来学习,不同语言的个性可以通过任务专属层(例如,任务神经网络层)的输出来学习,这样训练出来的网络包含了各种音素的发音情况,模型鲁棒性更高。通过迁移学习的形式,将已经训练好的模型参数迁移到新的模型来帮助新模型训练。这样,已经学到的模型参数可以分享给新的模型,从而加快并优化新的模型的学习效率。整合普通话和粤语的建模单元,通过较大颗粒度建模单元共享的形式,可以避免由于某些字符在训练文本中出现太少而造成模型参数训练不充分的情况。The multi-language multi-task training method of the embodiment of the present invention enables the commonality between different languages to be learned through the multi-layer language sharing network layer parameters, and the individuality of different languages can be learned through the task-specific layer (for example, the task neural network layer). The output is used to learn, so that the trained network contains the pronunciation of various phonemes, and the model is more robust. In the form of transfer learning, the parameters of the trained model are transferred to the new model to help the training of the new model. In this way, the learned model parameters can be shared with the new model, thereby speeding up and optimizing the learning efficiency of the new model. Integrating the modeling units of Mandarin and Cantonese, through the sharing of modeling units with larger granularity, can avoid insufficient training of model parameters due to too few characters appearing in the training text.
在一些实施例中,所述采用N种语言的混合语音训练样本对多任务模型进行训练包括:In some embodiments, the training of the multi-task model using mixed speech training samples of N languages includes:
基于对应于所述N个任务神经网络层的N个损失函数训练所述N个任务神经网络层的网络参数;Train the network parameters of the N task neural network layers based on the N loss functions corresponding to the N task neural network layers;
至少基于对应于所述N个任务神经网络层的N个损失函数共同训练所述多个共享网络层的网络参数。The network parameters of the plurality of shared network layers are jointly trained at least based on the N loss functions corresponding to the N task neural network layers.
在一些实施例中,所述多任务模型还包括与所述所述多个共享网络层中的最后一层连接的语言分类网络层;In some embodiments, the multi-task model further includes a language classification network layer connected to the last layer of the plurality of shared network layers;
所述至少基于对应于所述N个任务神经网络层的N个损失函数共同训练所述多个共享网络层的网络参数包括:The network parameters for jointly training the plurality of shared network layers at least based on the N loss functions corresponding to the N task neural network layers include:
基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数共同训练所述多个共享网络层的网络参数。The network parameters of the plurality of shared network layers are jointly trained based on the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers.
在一些实施例中,基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数共同训练所述多个共享网络层的网络参数包括:基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数的加权和训练所述多个共享网络层的网络参数。In some embodiments, jointly training the network parameters of the plurality of shared network layers based on the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers includes: based on the corresponding The network parameters of the plurality of shared network layers are trained on the weighted sum of the N loss functions of the N task neural network layers and the loss functions corresponding to the language classification network layers.
在一些实施例中,本发明还提供一种普通话和粤语混合语音识别方法,包括:将普通话和方言混合语音输入至采用本发明任一实施例所述的普通话和粤语混合语音识别模型训练方法训练得到的语音识别模型,进行混合语音识别。In some embodiments, the present invention also provides a mixed speech recognition method for Mandarin and Cantonese, comprising: inputting the mixed speech of Mandarin and dialect into the training method using the mixed speech recognition model training method for Mandarin and Cantonese described in any embodiment of the present invention The obtained speech recognition model is used for hybrid speech recognition.
如图3所示,为本发明的普通话和粤语混合语音识别模型训练系统的一实施例的原理框图,该系统300包括:As shown in FIG. 3 , it is a schematic block diagram of an embodiment of the Mandarin and Cantonese hybrid speech recognition model training system of the present invention. The
多任务模型训练模块310,用于采用N种语言的混合语音训练样本对多任务模型进行训练,所述多任务模型包括多个共享网络层和与所述多个共享网络层中的最后一层连接的对应于N种语言的N个任务神经网络层;The multi-task
语音识别模型训练模块320,用于将所述多个共享网络层的网络参数迁移至待训练语音识别模型,以完成所述待训练语音识别模型的训练。The speech recognition
本发明实施例首先采用多种语言的混合语音训练样本训练多任务模型,然后通过数据迁移的方式复用多任务模型的网络参数,基于普通话和粤语混合建模,训练普通话和粤语混合语音识别模型。能够解决普通话和粤语混合语音识别的问题,不需要对原来的识别服务进行大的修改,可以利用当前已有的成果,降低模型训练成本和服务开发成本。The embodiment of the present invention firstly uses mixed speech training samples of multiple languages to train a multi-task model, and then reuses the network parameters of the multi-task model by means of data migration, and trains a mixed speech recognition model of Putonghua and Cantonese based on the mixed modeling of Putonghua and Cantonese . It can solve the problem of mixed speech recognition in Mandarin and Cantonese, and does not require major modifications to the original recognition service. It can use the existing achievements to reduce the cost of model training and service development.
在一些实施例中,所述采用N种语言的混合语音训练样本对多任务模型进行训练包括:In some embodiments, the training of the multi-task model using mixed speech training samples of N languages includes:
基于对应于所述N个任务神经网络层的N个损失函数训练所述N个任务神经网络层的网络参数;Train the network parameters of the N task neural network layers based on the N loss functions corresponding to the N task neural network layers;
至少基于对应于所述N个任务神经网络层的N个损失函数共同训练所述多个共享网络层的网络参数。The network parameters of the plurality of shared network layers are jointly trained at least based on the N loss functions corresponding to the N task neural network layers.
在一些实施例中,所述多任务模型还包括与所述所述多个共享网络层中的最后一层连接的语言分类网络层;In some embodiments, the multi-task model further includes a language classification network layer connected to the last layer of the plurality of shared network layers;
所述至少基于对应于所述N个任务神经网络层的N个损失函数共同训练所述多个共享网络层的网络参数包括:The network parameters for jointly training the plurality of shared network layers at least based on the N loss functions corresponding to the N task neural network layers include:
基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数共同训练所述多个共享网络层的网络参数。The network parameters of the plurality of shared network layers are jointly trained based on the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers.
在一些实施例中,基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数共同训练所述多个共享网络层的网络参数包括:基于对应于所述N个任务神经网络层的N个损失函数和对应于所述语言分类网络层的损失函数的加权和训练所述多个共享网络层的网络参数。In some embodiments, jointly training the network parameters of the plurality of shared network layers based on the N loss functions corresponding to the N task neural network layers and the loss functions corresponding to the language classification network layers includes: based on the corresponding The network parameters of the plurality of shared network layers are trained on the weighted sum of the N loss functions of the N task neural network layers and the loss functions corresponding to the language classification network layers.
如图4所示,为本发明的普通话和粤语混合语音识别系统的一实施例的原理框图,该系统400包括:As shown in FIG. 4, it is a schematic block diagram of an embodiment of the Mandarin and Cantonese hybrid speech recognition system of the present invention. The system 400 includes:
语音识别模型410,采用本发明任一实施例所述的普通话和粤语混合语音识别模型训练方法训练得到;The
语音输入模块420,用于将普通话和粤语混合语音输入至所述语音识别模型,进行混合语音识别。The
为便于更加直观的体现本发明相对于现有技术所做出的技术贡献,以下将进一步结合具体实施例进行详细说明。In order to more intuitively reflect the technical contributions made by the present invention relative to the prior art, the following will further describe in detail with reference to specific embodiments.
主要包括三个步骤:It mainly includes three steps:
(1)数据准备(1) Data preparation
数据上分别将普通话和粤语的音频数据和文本数据进行混合,字典合并,按照上述建模单元的整理方法整理出建模单元;同时考虑到方言的数据较少,我们通过信号处理、真机转录、网上爬取等方式进行数据扩充;In terms of data, the audio data and text data of Mandarin and Cantonese are mixed respectively, and the dictionaries are combined, and the modeling units are sorted out according to the above-mentioned sorting method of the modeling units; at the same time, considering that the dialect data is less, we use signal processing, real machine transcription , web crawling, etc. for data expansion;
对音频进行特征提取,采用FBANK特征,使用帧长25ms、帧移10ms的窗对音频进行分帧,每一帧提取40维的Fbank特征用来训练神经网络;Extract the features of the audio, use the FBANK feature, use a window with a frame length of 25ms and a frame shift of 10ms to divide the audio into frames, and extract 40-dimensional Fbank features from each frame to train the neural network;
我们使用N种语言来训练神经网络,所以需要分别准备N种语言对应的特征和标注文本,然后将N种语言的特征合并到一起并进行随机打乱,来保证训练模型的输入特征随机性。We use N languages to train the neural network, so we need to prepare the corresponding features and labeled texts of the N languages, and then combine the features of the N languages and scramble them randomly to ensure the randomness of the input features of the training model.
(2)多语言多任务训练(2) Multi-language multi-task training
我们采用N种语言来进行多任务训练,采用有监督的多任务训练的方式,训练准则可以采用帧级别的CE(Cross Entropy)损失函数或序列级别的CTC(ConnectionistTemporalClassification)损失函数来训练模型参数,按照图5示例的网络结构,每一个output代表一种语言的专项训练任务,其输出对应的也是该语言的loss输出,所以当神经网络有N个语言训练任务时,同时就会有N个语言的loss输出和一个Language Id的loss输出。We use N languages for multi-task training and supervised multi-task training. The training criteria can use frame-level CE (Cross Entropy) loss function or sequence-level CTC (ConnectionistTemporalClassification) loss function to train model parameters, According to the network structure of the example in Figure 5, each output represents a special training task for a language, and its output corresponds to the loss output of the language, so when the neural network has N language training tasks, there will be N languages at the same time. The loss output and a Language Id loss output.
如图5所示为本发明的普通话和粤语混合语音识别模型训练方法的实施例的示意图。其中,NN(Nerual Network)layer表示一层神经网络层,可以是常用的DNN(DeepNeuralNetworks)、LSTM(Long Short-Term Memory)、FSMN(Feedforward SequentialMemory Networks)等。FIG. 5 is a schematic diagram of an embodiment of the method for training a Mandarin and Cantonese mixed speech recognition model of the present invention. Among them, NN (Nerual Network) layer represents a layer of neural network layer, which can be commonly used DNN (Deep Neural Networks), LSTM (Long Short-Term Memory), FSMN (Feedforward Sequential Memory Networks) and so on.
同时,我们会引入每一种语言的标签Language ID(以下简称LID)进来,采用CE准则进行训练,目标是最小化每一帧的语言种类判别错误,降低语言domain的分类错误,增加网络对多种不同语言的鲁棒性,这样可以减少各种语言数据不均衡带来的影响,同时可以加快神经网络的收敛速度。At the same time, we will introduce the label Language ID (hereinafter referred to as LID) of each language, and use the CE criterion for training. The goal is to minimize the language type discrimination error in each frame, reduce the language domain classification error, and increase the network to many The robustness of different languages can reduce the impact of imbalanced data in various languages, and at the same time, it can speed up the convergence speed of the neural network.
多语言多任务训练过程:Multilingual multitasking training process:
(a)首先将N种语言特征输入数据随机打乱;根据训练准则和建模单元的不同,每种语言准备相对应的label,此处我们采用帧级别的音素进行建模,分别得到N种语言的帧级别音素label,然后将所有label合并,按照训练特征的输入顺序排列;(a) First, randomly scramble the input data of N kinds of language features; according to the different training criteria and modeling units, prepare corresponding labels for each language, here we use frame-level phonemes for modeling, and get N kinds of The frame-level phoneme label of the language, and then all labels are merged and arranged according to the input order of the training features;
此处说明下LID的label的获取方法:Here is how to get the label of the LID:
根据事先知道的每种语言类别的先验信息,给不同的语言种类按照不同的数字分别进行编码,举例:对语言种类1的所有样本数据,分别根据其特征数据的长度将其置1,对语言种类2的所有样本数据,分别根据其特征数据的长度将其置2,依次类推可以分别得到每一种语言的LID的label,然后按照训练特征输入的顺序排列;According to the prior information of each language category known in advance, encode different language categories according to different numbers, for example: for all sample data of language category 1, set it to 1 according to the length of its characteristic data, For all sample data of language type 2, set it to 2 according to the length of its feature data, and so on, you can get the label of the LID of each language, and then arrange it according to the order of training feature input;
此时,我们分别准备好了特征输入数据、每一种语言的音素label和每一种语言的LID label。At this point, we have prepared feature input data, phoneme labels for each language, and LID labels for each language, respectively.
(b)此处我们会根据具体的任务情况给每个loss乘一个不同的缩放因子α和β,这样可以避免最终模型参数更新偏向某些个别的任务上:(b) Here we will multiply each loss by a different scaling factor α and β according to the specific task situation, so as to avoid the final model parameter update biased towards some individual tasks:
Loss=α1·loss1(output1)+α2·loss2(output2)+…+αN.lossN(outputN)+β.loss(Language ID)Loss=α1·loss1(output1)+α2·loss2(output2)+…+αN.lossN(outputN)+β.loss(Language ID)
使用每一个语言对应的loss(output)来更新其对应Task NN layer模型参数,使用所有的输出loss来更新底部共享模型参数,这样底部共享层可以学习到每一种语言的特性,使训练出的网络能够涵盖人类更大范围的音素的发音情况,同时,该方法搜集了不同语言训练数据集的各种环境,某一种语言在某种环境下学习到的参数同时可以提升其他语言在该场景下的识别性能,从训练数据鲁棒性的角度来看,这样的网络参数更鲁棒。Use the loss (output) corresponding to each language to update its corresponding Task NN layer model parameters, and use all output losses to update the bottom shared model parameters, so that the bottom shared layer can learn the characteristics of each language, so that the trained The network can cover the pronunciation of a wider range of human phonemes. At the same time, this method collects various environments of different language training data sets. The parameters learned by a language in a certain environment can also improve the performance of other languages in the scene. Such network parameters are more robust from the point of view of training data robustness.
(3)迁移学习初始化训练(3) Transfer learning initialization training
如图6所示,本发明的普通话和粤语混合语音识别模型训练方法的另一实施例的示意图。As shown in FIG. 6 , it is a schematic diagram of another embodiment of the method for training a mixed speech recognition model of Mandarin and Cantonese of the present invention.
将上述多语言多任务训练网络的公共层参数取来,作为普通话和粤语混合识别的初始化模型,通过迁移学习的形式,将上述已经训练好的模型参数迁移到新的模型来帮助新模型训练。这样已经学到的多种语言的模型参数可以分享给新的普通话和粤语(方言)模型,从而加快并优化新的模型的学习效率。The public layer parameters of the above-mentioned multi-language and multi-task training network are taken as the initialization model for the mixed recognition of Mandarin and Cantonese, and the above-mentioned trained model parameters are transferred to the new model in the form of transfer learning to help the training of the new model. In this way, the model parameters of multiple languages that have been learned can be shared with the new Mandarin and Cantonese (dialect) models, thereby speeding up and optimizing the learning efficiency of the new model.
普通话和粤语混合训练流程:Mandarin and Cantonese mixed training process:
(a)数据准备:根据不同的任务可以使用不同的建模单元来进行训练,此处我们采用上述2.3的方法准备好相应的普通话和粤语的字符建模单元,将两种语言的特征数据混合且随机打乱,将准备好两种语言的字符label按照特征数据排列;(a) Data preparation: Different modeling units can be used for training according to different tasks. Here we use the method in 2.3 above to prepare the corresponding Mandarin and Cantonese character modeling units, and mix the feature data of the two languages. And randomly scrambled, the character labels prepared in two languages are arranged according to the characteristic data;
(b)此处我们采用CTC的训练准则进行模型参数更新,和传统的训练方法一致,只有一个网络输出,所以最终训练出的模型可以支持普通话和粤语同时识别。(b) Here we use the training criteria of CTC to update the model parameters, which is consistent with the traditional training method. There is only one network output, so the final trained model can support the simultaneous recognition of Mandarin and Cantonese.
我们会将两种语言的字典合并,语料文本合并训练语言模型,这样在解码时,声学模型和语言模型可以实现同时识别两种语言的目的,而且混合识别的系统相比之前的单个系统性能不会损失。We will combine the dictionaries of the two languages and the corpus texts to train the language model, so that during decoding, the acoustic model and the language model can recognize the two languages at the same time, and the hybrid recognition system has better performance than the previous single system. will lose.
本发明通过多种语言多任务学习的方式,使用N(N≥3,可以是普通话、四川话、粤语、上海话以及包括外文等)种语言,N种语言通过多任务学习的方式进行训练,这样不同语言之间的共性可以通过多层共享的模型参数来学习,不同语言的个性可以通过单独语言的输出层来学习,这样训练出来的网络包含了各种音素的发音情况,模型更鲁棒;The present invention uses N (N≥3, which can be Mandarin, Sichuan dialect, Cantonese, Shanghai dialect and foreign languages, etc.) languages through multi-language multi-task learning, and N languages are trained by multi-task learning. In this way, the commonality between different languages can be learned through multi-layer shared model parameters, and the individuality of different languages can be learned through the output layer of a separate language. In this way, the trained network includes the pronunciation of various phonemes, and the model is more robust. ;
通过迁移学习的形式,将上述已经训练好的模型参数迁移到新的模型来帮助新模型训练。这样已经学到的模型参数可以分享给新的模型,从而加快并优化新的模型的学习效率;In the form of transfer learning, the above trained model parameters are transferred to the new model to help the training of the new model. In this way, the learned model parameters can be shared with the new model, thereby accelerating and optimizing the learning efficiency of the new model;
整合普通话和粤语的建模单元:我们采用字符的形式进行建模,分别从普通话和粤语的字典中统计出每一个相同拼音发音下对应的每个字在训练文本中的词频,将词频大于某个阈值的这些字符单独作为一个建模单元进行建模,将低于某阈值的字符用高词频的字符来统一建模,这样可以分别得到普通话和粤语的字符形式的建模单元,然后将普通话和粤语建模单元中相同的字符单元合并共享,不同的单独保留,这样可以得到一套普通话和粤语混合的建模单元,这样通过大颗粒度建模单元共享的形式,可以避免由于某些字符在训练文本中出现太少而模型参数训练不充分的情况。Integrating the modeling units of Mandarin and Cantonese: We use the form of characters to model, and count the word frequency of each character in the training text corresponding to each character with the same pinyin pronunciation from the Mandarin and Cantonese dictionaries, respectively. These characters with a threshold value are modeled separately as a modeling unit, and the characters below a certain threshold are uniformly modeled with characters with high word frequency, so that the modeling units in the form of Mandarin and Cantonese characters can be obtained respectively, and then the Mandarin The same character units in the Cantonese modeling unit are merged and shared, and the different ones are kept separately, so that a set of modeling units mixed with Mandarin and Cantonese can be obtained. In this way, through the sharing of large-granularity modeling units, it is possible to avoid the problems caused by certain characters. There are too few cases in the training text and the model parameters are not sufficiently trained.
该方法训练出来的模型既满足了同时识别两种语言的需求,同时又不需要改变内核工程上的框架,性能上也可达到甚至优于单一语言识别系统的性能。The model trained by this method not only satisfies the needs of recognizing two languages at the same time, but also does not need to change the framework of the kernel engineering, and its performance can also reach or even surpass the performance of a single language recognition system.
本发明提出多语言多任务的训练流程以及普通话和粤语建模单元混合的处理方法,不仅能解决现有普通话和粤语的混合识别问题,可以实现任何多种语言随意说的混合识别问题,且不需要对原来的识别服务进行大的修改,可以复用当前已有的成果,大大降低模型训练成本和服务开发成本。The present invention proposes a multi-language and multi-task training process and a processing method for mixing Mandarin and Cantonese modeling units, which can not only solve the mixed recognition problem of existing Mandarin and Cantonese, but also realize the mixed recognition problem of random speaking in any multiple languages. The original recognition service needs to be greatly modified, and the existing achievements can be reused, which greatly reduces the cost of model training and service development.
需要说明的是,本发明相对于本领域技术人员来说并非显而易见唾手可得的,实际上发明人在实现本发明的过程中曾经至少采用了以下较早版本的方案:It should be noted that the present invention is not obvious and readily available to those skilled in the art. In fact, the inventor has at least adopted the following earlier versions of the solution in the process of realizing the present invention:
将同一条音频解码时分别使用普通话和粤语识别资源,然后根据两个识别结果的置信度或语义分析,进行确定最终的文本识别结果。这种方法不需要复杂的模型训练,实现简单,只需要在识别后端做一些后处理即可达到目的。但是这种实现方法不够通用,而且依然存在工程代价高,资源浪费的情况。When decoding the same audio, use Mandarin and Cantonese recognition resources respectively, and then determine the final text recognition result according to the confidence or semantic analysis of the two recognition results. This method does not require complex model training, and is simple to implement. It only needs to do some post-processing in the recognition backend to achieve the goal. However, this implementation method is not universal enough, and there are still high engineering costs and waste of resources.
在声学模型的训练中,将两种语言原有的建模单元直接合并后训练,这种也是最容易想到且最容易实现的方法,但是这种做法没有将不同语言的建模单元共享,会导致最后性能损失较多。In the training of the acoustic model, the original modeling units of the two languages are directly merged and trained. This is also the easiest method to think of and the easiest to implement. However, this method does not share the modeling units of different languages. This results in a lot of performance loss in the end.
实际上本发明相对于现有技术以及发明人在进行发明创造的过程中所尝试过的各种技术方案至少具有以下有益效果:In fact, the present invention has at least the following beneficial effects relative to the prior art and various technical solutions tried by the inventor in the process of invention and creation:
(1)多种语言多任务训练方式,使得不同语言之间的共性可以通过多层共享的模型参数来学习,不同语言的个性可以通过单独语言的输出层来学习,这样训练出来的网络包含了各种音素的发音情况,模型更鲁棒;(1) Multi-language multi-task training method, so that the commonality between different languages can be learned through multi-layer shared model parameters, and the personality of different languages can be learned through the output layer of a separate language, so the trained network contains The pronunciation of various phonemes, the model is more robust;
(2)通过迁移学习的形式,将已经训练好的模型参数迁移到新的模型来帮助新模型训练。这样已经学到的模型参数可以分享给新的模型,从而加快并优化新的模型的学习效率;(2) In the form of transfer learning, the trained model parameters are transferred to the new model to help the training of the new model. In this way, the learned model parameters can be shared with the new model, thereby accelerating and optimizing the learning efficiency of the new model;
(3)整合普通话和粤语的建模单元:通过大颗粒度建模单元共享的形式,可以避免由于某些字符在训练文本中出现太少而模型参数训练不充分的情况。(3) Integrate the modeling units of Mandarin and Cantonese: Through the sharing of large-granularity modeling units, the situation that the model parameters are not sufficiently trained due to too few characters in the training text can be avoided.
如图7所示,本发明的实施例提供一种普通话和粤语混合语音识别方法,所述方法包括:As shown in FIG. 7 , an embodiment of the present invention provides a mixed speech recognition method for Mandarin and Cantonese, the method comprising:
S71、采用N种语言的混合语音训练样本对多任务模型进行训练,得到多任务模型的参数值;多任务模型具有若干语言共享网络层(第一至第n层共享层)、与语言共享网络层的最深一层相连的N+1个并联的任务专属层,所述语言共享网络层为神经网络层,所述任务专属层为神经网络层,其中,N≥3;S71. Use the mixed speech training samples of N languages to train the multi-task model, and obtain the parameter values of the multi-task model; the multi-task model has several language sharing network layers (the first to n-th shared layers), and the language sharing network N+1 parallel task-specific layers connected to the deepest layer of the layer, the language-sharing network layer is a neural network layer, and the task-specific layer is a neural network layer, where N≥3;
S72、将语言共享网络层的参数值迁移到普粤语音识别模型;普粤语音识别模型具有与多任务训练模型相同的语言共享网络层、一个普粤语识别任务专属层,普粤语识别任务专属层为神经网络层,与语言共享网络层的最深一层相连;S72. Migrate the parameter values of the language sharing network layer to the Cantonese speech recognition model; the Cantonese speech recognition model has the same language sharing network layer as the multi-task training model, a general Cantonese recognition task-specific layer, and a general Cantonese recognition task-specific layer is the neural network layer, which is connected to the deepest layer of the language sharing network layer;
S73、对普通话和粤语混合建模,训练普粤语音识别模型;S73. Model a mixture of Mandarin and Cantonese, and train a Cantonese speech recognition model;
S74、基于训练好的普粤语音识别模型对普通话和粤语混合的语音进行识别。S74. Recognize the mixed speech of Mandarin and Cantonese based on the trained Cantonese speech recognition model.
多任务训练模型具有若干语言共享网络层,语言共享网络层的深度(层数)可以根据使用需要设定。第一层共享层接收输入数据,输入数据包括从N种语言的多个语音训练样本提取的特征数据、每种语言的音素标签、每个语音训练样本对应的语言的标签(LanguageID,以下简称LID)。The multi-task training model has several language-sharing network layers, and the depth (number of layers) of the language-sharing network layers can be set according to usage needs. The first shared layer receives input data, and the input data includes feature data extracted from multiple speech training samples of N languages, phoneme labels of each language, and labels of the language corresponding to each speech training sample (LanguageID, hereinafter referred to as LID). ).
本实施例中,N种语言可以为普通话、四川话、粤语、上海话以及外文。获取N种语言的音频,其中每种语言对应多个音频,并获取与音频对应的标注文本,使用帧长25ms、帧移10ms的窗对音频分帧,每一帧提取40维的FBANK特征。将N种语言的多个音频提取出的特征合并,并随机打乱顺序,生成训练特征,以此形成N种语言的混合语音训练样本对应的训练特征,保证训练模型时输入特征的随机性。In this embodiment, the N languages may be Mandarin, Sichuanese, Cantonese, Shanghainese, and foreign languages. Acquire audio in N languages, where each language corresponds to multiple audios, and acquire the labeled text corresponding to the audio, divide the audio into frames using a window with a frame length of 25ms and a frame shift of 10ms, and extract 40-dimensional FBANK features for each frame. Combine the features extracted from multiple audios in N languages, and randomly shuffle the order to generate training features, so as to form the training features corresponding to the mixed speech training samples of N languages, to ensure the randomness of the input features when training the model.
语言的音素是根据语音的自然属性划分出来的最小语音单位,本实施例中,对每种语言,采用帧级别的音素,对帧级别的音素建模,建模函数可采用本领域常规建模函数,得到该种语言的帧级别音素标签,然后将该种语言的所有帧级别音素标签合并。对于N种语言的音素标签,按照训练特征的输入顺序排列。The phonemes of a language are the smallest phonetic units divided according to the natural attributes of speech. In this embodiment, for each language, frame-level phonemes are used, and frame-level phonemes are modeled. The modeling function can use conventional modeling in the field. function to get the frame-level phoneme labels for that language, and then merge all frame-level phoneme labels for that language. For the phoneme labels of N languages, they are arranged in the input order of the training features.
对于语音样本,由于事先知道该语音样本对应于哪种语言,即事先知道每种语言类别的先验信息,对语音样本,按照不同的数字对语音样本进行编码,同一种语言的语音样本,为其设置相同的LID。例如,对语言种类为1的所有语音样本,根据其特征数据的长度,将其LID置为1;对语言种类为2的所有语音样本,根据其特征数据的长度,将其LID置为2,依次类推可以为每一语音样本设置LID,按照训练特征的输入顺序排列。For a speech sample, since it is known in advance which language the speech sample corresponds to, that is, the prior information of each language category is known in advance, the speech samples are encoded according to different numbers, and the speech samples of the same language are It sets the same LID. For example, for all speech samples whose language type is 1, set their LID to 1 according to the length of their characteristic data; for all speech samples whose language type is 2, set their LID to 2 according to the length of their characteristic data, By analogy, the LID can be set for each speech sample, arranged according to the input order of the training features.
多任务训练模型具有若干语言共享网络层(第一至第n层共享层),语言共享网络层的深度可以根据使用需要、计算性能自行设定。所述语言共享网络层中的各层,可以采用相同的神经网络结构,也可以采用不同的神经网络结构。The multi-task training model has several language shared network layers (the first to nth shared layers), and the depth of the language shared network layers can be set by itself according to usage needs and computing performance. Each layer in the language shared network layer may adopt the same neural network structure, or may adopt different neural network structures.
如图5所示,第一层共享层(该共享层可以为神经网络层)接收输入数据,经过计算后的数据输入第二层共享层,经过计算再次产生输出,用做下一层共享层的输入,这样逐级输入,直至最深的一层共享层。As shown in Figure 5, the first shared layer (the shared layer can be a neural network layer) receives input data, the calculated data is input to the second shared layer, and the output is generated again after calculation, which is used as the next shared layer The input is input step by step until the deepest shared layer.
采用语言共享网络层,使得不同语言之间的共享可以由语言共享网络层学习。The language sharing network layer is adopted, so that the sharing between different languages can be learned by the language sharing network layer.
经过最深的一层语言共享网络层的计算,该最深的一层语言共享网络层的输出数据按先验的语言类别信息,将输出数据按语言类别分为N个子集,每个子集对应一种语言的输出数据。After the calculation of the deepest language sharing network layer, the output data of the deepest language sharing network layer is divided into N subsets according to the language category according to the prior language category information, and each subset corresponds to a Language output data.
与最深一层语言共享网络层相连有N+1个并联的任务专属层;所述任务专属层为神经网络层,其中的N个任务专属层中的每个分别对一个子集进行训练,每个任务专属层产生一个输出数据,记为Outputi,其中,1≤i≤N;第N+1个任务专属层对最深一层语言共享网络层的输出数据进行训练,输出结果为识别出的LID。There are N+1 parallel task-specific layers connected to the deepest language-sharing network layer; the task-specific layer is a neural network layer, and each of the N task-specific layers is trained on a subset, and each Each task-specific layer generates an output data, denoted as Output i , where 1≤i≤N; the N+1 task-specific layer trains the output data of the deepest language sharing network layer, and the output result is the identified LIDs.
各任务专属层均为神经网络层,任务专属层中的各层,可以采用相同的神经网络结构,也可以采用不同的神经网络结构;可以采用与某个语言共享网络层相同的结构,也可以采用与各语言共享网络层均不相同的结构。Each task-specific layer is a neural network layer, and each layer in the task-specific layer can use the same neural network structure or different neural network structures; it can use the same structure as a language shared network layer, or Adopt a structure that is different from the shared network layer for each language.
采用任务专属层,使得不同语言的个性可以通过单独的任务专属层学习,这样训练出来的网络包含了各种音素的发音情况,鲁棒性更强。The task-specific layer is adopted, so that the personality of different languages can be learned through a separate task-specific layer, so that the trained network includes the pronunciation of various phonemes, and is more robust.
对于语言共享网络层、任务专属层,均采用有监督的方式进行训练,训练准则可以采用帧级别的CE(Cross Entropy)损失函数或序列级别的CTC(Connectionist TemporalClassification)损失函数来训练模型参数。每个输出,即Outputi对应一种语言的专项训练任务,再计算Outputi对应于该语言的损失输出。在本实施例中,具有N个语言专项训练任务,产生N个语言的损失输出和一个LID的损失输出。基于N个语言的损失输出和一个LID的损失输出,可以计算出总损失。The language shared network layer and the task-specific layer are all trained in a supervised manner. The training criteria can use the frame-level CE (Cross Entropy) loss function or the sequence-level CTC (Connectionist Temporal Classification) loss function to train model parameters. Each output, that is, Output i corresponds to a special training task of a language, and then calculate the loss output of Output i corresponding to the language. In this embodiment, there are N language-specific training tasks, and N language loss outputs and one LID loss output are generated. Based on the loss output for N languages and the loss output for one LID, the total loss can be calculated.
为便于理解所述LID的损失输出,例如,输入的N种语言的混合语音训练样本,具有英语、四川话、普通话三种,LID分别记为1,2,3。由于第N+1个任务专属层训练的准确性存在一定偏差,第N+1个任务专属层识别到的语音对应的LID为1,2,4。则根据识别结果,可以计算LID的损失输出。In order to facilitate the understanding of the loss output of the LID, for example, the input mixed speech training samples of N languages include English, Sichuan, and Mandarin, and the LIDs are denoted as 1, 2, and 3, respectively. Since there is a certain deviation in the training accuracy of the N+1 task-specific layer, the LIDs corresponding to the speech recognized by the N+1 task-specific layer are 1, 2, and 4. Then according to the recognition result, the loss output of LID can be calculated.
本实施例中,采用CE准则训练第N+1个任务专属层,目标是最小化每一帧的语言种类判别错误,降低语言domain的分类错误,增加网络对多种不同语言的鲁棒性,可以减少各种语言数据不均衡带来的影响,加快神经网络的收敛速度。In this embodiment, the N+1 task-specific layer is trained using the CE criterion. The goal is to minimize the language type discrimination error in each frame, reduce the language domain classification error, and increase the robustness of the network to many different languages. It can reduce the influence of the imbalance of various language data and speed up the convergence speed of the neural network.
本实施例中,根据具体的任务情况,给损失配置缩放因子αi、β,这样可以避免各网络层参数更新时偏向某些个别任务上。In this embodiment, scaling factors α i and β are configured for the loss according to specific task conditions, so as to avoid biasing towards some individual tasks when updating the parameters of each network layer.
总损失的计算方式为:The total loss is calculated as:
Loss=α1·loss1(output1)+α2·loss2(output2)+…+αN·lossN(outputN)+β·loss(LID)Loss=α 1 ·loss 1 (output 1 )+α 2 ·loss 2 (output 2 )+…+α N ·loss N (output N )+β·loss(LID)
使用每个语言的损失输出来迭代更新该语言对应的任务专属层的参数,使用总损失来迭代更新所述若干语言共享网络层的参数。通过这样的学习方式,可以使得语言共享网络层学习到每一种语言的特性,使得训练出的语言共享网络层能够涵盖人类更大范围的音素的发音情况。同时,还搜集了不同语言训练样本集的各种环境,某种语言在某个环境下学习到的参数同时可以提升其他语言在该场景下的识别性能,这样的网络参数鲁棒性更强。The loss output for each language is used to iteratively update the parameters of the task-specific layer corresponding to that language, and the total loss is used to iteratively update the parameters of the several language shared network layers. Through such a learning method, the language sharing network layer can learn the characteristics of each language, so that the trained language sharing network layer can cover a wider range of human phonemes pronunciation. At the same time, various environments of different language training sample sets are also collected. The parameters learned by a certain language in a certain environment can also improve the recognition performance of other languages in this scenario. Such network parameters are more robust.
所述S72、将所述语言共享网络层的参数值迁移到普粤语音识别模型;所述普粤语音识别模型具有与所述多任务训练模型相同的语言共享网络层、一个普粤语识别任务专属层,所述普粤语识别任务专属层为神经网络层,与所述语言共享网络层的最深一层相连,Described S72, the parameter value of described language sharing network layer is migrated to Cantonese speech recognition model; Described Cantonese speech recognition model has the same language sharing network layer as described multi-task training model, a Cantonese recognition task exclusive layer, the Cantonese recognition task-specific layer is a neural network layer, which is connected to the deepest layer of the language shared network layer,
所述普粤语音识别模型对普通话和粤语混合的语音进行识别。如图4所示,所述普粤语音识别模型具有与所述多任务训练模型相同的语言共享网络层、还具有一个普粤语识别任务专属层,所述普粤语识别任务专属层为神经网络层,与所述语言共享网络层的最深一层相连。所述普粤语音识别模型中的语言共享网络层的层数、各层的结构分别与多任务训练模型中的语言共享网络层的层数、对应的各层的结构相同。The Cantonese speech recognition model recognizes the mixed speech of Mandarin and Cantonese. As shown in FIG. 4 , the Cantonese speech recognition model has the same language sharing network layer as the multi-task training model, and also has a Cantonese recognition task-specific layer, which is a neural network layer. , connected to the deepest layer of the language-sharing network layer. The number of layers of the language sharing network layer and the structure of each layer in the Puyue speech recognition model are respectively the same as the number of layers of the language sharing network layer and the structure of the corresponding layers in the multi-task training model.
所述普粤语识别任务专属层结构可以与某个语言共享网络层结构相同,也可以与所有语言共享网络层结构不同。The Cantonese recognition task-specific layer structure may be the same as the shared network layer structure of a certain language, or may be different from the shared network layer structure of all languages.
将训练好的多任务训练模型中的所述语言共享网络层的参数值迁移到普粤语音识别模型,作为普粤语音识别模型中语言共享网络层的初始化参数。将这样已经学习好的参数迁移,可以加快并优化所述普粤语音识别模型的学习效率。Migrate the parameter values of the language sharing network layer in the trained multi-task training model to the Puyue speech recognition model as the initialization parameters of the language sharing network layer in the Puyue speech recognition model. Transferring the learned parameters in this way can speed up and optimize the learning efficiency of the Cantonese speech recognition model.
S73、对普通话和粤语混合建模,训练所述普粤语音识别模型,包括:S73. Modeling a mixture of Mandarin and Cantonese, and train the Cantonese speech recognition model, including:
a、获取普通话、粤语的音频和对应的标注文本,使用帧长25ms、帧移10ms的窗对音频分帧,每一帧提取40维的FBANK特征;将普通话、粤语的特征混合,并随机打乱顺序;进一步地,如果训练样本中的方言数据较少,可以通过信号处理、真机转录、网上爬取等方式对数据进行扩充。a. Obtain Mandarin and Cantonese audio and corresponding annotation text, use a window with a frame length of 25ms and a frame shift of 10ms to frame the audio, and extract 40-dimensional FBANK features for each frame; mix the features of Mandarin and Cantonese, and randomly type Further, if the dialect data in the training samples is small, the data can be expanded by means of signal processing, real machine transcription, and online crawling.
b、整合普通话和粤语的建模单元,对整合后的普通话和粤语建模单元进行建模,得到普通话和粤语的字符标签,包括:b. Integrate the modeling units of Mandarin and Cantonese, model the integrated Mandarin and Cantonese modeling units, and obtain the character labels of Mandarin and Cantonese, including:
分别从普通话、粤语字典中统计出读音相同的每个字符在训练文本中的词频,将词频大于预设阈值的字符作为高频字符,将高频字符作为一个建模单元,将词频低于预设阈值的字符用读音相同的高频字代替;分别得到普通话字符建模单元、粤语字符建模单元;The word frequency of each character with the same pronunciation in the training text is counted from the Mandarin and Cantonese dictionaries respectively, and the character with a word frequency greater than the preset threshold is regarded as a high-frequency character, and the high-frequency character is regarded as a modeling unit. The characters with the threshold are replaced by high-frequency words with the same pronunciation; the Mandarin character modeling unit and the Cantonese character modeling unit are obtained respectively;
例如,在普通话字典中,“谋”、“眸”的读音相同,“谋”的词频大于预设阈值,“眸”的词频低于预设阈值,将“眸”用“谋”代替,将“谋”作为一个建模单元。For example, in the Mandarin dictionary, the pronunciations of "mou" and "mou" are the same, the word frequency of "mou" is greater than the preset threshold, and the word frequency of "mou" is lower than the preset threshold. "Mould" as a modeling unit.
将普通话字符建模单元、粤语字符建模单元中,读音相同的建模单元合并,读音不同的建模单元单独保留;In the Mandarin character modeling unit and the Cantonese character modeling unit, the modeling units with the same pronunciation are merged, and the modeling units with different pronunciations are kept separately;
例如,具有普通话建模单元“谋”、粤语建模单元“没”,普通话建模单元“谋”与粤语建模单元“没”的读音都是móu,则将普通话建模单元“谋”、粤语建模单元“没”合并。这样得到一套普通话和粤语混合的建模单元。这样形成较大颗粒度建模单元,可以避免由于某些字符在训练文本中出现太少而造成参数训练不充分的情况。For example, if there is a Mandarin modeling unit "mou" and a Cantonese modeling unit "no", the pronunciation of the Mandarin modeling unit "mou" and the Cantonese modeling unit "no" are both móu, then the Mandarin modeling unit "mou", Cantonese Modeling Unit "No" merged. This results in a set of modelling units that mix Mandarin and Cantonese. In this way, a larger granularity modeling unit is formed, which can avoid insufficient parameter training due to too few characters appearing in the training text.
对整合后的普通话和粤语建模单元进行建模,建模函数可采用本领域常规建模函数,得到普通话和粤语的字符标签,然后将普通话和粤语的字符标签,按照训练特征的输入顺序排列。Model the integrated Mandarin and Cantonese modeling units. The modeling function can use conventional modeling functions in the field to obtain the Mandarin and Cantonese character labels, and then arrange the Mandarin and Cantonese character labels according to the input order of the training features. .
c、以随机打乱顺序的所述FBANK特征、普通话和粤语的字符标签为输入,采用CTC的训练准则对所述普粤语音识别模型进行训练,得到训练好的所述普粤语音识别模型。c. The FBANK feature, Mandarin and Cantonese character labels in random order are used as input, and the training criterion of CTC is adopted to train the Cantonese speech recognition model, and the trained Cantonese speech recognition model is obtained.
本实施例中,采用CTC的训练准则进行模型参数的迭代更新,只有一个网络输出,最终训练好的所述普粤语音识别模型可以支持普通话和粤语的识别。In this embodiment, the iterative update of model parameters is performed using the training criteria of CTC, and there is only one network output, and the finally trained Cantonese speech recognition model can support the recognition of Mandarin and Cantonese.
使用这种方法训练出来的模型既满足了识别普通话和粤语的需求,又不需要改变内核工程上的框架,性能也可以达到甚至优于单一语言识别系统的性能。The model trained by this method not only meets the needs of recognizing Mandarin and Cantonese, but also does not need to change the framework of kernel engineering, and its performance can reach or even exceed the performance of a single language recognition system.
所述S74、基于训练好的普粤语音识别模型对普通话和粤语混合的语音进行识别,包括:Described S74, based on the trained Cantonese speech recognition model to recognize the mixed speech of Mandarin and Cantonese, including:
输入普通话和粤语混合的音频,基于训练好的普粤语音识别模型对普通话和粤语混合的语音进行识别,输出对应的文本信息。Input the audio mixed between Mandarin and Cantonese, recognize the mixed Mandarin and Cantonese speech based on the trained Cantonese speech recognition model, and output the corresponding text information.
进一步地,所述训练好的普粤语音识别模型是一个能够识别声音的声学模型,为了提高混合语音识别的准确性,再训练一个语言模型,即将普通话和粤语的字典合并、用于训练的语料文本合并,用以训练语言模型。在对普通话和粤语混合音频进行解码识别时,声学模型和语言模型同时识别两种语言,使得混合识别的系统相比单个系统性能上没有损失。Further, the trained Cantonese speech recognition model is an acoustic model capable of recognizing sounds, and in order to improve the accuracy of mixed speech recognition, a language model is trained, that is, the dictionary of Mandarin and Cantonese is merged, and the corpus used for training. Text merging to train language models. When decoding and recognizing the mixed audio of Mandarin and Cantonese, the acoustic model and the language model recognize the two languages at the same time, so that the mixed recognition system has no performance loss compared to a single system.
本实施例的方案,与将同一条音频解码时分别使用普通话和粤语识别资源,然后根据两个识别结果的置信度或语义分析,确定最终的文本识别结果的方案相比,工程代价低,不存在资源浪费的情况。The solution of this embodiment is compared with the solution of using Mandarin and Cantonese recognition resources respectively when decoding the same audio, and then determining the final text recognition result according to the confidence or semantic analysis of the two recognition results. There is a waste of resources.
本实施例的方案,与将两种语言原有的建模单元直接合并后训练,得到声学模型的方案相比,性能损失少。The solution in this embodiment has less performance loss than the solution in which the original modeling units of the two languages are directly combined and trained to obtain an acoustic model.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作合并,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of actions combined, but those skilled in the art should know that the present invention is not limited by the described sequence of actions. As in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present invention.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在一些实施例中,本发明实施例还提供一种存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时能够执行普通话和粤语混合语音识别方法的步骤。In some embodiments, embodiments of the present invention further provide a storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the program can perform the steps of the Mandarin and Cantonese mixed speech recognition method.
在一些实施例中,本发明实施例还提供一种电子设备,其包括:至少一个处理器,以及与所述至少一个处理器通信连接的存储器,其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行普通话和粤语混合语音识别方法。In some embodiments, embodiments of the present invention further provide an electronic device, which includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores data that can be accessed by the at least one processor. Instructions executed by a processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a Mandarin and Cantonese hybrid speech recognition method.
上述本发明实施例的普通话和粤语混合语音识别装置可用于执行本发明实施例的普通话和粤语混合语音识别方法,并相应的达到上述本发明实施例的实现普通话和粤语混合语音识别方法所达到的技术效果,这里不再赘述。本发明实施例中可以通过硬件处理器(hardware processor)来实现相关功能模块。The Mandarin and Cantonese mixed speech recognition device of the above-mentioned embodiment of the present invention can be used to execute the Mandarin and Cantonese mixed speech recognition method of the present invention, and correspondingly achieve the above-mentioned embodiment of the present invention. The technical effect will not be repeated here. In the embodiment of the present invention, the relevant functional modules may be implemented by a hardware processor (hardware processor).
图8是本申请另一实施例提供的普通话和粤语混合语音识别方法的电子设备的硬件结构示意图。如图8所示,该设备包括:FIG. 8 is a schematic diagram of a hardware structure of an electronic device for a mixed speech recognition method for Mandarin and Cantonese provided by another embodiment of the present application. As shown in Figure 8, the device includes:
一个或多个处理器810以及存储器820,图8中以一个处理器810为例。One or
执行普通话和粤语混合语音识别方法的设备还可以包括:输入装置830和输出装置840。The apparatus for performing the Mandarin and Cantonese mixed speech recognition method may further include: an
处理器810、存储器820、输入装置830和输出装置840可以通过总线或者其他方式连接,图8中以通过总线连接为例。The
存储器820作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本申请实施例中的普通话和粤语混合语音识别方法对应的程序指令/模块。处理器810通过运行存储在存储器820中的非易失性软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例语音服务方法。The
存储器820可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据语音服务装置的使用所创建的数据等。此外,存储器820可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器820可选包括相对于处理器810远程设置的存储器,这些远程存储器可以通过网络连接至语音服务装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The
输入装置830可接收输入的数字或字符信息,以及产生与语音服务装置的用户设置以及功能控制有关的信号。输出装置840可包括显示屏等显示设备。The
所述一个或者多个模块存储在所述存储器820中,当被所述一个或者多个处理器810执行时,执行上述任意方法实施例中的普通话和粤语混合语音识别方法。The one or more modules are stored in the
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。The above product can execute the method provided by the embodiments of the present application, and has functional modules and beneficial effects corresponding to the execution method. For technical details not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of this application.
本申请实施例的电子设备以多种形式存在,包括但不限于:The electronic devices of the embodiments of the present application exist in various forms, including but not limited to:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒体手机、功能性手机,以及低端手机等。(1) Mobile communication equipment: This type of equipment is characterized by having mobile communication functions, and its main goal is to provide voice and data communication. Such terminals include: smart phones (eg iPhone), multimedia phones, feature phones, and low-end phones.
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has the characteristics of mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as iPads.
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio and video players (eg iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。(4) Server: A device that provides computing services. The composition of the server includes a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general computer architecture, but due to the need to provide highly reliable services, the processing power, stability , reliability, security, scalability, manageability and other aspects of high requirements.
(5)其他具有数据交互功能的电子装置。(5) Other electronic devices with data interaction function.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence, or the parts that make contributions to related technologies, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic disks , optical disc, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions recorded in the foregoing embodiments, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present application.
Claims (12)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010737658.6A CN111816160A (en) | 2020-07-28 | 2020-07-28 | Mandarin and Cantonese hybrid speech recognition model training method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010737658.6A CN111816160A (en) | 2020-07-28 | 2020-07-28 | Mandarin and Cantonese hybrid speech recognition model training method and system |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111816160A true CN111816160A (en) | 2020-10-23 |
Family
ID=72864240
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010737658.6A Withdrawn CN111816160A (en) | 2020-07-28 | 2020-07-28 | Mandarin and Cantonese hybrid speech recognition model training method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111816160A (en) |
Cited By (21)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112561056A (en) * | 2020-12-07 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network model training method and device, electronic equipment and storage medium |
| CN112614485A (en) * | 2020-12-30 | 2021-04-06 | 竹间智能科技(上海)有限公司 | Recognition model construction method, voice recognition method, electronic device, and storage medium |
| CN113012706A (en) * | 2021-02-18 | 2021-06-22 | 联想(北京)有限公司 | Data processing method and device and electronic equipment |
| CN113241064A (en) * | 2021-06-28 | 2021-08-10 | 科大讯飞股份有限公司 | Voice recognition method, voice recognition device, model training method, model training device, electronic equipment and storage medium |
| CN113327600A (en) * | 2021-06-30 | 2021-08-31 | 北京有竹居网络技术有限公司 | Training method, device and equipment of voice recognition model |
| CN113469338A (en) * | 2021-06-30 | 2021-10-01 | 平安科技(深圳)有限公司 | Model training method, model training device, terminal device, and storage medium |
| CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
| JP2022020056A (en) * | 2020-11-04 | 2022-01-31 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Speech recognition methods, devices, electronic devices and storage media |
| CN114548200A (en) * | 2020-11-10 | 2022-05-27 | 国际商业机器公司 | Multilingual intent recognition |
| CN114582329A (en) * | 2022-03-03 | 2022-06-03 | 北京有竹居网络技术有限公司 | Voice recognition method and device, computer readable medium and electronic equipment |
| CN114580393A (en) * | 2020-11-30 | 2022-06-03 | 华为技术有限公司 | Method for acquiring natural language model and related equipment |
| CN114694634A (en) * | 2020-12-25 | 2022-07-01 | 暗物智能科技(广州)有限公司 | Training method, recognition method and device for Hausa voiceprint recognition model |
| CN114694655A (en) * | 2022-03-28 | 2022-07-01 | 广东电力信息科技有限公司 | An extension method and speech recognition method for Cantonese audio |
| CN114913860A (en) * | 2022-04-27 | 2022-08-16 | 中国工商银行股份有限公司 | Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product |
| CN115810347A (en) * | 2021-09-13 | 2023-03-17 | 北京猿力未来科技有限公司 | Voice recognition method, device, storage medium and equipment |
| CN116229967A (en) * | 2023-02-21 | 2023-06-06 | 思必驰科技股份有限公司 | Speech recognition method, system, electronic device and storage medium |
| CN116486783A (en) * | 2022-01-17 | 2023-07-25 | 台湾中华电信股份有限公司 | Multilingual speech recognition system, method, and computer storage medium |
| CN116486784A (en) * | 2023-01-16 | 2023-07-25 | 科大讯飞股份有限公司 | Multilingual switching free interaction method, device and electronic equipment |
| CN116564314A (en) * | 2023-05-31 | 2023-08-08 | 平安科技(深圳)有限公司 | Chinese-Cantonese mixed speech recognition method, device, computer equipment and storage medium |
| CN117174111A (en) * | 2023-11-02 | 2023-12-05 | 浙江同花顺智能科技有限公司 | Overlapping voice detection method, device, electronic equipment and storage medium |
| CN118280372A (en) * | 2024-06-03 | 2024-07-02 | 中邮消费金融有限公司 | Dialogue assistance method, device, storage medium, and computer program product |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
| CN107481717A (en) * | 2017-08-01 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | A kind of acoustic training model method and system |
| US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
| CN108682417A (en) * | 2018-05-14 | 2018-10-19 | 中国科学院自动化研究所 | Small data Speech acoustics modeling method in speech recognition |
| CN110428818A (en) * | 2019-08-09 | 2019-11-08 | 中国科学院自动化研究所 | The multilingual speech recognition modeling of low-resource, audio recognition method |
| CN110675865A (en) * | 2019-11-06 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for training hybrid language recognition models |
-
2020
- 2020-07-28 CN CN202010737658.6A patent/CN111816160A/en not_active Withdrawn
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017114201A1 (en) * | 2015-12-31 | 2017-07-06 | 阿里巴巴集团控股有限公司 | Method and device for executing setting operation |
| US20180053500A1 (en) * | 2016-08-22 | 2018-02-22 | Google Inc. | Multi-accent speech recognition |
| CN107481717A (en) * | 2017-08-01 | 2017-12-15 | 百度在线网络技术(北京)有限公司 | A kind of acoustic training model method and system |
| CN108682417A (en) * | 2018-05-14 | 2018-10-19 | 中国科学院自动化研究所 | Small data Speech acoustics modeling method in speech recognition |
| CN110428818A (en) * | 2019-08-09 | 2019-11-08 | 中国科学院自动化研究所 | The multilingual speech recognition modeling of low-resource, audio recognition method |
| CN110675865A (en) * | 2019-11-06 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for training hybrid language recognition models |
Non-Patent Citations (1)
| Title |
|---|
| JIANGYAN YI 等: "Language-Adversarial Transfer Learning for Low-Resource Speech Recognition", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 * |
Cited By (30)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12033615B2 (en) | 2020-11-04 | 2024-07-09 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for recognizing speech, electronic device and storage medium |
| JP7268113B2 (en) | 2020-11-04 | 2023-05-02 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Speech recognition method, device, electronic device and storage medium |
| JP2022020056A (en) * | 2020-11-04 | 2022-01-31 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Speech recognition methods, devices, electronic devices and storage media |
| CN114548200A (en) * | 2020-11-10 | 2022-05-27 | 国际商业机器公司 | Multilingual intent recognition |
| CN114580393A (en) * | 2020-11-30 | 2022-06-03 | 华为技术有限公司 | Method for acquiring natural language model and related equipment |
| CN112561056A (en) * | 2020-12-07 | 2021-03-26 | 北京百度网讯科技有限公司 | Neural network model training method and device, electronic equipment and storage medium |
| CN114694634A (en) * | 2020-12-25 | 2022-07-01 | 暗物智能科技(广州)有限公司 | Training method, recognition method and device for Hausa voiceprint recognition model |
| CN112614485A (en) * | 2020-12-30 | 2021-04-06 | 竹间智能科技(上海)有限公司 | Recognition model construction method, voice recognition method, electronic device, and storage medium |
| CN113012706A (en) * | 2021-02-18 | 2021-06-22 | 联想(北京)有限公司 | Data processing method and device and electronic equipment |
| CN113571045B (en) * | 2021-06-02 | 2024-03-12 | 北京它思智能科技有限公司 | Method, system, equipment and medium for identifying Minnan language voice |
| CN113571045A (en) * | 2021-06-02 | 2021-10-29 | 北京它思智能科技有限公司 | Minnan language voice recognition method, system, equipment and medium |
| CN113241064B (en) * | 2021-06-28 | 2024-02-13 | 科大讯飞股份有限公司 | Speech recognition, model training method and device, electronic equipment and storage medium |
| CN113241064A (en) * | 2021-06-28 | 2021-08-10 | 科大讯飞股份有限公司 | Voice recognition method, voice recognition device, model training method, model training device, electronic equipment and storage medium |
| CN113469338A (en) * | 2021-06-30 | 2021-10-01 | 平安科技(深圳)有限公司 | Model training method, model training device, terminal device, and storage medium |
| CN113469338B (en) * | 2021-06-30 | 2023-10-31 | 平安科技(深圳)有限公司 | Model training method, model training device, terminal equipment and storage medium |
| CN113327600A (en) * | 2021-06-30 | 2021-08-31 | 北京有竹居网络技术有限公司 | Training method, device and equipment of voice recognition model |
| CN115810347A (en) * | 2021-09-13 | 2023-03-17 | 北京猿力未来科技有限公司 | Voice recognition method, device, storage medium and equipment |
| CN116486783A (en) * | 2022-01-17 | 2023-07-25 | 台湾中华电信股份有限公司 | Multilingual speech recognition system, method, and computer storage medium |
| CN114582329B (en) * | 2022-03-03 | 2026-01-02 | 北京有竹居网络技术有限公司 | Speech recognition methods, devices, computer-readable media and electronic devices |
| CN114582329A (en) * | 2022-03-03 | 2022-06-03 | 北京有竹居网络技术有限公司 | Voice recognition method and device, computer readable medium and electronic equipment |
| CN114694655A (en) * | 2022-03-28 | 2022-07-01 | 广东电力信息科技有限公司 | An extension method and speech recognition method for Cantonese audio |
| CN114913860A (en) * | 2022-04-27 | 2022-08-16 | 中国工商银行股份有限公司 | Voiceprint recognition method, voiceprint recognition device, computer equipment, storage medium and program product |
| CN116486784A (en) * | 2023-01-16 | 2023-07-25 | 科大讯飞股份有限公司 | Multilingual switching free interaction method, device and electronic equipment |
| CN116229967A (en) * | 2023-02-21 | 2023-06-06 | 思必驰科技股份有限公司 | Speech recognition method, system, electronic device and storage medium |
| CN116229967B (en) * | 2023-02-21 | 2025-07-11 | 思必驰科技股份有限公司 | Speech recognition method, system, electronic device and storage medium |
| CN116564314A (en) * | 2023-05-31 | 2023-08-08 | 平安科技(深圳)有限公司 | Chinese-Cantonese mixed speech recognition method, device, computer equipment and storage medium |
| CN117174111B (en) * | 2023-11-02 | 2024-01-30 | 浙江同花顺智能科技有限公司 | Overlapping voice detection method, device, electronic equipment and storage medium |
| CN117174111A (en) * | 2023-11-02 | 2023-12-05 | 浙江同花顺智能科技有限公司 | Overlapping voice detection method, device, electronic equipment and storage medium |
| CN118280372A (en) * | 2024-06-03 | 2024-07-02 | 中邮消费金融有限公司 | Dialogue assistance method, device, storage medium, and computer program product |
| CN118280372B (en) * | 2024-06-03 | 2024-08-06 | 中邮消费金融有限公司 | Dialogue assistance method, device, storage medium, and computer program product |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111816160A (en) | Mandarin and Cantonese hybrid speech recognition model training method and system | |
| JP7247441B2 (en) | Semantic representation model processing method, device, electronic device, and storage medium | |
| CN108847241B (en) | Method for recognizing conference voice as text, electronic device and storage medium | |
| CN114970522B (en) | Pre-training methods, devices, equipment and storage media for language models | |
| CN107924483B (en) | Generation and application of generic hypothesis ranking model | |
| CN111090727B (en) | Language conversion processing method and device and dialect voice interaction system | |
| CN104143327B (en) | An acoustic model training method and device | |
| CN110930980B (en) | Acoustic recognition method and system for Chinese and English mixed voice | |
| CN108615525B (en) | Voice recognition method and device | |
| Xu et al. | Exploiting shared information for multi-intent natural language sentence classification. | |
| CN111382231B (en) | Intention recognition system and method | |
| CN114911932A (en) | Heterogeneous graph structure multi-conversation person emotion analysis method based on theme semantic enhancement | |
| CN112786029B (en) | Method and apparatus for training VAD using weakly supervised data | |
| CN110516253A (en) | Method and system for semantic understanding of spoken Chinese | |
| CN107301170A (en) | The method and apparatus of cutting sentence based on artificial intelligence | |
| CN116615727A (en) | Keyword data augmentation tool for natural language processing | |
| CN111833844A (en) | Method and system for training hybrid model for speech recognition and language classification | |
| CN113095086B (en) | Method and system for predicting source meaning | |
| CN111161724B (en) | Chinese audio-visual combined speech recognition method, system, equipment and medium | |
| CN112116907A (en) | Speech recognition model establishment, speech recognition method, apparatus, equipment and medium | |
| CN115050351A (en) | Method and device for generating timestamp and computer equipment | |
| CN109299231B (en) | Dialog state tracking method, system, electronic device and storage medium | |
| Okur et al. | End-to-end evaluation of a spoken dialogue system for learning basic mathematics | |
| CN110597958A (en) | Text classification model training and use method and device | |
| CN114882880A (en) | Decoder-based voice wake-up method and related equipment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information |
Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant after: Sipic Technology Co.,Ltd. Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province Applicant before: AI SPEECH Co.,Ltd. |
|
| CB02 | Change of applicant information | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201023 |
|
| WW01 | Invention patent application withdrawn after publication |