TWI659411B - Multi-language hybrid speech recognition method - Google Patents
Multi-language hybrid speech recognition method Download PDFInfo
- Publication number
- TWI659411B TWI659411B TW107106801A TW107106801A TWI659411B TW I659411 B TWI659411 B TW I659411B TW 107106801 A TW107106801 A TW 107106801A TW 107106801 A TW107106801 A TW 107106801A TW I659411 B TWI659411 B TW I659411B
- Authority
- TW
- Taiwan
- Prior art keywords
- multilingual
- language
- speech
- mixed
- indicate
- Prior art date
Links
Landscapes
- Machine Translation (AREA)
Abstract
本發明公開了一種多語言混合語音識別方法,屬於語音識別技術領域;方法包括:步驟S1,配置一包括多種不同語言的多語言混合詞典;步驟S2,根據多語言混合詞典以及包括多種不同語言的多語言語音數據訓練形成一聲學識別模型;步驟S3,根據包括多種不同語言的多語言文本語料訓練形成一語言識別模型;步驟S4,採用多語言混合詞典、聲學識別模型以及語言識別模型形成語音識別系統;隨後,採用語音識別系統對混合語音進行識別,並輸出對應的識別結果。上述技術方案的有益效果是:能夠支持多種語言混合語音的識別,提升識別的準確率和效率,因此提高語音識別系統的性能。The invention discloses a multilingual mixed speech recognition method, which belongs to the technical field of speech recognition. The method includes: step S1, configuring a multilingual mixed dictionary including a plurality of different languages; step S2, according to the multilingual mixed dictionary and a plurality of different languages. Multilingual speech data training forms an acoustic recognition model; step S3, forming a language recognition model based on multilingual text corpora training including multiple different languages; step S4, forming a speech using a multilingual mixed dictionary, acoustic recognition model, and language recognition model Recognition system; subsequently, the speech recognition system is used to recognize the mixed speech and output the corresponding recognition result. The above technical solution has the beneficial effects that it can support the recognition of mixed speech in multiple languages, improve the accuracy and efficiency of recognition, and therefore improve the performance of the speech recognition system.
Description
本發明涉及語音識別技術領域,尤其涉及一種多語言混合語音識別方法。The invention relates to the technical field of speech recognition, in particular to a multilingual mixed speech recognition method.
在日常說話的表達中,人們往往在無意中使用一種語言中夾雜另一種或者另幾種語言的表達方式,例如一些英文單詞在中文中會直接沿用其原本名字,例如「ipad」、「iphone」、「USB」等專有名詞,因此會造成中英文混雜的現象,這種現象會給語音識別帶來一定的困難和挑戰。In the expression of everyday speech, people often inadvertently use one language and another language or other languages. For example, some English words will directly use their original names in Chinese, such as "ipad", "iphone" , "USB" and other proper nouns, so it will cause a mixed phenomenon of Chinese and English, which will bring some difficulties and challenges to speech recognition.
早期的多語言混合語音識別系統的識別原理是分別建立單獨的語音識別系統,然後將混合語音切開,並將不同語種的語音片段分別送入對應的語音識別系統中進行識別,最後再將各個語音片段的識別結果合併,以形成混合語音的識別結果。這種識別方法一方面很難保證按照語種對混合語音進行切分的準確性,另一方面每個被切分後形成的語音片段的上下文資訊太短,從而影響識別準確率。The recognition principle of the early multilingual mixed speech recognition system was to establish separate speech recognition systems, then cut the mixed speech, and send the speech fragments of different languages to the corresponding speech recognition system for recognition, and finally each speech The recognition results of the segments are merged to form the recognition result of the mixed speech. This recognition method is difficult to guarantee the accuracy of segmenting the mixed speech according to the language, on the other hand, the context information of each segmented speech segment is too short, which affects the recognition accuracy.
近年來,多語言混合語音的識別方法的做法開始發生變化,具體為將單獨的語音識別系統進行詞典擴充,即使用一種語言的音素集去拼湊另一種語言,例如英語中的「iphone」在中文詞典中的發音會被拼湊成「愛瘋」。這樣的識別方法雖然能夠識別出個別不同語種的詞匯,但是一方面要求使用者的發音非常怪異(例如「iphone」必須準確發成「愛瘋」),另一方面在識別整句混合語音的準確率會大幅下降。In recent years, the practice of multilingual mixed speech recognition methods has begun to change. Specifically, it is to expand the dictionary of a single speech recognition system, that is, use the phoneme set of one language to piece together another language. For example, the "iphone" in English is in Chinese. The pronunciation in the dictionary will be pieced together as "love crazy". Although this recognition method can recognize individual vocabularies in different languages, on the one hand, it requires that the user's pronunciation is very strange (for example, "iphone" must be accurately pronounced as "love crazy"); The accuracy rate will drop significantly.
根據現有技術中存在的上述問題,現提供一種多語言混合語音識別方法的技術方案,旨在支持多種語言混合語音的識別,提升識別的準確率和效率,因此提高語音識別系統的性能。According to the above problems in the prior art, a technical solution of a multilingual mixed speech recognition method is now provided, which aims to support the recognition of multilingual mixed speech, improve the recognition accuracy and efficiency, and thus improve the performance of the speech recognition system.
上述技術方案具體包括: 一種多語言混合語音識別方法,其中,首先形成用於識別多語言的混合語音的語音識別系統,形成語音識別系統的方法包括: 步驟S1,配置一包括多種不同語言的多語言混合詞典; 步驟S2,根據多語言混合詞典以及包括多種不同語言的多語言語音數據訓練形成一聲學識別模型; 步驟S3,根據包括多種不同語言的多語言文本語料訓練形成一語言識別模型; 步驟S4,採用多語言混合詞典、聲學識別模型以及語言識別模型形成語音識別系統; 隨後,採用語音識別系統對混合語音進行識別,並輸出對應的識別結果。The above technical solution specifically includes: a multilingual mixed speech recognition method, in which a speech recognition system for identifying multilingual mixed speech is first formed, and the method for forming a speech recognition system includes: Step S1, configuring a multi-language multi-language Language mixed dictionary; Step S2, forming an acoustic recognition model based on multilingual mixed dictionary and multilingual speech data including multiple different languages; Step S3, forming a language recognition model based on multilingual text corpus training including multiple different languages; In step S4, a multilingual mixed dictionary, an acoustic recognition model, and a language recognition model are used to form a speech recognition system. Subsequently, the speech recognition system is used to recognize the mixed speech and output the corresponding recognition result.
優選的,該多語言混合語音識別方法,其中,步驟S1中,採用三音素建模的方式,根據分別對應每種不同語言的單語言詞典配置多語言混合詞典。Preferably, in the multilingual mixed speech recognition method, in step S1, a three-phoneme modeling method is adopted, and a multilingual mixed dictionary is configured according to a single language dictionary corresponding to each different language.
優選的,該多語言混合語音識別方法,其中,步驟S1中,採用三音素建模的方式配置多語言混合詞典; 在配置多語言混合詞典時,對多語言混合詞典中包括的每種所語言的音素前分別添加一對應的語種標記,以將多種不同語言的音素進行區分。Preferably, in the multilingual mixed speech recognition method, in step S1, a multilingual mixed dictionary is configured by using a three-phoneme modeling method; when the multilingual mixed dictionary is configured, each language included in the multilingual mixed dictionary is configured. A corresponding language mark is added before each phoneme of the phoneme to distinguish phonemes in different languages.
優選的,該多語言混合語音識別方法,其中,步驟S2具體包括: 步驟S21,根據多種不同語言混合的多語言語音數據以及多語言混合詞典訓練形成一聲學模型; 步驟S22,對多語言語音數據提取語音特徵,並採用聲學模型對語音特徵進行幀對齊操作,以獲得每一幀語音特徵所對應的輸出標籤; 步驟S23,將語音特徵作為聲學識別模型的輸入數據,以及將語音特徵對應的輸出標籤作為聲學識別模型的輸出層中的輸出標籤,以訓練形成聲學識別模型。Preferably, the multilingual mixed speech recognition method, wherein step S2 specifically includes: step S21, forming an acoustic model according to multilingual speech data mixed with multiple different languages and multilingual mixed dictionary training; step S22, multilingual speech data Extract the speech features and use the acoustic model to perform frame alignment operations on the speech features to obtain the output label corresponding to each frame of the speech feature; Step S23: Use the speech feature as the input data of the acoustic recognition model and output the corresponding speech feature. The labels are used as output labels in the output layer of the acoustic recognition model to train to form an acoustic recognition model.
優選的,該多語言混合語音識別方法,其中,聲學模型為隱馬爾可夫-高斯混合模型。Preferably, in the multilingual mixed speech recognition method, the acoustic model is a hidden Markov-Gaussian mixed model.
優選的,該多語言混合語音識別方法,其中,步驟S23中,對聲學識別模型進行訓練後,對聲學識別模型的輸出層進行調整,具體包括: 步驟S231,分別計算得到每種語言的先驗機率,以及計算得到所有種類的語言公用的靜音的先驗機率; 步驟S232,分別計算得到每種語言的後驗機率,以及計算得到靜音的後驗機率; 步驟S233,根據每種語言的先驗機率和後驗機率,以及靜音的先驗機率和後驗機率,調整聲學識別模型的輸出層。Preferably, in the multilingual mixed speech recognition method, in step S23, after the acoustic recognition model is trained, the output layer of the acoustic recognition model is adjusted, which specifically includes: Step S231, calculating the priors of each language separately The probability and the prior probability of mute that is common to all kinds of languages; step S232, the posterior probability of each language is calculated and the posterior probability of mute is calculated; step S233, according to the priors of each language The probability and posterior probability, as well as the silent prior and posterior probability, adjust the output layer of the acoustic recognition model.
優選的,該多語言混合語音識別方法,其中,步驟S231中,依照下述公式分別計算得到每種語言的先驗機率: ; 其中, 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的先驗機率; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中的靜音的第i種狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中的第j種語言中的狀態的總數; 用於表示多語言語音數據中的靜音的狀態的總數。 Preferably, in the multilingual mixed speech recognition method, in step S231, the prior probability of each language is calculated according to the following formula: ; among them, An output label used to represent the i-th state of the j-th language in the multilingual speech data; Used to represent the output labels in multilingual speech data as Probability of a priori Used to represent the output labels in multilingual speech data as total; An output label for representing the m-th state of silence in multilingual speech data; Used to represent the output labels in multilingual speech data as total; Used to represent the total number of states in the jth language in the multilingual speech data; The total number used to indicate the status of silence in multilingual speech data.
優選的,該多語言混合語音識別方法,其中,步驟S231中,依照下述公式計算得到靜音的先驗機率: ; 其中, 用於表示多語言語音數據中的靜音的第i種狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的先驗機率; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中的第j種語言中的狀態的總數; 用於表示多語言語音數據中的靜音的狀態的總數; L用於表示多語言語音數據中的所有語言。 Preferably, in the multilingual mixed speech recognition method, in step S231, the prior probability of silence is calculated according to the following formula: ; among them, An output label for representing the m-th state of silence in multilingual speech data; Used to represent the output labels in multilingual speech data as Probability of a priori Used to represent the output labels in multilingual speech data as total; An output label used to represent the i-th state of the j-th language in the multilingual speech data; Used to represent the output labels in multilingual speech data as total; Used to represent the total number of states in the jth language in the multilingual speech data; It is used to represent the total number of muted states in the multilingual voice data; L is used to represent all languages in the multilingual voice data.
優選的,該多語言混合語音識別方法,其中,步驟S232中,依照下述公式分別計算得到每種語言的後驗機率: ; 其中, 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; x用於表示語音特徵; 用於表示多語言語音數據中輸出標籤為 的後驗機率; 用於表示多語言語音數據中第j種語言的第i個狀態的輸入數據; 用於表示靜音的第i種狀態的輸入數據; 用於表示多語言語音數據中的第j種語言中的狀態的總數; 用於表示多語言語音數據中的靜音的狀態的總數; exp用於表示指數函數計算方式。 Preferably, in the multilingual mixed speech recognition method, in step S232, the posterior probability of each language is calculated according to the following formula: ; among them, An output label used to represent the i-th state of the j-th language in multilingual speech data; x is used to represent speech features; Used to represent the output labels in multilingual speech data as Post-test probability Input data for representing the i-th state of the j-th language in the multilingual speech data; Input data for representing the i-th state of silence; Used to represent the total number of states in the jth language in the multilingual speech data; It is used to indicate the total number of mute states in multilingual speech data; exp is used to indicate the exponential function calculation method.
優選的,該多語言混合語音識別方法,其中,步驟S232中,依照下述公式計算得到靜音的後驗機率: ; 其中, 用於表示多語言語音數據中的靜音的第i種狀態的輸出標籤; x用於表示語音特徵; 用於表示多語言語音數據中輸出標籤為 的後驗機率; 用於表示多語言語音數據中第j種語言的第i個狀態的輸入數據; 用於表示靜音的第i種狀態的輸入數據; 用於表示多語言語音數據中的第j種語言中的狀態的總數; 用於表示多語言語音數據中的靜音的狀態的總數; L用於表示多語言語音數據中的所有語言; exp用於表示指數函數計算方式。 Preferably, in the multilingual mixed speech recognition method, in step S232, the posterior probability of silence is calculated according to the following formula: ; among them, An output label used to represent the m-th state of silence in multilingual speech data; x is used to represent speech features; Used to represent the output labels in multilingual speech data as Post-test probability Input data for representing the i-th state of the j-th language in the multilingual speech data; Input data for representing the i-th state of silence; Used to represent the total number of states in the jth language in the multilingual speech data; It is used to indicate the total number of muted states in multilingual speech data; L is used to indicate all languages in multilingual speech data; exp is used to indicate the exponential function calculation method.
優選的,該多語言混合語音識別方法,其中,步驟S2中,聲學識別模型為深度神經網路的聲學模型。Preferably, in the multilingual mixed speech recognition method, in step S2, the acoustic recognition model is an acoustic model of a deep neural network.
優選的,該多語言混合語音識別方法,其中,步驟S3中,採用n-Gram模型訓練形成語言識別模型,或者採用遞歸神經網路訓練形成語言識別模型。Preferably, in the multilingual mixed speech recognition method, in step S3, n-Gram model training is used to form a language recognition model, or recursive neural network training is used to form a language recognition model.
優選的,該多語言混合語音識別方法,其中,形成語音識別系統後,首先對語音識別系統中不同種類的語言進行權重調整; 進行權重調整的步驟包括: 步驟A1,根據真實語音數據分別確定每種語言的後驗機率權重值; 步驟A2,根據後驗機率權重值,分別調整每種語言的後驗機率,以完成權重調整。Preferably, in the multilingual mixed speech recognition method, after the speech recognition system is formed, weight adjustments are first performed for different types of languages in the speech recognition system. The weight adjustment steps include: Step A1, determining each Step A2: Adjust the posterior probability of each language according to the posterior probability weight value to complete the weight adjustment.
優選的,該多語言混合語音識別方法,其中,步驟A2中,依照下述公式進行權重調整: ; 其中, 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; x用於表示語音特徵; 用於表示多語言語音數據中輸出標籤為 的後驗機率; 用於表示多語言語音數據中第j種語言的後驗機率權重值; 用於表示經過權重調整的多語言語音數據中輸出標籤為 的後驗機率。 Preferably, in the multilingual mixed speech recognition method, in step A2, weight adjustment is performed according to the following formula: ; among them, An output label used to represent the i-th state of the j-th language in multilingual speech data; x is used to represent speech features; Used to represent the output labels in multilingual speech data as Post-test probability A posteriori probability weight value used to represent the jth language in multilingual speech data; The output label used to represent the weighted multilingual speech data is Post-test probability.
上述技術方案的有益效果是:提供一種多語言混合語音識別方法,能夠支援多種語言混合語音的識別,提升識別的準確率和效率,因此提高語音識別系統的性能。The above technical solution has the beneficial effect of providing a multilingual mixed speech recognition method capable of supporting recognition of multilingual mixed speech, improving recognition accuracy and efficiency, and thereby improving performance of a speech recognition system.
下面將結合本發明實施例中的附圖,對本發明實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本發明一部分實施例,而不是全部的實施例。基於本發明中的實施例,本領域普通技術人員在沒有作出創造性勞動的前提下所獲得的所有其他實施例,都屬本發明保護的範圍。In the following, the technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
需要說明的是,在不衝突的情況下,本發明中的實施例及實施例中的特徵可以相互組合。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.
下面結合附圖和具體實施例對本發明作進一步說明,但不作為本發明的限定。The present invention is further described below with reference to the accompanying drawings and specific embodiments, but is not intended to limit the present invention.
基於現有技術中存在的上述問題,本發明中提供一種多語言混合語音識別方法,所謂混合語音,是指混合了多種不同語言的語音數據,例如使用者輸入語音「我需要一個USB介面」,該段語音中既包括中文語音,也包括英文的專有名詞「USB」,則該段語音為混合語音。本發明的其他實施例中,上述混合語音也可以為兩種以上語音的混合體,在此不做限制。Based on the above problems in the prior art, the present invention provides a multilingual mixed speech recognition method. The so-called mixed speech refers to a mixture of speech data of multiple different languages. For example, a user inputs a speech "I need a USB interface". The segment voice includes both Chinese voice and English proper noun "USB", then the segment voice is mixed voice. In other embodiments of the present invention, the foregoing mixed voice may also be a mixture of two or more voices, which is not limited herein.
上述多語言混合語音識別方法中,首先需要形成用於識別上述混合語音的語音識別系統。該語音識別系統的形成方法具體如圖1所示,包括: 步驟S1,配置一包括多種不同語言的多語言混合詞典; 步驟S2,根據多語言混合詞典以及包括多種不同語言的多語言語音數據訓練形成一聲學識別模型; 步驟S3,根據包括多種不同語言的多語言文本語料訓練形成一語言識別模型; 步驟S4,採用多語言混合詞典、聲學識別模型以及語言識別模型形成語音識別系統。In the above-mentioned multilingual mixed speech recognition method, it is first necessary to form a speech recognition system for identifying the mixed speech. The method for forming the speech recognition system is specifically shown in FIG. 1 and includes: Step S1, configuring a multilingual mixed dictionary including multiple different languages; Step S2, training according to the multilingual mixed dictionary and multilingual speech data including multiple different languages Forming an acoustic recognition model; Step S3, forming a language recognition model according to training of a multilingual text corpus including a plurality of different languages; Step S4, forming a speech recognition system using a multilingual mixed dictionary, an acoustic recognition model, and a language recognition model.
在形成語音識別系統後,則可以採用語音識別系統對混合語音進行識別,並輸出對應的識別結果。After the speech recognition system is formed, the speech recognition system can be used to recognize the mixed speech and output the corresponding recognition result.
具體地,本實施例中,上述多語言混合詞典為包括多種不同語言的混合詞典,該混合詞典被配置到音素級別。本發明的較佳的實施例中,採用三音素建模的方式配置上述混合詞典,能夠得到比字建模更穩定的詞典模型。另外,由於不同語言的詞典中可能包含相同字元表述的音素,因此需要在配置混合詞典時對多語言混合詞典中包括的每種語言的音素前分別添加一對應的語種標記,以將多種不同語言的音素進行區分。Specifically, in this embodiment, the above-mentioned multilingual mixed dictionary is a mixed dictionary including a plurality of different languages, and the mixed dictionary is configured to a phoneme level. In a preferred embodiment of the present invention, a triphone model is used to configure the above-mentioned mixed dictionary, and a dictionary model that is more stable than word modeling can be obtained. In addition, since dictionaries in different languages may contain phonemes expressed by the same character, it is necessary to add a corresponding language mark before the phonemes of each language included in the multi-language mixed dictionary when configuring the mixed dictionary, in order to change a variety of different The phonemes of the language are distinguished.
例如,中英文的音素集中都包括「b」、「d」等音素。為了加以區分,在所有的英文的音素集前面都添加語種標記(例如添加「en」作為前綴)以將英文的音素集與中文的音素集進行區分,具體如圖2所示。For example, both the Chinese and English phoneme sets include phonemes such as "b" and "d". In order to distinguish them, a language tag is added in front of all the English phoneme sets (for example, “en” is added as a prefix) to distinguish the English phoneme set from the Chinese phoneme set, as shown in FIG. 2.
上述語種標記可以為空,例如在混合詞典中存在兩種語言,則只需要對其中一種語言添加語種標記,即可以將兩種語言區分開來。類似地,若混合詞典中存在三種語言,則只需要對其中兩種語言添加語種標記,即可以將三種語言區分開來,以此類推。The above language mark can be empty. For example, if there are two languages in the mixed dictionary, only one language mark needs to be added to the two languages to distinguish the two languages. Similarly, if there are three languages in the mixed dictionary, only two languages need to be added with the language mark, that is, the three languages can be distinguished, and so on.
在上述混合詞典中,也可以只對可能產生混淆的語種的音素集之間添加語種標記,例如一個混合詞典中包括中文、英文以及其他語種,並且其中只有中英文的音素集可能存在混淆的問題,因此只需要在英文的音素集前面添加語種標記即可。In the above mixed dictionary, it is also possible to add language tags only to phoneme sets of languages that may cause confusion. For example, a mixed dictionary includes Chinese, English, and other languages, and only the Chinese and English phoneme sets may have a problem of confusion. , So you only need to add a language tag in front of the English phoneme set.
本實施例中,在形成多語言混合詞典之後,根據該混合詞典和包括多種語言的多語言語音數據訓練形成一聲學識別模型。具體地,上述多語言語音數據為事先預備好的包括多種不同語言的訓練用的混合語音數據,上述混合詞典在形成聲學識別模型的過程中提供不同種語言的音素。因此,在訓練形成多語言混合的聲學識別模型的過程中,為了得到混合語言音素的三音素關係,需要準備上述多種語言混合的多語言語音數據,以及依據上述形成的多語言混合詞典進行。In this embodiment, after forming a multilingual mixed dictionary, an acoustic recognition model is formed based on the mixed dictionary and multilingual speech data including multiple languages. Specifically, the multilingual voice data is prepared in advance and includes mixed voice data for training in multiple different languages, and the hybrid dictionary provides phonemes in different languages in the process of forming an acoustic recognition model. Therefore, in the process of training to form a multilingual acoustic recognition model, in order to obtain the triphoneme relationship of the mixed language phonemes, it is necessary to prepare the above-mentioned multilingual mixed multilingual speech data and perform the multilingual mixed dictionary formed according to the above.
本實施例中,隨後根據多種語言混合的多語言文本語料訓練形成一語言識別模型,並最終將多語言混合詞典、聲學識別模型和語言識別模型包括在一語音識別系統中,並根據該語音識別系統對使用者輸入的包括多種語言的混合語音進行識別,輸出識別結果。In this embodiment, a multilingual text corpus is subsequently trained to form a language recognition model, and finally a multilingual mixed dictionary, an acoustic recognition model, and a language recognition model are included in a voice recognition system, and according to the voice The recognition system recognizes a mixed voice including multiple languages input by a user, and outputs a recognition result.
本實施例中,經過上述處理後,上述混合語音的識別過程就與現有技術中對於單語種語音的識別過程類似,通過聲學識別模型將一段語音數據中的語音特徵識別成對應的音素或者字詞序列,並且通過語言識別模型將字詞序列識別成一個完整的句子,從而完成對混合語音的識別過程。上述識別過程在本文中不再贅述。In this embodiment, after the above processing, the above-mentioned mixed speech recognition process is similar to the recognition process for monolingual speech in the prior art. The acoustic features in a piece of speech data are identified into corresponding phonemes or words through an acoustic recognition model. Sequence, and recognize the word sequence as a complete sentence through the language recognition model, thus completing the recognition process of mixed speech. The above identification process will not be repeated here.
綜上,本發明技術方案中,首先根據多個單語種的語言詞典形成包括多個語種的多語言混合詞典,並在其中對不同語種的音素打上語種標記以進行區分。隨後根據多語言混合語音數據和多語言混合詞典訓練形成一聲學識別模型,以及根據多語言混合文本語料訓練形成一語言識別模型。隨後根據多語言混合詞典、聲學識別模型以及語言識別模型形成一個完整的語音識別系統,以對使用者輸入的多語言混合語音進行識別。In summary, in the technical solution of the present invention, a multilingual mixed dictionary including a plurality of languages is first formed according to a plurality of monolingual language dictionaries, and language marks of different languages are marked in the languages to distinguish them. Then, an acoustic recognition model is formed according to multilingual mixed speech data and a multilingual mixed dictionary training, and a language recognition model is formed according to multilingual mixed text corpus training. Then, a complete speech recognition system is formed according to the multilingual mixed dictionary, the acoustic recognition model and the language recognition model to recognize the multilingual mixed speech input by the user.
本發明的較佳的實施例中,如圖3所示,上述步驟S2具體包括: 步驟S21,根據多種不同語言混合的多語言語音數據以及多語言混合詞典訓練形成一聲學模型; 步驟S22,對多語言語音數據提取語音特徵,並採用聲學模型對語音特徵進行幀對齊操作,以獲得每一幀語音特徵所對應的輸出標籤; 步驟S23,將語音特徵作為聲學識別模型的輸入數據,以及將語音特徵對應的輸出標籤作為聲學識別模型的輸出層中的輸出標籤,以訓練形成聲學識別模型。In a preferred embodiment of the present invention, as shown in FIG. 3, the above-mentioned step S2 specifically includes: step S21, forming an acoustic model according to multilingual voice data mixed with multiple different languages and multilingual mixed dictionary training; step S22, Multilingual speech data is used to extract speech features and use the acoustic model to perform frame alignment operations on the speech features to obtain the output label corresponding to each frame of speech features. Step S23, using the speech features as input data of the acoustic recognition model, and the speech The output labels corresponding to the features are used as output labels in the output layer of the acoustic recognition model to train and form an acoustic recognition model.
具體地,本實施例中,在訓練形成聲學識別模型之前,首先根據多種不同語言混合的多語言語音數據訓練形成一聲學模型。該聲學模型可以為一隱馬爾可夫-高斯混合(Hidden Markov Model- Gaussian Mixture Model,HMM-GMM)模型。針對三音素建模中面臨的參數重估魯班性問題,在訓練形成聲學模型的過程中可以選擇使用參數共用技術,從而減少參數規模。基於HMM-GMM的聲學模型的建模技術目前已經相當成熟,在此不再贅述。Specifically, in this embodiment, before training to form an acoustic recognition model, an acoustic model is first formed by training based on multilingual speech data mixed with multiple different languages. The acoustic model may be a Hidden Markov Model-Gaussian Mixture Model (HMM-GMM) model. Aiming at the problem of parameter re-evaluation in the three-phoneme modeling, the parameter sharing technique can be selected in the process of training to form an acoustic model, thereby reducing the parameter scale. The modeling technology of the acoustic model based on HMM-GMM is quite mature at present, and will not be repeated here.
本實施例中,形成上述聲學模型後,需要利用該聲學模型對上述多語言語音數據進行幀對齊操作,從而將每一幀多語言語音數據中提取的語音特徵都對應有一個輸出標籤。具體地,經過幀對齊後,每一幀語音特徵都對應有一個GMM編號。而聲學識別模型的輸出層中的輸出標籤是每一幀語音特徵對應的標籤,因此該聲學識別模型的輸出層中的輸出標籤的個數即為HMM-GMM模型中的GMM的個數,每一個輸出節點對應一個GMM。In this embodiment, after the above-mentioned acoustic model is formed, it is necessary to use the acoustic model to perform frame alignment operation on the above-mentioned multilingual speech data, so that the speech features extracted from each frame of the multilingual speech data correspond to an output label. Specifically, after frame alignment, each frame of speech features corresponds to a GMM number. The output labels in the output layer of the acoustic recognition model are the labels corresponding to the speech features of each frame. Therefore, the number of output tags in the output layer of the acoustic recognition model is the number of GMM in the HMM-GMM model. One output node corresponds to one GMM.
本實施例中,將語音特徵作為聲學識別模型的輸入數據,以及將語音特徵對應的輸出標籤作為聲學識別模型的輸出層中的輸出標籤,以訓練形成聲學識別模型。In this embodiment, a voice feature is used as input data of the acoustic recognition model, and an output tag corresponding to the voice feature is used as an output tag in the output layer of the acoustic recognition model to train and form an acoustic recognition model.
如圖4所示為本發明的一個實施例中的聲學識別模型的大致結構,該聲學識別模型為由全連接的神經網路結構建立的深度神經網路模型,該神經網路中共包含7個全連接的神經網路單元,每層具有2048個節點,每兩個神經網路中間都包含一個sigmoid非線性單元。其輸出層採用softmax非線性單元實現。圖4中的s51用於表示該聲學識別模型的輸出層,L1、L2和L3分別表示輸出層上的關聯於不同種類的語言的輸出標籤。Figure 4 shows the general structure of an acoustic recognition model according to an embodiment of the present invention. The acoustic recognition model is a deep neural network model built from a fully connected neural network structure. The neural network contains a total of 7 Fully connected neural network unit, each layer has 2048 nodes, and every two neural networks contain a sigmoid non-linear unit. The output layer is implemented with a softmax non-linear unit. S51 in FIG. 4 is used to represent the output layer of the acoustic recognition model, and L1, L2, and L3 respectively represent output labels on the output layer that are associated with different kinds of languages.
本發明的較佳的實施例中,上述步驟S23中,在對聲學識別模型進行訓練後,需要針對多語言對聲學識別模型的輸出層進行調整和先驗等操作,具體如圖5所示,包括: 步驟S231,分別計算得到每種語言的先驗機率,以及計算得到所有種類的語言公用的靜音的先驗機率; 步驟S232,分別計算得到每種語言的後驗機率,以及計算得到靜音的後驗機率; 步驟S233,根據每種語言的先驗機率和後驗機率,以及靜音的先驗機率和後驗機率,調整聲學識別模型的輸出層。In a preferred embodiment of the present invention, after the acoustic recognition model is trained in the above step S23, operations such as adjusting and prioring the output layer of the acoustic recognition model for multiple languages need to be performed, as shown in FIG. 5, It includes: Step S231, calculating a priori probability of each language, and calculating a priori probability of mute common to all kinds of languages; Step S232, separately calculating a posterior probability of each language, and calculating a mute Post-test probability; Step S233: Adjust the output layer of the acoustic recognition model according to the pre-test probability and post-test probability of each language, and the pre-test probability and post-test probability of silence.
具體地,本發明的較佳的實施例中,在採用聲學識別模型進行語音識別時,對於給定的語音特徵,其輸出結果的字元串通常由下述公式決定: ; (1) 其中, 用於表示輸出結果的字串,w表示可能的字串,x表示輸入的語音特徵,P(w)用於表示上述語言識別模型的機率,P(x|w)用於表示上述聲學識別模型的機率。 Specifically, in a preferred embodiment of the present invention, when an acoustic recognition model is used for speech recognition, for a given speech feature, the character string of the output result is generally determined by the following formula: (1) Among them, A string used to represent the output result, w is a possible string, x is the input speech feature, P (w) is used to represent the probability of the above-mentioned language recognition model, and P (x | w) is used to represent the above-mentioned acoustic recognition model Chance.
則上述P(x|w)可以進一步展開為: ;(2) 其中,x t用於表示t時刻輸入的語音特徵,q t用於表示t時刻綁定的三音素狀態,π(q 0)用於表示初始狀態為q 0的機率分佈,P(x t|q t)用於表示q t狀態下,語音特徵為x t的機率。 Then the above P (x | w) can be further expanded as: (2) Among them, x t is used to represent the speech features input at time t, q t is used to represent the state of the three phonemes bound at time t, π (q 0 ) is used to represent the probability distribution of the initial state q 0 , P (x t | q t ) is used to indicate the probability that the speech feature is x t in the q t state.
則上述P(x t|q t)可以進一步展開為: ; (3) 其中,P(x t|q t)為上述聲學識別模型的輸出層的後驗機率,P(q t)為上述聲學識別模型的先驗機率,P(x t)則表示x t的機率。P(x t)跟字串序列不相關,因此可以忽略。 Then the above P (x t | q t ) can be further expanded as: (3) where P (x t | q t ) is the posterior probability of the output layer of the above-mentioned acoustic recognition model, P (q t ) is the prior probability of the above-mentioned acoustic recognition model, and P (x t ) represents x probability of t . P (x t ) is not related to the string sequence and can be ignored.
則根據上述公式(3)可以得出,通過計算聲學識別模型的輸出層的先驗機率和後驗機率能夠對輸出結果的字串進行調整。According to the above formula (3), it can be obtained that the string of the output result can be adjusted by calculating the prior probability and the posterior probability of the output layer of the acoustic recognition model.
本發明的較佳的實施例中,神經網路的先驗機率P(q)通常由下述公式計算得到: ; (4) 其中,Count(q i)用於表示多語言語音數據中標籤為q i的總數,N用於表示所有輸出標籤的總數。 In a preferred embodiment of the present invention, the prior probability P (q) of the neural network is usually calculated by the following formula: (4) Among them, Count (q i ) is used to represent the total number of tags q i in the multilingual speech data, and N is used to represent the total number of all output tags.
本發明的較佳的實施例中,由於不同種類的語言的訓練用語音數據的數量可能不同,因此上述先驗機率不能統一計算,需要根據不同種類的語言分別進行計算。In a preferred embodiment of the present invention, since the amount of training voice data for different kinds of languages may be different, the above-mentioned prior probability cannot be calculated uniformly, and needs to be calculated separately according to different kinds of languages.
則本發明的較佳的實施例中,上述步驟S231,首先分別計算得到每種語言的先驗機率,以及計算得到所有種類的語言公用的靜音的先驗機率。In a preferred embodiment of the present invention, in step S231, the prior probability of each language is first calculated and the prior probability of mute common to all kinds of languages is calculated.
首先依照下述公式分別計算得到每種語言的先驗機率: ; (5) 其中, 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的先驗機率; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中的靜音的第i種狀態的輸出標籤; 用於表示多語言語音數據中輸出標籤為 的總數; 用於表示多語言語音數據中的第j種語言中的狀態的總數; 用於表示多語言語音數據中的靜音的狀態的總數。 First, the prior probability of each language is calculated according to the following formula: (5) Among them, An output label used to represent the i-th state of the j-th language in the multilingual speech data; Used to represent the output labels in multilingual speech data as Probability of a priori Used to represent the output labels in multilingual speech data as total; An output label for representing the m-th state of silence in multilingual speech data; Used to represent the output labels in multilingual speech data as total; Used to represent the total number of states in the jth language in the multilingual speech data; The total number used to indicate the status of silence in multilingual speech data.
隨後,依照下述公式計算得到靜音的先驗機率: ;(6) 其中, 用於表示多語言語音數據中輸出標籤為 的先驗機率; L用於表示多語言語音數據中的所有語言。 Then, the prior probability of silence is calculated according to the following formula: ; (6) where Used to represent the output labels in multilingual speech data as A priori probability; L is used to represent all languages in multilingual speech data.
本發明的較佳的實施例中,在計算得到上述每種語言的先驗機率以及靜音的先驗機率後,繼續計算聲學識別模型的後驗機率。神經網路輸出的後驗機率P(q i|x)通常由輸出層計算得到,當輸出層為softmax非線性單元實現時,後驗機率通常按照下述公式計算得到: ; (7) 其中,y i用於表示第i個狀態下的輸入值,N為所有狀態的數目。 In the preferred embodiment of the present invention, after calculating the prior probability of each of the languages and the prior probability of silence, the posterior probability of the acoustic recognition model is continuously calculated. The posterior probability P (q i | x) of the neural network output is usually calculated by the output layer. When the output layer is implemented as a softmax non-linear unit, the posterior probability is usually calculated according to the following formula: (7) Among them, y i is used to represent the input value in the i-th state, and N is the number of all states.
同樣地,在聲學識別模型中,不同種類語言的訓練數據數量不均衡會造成不同種類語言的狀態值計算結果的分佈不平衡,因此後驗機率仍然需要針對不同種類的語言分別進行計算。Similarly, in the acoustic recognition model, the uneven amount of training data of different kinds of languages will cause the distribution of the state value calculation results of different kinds of languages to be uneven, so the post-test probability still needs to be calculated separately for different kinds of languages.
則本發明的較佳的實施例中,上述步驟S232中,依照下述公式分別計算得到每種語言的後驗機率: ; (8) 其中, x用於表示語音特徵; 用於表示多語言語音數據中輸出標籤為 的後驗機率; 用於表示多語言語音數據中第j種語言的第i個狀態的輸入數據; 用於表示靜音的第i種狀態的輸入數據; exp用於表示指數函數計算方式。 Then, in a preferred embodiment of the present invention, in step S232, the posterior probability of each language is calculated according to the following formula: (8) where x is used to represent the speech features; Used to represent the output labels in multilingual speech data as Post-test probability Input data for representing the i-th state of the j-th language in the multilingual speech data; Input data used to represent the i-th state of silence; exp is used to indicate how the exponential function is calculated.
本發明的較佳的實施例中,步驟S232中,依照下述公式計算得到靜音的後驗機率: ; (9) 其中, 用於表示多語言語音數據中輸出標籤為 的後驗機率。 In a preferred embodiment of the present invention, in step S232, the posterior probability of silence is calculated according to the following formula: ; (9) where Used to represent the output labels in multilingual speech data as Post-test probability.
本發明中,利用上述改進的公式(6)-(9)可以計算得到每種語言和靜音狀態下的先驗機率以及後驗機率,從而使得聲學識別模型能夠符合多語言混合建模的輸出需求,能夠更加精准地對每種語言以及靜音狀態進行描述。需要注意的是,經過上述公式調整後,先驗機率和後驗機率的總和均不再為1。In the present invention, the a priori probability and the a priori probability in each language and mute state can be calculated by using the improved formulas (6) to (9), so that the acoustic recognition model can meet the output requirements of multilingual mixed modeling , Can more accurately describe each language and mute state. It should be noted that after the above formula is adjusted, the sum of the probabilities of the first and the second probabilities is no longer 1.
本發明的較佳的實施例中,上述步驟S3中,可以採用n-Gram模型訓練形成語言識別模型,或者採用遞歸神經網路訓練形成語言識別模型。上述多語言文本語料中需要包括多語言單獨的文本語料,以及多語言混合的文本數據。In a preferred embodiment of the present invention, in the above step S3, a language recognition model may be formed by training with an n-Gram model, or a language recognition model may be formed by training using a recurrent neural network. The above multilingual text corpus needs to include multilingual separate text corpora and multilingual mixed text data.
本發明的較佳的實施例中,形成語音識別系統後,首先對語音識別系統中不同種類的語言進行權重調整; 進行權重調整的步驟如圖6所示,包括: 步驟A1,根據真實語音數據分別確定每種語言的後驗機率權重值; 步驟A2,根據後驗機率權重值,分別調整每種語言的後驗機率,以完成權重調整。In a preferred embodiment of the present invention, after forming a speech recognition system, first perform weight adjustment on different kinds of languages in the speech recognition system. The steps for performing weight adjustment are shown in FIG. 6, including: Step A1, based on real speech data Determine the posterior probability weight value of each language separately; Step A2, adjust the posterior probability of each language according to the posterior probability weight value to complete the weight adjustment.
具體地,本實施例中,形成上述語音識別系統後,由於在訓練過程中可能會產生訓練數據量不均衡的問題,數據量較多的一種語言會得到相對較大的先驗機率,由於最終的識別機率是後驗機率除以先驗機率,因此訓練數據較多的語言實際的識別機率反而偏小,這就會造成識別系統的識別結果可能會傾向於識別出某一種語言而無法識別另一種語言,從而造成識別結果的偏差。Specifically, in this embodiment, after the above-mentioned speech recognition system is formed, since the problem of imbalanced amount of training data may occur during training, a language with a larger amount of data will have a relatively large prior probability. The recognition probability is the posterior probability divided by the prior probability, so the actual recognition probability of a language with more training data is rather small, which will cause the recognition result of the recognition system to tend to recognize a certain language and fail to recognize another. A language that causes bias in recognition results.
為瞭解決這個問題,在將上述語音識別系統進行實用之前,需要採用真實的數據作為開發集對其進行實測以對每種語言的權重進行調整。上述權重調整通常應用在聲學識別模型輸出的後驗機率上,因此其公式如下: ; (10) 其中, 用於表示多語言語音數據中第j種語言的第i個狀態的輸出標籤; x用於表示語音特徵; 用於表示多語言語音數據中輸出標籤為 的後驗機率; 用於表示多語言語音數據中第j種語言的後驗機率權重值,該後驗機率權重值通過上述真實數據組成的開發集對聲學識別模型進行實測來確定。 In order to solve this problem, before the above speech recognition system is put into practical use, it is necessary to use real data as a development set to measure it to adjust the weight of each language. The above weight adjustment is usually applied to the posterior probability of the output of the acoustic recognition model, so its formula is as follows: ; (10) where An output label used to represent the i-th state of the j-th language in multilingual speech data; x is used to represent speech features; Used to represent the output labels in multilingual speech data as Post-test probability It is used to represent the posterior probability weight value of the j-th language in the multilingual speech data, and the posterior probability weight value is determined by actually measuring the acoustic recognition model by using the development set composed of the real data.
用於表示經過權重調整的多語言語音數據中輸出標籤為 的後驗機率。 The output label used to represent the weighted multilingual speech data is Post-test probability.
透過上述權重調整後能夠使得語音識別系統在不同的應用場景中都能得到很好的識別效果。After the weight adjustment is made, the speech recognition system can obtain a good recognition effect in different application scenarios.
在本發明的一個較佳的實施例中,對於一個由中英文混合的語音識別系統中,經過真實數據實測後可以將中文的後驗機率權重值設定為1.0,將英文的後驗機率權重值設定為0.3,將靜音的後驗機率權重值設定為1.0。In a preferred embodiment of the present invention, for a Chinese-English mixed speech recognition system, after the actual data is measured, the Chinese posterior probability weight value can be set to 1.0, and the English posterior probability weight value is set in English. Set it to 0.3 and set the mute posterior probability weight to 1.0.
本發明的其他實施例中,上述後驗機率權重值可以透過多次採用不同的真實數據組成的開發集進行反覆調整,最終確定最佳的取值。In other embodiments of the present invention, the above-mentioned posterior probability weight value may be repeatedly adjusted by using development sets composed of different real data multiple times to finally determine the optimal value.
以上僅為本發明較佳的實施例,並非因此限制本發明的實施方式及保護範圍,對於本領域技術人員而言,應當能夠意識到凡運用本發明說明書及圖示內容所作出的等同替換和顯而易見的變化所得到的方案,均應當包含在本發明的保護範圍內。The above are only preferred embodiments of the present invention, and therefore do not limit the implementation and protection scope of the present invention. For those skilled in the art, they should be able to realize that equivalent substitutions and Obvious changes should be included in the protection scope of the present invention.
S1‧‧‧步驟1S1‧‧‧Step 1
S2‧‧‧步驟2S2‧‧‧Step 2
S3‧‧‧步驟3S3‧‧‧Step 3
S4‧‧‧步驟4S4‧‧‧Step 4
S21‧‧‧步驟21S21‧‧‧Step 21
S22‧‧‧步驟22S22‧‧‧Step 22
S23‧‧‧步驟23S23‧‧‧Step 23
S51‧‧‧聲學識別模型的輸出層Output layer of S51‧‧‧acoustic recognition model
L1‧‧‧輸出標籤L1‧‧‧ output label
L2‧‧‧輸出標籤L2‧‧‧ output label
L3‧‧‧輸出標籤L3‧‧‧ output label
S231‧‧‧步驟231S231‧‧‧Step 231
S232‧‧‧步驟232S232‧‧‧Step 232
S233‧‧‧步驟233S233‧‧‧Step 233
A1‧‧‧步驟A1A1‧‧‧Step A1
A2‧‧‧步驟A2A2‧‧‧Step A2
圖1是本發明的較佳的實施例中,一種多語言混合語音識別方法中,形成語音識別系統的總體流程示意圖; 圖2是本發明的較佳的實施例中,多語言混合詞典的示意圖; 圖3是本發明的較佳的實施例中,於圖1的基礎上,訓練形成聲學識別模型的流程示意圖; 圖4是本發明的較佳的實施例中,聲學識別模型的結構示意圖; 圖5是本發明的較佳的實施例中,於圖2的基礎上,對聲學識別模型的輸出層進行調整的流程示意圖; 圖6是本發明的較佳的實施例中,對語音識別系統進行權重調整的流程示意圖。FIG. 1 is a schematic diagram of an overall process of forming a speech recognition system in a multilingual mixed speech recognition method in a preferred embodiment of the present invention; FIG. 2 is a schematic view of a multilingual mixed dictionary in a preferred embodiment of the present invention FIG. 3 is a schematic flowchart of training to form an acoustic recognition model based on FIG. 1 in a preferred embodiment of the present invention; FIG. 4 is a schematic structural diagram of an acoustic recognition model in a preferred embodiment of the present invention; 5 is a schematic flowchart of adjusting an output layer of an acoustic recognition model based on FIG. 2 in a preferred embodiment of the present invention; FIG. 6 is a voice recognition system in a preferred embodiment of the present invention Schematic diagram of the process of weight adjustment.
Claims (11)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW107106801A TWI659411B (en) | 2018-03-01 | 2018-03-01 | Multi-language hybrid speech recognition method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW107106801A TWI659411B (en) | 2018-03-01 | 2018-03-01 | Multi-language hybrid speech recognition method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI659411B true TWI659411B (en) | 2019-05-11 |
| TW201937479A TW201937479A (en) | 2019-09-16 |
Family
ID=67348147
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW107106801A TWI659411B (en) | 2018-03-01 | 2018-03-01 | Multi-language hybrid speech recognition method |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI659411B (en) |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110675865A (en) * | 2019-11-06 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for training hybrid language recognition models |
| CN111613208A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Language identification method and equipment |
| CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW200945066A (en) * | 2007-11-26 | 2009-11-01 | Warren Daniel Child | Modular system and method for managing Chinese, Japanese, and Korean linguistic data in electronic form |
| WO2015184186A1 (en) * | 2014-05-30 | 2015-12-03 | Apple Inc. | Multi-command single utterance input method |
| US20170148433A1 (en) * | 2015-11-25 | 2017-05-25 | Baidu Usa Llc | Deployed end-to-end speech recognition |
| US20170324866A1 (en) * | 2016-05-06 | 2017-11-09 | Genesys Telecommunications Laboratories, Inc. | System and method for chat automation |
-
2018
- 2018-03-01 TW TW107106801A patent/TWI659411B/en not_active IP Right Cessation
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW200945066A (en) * | 2007-11-26 | 2009-11-01 | Warren Daniel Child | Modular system and method for managing Chinese, Japanese, and Korean linguistic data in electronic form |
| WO2015184186A1 (en) * | 2014-05-30 | 2015-12-03 | Apple Inc. | Multi-command single utterance input method |
| US20170148433A1 (en) * | 2015-11-25 | 2017-05-25 | Baidu Usa Llc | Deployed end-to-end speech recognition |
| US20170324866A1 (en) * | 2016-05-06 | 2017-11-09 | Genesys Telecommunications Laboratories, Inc. | System and method for chat automation |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110675865A (en) * | 2019-11-06 | 2020-01-10 | 百度在线网络技术(北京)有限公司 | Method and apparatus for training hybrid language recognition models |
| CN110675865B (en) * | 2019-11-06 | 2021-09-28 | 百度在线网络技术(北京)有限公司 | Method and apparatus for training hybrid language recognition models |
| CN111613208A (en) * | 2020-05-22 | 2020-09-01 | 云知声智能科技股份有限公司 | Language identification method and equipment |
| CN111613208B (en) * | 2020-05-22 | 2023-08-25 | 云知声智能科技股份有限公司 | Language identification method and equipment |
| CN112257407A (en) * | 2020-10-20 | 2021-01-22 | 网易(杭州)网络有限公司 | Method and device for aligning text in audio, electronic equipment and readable storage medium |
| CN112257407B (en) * | 2020-10-20 | 2024-05-14 | 网易(杭州)网络有限公司 | Text alignment method and device in audio, electronic equipment and readable storage medium |
Also Published As
| Publication number | Publication date |
|---|---|
| TW201937479A (en) | 2019-09-16 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN108510976B (en) | Multi-language mixed voice recognition method | |
| CN108647214B (en) | Decoding method based on deep neural network translation model | |
| CN107305768B (en) | A typo-prone calibration method in voice interaction | |
| CN107103900B (en) | Cross-language emotion voice synthesis method and system | |
| CN115392259B (en) | Microblog text sentiment analysis method and system based on confrontation training fusion BERT | |
| CN110070855B (en) | A speech recognition system and method based on transfer neural network acoustic model | |
| Pan et al. | A unified sequence-to-sequence front-end model for mandarin text-to-speech synthesis | |
| CN111489746B (en) | Power grid dispatching voice recognition language model construction method based on BERT | |
| CN101751922A (en) | Text-independent speech conversion system based on HMM model state mapping | |
| CN109472026A (en) | An accurate emotional information extraction method for multiple named entities at the same time | |
| CN103035241A (en) | Model complementary Chinese rhythm interruption recognition system and method | |
| CN110852075B (en) | Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium | |
| Scharenborg et al. | Building an ASR system for a low-research language through the adaptation of a high-resource language ASR system: preliminary results | |
| CN108427665A (en) | A kind of text automatic generation method based on LSTM type RNN models | |
| Dethlefs et al. | Conditional random fields for responsive surface realisation using global features | |
| CN107402933A (en) | Entity polyphone disambiguation method and entity polyphone disambiguation equipment | |
| CN108231066A (en) | Speech recognition system and method thereof and vocabulary establishing method | |
| TWI659411B (en) | Multi-language hybrid speech recognition method | |
| de la Fuente et al. | A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models | |
| Zhu et al. | Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings | |
| Chen et al. | A self-attention joint model for spoken language understanding in situational dialog applications | |
| CN114254649A (en) | Language model training method and device, storage medium and equipment | |
| CN107193806A (en) | A kind of vocabulary justice former automatic prediction method and device | |
| CN115712705A (en) | Information matching method and device | |
| CN105895076B (en) | A kind of phoneme synthesizing method and system |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |