JP2005292770A

JP2005292770A - Acoustic model generation device and speech recognition device

Info

Publication number: JP2005292770A
Application number: JP2004286082A
Authority: JP
Inventors: Cincarek Tobias; トビアス・ツィンツァレク; Gruhn Rainer; ライナー・グルーン; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-03-10
Filing date: 2004-09-30
Publication date: 2005-10-20

Abstract

【課題】非ネイティブの話者の発話に対する音声認識精度を向上させる。
【解決手段】予め非ネイティブ話者グループごとに作成された複数の音響モデル３４を用いて入力発話３６に対する音声認識を行なう音声認識装置３８は、入力発話３６の音響的特徴に基づいて、入力発話３６の音響的特徴に合致する音響モデル８２を選択する話者グループ分類部８０と、選択された音響モデル８２を用いて入力発話３６に対する音声認識を行なうデコード部８４とを含む。複数の音響モデル３４を用いて並列にデコードし、最も尤度の高い仮説を選択するようにしてもよい。
【選択図】図３

Speech recognition accuracy for utterances of non-native speakers is improved.
A speech recognition device that performs speech recognition on an input utterance using a plurality of acoustic models created in advance for each non-native speaker group is based on the acoustic characteristics of the input utterance. A speaker group classifying unit 80 that selects an acoustic model 82 that matches the 36 acoustic features, and a decoding unit 84 that performs speech recognition on the input utterance 36 using the selected acoustic model 82. The hypothesis with the highest likelihood may be selected by decoding in parallel using a plurality of acoustic models 34.
[Selection] Figure 3

Description

この発明は音声認識のための音響モデル及びそうした音響モデルを用いた音声認識装置に関し、特に、ネイティブでない発話者（非ネイティブ話者）の発話を高精度に認識可能にするための音響モデルを生成する装置及びそうした音響モデルを使用して非ネイティブ話者の発話を高精度に認識可能な音声認識装置に関する。 The present invention relates to an acoustic model for speech recognition and a speech recognition apparatus using such an acoustic model, and in particular, generates an acoustic model for making it possible to recognize speech of a non-native speaker (non-native speaker) with high accuracy. The present invention relates to a device for performing speech recognition and a speech recognition device capable of recognizing a speech of a non-native speaker with high accuracy using such an acoustic model.

非ネイティブ話者の音声認識のための主な方法として、二つのものが知られている。第１は発音モデルの適応であり、第２は音響モデルの適応である。従来の研究により、発音モデルを用いることによって母語の中での外国語由来のアクセントに対する音声認識の性能が向上することが明らかとなっている。なお、このように発話者が母語以外の発話を行なう場合、その外国語を以下「非ネイティブ言語」と呼ぶことにする。 Two main methods are known for speech recognition of non-native speakers. The first is adaptation of the pronunciation model, and the second is adaptation of the acoustic model. Previous studies have shown that the use of pronunciation models improves speech recognition performance for accents derived from foreign languages in the native language. When a speaker speaks in a language other than his / her native language, the foreign language is hereinafter referred to as “non-native language”.

そうした方法では、辞書内の各語に対して発音の変形を手操作で追加する必要がある。しかし、非ネイティブ話者の母語と、非ネイティブ言語との双方に関する深い知識がなければ、そのような作業はできない。また、自動的にそうした作業を行なおうとすれば、ラベル付けされた大量の発話データが必要となり、そうしたデータを準備するのは困難である。加えて、この方法では母語の音素に関する置換、削除及び挿入しか対象とすることができないという問題がある。 Such a method requires manual modification of pronunciation for each word in the dictionary. However, such work cannot be done without deep knowledge of both the native language of the non-native speaker and the non-native language. Also, if such an operation is to be performed automatically, a large amount of labeled speech data is required, and it is difficult to prepare such data. In addition, this method has a problem that only replacement, deletion, and insertion related to phonemes in the native language can be targeted.

さらに、非特許文献１によれば、非ネイティブ話者は、母語の一部である発声特徴と、非ネイティブ言語の発声特徴とを融合させることにより、生成された発音を行なうように思われる。したがって、母語と非ネイティブ言語との双方の音響モデル又は発音モデルのみを用意しても、非ネイティブ話者の発声を十分に分析することは難しい。 Furthermore, according to Non-Patent Document 1, it seems that a non-native speaker performs the pronunciation generated by fusing the utterance features that are part of the native language and the utterance features of the non-native language. Therefore, it is difficult to sufficiently analyze the utterances of non-native speakers even if only acoustic models or pronunciation models of both native languages and non-native languages are prepared.

Ｊ．Ｅ．フレーゲ、Ｃ．シル、及びＩ．Ｒ．Ａ．マッケイ、「母語及び第二外国語の間の音素サブシステム間の相互作用」、スピーチ・コミュニケーション、４０：ｐｐ．４６７−４９１、２００３年（Ｆｌｅｇｅ，Ｊ．Ｅ．，Ｓｃｈｉｒｒｕ，Ｃ．，ａｎｄＭａｃＫａｙ，Ｉ．Ｒ．Ａ．，Ｉｎｔｅｒａｃｔｉｏｎｂｅｔｗｅｅｎｔｈｅｎａｔｉｖｅａｎｄｓｅｃｏｎｄｌａｎｇｕａｇｅｐｈｏｎｅｔｉｃｓｕｂｓｙｓｔｅｍｓ．ＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎ，４０：４６７−４９１，２００３．）J. et al. E. Frege, C.I. Sil, and I. R. A. McKay, “Interactions between phoneme subsystems between mother tongue and second foreign language”, Speech Communication, 40: pp. 467-491, 2003 (Flege, JE, Schirru, C., and MacKay, IRA, Interaction between the native and second language phonic subs. )

従来の研究から、非ネイティブ話者に対する音声認識の性能を向上させるためには、各音響−音素単位モデルの適応を行なうことが必要であり、発音のモデル化のみでは十分でないという結論が得られる。 From previous research, it is necessary to adapt each acoustic-phoneme unit model to improve speech recognition performance for non-native speakers, and it is concluded that pronunciation modeling alone is not sufficient. .

したがって本発明は、音響モデルを適応化させることにより、非ネイティブ話者に対する音声認識の性能を向上させることを目的とする。 Accordingly, the present invention aims to improve speech recognition performance for non-native speakers by adapting an acoustic model.

本発明はさらに、非ネイティブ話者の、ネイティブ話者と比較して異なる音響的特徴に対処することが可能な音声認識装置を提供することを目的とする。 It is another object of the present invention to provide a speech recognition device that can deal with different acoustic features of non-native speakers compared to native speakers.

本発明の第１の局面に係る音響モデル生成装置は、所定の音声の特徴にしたがって予め複数グループのいずれかに分類された複数話者の音声データを準備するための音声データ準備手段と、音声データ準備手段により準備された音声データに基づいて、各グループについてそれぞれ音響モデルを作成するための音響モデル群生成手段とを含む。 An acoustic model generation apparatus according to a first aspect of the present invention includes: audio data preparation means for preparing audio data of a plurality of speakers classified in advance into any of a plurality of groups according to predetermined audio characteristics; Acoustic model group generation means for creating an acoustic model for each group based on the voice data prepared by the data preparation means.

音声の特徴によって分類されたグループごとに音響モデルを作成することにより、各グループに属する話者に共通する特徴に適応した音響モデルが得られる。各グループに属する話者の音声認識などの際にその音響モデルを使用すると、全ての話者の音声データから作成した音響モデルを用いる場合と比較して、認識精度が向上する。 By creating an acoustic model for each group classified by the characteristics of speech, an acoustic model adapted to characteristics common to speakers belonging to each group can be obtained. When the acoustic model is used for speech recognition of speakers belonging to each group, the recognition accuracy is improved as compared with the case of using acoustic models created from speech data of all speakers.

好ましくは、音声データ準備手段は、複数話者による発話データに基づき、複数話者を複数グループにクラスタリングするための話者クラスタリング手段を含み、音響モデル群生成手段は、話者クラスタリング手段により得られた複数グループに対し、各グループに属する話者の発話データに基づいてそれぞれ音響モデルを作成するための音響モデル群生成手段を含む。 Preferably, the voice data preparation means includes speaker clustering means for clustering the plurality of speakers into a plurality of groups based on the utterance data by the plurality of speakers, and the acoustic model group generation means is obtained by the speaker clustering means. The plurality of groups includes acoustic model group generation means for creating an acoustic model based on the utterance data of the speakers belonging to each group.

発話データに基づいてクラスタリングを行なうことにより、複数話者を所定の基準にしたがって自動的に分類することができる。 By performing clustering based on the utterance data, a plurality of speakers can be automatically classified according to a predetermined standard.

好ましくは、話者クラスタリング手段は、複数話者の各々の発話データに基づき、話者依存の音響モデルを生成するための話者依存音響モデル生成手段と、音響モデルに基づいて、複数話者の各々の代表ベクトルを生成するための代表ベクトル生成手段と、複数話者について得られた複数の代表ベクトルに対して主成分分析を行なうことにより、複数の代表ベクトルをより低次の複数の代表ベクトルに変換するための手段と、低次の複数の代表ベクトルに対し予め定めるクラスタリング処理を実行することにより、複数の話者を複数のグループにクラスタリングするためのクラスタリング手段とを含む。 Preferably, the speaker clustering means includes speaker-dependent acoustic model generation means for generating a speaker-dependent acoustic model based on the utterance data of each of the plurality of speakers, and a plurality of speakers based on the acoustic model. Representative vector generation means for generating each representative vector, and principal component analysis on a plurality of representative vectors obtained for a plurality of speakers, thereby reducing a plurality of representative vectors to a plurality of lower-order representative vectors And a clustering means for clustering a plurality of speakers into a plurality of groups by executing a predetermined clustering process on a plurality of low-order representative vectors.

さらに好ましくは、予め定めるクラスタリング処理は、Ｋ平均クラスタリング処理である。 More preferably, the predetermined clustering process is a K-means clustering process.

Ｋ平均クラスタリング処理を用いることにより、同程度の大きさのクラスタを作ることができる。クラスタごとに属する話者数を均等化でき、いずれの音響モデルも同程度の頑健さで構築できる。 By using the K-means clustering process, clusters of the same size can be created. The number of speakers belonging to each cluster can be equalized, and any acoustic model can be constructed with the same level of robustness.

音響モデル群生成手段は、話者クラスタリング手段により得られた複数グループの各々に対し、各グループに属する話者の発話データを用いて音響モデルを作成するための音響モデル群生成手段、又は話者クラスタリング手段により得られた複数グループの各々に対し、予め準備された話者に依存しない基本音響モデルを各グループに属する話者の発話データを用いた最大事後推定により適応させることにより音響モデルを作成するための音響モデル群生成手段のいずれを含んでもよい。 The acoustic model group generation means is an acoustic model group generation means for creating an acoustic model for each of a plurality of groups obtained by the speaker clustering means using the utterance data of the speakers belonging to each group, or a speaker For each of the multiple groups obtained by the clustering means, an acoustic model is created by adapting a prepared speaker-independent basic acoustic model by maximum a posteriori estimation using the speech data of the speakers belonging to each group. Any of the acoustic model group generation means for doing so may be included.

本発明の第２の局面に係る音声認識装置は、複数の音響モデルを用いて、入力発話に対する音声認識を行なう音声認識装置である。複数の音響モデルは、それぞれ互いに異なる音響的特徴を持つ発話データから生成されたものである。音声認識装置は、入力発話の音響的特徴に基づいて、複数の音響モデルのうちで入力発話の音響的特徴に合致する音響モデルを選択するための音響モデル選択手段と、音響モデル選択手段により選択された音響モデルを用いて入力発話に対する音声認識を行なうための音声認識手段とを含む。 A speech recognition apparatus according to a second aspect of the present invention is a speech recognition apparatus that performs speech recognition on an input utterance using a plurality of acoustic models. The plurality of acoustic models are generated from utterance data having different acoustic characteristics. The speech recognition device is selected by an acoustic model selection unit for selecting an acoustic model that matches the acoustic feature of the input utterance from a plurality of acoustic models based on the acoustic feature of the input utterance, and selected by the acoustic model selection unit Speech recognition means for performing speech recognition on the input utterance using the acoustic model.

互いに異なる音響的特徴に基づいて分類された発話データから構築された音響モデルの中から、入力発話の音響的特徴に合致する音響モデルを選択する。このようにして選択された音響モデルを用いると、入力発話に含まれる発音のバリエーションに対して頑健な音声認識を行なうことができる。その結果、音声認識の精度を高めることができる。 An acoustic model that matches the acoustic feature of the input utterance is selected from acoustic models constructed from utterance data classified based on different acoustic features. If the acoustic model selected in this way is used, robust speech recognition can be performed with respect to pronunciation variations included in the input utterance. As a result, the accuracy of voice recognition can be increased.

本発明の第３の局面に係る音声認識装置は、複数の音響モデルを用いて、入力発話に対する音声認識を行なう音声認識装置である。複数の音響モデルは、それぞれ互いに異なる音響的特徴を持つ発話データから生成されたものである。音声認識装置は、複数の音響モデルの各々を用いて入力発話に対する音声認識を行ない、複数の仮説を出力するための音声認識手段と、音声認識手段により出力される複数の仮説に基づいて、一つの仮説を出力するための仮説出力手段とを含む。 A speech recognition apparatus according to a third aspect of the present invention is a speech recognition apparatus that performs speech recognition on an input utterance using a plurality of acoustic models. The plurality of acoustic models are generated from utterance data having different acoustic characteristics. The speech recognition apparatus performs speech recognition on an input utterance using each of a plurality of acoustic models, and outputs speech recognition means for outputting a plurality of hypotheses based on a plurality of hypotheses output by the speech recognition means. Hypothesis output means for outputting one hypothesis.

このように複数種類の音響モデルを用いて並列に音声認識を行なうと、それら音響モデルを用いて最も確率が高いと思われる仮説が得られる。それらの仮説の中で、一つの仮説を選択することにより、入力発話の音響的特徴にしたがって音響モデルを予め選択するということなしに、入力発話に対する音声認識を実現できる。 When speech recognition is performed in parallel using a plurality of types of acoustic models in this way, a hypothesis that seems to have the highest probability is obtained using these acoustic models. By selecting one of these hypotheses, speech recognition for the input utterance can be realized without selecting an acoustic model in advance according to the acoustic characteristics of the input utterance.

好ましくは、音声認識手段は、複数の仮説とともにそれぞれ尤度を出力し、仮説出力手段は、複数の仮説のうち、最も尤度の高いものを選択して出力するための手段を含む。 Preferably, the speech recognition means outputs likelihood together with a plurality of hypotheses, and the hypothesis output means includes means for selecting and outputting the highest likelihood among the plurality of hypotheses.

複数の仮説の中で認識時に音響モデルを用いて得られた尤度が最も高いものを選択することにより、入力発話の音響的特徴にしたがって音響モデルを予め選択するということなしに、入力音声に対する音声認識を高精度で実現できる。 By selecting the one with the highest likelihood obtained using the acoustic model at the time of recognition from among multiple hypotheses, it is possible to select the acoustic model according to the acoustic characteristics of the input utterance without pre-selecting the acoustic model. Voice recognition can be realized with high accuracy.

また、音声認識手段は、複数の仮説に含まれる単語ごとにそれぞれ尤度を出力し、仮説出力手段は、複数の仮説を統合することにより形成可能な単語列のうちで、各単語の尤度に基づいて算出される尤度が最も高いものを出力する仮説統合手段を含むものであってもよい。 The speech recognition means outputs likelihood for each word included in the plurality of hypotheses, and the hypothesis output means outputs the likelihood of each word in the word string that can be formed by integrating the plurality of hypotheses. Hypothesis integration means for outputting the one with the highest likelihood calculated based on the above may be included.

［第１の実施の形態］
＜構成＞
図１に、本発明の一実施の形態に係る英語発話の音声認識システム２０の構成をブロック図形式で示す。図１を参照して、このシステム２０は、複数の非ネイティブ話者の英語発話データ３０を、それらの音響的特徴に基づいて複数のグループにクラスタリングし、グループ別音響モデル群３４を生成するためのグループ別音響モデル生成装置３２と、このグループ別音響モデル群３４を用い、入力発話３６に対する音声認識を行なって仮説４０を出力するための非ネイティブ発話音声認識装置３８とを含む。 [First Embodiment]
<Configuration>
FIG. 1 is a block diagram showing the configuration of a speech recognition system 20 for English utterances according to an embodiment of the present invention. Referring to FIG. 1, the system 20 is configured to cluster English utterance data 30 of a plurality of non-native speakers into a plurality of groups based on their acoustic characteristics to generate a group-specific acoustic model group 34. And a non-native utterance speech recognition device 38 for performing speech recognition on the input utterance 36 and outputting a hypothesis 40 using the group-specific acoustic model group 34.

図２に、グループ別音響モデル生成装置３２のより詳細な構成をブロック図形式で示す。図２を参照して、グループ別音響モデル生成装置３２は、非ネイティブ話者発話データ３０を、音響的特徴に基づいて複数のグループ６２−１〜６２−ｎにクラスタリングするための発話者クラスタリング処理部６０と、各グループ６２−１〜６２−ｎについて、予め準備された音響モデルをトレーニングし、グループ別の音響モデル群３４を生成するための音響モデルトレーニング部６４とを含む。 FIG. 2 shows a more detailed configuration of the group-specific acoustic model generation device 32 in the form of a block diagram. Referring to FIG. 2, the group-specific acoustic model generation device 32 performs speaker clustering processing for clustering the non-native speaker utterance data 30 into a plurality of groups 62-1 to 62-n based on acoustic features. The unit 60 and the acoustic model training unit 64 for training the acoustic models prepared in advance for each of the groups 62-1 to 62-n and generating the acoustic model group 34 for each group are included.

発話者クラスタリング処理部６０によるクラスタリングの詳細については後述する。音響モデルトレーニング部６４による音響モデルのトレーニングは、使用する発話データが、発話者クラスタリング処理部６０によってクラスタリングされた発話者グループ６２−１〜６２−ｎのいずれかである点を除き、通常のものと同様である。 Details of clustering by the speaker clustering processing unit 60 will be described later. The acoustic model training by the acoustic model training unit 64 is normal except that the utterance data to be used is any of the speaker groups 62-1 to 62-n clustered by the speaker clustering processing unit 60. It is the same.

図３は、このようにして生成されたグループ別音響モデル群３４を用いて、非ネイティブ話者による入力発話３６に対する音声認識を行なって仮説４０を出力する非ネイティブ発話音声認識装置３８のブロック図である。図３を参照して、この装置３８は、入力発話３６を受け、この入力発話３６の発話者がグループ別音響モデル群３４のどの話者グループに属するかを判定し、当該グループの音響モデル８２をグループ別音響モデル群３４から選択するための話者グループ分類部８０と、選択された音響モデル８２を用いて、入力発話３６をデコード（音声認識）し仮説４０を出力するためのデコード部８４とを含む。 FIG. 3 is a block diagram of a non-native utterance speech recognition apparatus 38 that performs speech recognition on an input utterance 36 by a non-native speaker and outputs a hypothesis 40 using the group-specific acoustic model group 34 thus generated. It is. Referring to FIG. 3, this apparatus 38 receives an input utterance 36, determines which speaker group in the group-specific acoustic model group 34 the speaker of the input utterance 36 belongs to, and the acoustic model 82 of the group. Is selected from the group-specific acoustic model group 34 and the selected acoustic model 82 is used to decode the input utterance 36 (speech recognition) and to output a hypothesis 40. Including.

図２に示す発話者クラスタリング処理部６０は、Ｋ平均クラスタリングアルゴリズムを用いて、データ自身の特徴に基づいて非ネイティブ話者発話データ３０を複数のグループ６２−１〜６２−ｎにクラスタリングする。その手順は以下の通りである。 The speaker clustering processing unit 60 shown in FIG. 2 uses the K-means clustering algorithm to cluster the non-native speaker utterance data 30 into a plurality of groups 62-1 to 62-n based on the characteristics of the data itself. The procedure is as follows.

すなわち、各話者について、話者依存の音響モデル（以下「ＳＤ−ＡＭ（ＳｐｅａｋｅｒＤｅｐｅｎｄｅｎｔＡｃｏｕｓｔｉｃＭｏｄｅｌ）」と呼ぶ。）を作成する。次にＳＤ−ＡＭごとに、その平均ベクトルをつなぎ合わせることにより各話者を代表する統合ベクトル（以下「代表ベクトル」と呼ぶ。）を作成する。本実施の形態ではこのＳＤ−ＡＭは、ＨＭＭ（隠れマルコフモデル）からなる、１状態あたり１ガウス分布を持つモノフォンモデルである。このＳＤ−ＡＭは、話者に依存しない基本ＡＭ（以下「ＳＩ−ＡＭ（ＳｐｅａｋｅｒＩｎｄｅｐｅｎｄｅｎｔＡｃｏｕｓｔｉｃＭｏｄｅｌ）」と呼ぶ。）に対するＭＡＰ（最大事後推定）を行なうことによって得られる。 That is, a speaker-dependent acoustic model (hereinafter referred to as “SD-AM (Speaker Dependent Acoustic Model)”) is created for each speaker. Next, an integrated vector representing each speaker (hereinafter referred to as “representative vector”) is created for each SD-AM by connecting the average vectors. In the present embodiment, the SD-AM is a monophone model having one Gaussian distribution per state, which is composed of an HMM (Hidden Markov Model). This SD-AM is obtained by performing MAP (maximum a posteriori estimation) on a basic AM independent of a speaker (hereinafter referred to as “SI-AM (Speaker Independent Acoustic Model)”).

このようにして話者ごとに得られた代表ベクトルに対し、主成分分析（ＰＣＡ）を行ない、１５次元の固有空間の基底を得る。この場合、この基底によりサンプルの変動（分散）の９５％がカバーされるように基底の次元を設定するのが望ましい。すなわち、主成分分析における固有値の和の比率をｒ_kで表すと、 In this way, principal component analysis (PCA) is performed on the representative vector obtained for each speaker to obtain a 15-dimensional eigenspace base. In this case, it is desirable to set the dimension of the base so that 95% of the variation (variance) of the sample is covered by this base. That is, when representing the ratio of the eigenvalues sum of the principal component analysis with r _k,

となるように、基底の次元ｋを設定する。ただし上式でλ_iはｉ番目の固有値、ｍはもとの代表ベクトルの次元を表す。

The base dimension k is set so that In the above equation, λ _i represents the i-th eigenvalue, and m represents the dimension of the original representative vector.

前述した各話者の代表ベクトルをこの固有空間に投射することにより、各話者をより低次のベクトルで表すことが可能になる。 By projecting the representative vector of each speaker described above to this eigenspace, each speaker can be represented by a lower-order vector.

Ｋ平均アルゴリズムでは、同程度の大きさのクラスタを作ることができる。本実施の形態ではクラスタ数は上記したグループ数ｎ（例えばｎ＝５）に設定した。ｎは、例えば、想定される非ネイティブ話者の母語の数に等しく選択すればよい。クラスタが疎にならないように、クラスタの数は十分小さくするとよい。各クラスタをそれぞれの母語グループの話者で初期化すると、話者の数のつりあったクラスタができやすくなる。 With the K-average algorithm, clusters of the same size can be created. In the present embodiment, the number of clusters is set to the above-described number of groups n (for example, n = 5). For example, n may be selected to be equal to the number of assumed native languages of non-native speakers. The number of clusters should be sufficiently small so that the clusters do not become sparse. If each cluster is initialized with the speakers of each native language group, a cluster with a balanced number of speakers can be easily created.

なお、種々の距離尺度（ｍｉｎ，ｍａｘ，ａｖｅｒａｇｅ，ｍｅａｎ）を用いた階層的クラスタリングを使用してもよい。この場合、距離尺度としてｍｉｎ、ａｖｅｒａｇｅ、ｍｅａｎを使用すると一つの大きなクラスタができる傾向が高い。距離尺度としてｍａｘを使用するとクラスタが疎になる傾向が低い。 Note that hierarchical clustering using various distance measures (min, max, average, mean) may be used. In this case, if min, average, and mean are used as the distance scale, there is a high tendency to form one large cluster. When max is used as the distance measure, the cluster tends to be sparse.

このクラスタリング処理により、複数の非ネイティブ話者の発話データがクラスタリングされ、結果として各話者は話者グループ６２−１〜６２−ｎに分類されることになる。各話者グループに対し一つの音響モデルが音響モデルトレーニング部６４によりトレーニングされる。音響モデルのトレーニングでは、ＳＩ−ＡＭに対するＭＡＰ適応処理を行なうか、モノフォンＡＭを最初から非ネイティブ発話データのみによりトレーニングするか、の二つの方法がある。いずれの方法をとってもよいが、ＡＭを非ネイティブ話者データごとに最初からトレーニングする方が、基本ＡＭに対するＭＡＰ適応処理を行なうより性能がよいという実験結果が得られている。したがって本実施の形態ではクラスタリングされた非ネイティブ話者グループごとに、音響モデルを最初からトレーニングする。 By this clustering process, the utterance data of a plurality of non-native speakers are clustered, and as a result, each speaker is classified into speaker groups 62-1 to 62-n. One acoustic model is trained by the acoustic model training unit 64 for each speaker group. There are two methods for training an acoustic model: MAP adaptation processing for SI-AM, or monophone AM is trained only from non-native utterance data from the beginning. Either method can be used, but experimental results have been obtained that training AM from the beginning for each non-native speaker data has better performance than performing MAP adaptation processing on basic AM. Therefore, in this embodiment, the acoustic model is trained from the beginning for each clustered non-native speaker group.

図３を参照して、話者グループ分類部８０は、入力発話３６の音響的特徴に基づいて、入力発話３６の話者がグループ別音響モデル群３４のどの話者グループに属するかを分類する機能を持つ。そして、その話者グループに対応する音響モデル８２をグループ別音響モデル群３４から選択する。 Referring to FIG. 3, the speaker group classification unit 80 classifies which speaker group of the group-specific acoustic model group 34 the speaker of the input utterance 36 belongs to based on the acoustic characteristics of the input utterance 36. Has function. Then, the acoustic model 82 corresponding to the speaker group is selected from the group-specific acoustic model group 34.

デコード部８４による、音響モデル８２を用いた入力発話３６のデコードは、従来から行なわれているものと同様である。 The decoding of the input utterance 36 using the acoustic model 82 by the decoding unit 84 is the same as that conventionally performed.

＜動作＞
図１〜図３を参照して、この第１の実施の形態に係る音声認識システム２０は以下のように動作する。このシステム２０の動作は二つのフェーズに分けられる。第１のフェーズはグループ別音響モデル生成装置３２による、オフラインでのグループ別音響モデル群３４の生成処理である。第２のフェーズは、このようにして生成されたグループ別音響モデル群３４を用い、非ネイティブ発話音声認識装置３８が行なう入力発話３６の音声認識である。以下順に説明する。 <Operation>
With reference to FIGS. 1 to 3, the speech recognition system 20 according to the first embodiment operates as follows. The operation of the system 20 is divided into two phases. The first phase is a process for generating the group-by-group acoustic model group 34 by the group-by-group acoustic model generation device 32. The second phase is speech recognition of the input utterance 36 performed by the non-native utterance speech recognition device 38 using the group-specific acoustic model group 34 generated in this way. This will be described in order below.

第１のフェーズでは、最初に非ネイティブ発話データ３０の収集を行なう。ここでは、できれば同じ性の、様々な言語を母語とする話者による、同じ英語の文の発話を収集する。一つの母語につき、複数の話者が存在することが好ましい。ただし、それぞれの話者から収集する音声データが異なっていてもよい。この場合、それぞれの話者について、音素的につりあっている文による発話の収集が欠かせない。 In the first phase, the non-native utterance data 30 is first collected. Here, if possible, utterances of the same English sentence by speakers of the same sex and native speakers of various languages are collected. It is preferable that a plurality of speakers exist for one native language. However, the voice data collected from each speaker may be different. In this case, it is indispensable to collect utterances by phonetic balanced sentences for each speaker.

図２を参照して、発話者クラスタリング処理部６０は、前述した通りのクラスタリングを非ネイティブ発話データ３０に対して行ない、話者を複数の話者グループ６２−１〜６２−ｎにクラスタリングする。音響モデルトレーニング部６４は、話者グループ６２−１〜６２−ｎの各々について、それらに属する話者の発話データを用いてモノフォン音響モデルをトレーニングすることにより、グループ別音響モデル群３４を生成する。 Referring to FIG. 2, speaker clustering processing unit 60 performs clustering as described above on non-native utterance data 30 to cluster speakers into a plurality of speaker groups 62-1 to 62-n. The acoustic model training unit 64 generates a group-specific acoustic model group 34 by training the monophone acoustic model for each of the speaker groups 62-1 to 62-n using the speech data of the speakers belonging to them. .

このグループ別音響モデル群３４が生成されれば、図１に示す非ネイティブ発話音声認識装置３８による音声認識が可能になる。 If this group-specific acoustic model group 34 is generated, speech recognition by the non-native utterance speech recognition device 38 shown in FIG. 1 becomes possible.

図３を参照して、入力発話３６が非ネイティブ発話音声認識装置３８に与えられたものとする。通常は、入力発話３６の話者がどの言語グループに属するかについては不明である。話者グループ分類部８０は、この入力発話３６の音響的特徴に基づき、入力発話３６がグループ別音響モデル群３４のどのグループに属するものであるかを推定し、そのグループの音響モデル８２を選択する。 Referring to FIG. 3, it is assumed that an input utterance 36 is given to a non-native utterance voice recognition device 38. Usually, it is unclear as to which language group the speaker of the input utterance 36 belongs. Based on the acoustic features of the input utterance 36, the speaker group classification unit 80 estimates which group the input utterance 36 belongs to, and selects the acoustic model 82 of the group. To do.

デコード部８４は、この音響モデル８２を用い、入力発話３６に対するデコードを行なって仮説４０を出力する。 The decoding unit 84 uses the acoustic model 82 to decode the input utterance 36 and outputs a hypothesis 40.

なお、上記したように固有空間でクラスタリングするのではなく、予め話者の母語が分かっているのであれば、その母語によって話者を別グループにし、各グループの話者の音声データを用いて音響モデルをトレーニングしても同様の効果が得られる。このように話者の母語により分類された音声データを用いてトレーニングされた音響モデルをアクセント依存の音響モデルと呼ぶ。これに対し、前述したように固有空間で基底を用いてクラスタリングされた音声データを用いてトレーニングされた音響モデルをクラスタ依存の音響モデルと呼ぶ。 If the speaker's native language is known in advance, instead of clustering in the eigenspace as described above, the speakers are grouped according to the native language, and the voice data of the speakers of each group is used for sound. Similar effects can be achieved by training the model. The acoustic model trained using the speech data classified according to the speaker's mother tongue is called an accent-dependent acoustic model. On the other hand, as described above, an acoustic model trained using speech data clustered using a base in the eigenspace is called a cluster-dependent acoustic model.

＜実験＞
この第１の実施の形態に係るシステム２０を用い、その効果を確認する実験を行なった。実験では、ＨＴＫ（隠れマルコフモデルツールキット）を用いて全ての音響モデルおよび言語モデルの学習、ならびにデコーディングを行なった。非ネイティブの話者として、日本、中国、フランス（仏）、ドイツ（独）、およびインドネシアの話者をそれぞれ１５人ずつ、合計７５人を対象に実験を行なった。以下の実験では全ての話者が同じ文を発音した。トレーニングおよび適応データはそれぞれ８８発話（約１０分）を含み、検証データセットは１０発話（約１分）、テストデータセットは２３発話（約３分）を、それぞれ含む。 <Experiment>
Using the system 20 according to the first embodiment, an experiment for confirming the effect was performed. In the experiment, all acoustic and language models were learned and decoded using HTK (Hidden Markov Model Toolkit). As non-native speakers, experiments were conducted with a total of 75 speakers, 15 from Japan, China, France (France), Germany (Germany), and Indonesia. In the following experiment, all speakers pronounced the same sentence. The training and adaptation data each contain 88 utterances (about 10 minutes), the validation data set contains 10 utterances (about 1 minute), and the test data set contains 23 utterances (about 3 minutes).

まず、比較対象とするために、６人のネイティブ英語話者によりベースラインモデルを以下のようにして作成した。ここで使用した文は非ネイティブ話者に対して使用した文と同じである。 First, for comparison purposes, a baseline model was created by six native English speakers as follows. The sentences used here are the same as those used for non-native speakers.

−音響モデル−
ＬＤＣ（ＬｉｎｇｕｉｓｔｉｃＤａｔａＣｏｎｓｏｒｔｉｕｍ）のウォールストリートジャーナル（登録商標）コーパスに含まれる６０時間以上（３７，４１３発話）の音声データを発話者に依存しないネイティブ英語音響モデルの作成に用いた。音響モデルとして、以下の３通りの構成のものを作成した
（１）３状態・１６混合分布からなるモノフォンの４４個のＨＭＭ
（２）約３，０００状態・１０混合分布からなる、状態クラスタリングされたバイフォンモデル
（３）約９，６００状態・１２混合分布からなる状態クラスタリングされたクロスワード・トライフォンモデル
モデル作成の特徴量として、１０ミリ秒間隔で３９個の音響特徴量、１２個のＭＦＣＣ（メル周波数ケプストラム係数）、エネルギとその第１次および第２次微分とを抽出した。 -Acoustic model-
Speech data of more than 60 hours (37,413 utterances) included in the Wall Street Journal (registered trademark) corpus of LDC (Linguistic Data Consortium) was used to create a native English acoustic model independent of the speaker. The following three configurations were created as acoustic models: (1) 44 monophonic HMMs consisting of 3 states and 16 mixed distributions
(2) State clustered biphone model consisting of approximately 3,000 states and 10 mixed distributions (3) State clustered crossword triphone model consisting of approximately 9,600 states and 12 mixed distributions Features of model creation As quantities, 39 acoustic feature quantities, 12 MFCCs (Mel Frequency Cepstrum Coefficients), energy and their first and second derivatives were extracted at 10 millisecond intervals.

これら３つの音響モデルの精度を調べるため、Ｈｕｂ２５Ｋ評価タスクを行なった。その結果、モノフォンについては８０．８％、バイフォンについては８６．８％、トライフォンについては９３．６％の精度を得た。 To examine the accuracy of these three acoustic models, a Hub2 5K evaluation task was performed. As a result, 80.8% accuracy was obtained for monophones, 86.8% for biphones, and 93.6% for triphones.

音声データのうち、男性のみの発話を用い、ＭＡＰ適応によって話者に依存しないベースライン音響モデルを構築した。 Among speech data, we used a male-only utterance and constructed a speaker-independent baseline acoustic model by MAP adaptation.

−言語モデル−
６，４６０発話（６５，８３９単語）を含む、ホテルの予約対話ドメインの２３５対話からなるデータベースから、ｎグラム確率を推定した。辞書は、複合語を含め７，３００語に対する約８，８００個の見出しを含んでいた。３４４単語評価タスク（２３発話からなる二つの対話）により求めたパープレキシティは３２であった。 -Language model-
The n-gram probabilities were estimated from a database of 235 dialogues in the hotel's reserved dialogue domain containing 6,460 utterances (65,839 words). The dictionary contained about 8,800 headlines for 7,300 words including compound words. The perplexity determined by the 344 word evaluation task (two dialogues consisting of 23 utterances) was 32.

−結果−
評価のため、７５重リーブ・ワン・アウトクロス検定を話者グループ依存のモデルを用いた全ての実験に対して行ない、性能に関する実際的な評価を行なった。話者グループ依存のモデルは１０混合分布からなる４２個のＨＭＭを含んでいる。各話者グループに対して別々に音声認識結果を調べた。 -Result-
For evaluation, a 75-fold leave-one-out cross test was performed for all experiments using a speaker group-dependent model, and a practical evaluation of performance was performed. The speaker group dependent model includes 42 HMMs with 10 mixture distributions. The speech recognition results were examined separately for each speaker group.

・話者に依存しないモデル
前述した話者に依存しないベースラインモデルを用いた場合、どのタイプの音響モデルを用いたか、および発話者がどのグループに属するか、によってその結果は大きく変わった。その結果を表１に示す。 -Speaker-independent model When the above-described speaker-independent baseline model was used, the results varied greatly depending on which type of acoustic model was used and which group the speaker belonged to. The results are shown in Table 1.

表１から明らかなように、ネイティブ英語話者の場合、モノフォン、バイフォン、トラ
イフォンのいずれを用いても同程度の精度が得られた。しかし、他の話者グループの場合
には、モノフォン言語モデルを用いた場合に最も高い精度が得られ、バイフォン、トライフォンとなるにしたがい得られる精度が低くなるという興味深い結果が得られた。したがって、少なくとも英語の場合には、ネイティブ以外の話者の場合にはモノフォン音響モデルを用いるのが最も好ましいことが明確に分かる。これは、非ネイティブ話者の場合にはネイティブ話者と比較して発音のバリエーションが広いことが原因と思われる。非ネイティブ話者の発音のバリエーションが広くなるのは、英語以外の場合にも同様であろうから、英語に限らず、非ネイティブ話者の音声認識を行なう場合には、モノフォン音響モデルを用いることが望ましいと推定できる。

As is apparent from Table 1, in the case of a native English speaker, the same level of accuracy was obtained using any of monophone, biphone, and triphone. However, in the case of other speaker groups, the highest accuracy was obtained when the monophone language model was used, and an interesting result was obtained that the accuracy obtained with biphone and triphone was reduced. Thus, it can clearly be seen that it is most preferable to use the monophone acoustic model for non-native speakers, at least for English. This is probably because non-native speakers have wider pronunciation variations than native speakers. Non-native speaker pronunciation variations will be the same in non-English speakers as well, so use the monophone acoustic model for speech recognition of non-native speakers, not limited to English. Can be estimated to be desirable.

・話者クラスタリング
いくつかの距離尺度を用いて話者のクラスタリングについても実験を行なった。階層的クラスタリングを用いた場合、最長距離の点ではバランスのとれたクラスタが得られたが、重心距離および平均ベクトル間距離という点ではかなり疎なクラスタとなった。 -Speaker clustering We also experimented with speaker clustering using several distance measures. When hierarchical clustering was used, a balanced cluster was obtained at the point of the longest distance, but the cluster was rather sparse in terms of the centroid distance and the average vector distance.

また、前述した主成分分析における固有空間の次元ｋが大きくなるとともに、クラスタが疎になる傾向が高くなる。次元が大きくてもよい結果を得られるのは、階層的クラスタリングで距離尺度としてｍａｘを使用した場合と、Ｋ平均を用いた場合とである。なお、主成分分析の結果得られるクラスタがあまりに疎である場合、次元の数ｋを、前述した値より低い値に設定してもよい。 In addition, the eigenspace dimension k in the above-described principal component analysis increases, and the cluster tends to become sparse. Good results can be obtained even if the dimension is large, when max is used as a distance measure in hierarchical clustering and when K-means is used. In addition, when the cluster obtained as a result of the principal component analysis is too sparse, the number k of dimensions may be set to a value lower than the above-described value.

クラスタリングの結果を表２に示す。 Table 2 shows the results of clustering.

・最良認識精度
各テスト話者の母語、およびその属するクラスタについての知識を用い、選択したモデルが正しいものと想定した実験（オラクル実験）を行なって、最良の認識精度としてどのような値が得られるかを確認した。その結果を表３に示す。

・ Best recognition accuracy Using the knowledge of each test speaker's native language and the cluster to which it belongs, an experiment (Oracle experiment) assuming that the selected model is correct, and what value is obtained as the best recognition accuracy I confirmed that it was possible. The results are shown in Table 3.

表３から明らかなように、話者依存の音響モデルを用いることにより、単語認識精度が
大きく改善する。アクセント依存の音響モデルを用いた場合にも性能は高い。これは、共
通の母語を持つ話者についてはアクセントの特徴も共通していることを示唆している。クラスタ依存のモデルを用いた場合にもよい結果が得られるが、アクセント依存のモデルを用いた場合と比較するとやや精度が低くなっている。

As is apparent from Table 3, the word recognition accuracy is greatly improved by using the speaker-dependent acoustic model. The performance is also high when an accent-dependent acoustic model is used. This suggests that speakers with a common mother tongue share the same accent characteristics. Good results are also obtained when using a cluster-dependent model, but the accuracy is slightly lower than when using an accent-dependent model.

以上のようにこの第１の実施の形態に係るシステム２０によれば、非ネイティブ話者ごとに、音声認識のために最適と思われる音響モデルを選択し、その音響モデルを用いて入力発話のデコードを行なう。したがって、話者に依存しない音響モデルを用いた場合と比較して音声認識の精度がより高くなる可能性が高い。出願人において実験したところ、特に日本人による英語の発話に関し、単語認識精度に関して４８％の相対的改善が見られた。 As described above, according to the system 20 according to the first embodiment, an acoustic model that is considered optimal for speech recognition is selected for each non-native speaker, and an input utterance is input using the acoustic model. Decode. Therefore, there is a high possibility that the accuracy of speech recognition is higher than in the case where an acoustic model that does not depend on the speaker is used. An experiment conducted by the applicant showed a relative improvement of 48% in terms of word recognition accuracy, particularly with respect to English utterances by Japanese.

［第２の実施の形態］
＜構成＞
上記した第１の実施の形態のシステムでは、話者グループ分類部８０が入力発話３６の属するグループを推定し、そのグループに対応する音響モデル８２を選択してデコードに用いた。しかし本発明はそのような実施の形態には限定されない。例えば、上記した複数のグループ別音響モデルをすべて用いて並列に入力発話に対するデコードを行ない、得られた複数の仮説のうち最も尤度の高いものを選択するようにしてもよい。図４にそのような非ネイティブ発話音声認識装置１００のブロック図を示す。 [Second Embodiment]
<Configuration>
In the system of the first embodiment described above, the speaker group classification unit 80 estimates a group to which the input utterance 36 belongs, and selects an acoustic model 82 corresponding to the group and uses it for decoding. However, the present invention is not limited to such an embodiment. For example, the input utterance may be decoded in parallel using all the plurality of group-specific acoustic models, and the one with the highest likelihood may be selected from the obtained hypotheses. FIG. 4 shows a block diagram of such a non-native utterance speech recognition apparatus 100.

図４を参照して、この非ネイティブ発話音声認識装置１００は、グループ別音響モデル群３４に含まれるグループ別の音響モデルを用い、入力発話３６に対するデコードを並列に行ない、複数の仮説１１２をそれらの尤度とともに出力するためのデコード部１１０と、複数の仮説１１２の中で最も尤度の高い仮説を選択し仮説４０として出力するための仮説選択部１１４とを含む。 Referring to FIG. 4, this non-native utterance speech recognition apparatus 100 decodes the input utterance 36 in parallel using group-specific acoustic models included in the group-specific acoustic model group 34, and sets a plurality of hypotheses 112. And a hypothesis selection unit 114 for selecting a hypothesis having the highest likelihood among a plurality of hypotheses 112 and outputting the hypothesis 40 as a hypothesis 40.

音響モデルがｋ個あるものとし、得られるｋ個の仮説の尤度をそれぞれｐ_i（ｘ｜ｗ）（ｉ＝１〜ｋ）（ｘは入力音声の音響特徴ベクトル列、ｗは単語列）とすると、仮説選択部１１４は^~ｗ＝ａｒｇｍａｘ_i=1…kｌｏｇｐ_i（ｘ｜ｗ）となる仮説^~ｗ（本明細書では、符号の直前の「^~」は、直後の符号の直上に記載されるべき記号をあらわすものとする。）を最終候補として選択する。これは以下の理由による。特徴ベクトルシーケンスｘが観測されたときの、各音響モデルから得られる単語列ｗの事後確率がｌｏｇｐ_i（ｗ｜ｘ）であり、これを最大とする単語列^~ｗを求める問題は、次のように定式化される。 Assume that there are k acoustic models, and the likelihoods of the k hypotheses obtained are p _i (x | w) (i = 1 to k) (x is an acoustic feature vector sequence of input speech, and w is a word sequence). If you, hypothesis selection unit 114 ^{_{~ w = argmax i = 1 ...}} k logp i | in (x w) to become hypothesis ^~ w (herein, the ^"~" of the previous sign, just above the right after the sign Represents a symbol to be described.) Is selected as a final candidate. This is due to the following reason. When feature vector sequence x is observed, the posterior probability of the word sequence w obtained from each acoustic model logp _i | a (w x), issue the following seeking a word string ^~ w to maximize this It is formulated as follows.

最後の式のうちｐ_i（ｗ）は言語モデルにおける事前確率であって、どの音響モデルを用いた場合でも等しい。したがって結局^~ｗとしては、ｌｏｇｐ_i（ｘ｜ｗ）を最大とするようなものが選択される。

In the last equation, p _i (w) is the prior probability in the language model, and is the same regardless of which acoustic model is used. Thus as the end ^~ w, logp _i | such as to maximize the (x w) is selected.

＜動作＞
この非ネイティブ発話音声認識装置１００の動作については明らかであるので、ここではその詳細については述べない。なお、仮説全体として最も尤度の高いものを選択する代わりに、仮説を構成する単語ごと、または単語ネットワークの経路ごとに、最も高い尤度を選択することにより仮説４０を生成する、いわゆる仮説統合を行なうようにしてもよい。 <Operation>
Since the operation of the non-native utterance speech recognition apparatus 100 is clear, details thereof will not be described here. Instead of selecting the most likely hypothesis as a whole hypothesis, so-called hypothesis integration that generates the hypothesis 40 by selecting the highest likelihood for each word constituting the hypothesis or for each path of the word network. May be performed.

＜実験＞
この第２の実施の形態に係る非ネイティブ発話音声認識装置１００を用いてアクセント依存モデルおよびクラスタ依存モデルを用いて実験を行なった。その結果を表４に示す。 <Experiment>
Experiments were performed using the accent-dependent model and the cluster-dependent model using the non-native utterance speech recognition apparatus 100 according to the second embodiment. The results are shown in Table 4.

表４に示す精度のうち、アクセント依存モデルを用いた場合の精度は、表３に示す精度と比較してやや落ちている。しかしクラスタ依存モデルを用いた場合には精度の低下はない。さらに、いずれのモデルを用いた場合でも表３に示すベースラインモデルを用いた場合よりよい結果が得られている。また、いずれのモデルを用いた場合も、日本語話者を除きモデルによる精度の差異は有意なものではない。

Of the accuracies shown in Table 4, the accuracies when using the accent-dependent model are slightly lower than the accuracies shown in Table 3. However, there is no decrease in accuracy when a cluster-dependent model is used. Furthermore, a better result is obtained when any model is used than when the baseline model shown in Table 3 is used. Also, regardless of which model is used, the accuracy differences between the models are not significant except for Japanese speakers.

クラスタ分類精度（６４．６％）はアクセント分類精度（５２．５％）よりも高かったが、より多くの話者のデータが利用可能になれば、クラスタ依存モデルを用いた並列デコードの方がアクセント依存モデルを用いたものよりもよい性能を示すのではないかと考えられる。各話者グループに対して得た結果を図５に示す。 The cluster classification accuracy (64.6%) was higher than the accent classification accuracy (52.5%), but if more speaker data becomes available, parallel decoding using the cluster-dependent model is better. Perhaps better performance than that using an accent-dependent model. The results obtained for each speaker group are shown in FIG.

以上のように、この発明の実施の形態によれば、ある言語について非ネイティブの話者による発話データ３０に基づき、グループ別音響モデル群３４が生成される。これらグループ別音響モデル群３４を用い、入力発話３６のうちで最も適切と思われる音響モデルを用いたデコードが行なわれる。または、複数の音響モデルを用いてデコードした結果得られた仮説の中で、最も尤度の高いものが選択される。その結果、当該言語を母語としない、母語の影響を受けた独特のアクセントで当該言語の発話を行なう非ネイティブ話者の発話を高い精度で認識することができる。 As described above, according to the embodiment of the present invention, the acoustic model group 34 for each group is generated based on the utterance data 30 by a non-native speaker for a certain language. Using these acoustic model groups 34 for each group, decoding is performed using an acoustic model that seems to be most appropriate among the input utterances 36. Alternatively, a hypothesis obtained as a result of decoding using a plurality of acoustic models is selected with the highest likelihood. As a result, it is possible to recognize the speech of a non-native speaker who speaks the language with a unique accent influenced by the native language without using the language as a native language with high accuracy.

［単一の非ネイティブモデル］
今回考慮した５つのアクセントグループの全話者に対する発音のバリエーションを一つのモノフォン音響モデルを用いて的確に表すことができるかどうかを調べるため、各アクセントグループから１０名、合計５０名の非ネイティブ話者を用いて１６混合分布の非ネイティブモノフォンモデル（ＮＮ）をトレーニングし、その評価を行なった。評価では、残りの２５名の話者を用いて３重クロス検定を行なった。各トレーニングセットおよびテストセットのための話者はランダムに選択した。その際、各話者の母語が均一に分布するように配慮した。 [Single non-native model]
In order to investigate whether the variation of pronunciation for all speakers in the five accent groups considered this time can be accurately expressed using one monophonic acoustic model, 10 non-native speakers, 10 from each accent group A non-native monophone model (NN) with a 16 mixture distribution was trained and evaluated. In the evaluation, a triple cross test was performed using the remaining 25 speakers. The speakers for each training set and test set were randomly selected. At that time, consideration was given so that the native language of each speaker was evenly distributed.

このようにして作成した話者独立な非ネイティブモノフォンモデルを用いた音声認識の結果を、テスト話者の母語別に表５に示す。 The results of speech recognition using the speaker-independent non-native monophone model created in this way are shown in Table 5 for each test speaker's mother tongue.

表５より、表４に示すアクセント依存モデルまたはクラスタ依存モデルを用いた並列デコーディングを用いた結果に匹敵する結果が得られることが分かる。ただし、表３に示した結果のうち、アクセント依存モデルを用いた結果と比較すると、表３の方が高い。したがって、もしもアクセントによる話者の分類が高精度でできるのであれば、対応するアクセント依存モデルを用いて音声認識を行なうことが原則としては望ましいといえる。

From Table 5, it can be seen that a result comparable to the result using parallel decoding using the accent-dependent model or cluster-dependent model shown in Table 4 is obtained. However, among the results shown in Table 3, Table 3 is higher than the results using the accent-dependent model. Therefore, if it is possible to classify speakers by accents with high accuracy, it is in principle desirable to perform speech recognition using the corresponding accent-dependent model.

この話者独立な非ネイティブモノフォンモデルを用いると、その精度に限界がある。しかし、非ネイティブ話者の音声コーパスが利用できない場合には、文脈依存の頑健な音響モデルをトレーニングにより得ることは困難である。各アクセントグループ内での非ネイティブ話者の発音のバリエーションがほぼ一致した傾向を示すことを仮定すれば、アクセントおよび文脈依存のモデルを用いることで、より精度を高めることが可能と思われる。 When this speaker-independent non-native monophone model is used, its accuracy is limited. However, when a speech corpus of non-native speakers is not available, it is difficult to obtain a robust context-dependent acoustic model by training. Assuming that the pronunciation variations of non-native speakers within each accent group tend to be in close agreement, it may be possible to increase accuracy by using accent and context dependent models.

なお、上記した実施の形態では、英語発話に対する非ネイティブ話者の発話を音声認識する実施の形態を例にした。しかし本発明はそのような実施の形態に限定されない。任意の言語に対して、上記した非ネイティブ話者の音声認識を行なうようにしてもよい。 In the above-described embodiment, an embodiment in which speech of a non-native speaker's speech with respect to an English speech is recognized as an example. However, the present invention is not limited to such an embodiment. The speech recognition of the non-native speaker described above may be performed for an arbitrary language.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係る音声認識システム２０のブロック図である。1 is a block diagram of a speech recognition system 20 according to a first embodiment of the present invention. 図１に示すグループ別音響モデル生成装置３２のブロック図である。It is a block diagram of the acoustic model production | generation apparatus 32 classified by group shown in FIG. 図１に示す非ネイティブ発話音声認識装置３８のブロック図である。It is a block diagram of the non-native utterance voice recognition apparatus 38 shown in FIG. 本発明の第２の実施の形態に係る非ネイティブ発話音声認識装置１００のブロック図である。It is a block diagram of the non-native utterance voice recognition device 100 concerning a 2nd embodiment of the present invention. 各モデルを用いて５通りの方式で得た単語認識精度を話者グループごとに示すグラフである。It is a graph which shows the word recognition precision obtained by five kinds of methods using each model for every speaker group.

Explanation of symbols

２０音声認識システム、３０非ネイティブ話者発話データ、３２グループ別音響モデル生成装置、３４グループ別音響モデル群、３６入力発話、３８，１００非ネイティブ発話音声認識装置、４０，１１２仮説、６０発話者クラスタリング処理部、６４音響モデルトレーニング部、８０話者グループ分類部、８２音響モデル、８４，１１０デコード部、１１４仮説選択部 20 speech recognition system, 30 non-native speaker utterance data, 32 group acoustic model generation device, 34 group acoustic model group, 36 input utterance, 38,100 non-native utterance speech recognition device, 40, 112 hypothesis, 60 speaker Clustering processing unit, 64 acoustic model training unit, 80 speaker group classification unit, 82 acoustic model, 84, 110 decoding unit, 114 hypothesis selection unit

Claims

Voice data preparation means for preparing voice data of a plurality of speakers classified in advance into any of a plurality of groups according to predetermined voice characteristics;
An acoustic model generation device, comprising: an acoustic model group generation unit for generating an acoustic model for each group based on the audio data prepared by the audio data preparation unit.

The voice data preparation means includes speaker clustering means for clustering the plurality of speakers into a plurality of groups based on utterance data by a plurality of speakers,
The acoustic model group generation means includes means for creating an acoustic model for each of a plurality of groups obtained by the speaker clustering means based on the utterance data of speakers belonging to each group. The acoustic model generation device described in 1.

The speaker clustering means includes:
Speaker-dependent acoustic model generation means for generating a speaker-dependent acoustic model based on the utterance data of each of the plurality of speakers;
Representative vector generation means for generating a representative vector of each of the plurality of speakers based on the acoustic model;
Means for converting the plurality of representative vectors into a plurality of lower-order representative vectors by performing principal component analysis on the plurality of representative vectors obtained for the plurality of speakers;
The acoustic model generation device according to claim 2, further comprising: clustering means for clustering the plurality of speakers into a plurality of groups by executing a predetermined clustering process on the plurality of low-order representative vectors. .

A speech recognition apparatus that performs speech recognition on an input utterance using a plurality of acoustic models, wherein the plurality of acoustic models are generated from utterance data having different acoustic characteristics,
An acoustic model selection means for selecting an acoustic model that matches the acoustic feature of the input utterance among the plurality of acoustic models based on the acoustic feature of the input utterance;
A speech recognition unit including speech recognition means for performing speech recognition on the input utterance using the acoustic model selected by the acoustic model selection unit.

A speech recognition apparatus that performs speech recognition on an input utterance using a plurality of acoustic models, wherein the plurality of acoustic models are generated from utterance data having different acoustic characteristics,
Speech recognition means for performing speech recognition on an input utterance using each of the plurality of acoustic models and outputting a plurality of hypotheses;
A speech recognition apparatus comprising: a hypothesis output means for outputting one hypothesis based on a plurality of hypotheses output by the speech recognition means.