JPH04347898A

JPH04347898A - Voice recognizing method

Info

Publication number: JPH04347898A
Application number: JP3120320A
Authority: JP
Inventors: Mizuhiro Hida; 飛田　瑞広
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-05-24
Filing date: 1991-05-24
Publication date: 1992-12-03

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は、周囲騒音が大きな場
所で発声された音声を認識するに適する音声認識方法に
関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method suitable for recognizing speech uttered in a place with a lot of ambient noise.

【０００２】0002

【従来の技術】情報の伝達手段としては、一般的には手
で操作するタイプライタや押しボタン等に比べて、音声
を用いた場合の方が伝送速度や操作性の点で優れている
。音声を情報伝達の手段に用いる場合には、伝達する相
手が機械の場合、発声音声の内容を正しく認識してやる
ための音声認識装置が必要となる。2. Description of the Related Art As a means of transmitting information, voice is generally superior to manually operated typewriters, push buttons, etc. in terms of transmission speed and operability. When using voice as a means of transmitting information, if the recipient is a machine, a voice recognition device is required to correctly recognize the content of the uttered voice.

【０００３】しかし、音声認識装置は、静寂な室内だけ
ではなく、騒音の大きな認識性能を低下させる環境条件
下においても使用される。従って、認識性能を向上する
ための手法としては、騒音の影響を極力排除して清浄な
音声とした後に認識処理を行なうのが一般的であり、そ
のために指向性の鋭いマイクロホンを用いたり、２本の
指向性マイクロホンを指向軸を反対方向に配置して２入
力系を構成し、サブトラクション法により騒音を除去す
る等の手法が提案されている。However, speech recognition devices are used not only in quiet rooms, but also in noisy environmental conditions that degrade recognition performance. Therefore, as a method to improve recognition performance, it is common to perform recognition processing after eliminating the influence of noise as much as possible to make the sound clean. Techniques have been proposed, such as configuring a two-input system by arranging directional microphones with their directivity axes in opposite directions, and removing noise using the subtraction method.

【０００４】しかし、騒音レベルの大きな場所で収音し
た音声から、騒音のみを必要十分な特性で除去すること
は、上記の手法を採用した場合でも困難であるため、実
用上十分な認識性能を得ることは非常に難しい状況下に
ある。一方、騒音が重畳した音声を認識する場合には、
多少騒音の付加した音声で作成した標準パターンを用い
て認識することにより、騒音の無い音声を標準パターン
に用いるよりも認識性能が向上することが知られている
。そこでまず、この発明の認識性能の向上を実現するた
めの前提条件としている、騒音の加わった音声を標準パ
ターンに用いたときの認識への効果を確認するための実
験を行なった。図３は、騒音レベルを大、中、小と変化
した時に１００都市名で構成された単語セットを２回繰
り返して発声した音声のうち、一方の単語セットで標準
音声パターンを作成して他方の発声音声をテスト音声と
してその単語セットをＤＰマッチングにより認識した場
合（図中の実線）と、騒音レベルが小のときに発声した
単語セットの音声を用いて標準音声パターンを作成し、
各騒音条件下で発声した単語セットの音声をテスト音声
として認識した場合（図中の点線）とを対比して示した
ものである。[0004] However, even if the above-mentioned method is adopted, it is difficult to remove only the noise with necessary and sufficient characteristics from voices collected in a place with a high noise level. It is extremely difficult to obtain. On the other hand, when recognizing speech with superimposed noise,
It is known that recognition performance is improved by performing recognition using a standard pattern created using speech with some noise added, compared to using speech without noise as the standard pattern. Therefore, we first conducted an experiment to confirm the effect on recognition when noise-added speech is used as a standard pattern, which is a prerequisite for realizing the improvement in recognition performance of this invention. Figure 3 shows the sound that was uttered by repeating a word set consisting of 100 city names twice when the noise level was changed from high to medium to low. A standard speech pattern was created using one word set and the other A standard speech pattern is created using the voice of the word set uttered when the noise level is low and when the word set is recognized by DP matching (solid line in the figure) using the voice as a test voice.
This figure shows a comparison between the case where the speech of the word set uttered under each noise condition is recognized as the test speech (dotted line in the figure).

【０００５】これから、同一の騒音レベル条件で発声し
た二つの音声の一方を用いて作成した、標準音声パター
ンを用いて、もう一方の発声音声をテスト音声として認
識した時に得られる認識率は、騒音レベルが大きな場合
でもほぼ１００％に近い値が得られるのに対し、騒音レ
ベルが小さい時に発声した音声を用いて標準音声パター
ンを作成して、騒音レベルが中及び大の条件下で発声し
た音声を認識した場合は、騒音レベルが大きくなるにし
たがって認識率が低下し、騒音レベルが大の条件で発声
した音声では約３０％にまで低減することが分かる。[0005] From now on, when a standard speech pattern created using one of two speeches uttered under the same noise level conditions is used to recognize the other utterance as a test speech, the recognition rate obtained is While a value close to 100% can be obtained even when the noise level is high, standard speech patterns are created using sounds uttered when the noise level is low, and sounds uttered under medium and high noise levels. It can be seen that the recognition rate decreases as the noise level increases, and is reduced to about 30% for voices uttered under conditions of high noise level.

【０００６】この認識実験の結果から、騒音が重畳した
音声を認識する場合には、同一の騒音条件下で発声した
音声を用いて作成した標準音声パターンを用いることが
、認識率の向上に関して効果的であることが確認された
。しかしこれを実現するには、音声認識装置を使用する
実環境下において、認識装置を使用するたび毎に標準パ
ターンを作成するための音声を収録することが必要とな
る。これは、使用する室内の環境条件が変化したときや
、同じ室内でも騒音の種類やレベルが変化した場合には
、その都度標準パターンを作成するための音声を発声す
ることが必要となるため、これらの条件を満たした条件
で認識装置を使用することは、多大の労力と時間とが必
要となり現実的な方法とは言えない。[0006] The results of this recognition experiment show that when recognizing speech with superimposed noise, using a standard speech pattern created using speech uttered under the same noise conditions is effective in improving the recognition rate. It was confirmed that this was the case. However, in order to realize this, it is necessary to record speech for creating a standard pattern each time the recognition device is used in an actual environment in which the speech recognition device is used. This is because when the environmental conditions in the room you use change, or when the type and level of noise changes even within the same room, it is necessary to utter a sound to create a standard pattern each time. Using a recognition device under conditions that satisfy these conditions requires a great deal of effort and time, and cannot be said to be a realistic method.

【０００７】[0007]

【発明が解決しようとする課題】この様な点に鑑み、こ
の発明は様々な騒音環境条件下で発声した音声を認識す
る場合において、認識装置を使用するたび毎に標準音声
パターンを作成するための発声を行なわなくても、実用
上十分な認識性能が得られる音声認識方法を提供しよう
とするものである。[Problems to be Solved by the Invention] In view of the above points, the present invention provides a method for creating a standard speech pattern each time a recognition device is used when recognizing speech uttered under various noisy environment conditions. The purpose of the present invention is to provide a speech recognition method that can obtain practically sufficient recognition performance without the need for utterance.

【０００８】[0008]

【課題を解決するための手段】この発明によれば標準と
なる音声パターンを作成するための音声を、騒音や残響
のない無響室あるいは防音室等で、発声レベルを何段階
かに変化して発声したクリーンな音声の、音声部分のみ
を切り出した単語セット列のデータとして音声認識装置
内のバッファに基準音声として予め蓄えておく。[Means for Solving the Problems] According to the present invention, the speech level for creating a standard speech pattern is changed in several stages in an anechoic room or a soundproof room without noise or reverberation. A word set sequence data obtained by cutting out only the voice part of a clean voice uttered is stored in advance in a buffer in the voice recognition device as a reference voice.

【０００９】実使用環境において被認識単語として発声
されたテスト音声の直前に収音される騒音データ及びそ
の時間平均レベルＰｎ　と、騒音の重畳した被認識対象
となるテスト音声及びその時間平均レベルＰｔ　とを求
め、またこれらの比Ｐｔ　／Ｐｎ　を求め、前述の予め
何段階かにレベルを変化して発声した基準音声に、テス
ト音声発声時の実使用環境で収音した騒音データを前記
Ｐｔ　／Ｐｎ　の比に等しくなるように、基準音声のレ
ベルを調整して加算し、この加算を各レベルで発声した
基準音声のすべてについて行い、この各基準音声と騒音
データとを加算したものすべてについて作成した標準音
声から、パターンマッチング用の標準テンプレートを作
成して騒音下で発声したテスト音声の認識を行なう。Noise data collected immediately before the test voice uttered as a word to be recognized in an actual use environment and its time average level Pn; and a test voice to be recognized with superimposed noise and its time average level Pt. and the ratio Pt /Pn of these, and the noise data collected in the actual usage environment when the test voice was uttered was added to the reference voice that was uttered with the level changed in several stages in advance, and the Pt /Pn was calculated. The level of the reference sound is adjusted and added so that it is equal to the ratio of Pn, and this addition is performed for all of the reference sounds uttered at each level, and the result is created for all of the additions of each reference sound and the noise data. A standard template for pattern matching is created from the standard speech generated, and test speech uttered in noise is recognized.

【００１０】0010

【実施例】以下にこの発明の実施例について詳細に説明
する。図１にこの発明の一実施例を示す。テスト音声デ
ータ収集部１１から騒音のある実使用環境下で発声した
騒音が重畳した被認識対象の音声データが収集される。このテスト音声データには音声の重畳しない騒音データ
が時系列上で音声データと交互に出現する。このテスト
音声データから音声と騒音との各区間が音声区間検出部
１２で識別して検出される。音声区間の検出は、音声区
間と騒音区間との音響パワーレベルの違いに着目した方
法などで、例えばその域値をあるレベル設定値に対して
これ以上のレベルで継続する時間長などを手がかりとし
て行なわれる。音声区間検出部１２で音声として検出さ
れた被認識対象となるテスト音声Ｓｉ　と、騒音として
検出された騒音データＮｉ　と、これらテスト音声Ｓｉ
　の音響パワーレベルＰｔ　と騒音データＮｉ　の音響
パワーレベルＰｎ　と、これらの比Ｐｔ　／Ｐｎ　がテ
ストデータ蓄積部１３に蓄えられる。EXAMPLES Examples of the present invention will be described in detail below. FIG. 1 shows an embodiment of this invention. The test voice data collection unit 11 collects voice data of the object to be recognized, which is uttered in a noisy actual usage environment and has noise superimposed thereon. In this test audio data, noise data on which no audio is superimposed appears alternately with audio data in time series. From this test audio data, each section of speech and noise is identified and detected by the speech section detection section 12. Speech sections can be detected by methods that focus on the difference in sound power level between speech sections and noise sections.For example, the threshold value is detected using clues such as the length of time that the threshold continues at a level higher than a certain level setting value. It is done. Test speech Si to be recognized detected as speech by the speech section detection unit 12, noise data Ni detected as noise, and these test speech Si
The acoustic power level Pt of the noise data Ni, the acoustic power level Pn of the noise data Ni, and their ratio Pt/Pn are stored in the test data storage section 13.

【００１１】一方、音声の標準パターンを作成するため
のクリーン音声データ収集部１４は、予め反響や残響及
び騒音の無い環境で、発声レベルを何段階か（例えば、
大、中、小の３段階）に変化して発声した音声が収音さ
れる。発声レベルを変化して収録することの必要性は次
の理由による。すなわち、先の図３で示した騒音が小の
ときに発声した音声を用いて標準パターンを作成して認
識した場合の、騒音レベルが大の時の音声の認識性能が
低下している（図３中の点線の特性）理由には、音声に
騒音が重畳されている事によるクリーンな音声とのスペ
クトル特性の違いの他に、騒音環境下で発声したときの
騒音のレベルに依存して発声レベルが異なることによっ
て生ずる発声変形（ロンバート効果）による要因が大き
く影響していると考えられるためで、従って騒音レベル
が幾段階化に変化した場合を想定した、発声変形を生じ
た音声が標準パターンを作成する時に必要となる。On the other hand, the clean voice data collection unit 14 for creating a standard voice pattern sets the voice level to several levels (for example,
The uttered voice is recorded in three stages (loud, medium, and soft). The necessity of recording while changing the vocalization level is due to the following reason. In other words, when a standard pattern is created and recognized using the voice uttered when the noise level is low as shown in Figure 3, the recognition performance for the voice when the noise level is high is degraded (Figure 3). (Characteristics indicated by the dotted line in 3) The reasons include the difference in spectral characteristics from clean voice due to noise being superimposed on the voice, as well as the difference in spectral characteristics when vocalizing in a noisy environment, depending on the noise level. This is because it is thought that the factor of vocal deformation (Lombard effect) that occurs due to different levels has a large influence.Therefore, assuming the case where the noise level changes in several stages, the voice with vocal deformation is the standard pattern. Required when creating.

【００１２】なお、普通に発声した音声に例えばホルマ
ント周波数の移動量などの、発声変形による物理的な特
徴成分の変化量をクリーンな音声に適用して、標準音声
パターンを作成するための発声変形音声を得ることも原
理的には可能であると共に、これによって得られた音声
の特性が上述した何段階かに変化して発声した音声と十
分な類似性が得られれば、標準パターン作成用の音声は
普通の発声レベルで１回だけ発声すれば良いという利点
があり大いに有効な方法となる。[0012] Vocalization transformation is performed to create a standard speech pattern by applying the amount of change in physical characteristic components due to vocalization modification, such as the shift amount of formant frequency, to clean speech to normally uttered speech. In principle, it is possible to obtain a voice, and if the characteristics of the voice obtained by this change in several stages and have sufficient similarity to the voice uttered, it can be used to create a standard pattern. This method has the advantage that the voice only needs to be uttered once at a normal voice level, making it a very effective method.

【００１３】これらのいずれかの方法を用いて得た、発
声変形を伴ったクリーンな音声データの、音声区間の検
出が音声区間検出部１５で行われ、区間検出された音声
データは基準音声Ｃｓｉｊとして、かつその単語発声時
間長をＴｓｉｊとして、この両者を発声レベルの違い毎
に基準音声蓄積部１６に蓄える。Ｃｓｉｊ及びＴｓｉｊ
の添え字ｓは標準パターン作成のための基準音声である
ことを、ｉは発声レベルの違い（ｉ＝１〜ｍ、ｍ≧２）
を、ｊは認識単語番号（ｊ＝１〜ｎ）をそれぞれ示す。このような基準音声を予め蓄積しておく。[0013] The voice section detecting section 15 detects the voice section of the clean voice data with vocal deformation obtained using any of these methods, and the voice data that has been detected is converted into the reference voice Csij. and the word utterance time length is set as Tsij, and both are stored in the reference speech storage unit 16 for each difference in utterance level. Csij and Tsij
The subscript s indicates the reference voice for creating the standard pattern, and i indicates the difference in vocalization level (i = 1 to m, m≧2)
, and j indicates the recognized word number (j=1 to n), respectively. Such reference voices are stored in advance.

【００１４】認識に当っては、各基準音声Ｃｓｉｊは、
レベル制御部１７を経て加算部１８で、テストデータ蓄
積部１３に記憶されている騒音データＮｉ　が重畳され
る。このとき騒音データＮｉ　の時間長Ｔ１は、クリー
ン音声データの時間長Ｔｓｉｊより長い場合と短い場合
との２つの場合があり、各々に応じて騒音重畳の方法を
以下のように行なう。[0014] In recognition, each reference voice Csij is
The noise data Ni stored in the test data storage section 13 is superimposed in the addition section 18 via the level control section 17. At this time, there are two cases in which the time length T1 of the noise data Ni is longer and shorter than the time length Tsij of the clean audio data, and the noise superimposition method is performed as follows depending on each case.

【００１５】この様子を図２を用いて説明する。図２Ａ
は、テスト音声データの時間経過に対するレベル変化特
性を示したもので、騒音データＮｉ　の時間長がＴ１で
、その時間平均パワーレベルがＰｎ　で、その後に被認
識対象となるテスト音声Ｓｉ　が続き、その発声時間長
がＴ２で、かつその騒音を含んだ当該テスト音声の時間
平均パワーレベルがＰｔ　である。図２Ｂ，Ｃは、基準
音声Ｃｓｉｊに上記の騒音データ（時間長Ｔ１）Ｎｉ　
を加算部１８で加算するときの様子を示したものである
。基準音声Ｃｓｉｊに前述した時間平均レベルＰｎ　で
時間長がＴ１の特性を有する騒音を加算したときの、音
声区間部分の時間長Ｔｓｉｊにおける平均のパワーレベ
ルをＰｓｉｊとしている。This situation will be explained using FIG. 2. Figure 2A
shows the level change characteristics of the test voice data over time, where the time length of the noise data Ni is T1, its time average power level is Pn, followed by the test voice Si to be recognized, The utterance time length is T2, and the time average power level of the test voice including the noise is Pt. FIGS. 2B and 2C show that the above noise data (time length T1) Ni is added to the reference voice Csij.
This figure shows how the adding unit 18 adds the numbers. Psij is the average power level in the time length Tsij of the voice section when the above-mentioned noise having the characteristics of time average level Pn and time length T1 is added to the reference voice Csij.

【００１６】ここで騒音を重畳する場合に、■Ｔ１＞Ｔ
ｓｉｊ ■Ｔ１≦Ｔｓｉｊの２条件がある。■の場合は、図２Ｂに示すように、騒
音の時間Ｔ１内に基準音声の時間間隔Ｔｓｉｊ（この図
ではｊ＝１及び２の場合について表記している）が収ま
るように、騒音Ｎｉ　の始めから基準音声Ｃｓｉｊの始
めまでの時間長Ｔｐを、後に行なう音声区間検出時に音
声区間が確実に切り出せる様な値に任意に設定して騒音
Ｎｉを重畳してやれば良い。このとき、Ｐｓｉｊ／Ｐｎ
の値を計算部１９で計算し、この値をテストデータ蓄積
部１３に蓄えられているＰｔ　／Ｐｎ　の値とレベル比
較部２１で比較し、両者が等しくなるように、すなわち
Ｐｓｉｊ＝Ｐｔ　となるように騒音重畳基準音声のパワ
ーレベルＰｓｉｊをレベル制御部１７で調整する。[0016] When noise is superimposed here, ■T1>T
sij ■There are two conditions: T1≦Tsij. In the case of ■, as shown in FIG. 2B, the beginning of the noise Ni is set so that the time interval Tsij of the reference sound (in this figure, the cases of j=1 and 2 are shown) falls within the time T1 of the noise. The noise Ni may be superimposed by arbitrarily setting the time length Tp from the beginning to the beginning of the reference speech Csij to a value that allows the speech section to be reliably cut out when the speech section is detected later. At this time, Psij/Pn
The calculation unit 19 calculates the value of Pt /Pn stored in the test data storage unit 13, and the level comparison unit 21 compares this value with the value of Pt /Pn stored in the test data storage unit 13 so that the two become equal, that is, Psij = Pt. The power level Psij of the noise superimposed reference sound is adjusted by the level control unit 17 as follows.

【００１７】一方■の場合は、図２Ｃに示すように、基
準音声の時間長Ｔｓｉｊ（この図ではｊ＝１の場合につ
いて表記している）を包含してかつその前後に騒音のみ
の区間が存在するように、音声と騒音とを重畳してやる
必要がある。従って、例えば騒音の時間長Ｔ１が基準音
声の時間長Ｔｓｉｌの１／３程度しか無かった場合には
、時間長Ｔ１をもつ同一の騒音データを４〜５回繰り返
して基準音声信号に重畳してやることとなる。このとき
、■の場合と同様にＰｓｉｊ／Ｐｎ　の値が、テストデ
ータ蓄積部１３に蓄えられているＰｔ　／Ｐｎ　の値に
等しくなるように、すなわちＰｓｉｊ＝Ｐｔ　となるよ
うにレベルＰｓｉｊをレベル制御部１７で調整する。On the other hand, in the case of ■, as shown in FIG. 2C, there is a section containing only noise that includes the time length Tsij of the reference voice (in this figure, the case where j=1 is shown) and before and after it. It is necessary to superimpose the voice and noise so that they exist. Therefore, for example, if the time length T1 of the noise is only about 1/3 of the time length Tsil of the reference sound, the same noise data having the time length T1 may be repeated 4 to 5 times and superimposed on the reference sound signal. becomes. At this time, the level Psij is controlled so that the value of Psij/Pn becomes equal to the value of Pt /Pn stored in the test data storage section 13, that is, Psij=Pt, as in the case of (2). Adjustment is made in section 17.

【００１８】このように基準音声レベルを調整して、こ
れに騒音データＮｉ　が重畳された標準音声データはバ
ッファ２２に蓄積される。その標準音声データは、それ
まで開放されたスイッチ２３が閉じられて、音声区間検
出部２４へ入力される。音声区間検出部２４のアルゴリ
ズム並びに域値は、先のテスト音声データに対する音声
区間検出部１２の条件と同一設定のもとに行なわれて音
声の区間が検出される。こうすることによって、音声区
間の検出誤りが多少生じたとしても、その誤り方はテス
ト音声並びに標準音声ともに同一となることが期待され
、結果的には認識性能の向上に寄与できる特徴を有する
。The standard audio data with the reference audio level adjusted in this way and the noise data Ni superimposed thereon is stored in the buffer 22. The standard voice data is input to the voice section detecting section 24 after the switch 23, which had been open until then, is closed. The algorithm and threshold value of the voice section detecting section 24 are set to be the same as the conditions of the voice section detecting section 12 for the previous test voice data, and the voice section is detected. By doing this, even if some errors occur in detecting speech sections, it is expected that the errors will be the same for both the test speech and the standard speech, and as a result, the present invention has a feature that can contribute to improving recognition performance.

【００１９】音声区間検出部２４で音声区間検出された
音声信号は、標準パターン作成部２５において音声スペ
クトルの分析と標準パターンの作成が行なわれる。この
ようにしてすべての発声レベルについてのすべての認識
単語についての基準音声Ｃｓｉｊについて騒音を重畳し
たものから標準パターンが作成される。これらのすべて
の標準パターンとテストパターン作成部２６で作成され
たテスト音声のパターンとの照合が、パターン照合部２
７で行なわれる。ここで単語間の類似度を求めるための
距離計算が行なわれ、テスト音声に最も類似度の高い順
序で、標準音声パターンの出力がテスト音声の認識結果
として、必要に応じて第１順位の結果だけでなくそれ以
下の順位を伴って認識結果の出力部２８へ出力され、認
識装置としての一連の動作を行なうものである。The speech signal whose speech section has been detected by the speech section detection section 24 is subjected to speech spectrum analysis and creation of a standard pattern at the standard pattern creation section 25. In this way, a standard pattern is created by superimposing noise on the reference speech Csij for all recognized words at all utterance levels. The pattern matching unit 2 compares all these standard patterns with the test audio pattern created by the test pattern creating unit 26.
It will be held at 7. Here, a distance calculation is performed to determine the similarity between words, and the standard speech patterns are output as the recognition results of the test speech in the order of the highest similarity to the test speech, and the results of the first rank are used as necessary. In addition, the recognition result is outputted to the recognition result output unit 28 along with the ranking below, and a series of operations as a recognition device are performed.

【００２０】なお、ここで説明した被認識対象となるテ
スト音声としては、単語音声を対象として述べているが
、この発明による構成及び手法上、被認識音声は単語音
声にのみ限定されるものではない。[0020] Note that the test speech to be recognized described here is word speech, but due to the structure and method of the present invention, the speech to be recognized is not limited to word speech. do not have.

【００２１】[0021]

【発明の効果】以上説明したように、この発明による音
声認識方法を用いることにより、認識装置を使用する場
合の環境条件の変動に対応した認識処理を行なうことに
よって、高確度な認識性能を確保することが出来る。す
なわち、実使用時の話者・マイクロホン間の距離が近い
場合は、騒音レベルが多少高くても発声レベルは小さく
、従って発声変形量が少ない代わりにＳ／Ｎが低下する
こと、逆に話者・マイクロホン間の距離が離れている場
合は、騒音レベルが高い場合は勿論、騒音レベルが低い
場合でも大きな声で発声すること等が、発声者が往々に
して無意識のうちに行なう習性があることで、これに対
処するために発声変形量の異なる複数組の基準音声を事
前に用意しておき、さらに実使用環境で発生している騒
音を、前述した基準音声に重畳して、かつこのときのＳ
／Ｎの値をテスト音声のＳ／Ｎの値と等しく設定すると
共に、基準音声から標準パターンを作成するための音声
区間の検出を、テスト音声と同一のアルゴリズムと域値
を設定して行なうことにより、テスト音声と標準音声と
は同一の環境条件下で発声したものと殆ど同様の特性を
持つものが実現できる。この結果、使用者に対して必要
最小限の負担を課するのみで、実用上十分な認識精度を
確保した音声認識装置を実現することが出来る特徴を有
する。[Effects of the Invention] As explained above, by using the speech recognition method according to the present invention, highly accurate recognition performance is ensured by performing recognition processing that corresponds to changes in environmental conditions when using a recognition device. You can. In other words, if the distance between the speaker and the microphone is close during actual use, the vocalization level will be small even if the noise level is somewhat high.Therefore, the amount of vocalization distortion will be small, but the S/N will be lower; - When the distance between the microphones is large, the speaker often unconsciously tends to speak loudly even when the noise level is low, as well as when the noise level is high. In order to deal with this, multiple sets of reference sounds with different amounts of vocalization distortion are prepared in advance, and the noise occurring in the actual usage environment is superimposed on the reference sound mentioned above. S of
/N value is set equal to the S/N value of the test voice, and the same algorithm and threshold as the test voice are used to detect the voice section for creating a standard pattern from the reference voice. As a result, the test voice and the standard voice can have almost the same characteristics as those uttered under the same environmental conditions. As a result, the present invention has the feature that it is possible to realize a speech recognition device that ensures a practically sufficient recognition accuracy by imposing only the minimum necessary burden on the user.

[Brief explanation of drawings]

【図１】この発明を適用した音声認識装置の機能構成例
を示すブロック図。FIG. 1 is a block diagram showing an example of a functional configuration of a speech recognition device to which the present invention is applied.

【図２】クリーンな音声データに騒音データを重畳する
方法を説明するための図。FIG. 2 is a diagram for explaining a method of superimposing noise data on clean audio data.

【図３】騒音レベルと認識率との関係を示す図。FIG. 3 is a diagram showing the relationship between noise level and recognition rate.

Claims

[Claims]

[Claim 1] Only speech segments are detected from clean speech data in which the word set of the vocabulary to be recognized is uttered at multiple levels of utterance under environmental conditions with few echoes and reverberations and no noise. The reference sound and its time average level are memorized, and the sound section and the noise section are detected from the input test sound data, and the test sound, noise data, and the time average level Pt of the test sound are determined.
, the time average level Pn of the noise data, and the ratio Pt /Pn of these, and control the level of the reference sound so that it becomes equal to the ratio Pt /Pn for all of the reference sounds, and compare it with the noise data. Create noise-superimposed speech by adding them, detect speech sections from these noise-superimposed speeches, create standard speech patterns for each word from these speech sections for each of the above pronunciation levels, and compare this standard speech pattern with the above test speech. A speech recognition method that recognizes test speech by performing pattern matching.

2. The audio according to claim 1, wherein the detection of the audio interval from the test audio data and the detection of the audio interval from the noise superimposed audio are performed using the same algorithm and threshold value. Recognition method.