JP2000163098A

JP2000163098A - Voice recognition device

Info

Publication number: JP2000163098A
Application number: JP10334201A
Authority: JP
Inventors: Eikichi Taira; 栄吉平良
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1998-11-25
Filing date: 1998-11-25
Publication date: 2000-06-16

Abstract

(57)【要約】【課題】従来の音声認識装置では、高騒音環境下では
音声区間を正しく検出できず、認識率が低下する等の問
題があった。また、音声入力時にトリガボタン等の操作
を要し、操作が面倒であった。【解決手段】非音声区間から推定した騒音スペクトル
を用いて混入する騒音成分を除去し、音声パワーとして
音声帯域のみのパワーを使用して音声区間を検出する。
さらに、尤度算出結果の最上位単語と次候補単語との尤
度差を用いて最終的な認識結果を出力する。 (57) [Summary] [PROBLEMS] The conventional speech recognition device has a problem that a speech section cannot be correctly detected in a high-noise environment, and the recognition rate decreases. In addition, the operation of the trigger button and the like is required at the time of voice input, and the operation is troublesome. SOLUTION: A mixed noise component is removed using a noise spectrum estimated from a non-voice section, and a voice section is detected by using power of only a voice band as voice power.
Furthermore, a final recognition result is output using the likelihood difference between the highest word of the likelihood calculation result and the next candidate word.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、工事現場や災害
地等で使用される車両や防衛用車両のように主として不
整地で使用される車両のように、高騒音かつ揺動の大き
な環境下で電子機器を操作するのに利用する音声認識装
置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a high noise and large swing environment such as a vehicle mainly used on rough terrain, such as a vehicle used in a construction site or a disaster area or a defense vehicle. The present invention relates to a voice recognition device used to operate an electronic device.

【０００２】[0002]

【従来の技術】工事現場や災害地等で使用される車両や
防衛用車両のように主として不整地で使用される車両で
は、一般にエンジン音等の騒音が大きく一般乗用車に比
べ車内の静寂性は劣っている。また、災害地や演習場等
の不整地の使用では大きな揺動をともない、体を支える
ため両手が自由にならないことが多い。そのような状況
では、ボタン、スイッチ等を有する機器の操作に不自由
することがある。近年、カーエレクトロニクスの技術の
発達によりカーナビゲーションシステムの普及も著し
く、そのような車両においてもカーナビゲーションシス
テムや情報表示装置の様なシステムの搭載が要望されて
いる。しかしながら、前述のように災害地派遣車両や防
衛用車両等の過酷な使用環境においては、手による複雑
なボタン、スイッチ操作は困難である。そこで、手を使
わずに音声でさまざまな電子機器が操作できれば便利で
ある。したがって、高騒音かつ揺動の大きな過酷な環境
下で使用される車両においても利用可能な音声認識装置
が要望されている。図１０は一般車両に設置された音声
認識装置の設置状況を示しており、３０は音声認識装
置、４０は音声認識装置で操作するカーナビ等のコント
ローラ、５０はカーナビ等のモニタを示している。ここ
で、３０の音声認識装置を図９に詳しく示している。図
９において、１ａは音声を入力するマイク、１ｂは環境
騒音を拾うマイク、１１は音声を入力する際に操作する
トリガボタン、２は入力信号から入力パワー等を算出す
る音響分析手段、４は騒音データから騒音スペクトルを
推定する騒音スペクトル推定手段、５は推定した騒音ス
ペクトルを用いて入力信号から騒音成分を除去する騒音
除去手段、３は騒音成分を除去した入力パワーを用いて
音声区間を検出する音声区間検出手段、６は除去された
音声データから特徴ベクトルを抽出する単語特徴ベクト
ル抽出手段、７は認識される登録単語の特徴ベクトルを
記憶した登録単語辞書、８は抽出した特徴ベクトルを登
録単語と比較してその尤度を算出し最上位単語を認識結
果として出力する音声認識手段である。2. Description of the Related Art Vehicles mainly used on rough terrain, such as vehicles used in construction sites and disaster areas, and defense vehicles, generally have a large noise such as engine noise and have a quieter interior than ordinary passenger vehicles. Inferior. In addition, the use of irregular terrain such as a disaster area or a training ground is accompanied by a large swing, and in many cases, both hands are not free to support the body. In such a situation, it may be difficult to operate a device having buttons, switches, and the like. 2. Description of the Related Art In recent years, the spread of car navigation systems has been remarkable due to the development of car electronics technology, and even in such vehicles, it is desired to mount a system such as a car navigation system or an information display device. However, as described above, in a severe use environment such as a disaster area dispatch vehicle or a defense vehicle, it is difficult to operate complicated buttons and switches by hand. Therefore, it is convenient if various electronic devices can be operated by voice without using hands. Therefore, there is a demand for a voice recognition device that can be used in a vehicle used in a harsh environment with high noise and large swing. FIG. 10 shows an installation state of a voice recognition device installed in a general vehicle, 30 denotes a voice recognition device, 40 denotes a controller such as a car navigation operated by the voice recognition device, and 50 denotes a monitor such as a car navigation. Here, 30 speech recognition devices are shown in detail in FIG. In FIG. 9, 1a is a microphone for inputting sound, 1b is a microphone for picking up environmental noise, 11 is a trigger button operated when inputting sound, 2 is acoustic analysis means for calculating input power or the like from an input signal, 4 is Noise spectrum estimating means for estimating a noise spectrum from noise data, 5 means for removing noise components from an input signal using the estimated noise spectrum, and 3 means detecting a voice section using input power from which noise components have been removed. 6 is a word feature vector extraction means for extracting a feature vector from the removed speech data, 7 is a registered word dictionary storing feature vectors of registered words to be recognized, and 8 is a register of extracted feature vectors. This is a speech recognition unit that compares the likelihood with a word, calculates its likelihood, and outputs the highest word as a recognition result.

【０００３】従来、高騒音下における音声認識では、環
境騒音と入力音声を区別するために音声を入力する際に
トリガボタン１１を押して音声を入力する。この時、マ
イク１ａで音声、マイク１ｂで騒音を取得し、音響分析
手段２においてそれぞれの入力パワー、パワースペクト
ル等を算出する。騒音スペクトル推定手段４ではパワー
スペクトルの周波数成分ごとにパワーを予測して騒音ス
ペクトルを推定する。騒音除去手段５では推定した騒音
スペクトルを用いてマイク１ａで取得した入力信号のパ
ワースペクトルから騒音成分を除去して音声スペクトル
を算出する。また、音声スペクトルからパワーを積算し
て音声パワーを算出する。算出した音声パワーを用いて
音声区間検出手段３において音声区間を検出する。単語
特徴ベクトル抽出手段６において検出した音声区間デー
タから単語特徴ベクトルを抽出する。登録単語辞書７に
は登録単語の特徴ベクトルが記憶されており、音声区間
が決定されると認識処理手段８において抽出した単語特
徴ベクトルと登録単語の特徴ベクトルとの尤度を算出す
る。算出した尤度の最上位単語を認識結果として出力す
る。Conventionally, in voice recognition under high noise, a voice is input by pressing a trigger button 11 when voice is input in order to distinguish between environmental noise and input voice. At this time, voice is acquired by the microphone 1a and noise is acquired by the microphone 1b, and the input power, power spectrum, and the like are calculated by the acoustic analysis unit 2. The noise spectrum estimating means 4 estimates the noise spectrum by predicting the power for each frequency component of the power spectrum. The noise removing unit 5 uses the estimated noise spectrum to remove a noise component from the power spectrum of the input signal acquired by the microphone 1a to calculate a voice spectrum. Also, the power is calculated by integrating the power from the voice spectrum. The voice section is detected by the voice section detection means 3 using the calculated voice power. A word feature vector is extracted from the speech section data detected by the word feature vector extraction means 6. The registered word dictionary 7 stores the feature vectors of the registered words. When the speech section is determined, the likelihood between the word feature vector extracted by the recognition processing unit 8 and the feature vector of the registered word is calculated. The top word of the calculated likelihood is output as a recognition result.

【０００４】[0004]

【発明が解決しようとする課題】従来の音声認識装置で
は騒音成分を除去するのに環境騒音のみを取得するマイ
クと音声と環境騒音を同時に取得する２本のマイクが必
要である。両方のマイクが同相の環境騒音を取得できな
い場合、逆に音声信号に雑音を加える結果になり、認識
率が大きく低下する等の問題があるため、マイクの配置
には十分な注意が必要であった。また、騒音下では音声
と騒音とを区別するためにトリガボタン等を操作して音
声を入力する必要があり、操作が面倒である等の欠点が
あった。In a conventional speech recognition apparatus, a microphone for acquiring only environmental noise and two microphones for simultaneously acquiring speech and environmental noise are required to remove noise components. If both microphones cannot acquire in-phase environmental noise, noise will be added to the audio signal, which will cause a problem such as a significant decrease in the recognition rate. Was. In addition, under noise, it is necessary to input a sound by operating a trigger button or the like in order to distinguish the sound from the noise, which has a disadvantage that the operation is troublesome.

【０００５】この発明はかかる問題を解決するために為
されたものであり、騒音下でも安定した認識が可能な音
声認識装置を得ることを目的としている。The present invention has been made to solve such a problem, and an object of the present invention is to provide a voice recognition device capable of performing stable recognition even under noise.

【０００６】[0006]

【課題を解決するための手段】第１の発明による音声認
識装置は、音声を入力するマイクと、入力信号からパワ
ースペクトルを算出し音声帯域の合計パワーを出力する
音響分析手段と、音声帯域パワーを用いて音声区間を検
出する音声区間検出手段と、非音声区間のデータから騒
音スペクトルを推定する騒音スペクトル推定手段と、音
声区間のデータから騒音成分を除去する騒音除去手段と
除去された音声データから特徴ベクトルを抽出する単語
特徴ベクトル抽出手段と、認識させる登録単語の特徴ベ
クトルを記憶した登録単語辞書と、抽出した特徴ベクト
ルを登録単語と比較してその尤度を算出し最上位単語と
次候補との尤度差が設定したしきい値以上のとき認識結
果として出力する音声認識手段とを設けたものである。According to a first aspect of the present invention, there is provided a voice recognition apparatus comprising: a microphone for inputting voice; an acoustic analysis unit for calculating a power spectrum from an input signal and outputting a total power of a voice band; Voice section detecting means for detecting a voice section by using a noise section, a noise spectrum estimating means for estimating a noise spectrum from data of a non-voice section, a noise removing means for removing a noise component from data of a voice section, and voice data removed. Word feature vector extraction means for extracting a feature vector from a registered word dictionary storing feature vectors of registered words to be recognized, and comparing the extracted feature vector with the registered words, calculating the likelihood thereof, and Voice recognition means for outputting as a recognition result when the likelihood difference with the candidate is equal to or greater than a set threshold value.

【０００７】また、第２の発明による音声認識装置は、
音声を入力するマイクと、入力信号からパワースペクト
ルを算出し音声帯域の合計パワーを出力する音響分析手
段と、音声帯域パワーを用いて音声区間を検出する音声
区間検出手段と、非音声区間のデータから騒音スペクト
ルを推定する騒音スペクトル推定手段と、音声区間のデ
ータから騒音成分を除去する騒音除去手段と、除去され
た音声データから特徴ベクトルを抽出する単語特徴ベク
トル抽出手段と、認識させる登録単語の特徴ベクトルを
記憶した登録単語辞書と、抽出した特徴ベクトルを登録
単語辞書と比較してその尤度を算出し上位数単語を認識
結果として出力する音声認識手段と、出力された上位数
単語の並びから音声スイッチのキーワードを判定するキ
ーワード判定手段とを設けたものである。[0007] A speech recognition apparatus according to a second aspect of the present invention comprises:
A microphone for inputting voice, acoustic analysis means for calculating a power spectrum from an input signal and outputting the total power of a voice band, voice section detecting means for detecting a voice section using voice band power, and data for a non-voice section. A noise spectrum estimating means for estimating a noise spectrum from the speech data, a noise removing means for removing a noise component from data of a voice section, a word feature vector extracting means for extracting a feature vector from the removed voice data, and a registered word to be recognized. A registered word dictionary that stores feature vectors, a speech recognition unit that compares the extracted feature vectors with the registered word dictionary, calculates the likelihood thereof, and outputs the top few words as a recognition result, and a sequence of the top few words output And a keyword determining means for determining a keyword of the voice switch from.

【０００８】また、第３の発明による音声認識装置は、
第１または第２の発明に示す音声認識装置で、音声区間
検出手段で検出した音声区間フレームの平均パワーと騒
音フレームの平均パワーからＳＮ比に応じて認識処理を
制御するＳＮＲ判定手段を設けたものである。[0008] Further, a speech recognition apparatus according to a third aspect of the present invention comprises:
In the speech recognition apparatus according to the first or second aspect, an SNR determination unit that controls a recognition process based on an average power of a voice section frame and an average power of a noise frame detected by the voice section detection unit according to an SN ratio is provided. Things.

【０００９】また、第４の発明による音声認識装置は、
第１または第２の発明に示す音声認識装置で音声区間を
検出するのに利用するパワーとして、推定した騒音スペ
クトルから主周波数を算出し入力データのパワースペク
トルのうち騒音の主周波数成分近辺を除いたパワーの合
計を用いて音声区間を検出する手段を設けたものであ
る。[0009] A speech recognition apparatus according to a fourth aspect of the present invention comprises:
As a power used for detecting a voice section in the voice recognition device according to the first or second aspect of the present invention, a main frequency is calculated from an estimated noise spectrum, and a power spectrum of input data excluding a vicinity of a main frequency component of noise is excluded. Means for detecting a voice section using the sum of the powers.

【００１０】また、第５の発明による音声認識装置は、
第１または第２の発明に示す音声認識装置で入力データ
のパワースペクトルと推定した騒音のパワースペクトル
の差から求めたパワースペクトルのうち音声帯域のパワ
ーの合計を用いて音声区間を検出する手段を設けたもの
である。[0010] A speech recognition apparatus according to a fifth aspect of the present invention includes:
The voice recognition device according to the first or second aspect of the present invention includes means for detecting a voice section by using a sum of powers of voice bands in a power spectrum obtained from a difference between a power spectrum of input data and a power spectrum of noise estimated. It is provided.

【００１１】[0011]

【発明の実施の形態】実施の形態１．図１はこの発明の
実施の形態１を示すものである。１は音声を入力するマ
イク、２は入力信号からパワースペクトルを算出し音声
帯域の合計パワーを出力する音響分析手段、３は音声帯
域パワーを用いて音声区間を検出する音声区間検出手
段、４は非音声区間のデータから騒音スペクトルを推定
する騒音スペクトル推定手段、５は音声区間のデータか
ら騒音成分を除去する騒音除去手段、６は除去された音
声データから特徴ベクトルを抽出する単語特徴ベクトル
抽出手段、７は認識させる登録単語の特徴ベクトルを記
憶した登録単語辞書、８は抽出した特徴ベクトルを登録
単語と比較してその尤度を算出し最上位単語を認識結果
として出力する音声認識手段である。また図２は、音声
認識装置の設置状況を示しており、１はヘッドセットに
装着されたマイク、３０はこの発明による音声認識装
置、４０は音声認識装置で操作するカーナビ等コントロ
ーラ、５０はカーナビ等のモニタを示している。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 FIG. 1 shows a first embodiment of the present invention. 1 is a microphone for inputting voice, 2 is an acoustic analysis unit that calculates a power spectrum from an input signal and outputs the total power of a voice band, 3 is a voice section detection unit that detects a voice section using voice band power, 4 is Noise spectrum estimating means for estimating a noise spectrum from data of a non-speech section, 5 a noise removing means for removing a noise component from data of a speech section, 6 a word feature vector extracting means for extracting a feature vector from the removed speech data. , 7 is a registered word dictionary storing the feature vectors of the registered words to be recognized, and 8 is a speech recognition means for comparing the extracted feature vectors with the registered words, calculating the likelihood thereof, and outputting the top word as a recognition result. . FIG. 2 shows an installation state of the voice recognition device, 1 is a microphone mounted on a headset, 30 is a voice recognition device according to the present invention, 40 is a controller such as a car navigation operated by the voice recognition device, and 50 is a car navigation system. Etc. are shown.

【００１２】図３は音響分析手段２の処理の流れを示す
ものである。前処理２１で入力信号の窓掛け処理等を行
い、パワースペクトル算出２２で入力信号のパワースペ
クトルを算出する。算出した入力信号のパワースペクト
ルのうち低域部、中域部、高域部に分けた音声帯域の合
計パワーを積算する。FIG. 3 shows a flow of processing of the acoustic analysis means 2. A preprocessing 21 performs windowing processing of the input signal, and a power spectrum calculation 22 calculates a power spectrum of the input signal. In the calculated power spectrum of the input signal, the total power of the audio band divided into the low band, the middle band, and the high band is integrated.

【００１３】入力信号は音響分析手段２において音声帯
域のみのパワーだけを積算することにより騒音パワーが
抑えられ、高騒音下においても音声区間を検出しやすく
なる。図４は「さっぽろ」と発生したときの例で音声区
間検出手段３における音声区間検出の判定方法を示す。
音声区間検出手段３には音響分析手段２で算出した音声
帯域パワーが入力パワーとして入力される。非音声区間
と判定された区間では逐次雑音レベルを更新し、入力パ
ワーがしきい値１より大きくなると仮の始端Ｔ１を検出
する。Ｔ１が検出されると最後に更新した雑音レベル値
で雑音レベルを固定する。Ｔ１検出後さらに、数フレー
ムの間に入力パワーがしきい値２より大きくなった場
合、Ｔ１を始端Ｔｓとする。始端Ｔｓを検出した後入力
パワーがしきい値３より小さくなりさらにしきい値４よ
り小さくなると仮の終端Ｔ２を検出する。Ｔ２検出後あ
る一定時間（Ｔｐ）、入力パワーがしきい値４を越えな
い場合Ｔ２を終端Ｔｅとして音声区間を検出する。この
とき、Ｔｐ以内に入力パワーが再びしきい値４を越える
と仮の終端Ｔ２をリセットし、再度仮の終端を検出する
ようにする。The noise power of the input signal is suppressed by integrating only the power of the voice band in the acoustic analysis means 2, and the voice section can be easily detected even under high noise. FIG. 4 shows an example of a case where “Sapporo” has occurred, and shows a method of determining voice section detection in the voice section detection means 3.
The speech band power calculated by the sound analysis means 2 is input to the speech section detection means 3 as input power. The noise level is sequentially updated in the section determined to be a non-voice section, and when the input power becomes larger than the threshold value 1, a temporary start end T1 is detected. When T1 is detected, the noise level is fixed at the last updated noise level value. If the input power becomes larger than the threshold value 2 during several frames after T1 is detected, T1 is set as the starting point Ts. When the input power becomes smaller than the threshold value 3 after detecting the start end Ts and further becomes smaller than the threshold value 4, the provisional end T2 is detected. If the input power does not exceed the threshold value 4 for a certain period of time (Tp) after detecting T2, a voice section is detected with T2 as the terminal Te. At this time, if the input power again exceeds the threshold value 4 within Tp, the provisional termination T2 is reset, and the provisional termination is detected again.

【００１４】音声区間検出手段３で非音声区間と判定さ
れた場合、入力信号のパワースペクトルは騒音スペクト
ル推定手段４へ与えられる。騒音スペクトル推定手段４
では、各周波数成分毎にパワー値を推定し逐次更新す
る。推定方法としては線形予測、最小自乗線形予測等を
利用することができる。音声区間検出手段３で仮の始端
Ｔ１が検出されると騒音除去手段５において入力信号パ
ワースペクトルから騒音スペクトル推定手段４で推定し
た騒音スペクトルを除去し、これにより入力信号に含ま
れる騒音成分を抑圧する。次に単語特徴ベクトル抽出手
段５において騒音除去手段５で騒音スペクトルを除去し
たパワースペクトルデータから単語特徴ベクトルを抽出
する。登録単語辞書７には登録単語の特徴ベクトルが記
憶されている。音声区間が決定されると認識処理手段８
において抽出した単語特徴ベクトルと登録単語の特徴ベ
クトルとの尤度を算出する。ここで、最上位単語と次単
語との尤度差がしきい値Ｒ１未満の場合、認識結果とし
て聞き直しを促す信号を出力する。また、最上位単語と
次単語との尤度差がしきい値Ｒ２以上の場合、最上位単
語を認識結果として出力する。When the voice section detecting means 3 determines that the voice section is a non-voice section, the power spectrum of the input signal is supplied to the noise spectrum estimating means 4. Noise spectrum estimation means 4
Then, the power value is estimated for each frequency component and updated sequentially. As the estimation method, linear prediction, least square linear prediction, or the like can be used. When the voice section detecting means 3 detects the temporary start end T1, the noise removing means 5 removes the noise spectrum estimated by the noise spectrum estimating means 4 from the input signal power spectrum, thereby suppressing the noise component contained in the input signal. I do. Next, the word feature vector extraction unit 5 extracts a word feature vector from the power spectrum data from which the noise spectrum has been removed by the noise removal unit 5. The registered word dictionary 7 stores feature vectors of registered words. When the voice section is determined, the recognition processing means 8
The likelihood between the extracted word feature vector and the registered word feature vector is calculated. Here, when the likelihood difference between the top word and the next word is less than the threshold value R1, a signal prompting re-listening is output as a recognition result. If the likelihood difference between the top word and the next word is equal to or greater than the threshold value R2, the top word is output as a recognition result.

【００１５】実施の形態２．図５はこの発明の実施の形
態２を示すものである。２は入力信号からパワースペク
トルを算出し音声帯域の合計パワーを出力する音響分析
手段、３は音声帯域パワーを用いて音声区間を検出する
音声区間検出手段、４は非音声区間のデータから騒音ス
ペクトルを推定する騒音スペクトル推定手段、５は音声
区間のデータから騒音成分を除去する騒音除去手段、６
は除去された音声データから特徴ベクトルを抽出する単
語特徴ベクトル抽出手段、７は認識させる登録単語の特
徴ベクトルを記憶した登録単語辞書、８は抽出した特徴
ベクトルを登録単語と比較してその尤度を算出し上位数
単語を認識結果として出力する音声認識手段、９は出力
された上位数単語から音声スイッチのキーワードを判定
するキーワード判定手段である。ここで、キーワードに
は例えば「ナビ」等の愛称を設定しておけば良い。Embodiment 2 FIG. 5 shows a second embodiment of the present invention. Reference numeral 2 denotes an acoustic analysis unit that calculates a power spectrum from an input signal and outputs a total power of a voice band. 3 denotes a voice section detection unit that detects a voice section using the voice band power. 4 denotes a noise spectrum from data of a non-voice section. A noise spectrum estimating means for estimating the noise component; a noise removing means for removing a noise component from the data of the voice section;
Is a word feature vector extraction means for extracting a feature vector from the removed speech data, 7 is a registered word dictionary storing feature vectors of registered words to be recognized, 8 is a comparison of the extracted feature vector with the registered words and the likelihood thereof. Is a voice recognition means for calculating the top several words as a recognition result, and 9 is a keyword determination means for determining the keyword of the voice switch from the outputted top several words. Here, for example, a nickname such as “navi” may be set as the keyword.

【００１６】キーワード判定手段９は音声スイッチを実
現するものであり、音声認識手段８から出力される上位
数単語のうち最上位単語がキーワードでかつ次単語が特
定の単語である場合、キーワードであると判定する。音
声認識手段８から出力される上位数単語の中にキーワー
ドが含まれていない場合、入力された単語を破棄する。
また、最上位単語がキーワードでかつ次単語が特定の単
語でない場合、認識結果として聞き直しを促す信号を出
力する。ここで、キーワードであると判定されるとキー
ワード判定手段９はスイッチＯＮの状態となり、認識結
果としてキーワードを出力する。さらに、次に入力され
る音声に対して認識結果を出力する。また、数秒以上音
声入力がない場合、スイッチＯＦＦの状態となり再度キ
ーワード判定を行う。つまり、キーワード判定をした後
次に入力される音声コマンドの認識を行うように動作
し、ある時間無声部分が生じると再度キーワード判定を
行うように動作する。The keyword determining means 9 realizes a voice switch. If the highest word among the top several words output from the voice recognizing means 8 is a keyword and the next word is a specific word, it is a keyword. Is determined. If the keyword is not included in the top few words output from the voice recognition means 8, the input word is discarded.
If the top word is a keyword and the next word is not a specific word, a signal prompting re-listening is output as a recognition result. Here, when it is determined that the keyword is the keyword, the keyword determination means 9 is turned on and outputs the keyword as a recognition result. Further, a recognition result is output for the next input voice. If there is no voice input for several seconds or more, the switch is turned off and the keyword is determined again. That is, after the keyword is determined, the operation is performed so as to recognize the voice command to be input next, and when the unvoiced portion occurs for a certain time, the operation is performed so as to perform the keyword determination again.

【００１７】実施の形態３．図６はこの発明の実施の形
態３を示すものである。２は入力信号からパワースペク
トルを算出し音声帯域の合計パワーを出力する音響分析
手段、３は音声帯域パワーを用いて音声区間を検出する
音声区間検出手段、４は非音声区間のデータから騒音ス
ペクトルを推定する騒音スペクトル推定手段、５は音声
区間のデータから騒音成分を除去する騒音除去手段、６
は除去された音声データから特徴ベクトルを抽出する単
語特徴ベクトル抽出手段、７は認識させる登録単語の特
徴ベクトルを記憶した登録単語辞書、８は抽出した特徴
ベクトルを登録単語と比較してその尤度を算出し上位数
単語を認識結果として出力する音声認識手段、１０は音
声区間検出手段で検出した音声区間フレームの平均パワ
ーと騒音フレームの平均パワーからＳＮ比に応じて認識
処理を制御するＳＮＲ判定手段である。Embodiment 3 FIG. 6 shows a third embodiment of the present invention. Reference numeral 2 denotes an acoustic analysis unit that calculates a power spectrum from an input signal and outputs a total power of a voice band. 3 denotes a voice section detection unit that detects a voice section using the voice band power. 4 denotes a noise spectrum from data of a non-voice section. A noise spectrum estimating means for estimating the noise component; a noise removing means for removing a noise component from the data of the voice section;
Is a word feature vector extracting means for extracting a feature vector from the removed speech data, 7 is a registered word dictionary storing feature vectors of registered words to be recognized, 8 is a comparison of the extracted feature vector with the registered words and the likelihood thereof. Speech recognizing means for calculating the top several words as a recognition result, and SNR judgment 10 for controlling the recognition processing according to the SN ratio from the average power of the speech section frame and the average power of the noise frame detected by the speech section detecting means. Means.

【００１８】音声区間検出手段３で音声区間を検出する
とＳＮＲ判定手段１０では音声区間全フレームのパワー
平均を算出する。次に、騒音スペクトル推定手段３で推
定された非音声区間における騒音パワーとの比を求めて
ある設定しきい値以上の場合、雑音除去以降の処理を実
行する。ある設定しきい値未満の場合、ＳＮＲ判定結果
として入力音声は不明瞭であると出力し、入力信号を破
棄する。When the voice section is detected by the voice section detection means 3, the SNR determination means 10 calculates the average power of all frames in the voice section. Next, when the ratio with the noise power in the non-speech section estimated by the noise spectrum estimating means 3 is equal to or larger than a predetermined threshold value, processing after noise removal is executed. If the value is less than a certain set threshold, the input voice is output as unclear as an SNR determination result, and the input signal is discarded.

【００１９】ＳＮＲ判定手段１０は、当然実施の形態２
に示す音声認識装置にも付加することができる。The SNR determination means 10 is, of course, used in the second embodiment.
Can be added to the speech recognition device shown in FIG.

【００２０】実施の形態４．図７はこの発明の実施の形
態４を示すものである。上記実施の形態１および実施の
形態２では音声区間を検出するのに音声帯域パワーを用
いているが、実施の形態４では騒音スペクトル推定手段
４で取得した騒音スペクトルから主周波数検出２４によ
り騒音スペクトルのパワーの大きな主周波数成分を検出
する。音声帯域パワー算出２３では検出した騒音の主周
波数付近のパワーを除いて音声帯域パワーを算出する。Embodiment 4 FIG. 7 shows a fourth embodiment of the present invention. In the first embodiment and the second embodiment, the voice band power is used to detect the voice section. In the fourth embodiment, the noise spectrum is obtained by the main frequency detection 24 from the noise spectrum obtained by the noise spectrum estimating means 4. A main frequency component having a large power is detected. In the audio band power calculation 23, the audio band power is calculated excluding the power near the main frequency of the detected noise.

【００２１】実施の形態５．図８はこの発明の実施の形
態５を示すものである。実施の形態５では騒音除去処理
２５において入力信号のパワースペクトルから騒音スペ
クトル推定手段３で取得した騒音スペクトルを差し引い
たパワースペクトルを算出する。算出したパワースペク
トルから音声帯域パワーを算出する。Embodiment 5 FIG. 8 shows a fifth embodiment of the present invention. In the fifth embodiment, in the noise removal processing 25, a power spectrum is calculated by subtracting the noise spectrum acquired by the noise spectrum estimating means 3 from the power spectrum of the input signal. The voice band power is calculated from the calculated power spectrum.

【００２２】[0022]

【発明の効果】第１の発明によれば、１つのマイクのみ
を使用して音声帯域パワーにおける騒音レベルを音声区
間検出における基準とし、非音声区間において逐次騒音
レベルを更新しているため環境騒音が変化しても音声区
間を正しく検出することができる。また、非音声区間で
混入する環境騒音信号から騒音スペクトルを推定し、検
出した音声区間の入力信号から推定した騒音スペクトル
を差し引くことにより騒音成分を除去するように構成し
ているため環境騒音が大きく低Ｓ／Ｎ環境下でも安定し
た音声認識が可能である。According to the first aspect of the present invention, the noise level in the voice band power is used as a reference in voice section detection using only one microphone, and the noise level is updated successively in the non-voice section. , The voice section can be correctly detected. In addition, the noise spectrum is estimated from the environmental noise signal mixed in the non-voice section, and the noise component is removed by subtracting the estimated noise spectrum from the input signal in the detected voice section. Stable speech recognition is possible even in a low S / N environment.

【００２３】また、第２の発明によれば、音声スイッチ
となるキーワードを上位数単語の認識単語を利用して判
定するようにしているため誤認識、誤動作を抑制してお
り音声入力のためのスイッチ操作を必要としないため操
作性の向上が図れる。According to the second aspect of the present invention, the keyword to be a voice switch is determined by using the recognized words of the top several words. Since no switch operation is required, operability can be improved.

【００２４】また、第３の発明によれば、入力音声のＳ
Ｎ比が低すぎる場合は認識処理を実行せずに入力をリジ
ェクトしているため、誤認識を抑制する効果がある。Further, according to the third invention, the S of the input voice is
If the N ratio is too low, the input is rejected without performing the recognition process, and thus there is an effect of suppressing erroneous recognition.

【００２５】また、第４および第５の発明によれば、環
境騒音の影響を受けにくいように構成しており音声区間
検出のための高いＳＮ比を確保し、正しく音声区間を検
出できる効果がある。Further, according to the fourth and fifth aspects of the present invention, the configuration is made so as not to be easily influenced by environmental noise, so that a high S / N ratio for voice section detection can be ensured, and the effect of correctly detecting voice sections can be obtained. is there.

[Brief description of the drawings]

【図１】この発明による音声認識装置の実施の形態１
を示す図である。FIG. 1 is a first embodiment of a speech recognition apparatus according to the present invention;
FIG.

【図２】この発明による音声認識装置の設置状況を示
す図である。FIG. 2 is a diagram showing an installation state of a voice recognition device according to the present invention.

【図３】この発明による音声認識装置の音響分析手段
を示す図である。FIG. 3 is a diagram showing acoustic analysis means of the speech recognition device according to the present invention.

【図４】この発明による音声認識装置の音声区間検出
手段を示す図である。FIG. 4 is a diagram showing a voice section detecting means of the voice recognition device according to the present invention.

【図５】この発明による音声認識装置の実施の形態２
を示す図である。FIG. 5 is a second embodiment of the speech recognition apparatus according to the present invention.
FIG.

【図６】この発明による音声認識装置の実施の形態３
を示す図である。FIG. 6 is a third embodiment of the speech recognition apparatus according to the present invention.
FIG.

【図７】この発明による音声認識装置の実施の形態４
を示す図である。FIG. 7 is a fourth embodiment of the speech recognition apparatus according to the present invention;
FIG.

【図８】この発明による音声認識装置の実施の形態５
を示す図である。FIG. 8 is a fifth embodiment of the speech recognition apparatus according to the present invention.
FIG.

【図９】従来の音声認識装置の構成を示す図である。FIG. 9 is a diagram showing a configuration of a conventional voice recognition device.

【図１０】従来の音声認識装置の設置状況を示す図で
ある。FIG. 10 is a diagram showing an installation state of a conventional voice recognition device.

[Explanation of symbols]

１音声用マイク、１ｂ騒音用マイク、２音響分析
手段、３音声区間検出手段、４騒音スペクトル推定
手段、５騒音除去手段、６単語特徴ベクトル抽出手
段、７単語登録辞書、８音声認識手段、９キーワ
ード判定手段、１０ＳＮＲ判定手段、１１トリガボ
タン、２１前処理、２２パワースペクトル算出、２
３音声帯域パワー算出、２４主周波数検出、２５
騒音除去処理、３０音声認識装置、４０カーナビ等
コントローラ、５０カーナビ等のモニタ。Reference Signs List 1 microphone for sound, 1b microphone for noise, 2 sound analyzing means, 3 sound section detecting means, 4 noise spectrum estimating means, 5 noise removing means, 6 word feature vector extracting means, 7 word registration dictionary, 8 voice recognizing means, 9 Keyword determination means, 10 SNR determination means, 11 trigger button, 21 preprocessing, 22 power spectrum calculation, 2
3 Voice band power calculation, 24 main frequency detection, 25
Noise removal processing, 30 voice recognition device, 40 car navigation controller, 50 car navigation monitor.

Claims

[Claims]

1. A microphone for inputting a voice, a sound analysis unit for calculating a power spectrum from an input signal and outputting a total power of a voice band, a voice section detecting unit for detecting a voice section using the voice band power, Noise spectrum estimating means for estimating a noise spectrum from non-speech section data, noise removing means for removing a noise component from speech section data, word feature vector extracting means for extracting a feature vector from the removed speech data, A registered word dictionary that stores the feature vectors of the registered words to be recognized, and the extracted feature vectors are compared with the registered words to calculate the likelihood, and the likelihood difference between the top word and the next candidate is equal to or greater than the set threshold. And a speech recognition means for outputting the result as a recognition result.

2. A microphone for inputting voice, an acoustic analysis unit for calculating a power spectrum from an input signal and outputting a total power of a voice band, a voice section detecting unit for detecting a voice section using the voice band power, Noise spectrum estimating means for estimating a noise spectrum from non-speech section data, noise removing means for removing a noise component from speech section data, word feature vector extracting means for extracting a feature vector from the removed speech data, A registered word dictionary that stores feature vectors of registered words to be recognized, a speech recognition unit that compares the extracted feature vectors with the registered word dictionary, calculates the likelihood thereof, and outputs the top few words as a recognition result,
A speech recognition apparatus, comprising: a keyword determination unit that determines a keyword of a voice switch using a sequence of top several words of a recognition result.

3. The method according to claim 1, wherein the average power of the voice section frame and the average power of the noise frame detected by the voice section detection means are used as S
3. The speech recognition device according to claim 1, further comprising an SNR determination unit for controlling a recognition process according to the N ratio.

4. A voice section is calculated by calculating a main frequency from an estimated noise spectrum and using a sum of powers of a power spectrum of input data excluding a vicinity of a main frequency component of noise. Or the speech recognition device according to 2.

5. A voice section is detected by using a sum of powers of voice bands in a power spectrum obtained from a difference between a power spectrum of input data and a power spectrum of estimated noise. A speech recognition device according to claim 1.