JP2001042886A

JP2001042886A - Speech input and output system and speech input and output method

Info

Publication number: JP2001042886A
Application number: JP11219518A
Authority: JP
Inventors: Masamitsu Muratani; 政充村谷; Katsuhiko Machida; 勝彦町田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1999-08-03
Filing date: 1999-08-03
Publication date: 2001-02-16

Abstract

PROBLEM TO BE SOLVED: To make it possible to execute an environmental adaptation processing to speech recognition processing for making at least the human speeches the object by adopting a constitution to execute the recognition processing in accordance with the environmental learning information learned beforehand and internally held in the state that the environmental learning processing ends >=1 times when human utters voice. SOLUTION: This system has the means for executing the speech recognition processing in accordance with the environmental learning information learned beforehand and internally held in the state that the environmental learning processing ends >=1 times when the humans utter the voices. For example, a speech recognition engine 1 of single output of a speech input and output system 10 executes the environmental learning processing to the speech recognition processing for at least the human speeches simultaneously or as a post processing of the speech recognition processing. A signal switching section 2 switches the signal inputted to the speech recognition engine 1 of single output. A mixer section 4 superposes the sample patterns of the previously prepared human utterance and the speech signals inputted from outside. A speech input section 5 converts the external speeches to the speech signals.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力される音声信
号に含まれる人間の声以外の背景雑音を学習することに
より認識率の向上を目的とした環境適応機能を備え前記
環境学習処理を前記音声認識処理と同時または当該音声
認識処理の後処理として実行する音声認識エンジンを用
いた音声入出力技術に係り、特に少なくとも人間の音声
を対象とする音声認識処理に対する環境適応処理を実行
できる音声入出力システムおよび音声入出力方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention has an environment adaptation function for improving a recognition rate by learning background noise other than a human voice included in an input speech signal, and performs the environment learning process. The present invention relates to a speech input / output technology using a speech recognition engine that is executed simultaneously with speech recognition processing or as a post-processing of the speech recognition processing. The present invention relates to an output system and an audio input / output method.

【０００２】[0002]

【従来の技術】図５は１入力の音声認識エンジンを用い
た従来のシステム例（第１従来技術）である。図５を参
照すると、第１従来技術は、人間の声が音声信号として
入力されることを期待する入力を１つのみ持つ１入力の
音声認識エンジンであって、外部から取り込まれた音声
信号から認識対象とすべき音声を取り除いた信号を基に
背景雑音を学習する環境学習機能を有する音声認識エン
ジンＰ１と、スイッチ部Ｐ２と、外部の音声を信号に変
換する音声入力手段Ｐ３を備え、人間が発声しないとき
にスイッチ部Ｐ２を開放し認識エンジンに音声信号を入
力しないように設定することで認識処理を実行せず、人
間が発声するときにスイッチ部Ｐ２を接続することによ
り人間の声が混ざった音声信号を選択し、音声入力手段
Ｐ３で変換された音声信号を音声認識エンジンＰ１に取
り込んで環境適応機能を伴う認識処理を実行するように
構成されている。2. Description of the Related Art FIG. 5 shows an example of a conventional system using a one-input speech recognition engine (first prior art). Referring to FIG. 5, the first related art is a one-input speech recognition engine having only one input that expects a human voice to be inputted as a speech signal, wherein A speech recognition engine P1 having an environment learning function of learning background noise based on a signal from which speech to be recognized has been removed, a switch unit P2, and speech input means P3 for converting external speech into a signal; By not opening the switch P2 and inputting a speech signal to the recognition engine when no utterance is made, the recognition process is not executed, and by connecting the switch P2 when a human utters, the human voice is The mixed speech signal is selected, the speech signal converted by the speech input means P3 is taken into the speech recognition engine P1, and a recognition process with an environment adaptation function is executed. There.

【０００３】図６は、２入力の音声認識エンジンを用い
た従来のシステム例（第２従来技術）である。図６を参
照すると、第２従来技術は、人間の声が音声信号として
入力されることを期待する入力と背景雑音が音声信号と
して入力されることを期待する入力の２つの入力を持つ
２入力の音声認識エンジンであって、２つの音声信号の
入力を必要とし、外部から取り込まれた音声信号から認
識対象とすべき音声を取り除いた信号を基に背景雑音を
学習する環境学習機能を有する音声認識エンジンＰ１
と、スイッチ部Ｐ２と、外部の音声を信号に変換する音
声入力手段Ｐ３と、外部の環境音を信号に変換する環境
音入力手段Ｐ４を備え、人間が発声しないときスイッチ
部Ｐ２のスイッチＡを開放するとともにスイッチＢを接
続しておき音声認識エンジンＰ１に音声信号を入力しな
いように設定することで認識処理を実行せず、人間が発
声するときにスイッチ部Ｐ２のスイッチＡを接続すると
ともにスイッチＢを開放することにより人間の声が混ざ
った音声信号を選択し音声入力手段Ｐ３で変換された音
声信号を音声認識エンジンＰ１に取り込んで環境適応機
能を伴う認識処理を実行するように構成されている。ま
た他の従来技術としては、例えば、特開平８−２１１８
８８号公報（第３従来技術）に記載のものがある。すな
わち、第３従来技術は、統計的モデルを用いた音声認識
における認識率向上のための環境適応方法の改良を目的
とするものであって、統計的モデルを用いた音声認識の
ために環境雑音を学習して非音声区間に対応する統計的
モデルである非音声モデルを作成する環境適応方法にお
いて、音声認識の開始前に音声認識を行おうとする環境
の学習用環境雑音データを学習用音声データに重畳し、
雑音重畳学習用音声データを取得し、雑音重畳学習用音
声データから非音声モデルを学習により作成し、非音声
モデルが非音声から音声への渡りの部分、定常非音声部
分、および音声から非音声への渡りの部分に対応した３
状態を表現する音声認識における環境適応方法である。
このような音声認識における環境適応方法によれば非音
声モデルを認識環境の雑音を重畳した後の学習用音声デ
ータから学習により作成するので、環境雑音に加えてリ
ップノイズおよび息等の人間が発声する雑音をも学習す
ることができ、非音区間での整合性が向上し、音声の認
識率が向上し、また、非音声モデルが非音声から音声へ
の渡りの部分、定常非音声部分、および音声から非音声
への渡りの部分の３状態を表現することができる場合
は、環境自体の雑音および人間が発声する雑音に加え
て、これら非音声から音声への渡りの部分および音声か
ら非音声へ渡りの部分も学習することが可能となり、認
識率がさらに高くなり、さらに、学習用音声データから
作成した音声モデルと雑音重畳学習用音声データから作
成した音声モデルとを選択する選択手段を有する場合
は、環境雑音の性質によって音声モデルを使い分けるこ
とができ、音声区間での整合性が高くなり、音声の認識
率が高くなるといった効果が開示されている。FIG. 6 shows an example of a conventional system using a two-input speech recognition engine (second prior art). Referring to FIG. 6, the second related art has two inputs including an input expecting a human voice to be input as an audio signal and an input expecting a background noise to be input as an audio signal. Voice recognition engine, which requires two voice signal inputs, and has an environment learning function of learning background noise based on a signal obtained by removing a voice to be recognized from a voice signal fetched from outside. Recognition engine P1
And a switch unit P2, a voice input unit P3 for converting an external voice into a signal, and an environmental sound input unit P4 for converting an external environmental sound into a signal. When the switch is opened and the switch B is connected and the speech signal is not inputted to the speech recognition engine P1, the recognition process is not executed. When a human utters, the switch A of the switch section P2 is connected and the switch is connected. By releasing B, a voice signal mixed with a human voice is selected, the voice signal converted by the voice input means P3 is taken into the voice recognition engine P1, and a recognition process with an environment adaptation function is executed. I have. As another conventional technique, for example, Japanese Patent Application Laid-Open No. H08-2118
No. 88 (third prior art). That is, the third prior art aims at improving an environment adaptation method for improving the recognition rate in speech recognition using a statistical model, and aims at improving environmental noise for speech recognition using a statistical model. In the environment adaptation method of creating a non-speech model, which is a statistical model corresponding to a non-speech section, learning environment noise data for an environment in which speech recognition is to be performed before the start of speech recognition, is used as the learning speech data. Superimposed on
The speech data for the noise superimposition learning is acquired, and a non-speech model is created from the speech data for the noise superimposition learning by learning, and the non-speech model is a transition from non-speech to speech, a stationary non-speech part, and a speech to non-speech. 3 corresponding to the transition to
This is an environment adaptation method in speech recognition expressing a state.
According to such an environment adaptation method in speech recognition, a non-speech model is created by learning from learning speech data after noise of a recognition environment is superimposed, so that humans such as lip noise and breath in addition to environmental noise are uttered. Noise can be learned, the consistency in non-sound sections is improved, the speech recognition rate is improved, and the non-speech model is a transition from non-speech to speech, a stationary non-speech part, If the three states of the transition from speech to non-speech can be expressed, in addition to the noise of the environment itself and noise uttered by humans, the transition from non-speech to speech and the non-speech to speech It is also possible to learn the transition to the voice, the recognition rate is further increased, and furthermore, the voice model created from the learning voice data and the voice model created from the noise superimposition learning voice data are used. If having a selection means for-option can selectively use speech model by the nature of the environmental noise, the higher the integrity of the voice segment, the effect is disclosed such as voice recognition rate is high.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上記従
来技術の音声認識エンジンでは、認識処理と同時または
後処理として環境学習を実行するため、人間が発声した
状態で認識処理を行わなければ環境学習処理は行われな
いという問題点があり、このため、音声認識エンジンが
環境適応機能を持っているにもかかわらず、人間が発声
するまで音声認識エンジンが環境学習を実行しないた
め、音声認識エンジン内部に環境情報が蓄積されておら
ず、初回の発声に対して環境適応機能が十分に発揮でき
ないという問題点があった。However, in the above-mentioned prior art speech recognition engine, environmental learning is performed simultaneously with or after the recognition process. Therefore, despite the fact that the speech recognition engine has an environment adaptation function, the speech recognition engine does not execute environmental learning until a human utters, so that the speech recognition engine There was a problem that the environment adaptation function could not be sufficiently exhibited for the first utterance because environmental information was not accumulated.

【０００５】本発明は斯かる問題点を鑑みてなされたも
のであり、その目的とするところは、少なくとも人間の
音声を対象とする音声認識処理に対する環境適応処理を
実行できる音声入出力システムおよび音声入出力方法を
提供する点にある。SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and has as its object to provide a speech input / output system and a speech input / output system capable of executing at least environment adaptation processing for speech recognition processing for human speech. The point is to provide an input / output method.

【０００６】[0006]

【課題を解決するための手段】請求項１に記載の発明の
要旨は、少なくとも人間の音声を対象とする音声認識処
理に対する環境適応処理を実行できる音声入出力システ
ムであって、前記環境学習処理を前記音声認識処理と同
時または当該音声認識処理の後処理として実行する音声
認識エンジンと、人間が発声を実行しないときにあらか
じめ用意しておいた音声パターンと外部からの入力信号
を重ね合わせて前記音声認識エンジンに入力して事前に
認識処理を実行して環境情報を学習する手段と、人間が
発声したときに環境学習処理が少なくとも一回以上終了
した状態で事前に学習され内部に保持されている環境学
習情報を基に認識処理を実行する手段を有することを特
徴とする音声入出力システムに存する。また請求項２に
記載の発明の要旨は、少なくとも人間の音声を対象とす
る音声認識処理に対する環境適応処理を実行できる音声
入出力システムであって、前記環境学習処理を前記音声
認識処理と同時または当該音声認識処理の後処理として
実行する音声認識エンジンと、人間が発声を実行しない
ときにあらかじめ用意しておいた音声パターンと外部か
らの背景音とを重ね合わせて前記音声認識エンジンに入
力して事前に認識処理を実行して環境情報を学習する手
段と、人間が発声したときに環境学習処理が少なくとも
一回以上終了した状態で事前に学習され内部に保持され
ている環境学習情報を基に認識処理を実行する手段を有
することを特徴とする音声入出力システムに存する。ま
た請求項３に記載の発明の要旨は、外部の音声を音声信
号に変換する音声入力部と、音声信号の入力用の音声入
力チャネルを１つのみ備え前記音声信号に含まれる人間
の声以外の背景雑音を学習することにより認識率の向上
を目的とした環境適応機能を有する前記音声認識エンジ
ンと、前記１入力の前記音声認識エンジンに入力する信
号を切り替える信号切り替え部と、あらかじめ用意した
人間の発声のサンプルパターンと外部から入力される音
声信号とを重ね合わせるミキサー部を有することを特徴
とする請求項１または２に記載の音声入出力システムに
存する。また請求項４に記載の発明の要旨は、外部の音
声を音声信号に変換する音声入力部と、音声信号の入力
端子を１つのみ備え前記音声信号に含まれる人間の声以
外の背景雑音を学習することにより認識率の向上を目的
とした環境適応機能を有する前記音声認識エンジンと、
前記１入力の前記音声認識エンジンに入力する信号を切
り替える信号切り替え部と、あらかじめ用意した人間の
発声のサンプルパターンを記憶するサンプル音声パター
ン記憶部と、前記サンプル音声パターン記憶部に保存さ
れているサンプル音声パターンと前記音声入力部から入
力される音声信号とを重ね合わせて出力するミキサー部
を有することを特徴とする請求項１または２に記載の音
声入出力システムに存する。また請求項５に記載の発明
の要旨は、音声入力開始後であって人間が発声していな
い場合、前記サンプル音声パターン記憶部に保持されて
いる人間の音声のサンプルパターンに外部からの入力を
前記ミキサー部で重ね合わせて前記１入力の前記音声認
識エンジンに入力し、前記信号切り替え部を所定側に設
定することで、前記１入力の前記音声認識エンジンが、
実際には人間の発声による音声入力が行われていない場
合における認識処理を行って環境適応の学習を行う１入
力時環境適応処理を実行するように構成されていること
を特徴とする請求項４に記載の音声入出力システムに存
する。また請求項６に記載の発明の要旨は、音声入力開
始後であって人間による発声が行われる場合、前記信号
切り替え部を他の所定側に設定し、外部からの音声入力
をそのまま前記１入力の前記音声認識エンジンに入力し
て認識処理を行う１入力時音声認識処理を実行するよう
に構成されていることを特徴とする請求項４または５に
記載の音声入出力システムに存する。また請求項７に記
載の発明の要旨は、外部の音声を音声信号に変換する音
声入力部と、外部の環境音を環境音信号に変換する環境
音入力部と、前記音声信号が入力されることを期待する
音声入力用の音声入力チャネルと前記環境音信号が入力
されることを期待する環境音入力用のノイズ入力チャネ
ルを２つ備え前記音声信号に含まれる人間の声以外の背
景雑音を学習することにより認識率の向上を目的とした
環境適応機能を有する２入力の前記音声認識エンジン
と、前記２入力の音声認識エンジンに入力する信号を切
り替える信号切り替え部と、前記サンプル音声パターン
記憶部に保存されているサンプル音声パターンと前記環
境音入力部から入力される前記環境音信号とを重ね合わ
せて出力するミキサー部を有することを特徴とする請求
項１または２に記載の音声入出力システムに存する。ま
た請求項８に記載の発明の要旨は、音声入力開始後であ
って人間が発声していない場合、前記環境音入力部から
出力される前記環境音信号を前記サンプル音声パターン
記憶部に保持されているサンプル音声パターンに重ね合
わせて生成した信号を前記２入力の音声認識エンジンの
前記音声入力チャネルに入力し、前記信号切り替え部を
所定側に設定し、前記環境音入力部から入力された前記
環境音信号を前記２入力の音声認識エンジンの前記ノイ
ズ入力チャネルに直接入力することで、前記２入力の音
声認識エンジンが、人間が発声していない場合における
認識処理を行って環境情報を学習する２入力時環境適応
処理を実行するように構成されていることを特徴とする
請求項７に記載の音声入出力システムに存する。また請
求項９に記載の発明の要旨は、音声入力開始後であって
人間による発声が行われる場合、前記信号切り替え部を
他の所定側に設定し、前記音声入力部からの音声信号を
前記音声入力チャネルに入力し、前記環境音入力部から
入力された前記環境音信号を前記ノイズ入力チャネルに
入力することで、前記２入力の音声認識エンジンが、実
際に人間により発声された声を含む音声信号を認識対象
とした認識処理を行う２入力時音声認識処理を実行する
ように構成されていることを特徴とする請求項７または
８に記載の音声入出力システムに存する。また請求項１
０に記載の発明の要旨は、少なくとも人間の音声を対象
とする音声認識処理に対する環境適応処理を実行できる
音声入出力方法であって、前記環境学習処理を前記音声
認識処理と同時または当該音声認識処理の後処理として
実行する音声認識工程と、人間が発声を実行しないとき
にあらかじめ用意しておいた音声パターンと外部からの
入力信号を重ね合わせて前記音声認識工程に入力して事
前に認識処理を実行して環境情報を学習する工程と、人
間が発声したときに環境学習処理が少なくとも一回以上
終了した状態で事前に学習され内部に保持されている環
境学習情報を基に認識処理を実行する工程を有すること
を特徴とする音声入出力方法に存する。また請求項１１
に記載の発明の要旨は、少なくとも人間の音声を対象と
する音声認識処理に対する環境適応処理を実行できる音
声入出力方法であって、前記環境学習処理を前記音声認
識処理と同時または当該音声認識処理の後処理として実
行する音声認識工程と、人間が発声を実行しないときに
あらかじめ用意しておいた音声パターンと外部からの背
景音とを重ね合わせて前記音声認識工程に入力して事前
に認識処理を実行して環境情報を学習する工程と、人間
が発声したときに環境学習処理が少なくとも一回以上終
了した状態で事前に学習され内部に保持されている環境
学習情報を基に認識処理を実行する工程を有することを
特徴とする音声入出力方法に存する。また請求項１２に
記載の発明の要旨は、外部の音声を音声信号に変換する
音声入力工程と、音声信号の入力用の音声入力チャネル
を１つのみ備え前記音声信号に含まれる人間の声以外の
背景雑音を学習することにより認識率の向上を目的とし
た環境適応機能を有する前記音声認識工程と、前記１入
力音声認識工程に入力する信号を切り替える信号切り替
え工程と、あらかじめ用意した人間の発声のサンプルパ
ターンと外部から入力される音声信号とを重ね合わせる
ミキシング工程を有することを特徴とする請求項１０ま
たは１１に記載の音声入出力方法に存する。また請求項
１３に記載の発明の要旨は、外部の音声を音声信号に変
換する音声入力工程と、音声信号の入力端子を１つのみ
備え前記音声信号に含まれる人間の声以外の背景雑音を
学習することにより認識率の向上を目的とした環境適応
機能を有する前記音声認識工程と、前記１入力音声認識
工程に入力する信号を切り替える信号切り替え工程と、
あらかじめ用意した人間の発声のサンプルパターンを記
憶するサンプル音声パターン記憶工程と、前記サンプル
音声パターン記憶工程に保存されているサンプル音声パ
ターンと前記音声入力工程から入力される音声信号とを
重ね合わせて出力するミキシング工程を有することを特
徴とする請求項１０または１１に記載の音声入出力方法
に存する。また請求項１４に記載の発明の要旨は、音声
入力開始後であって人間が発声していない場合、前記サ
ンプル音声パターン記憶工程に保持されている人間の音
声のサンプルパターンに外部からの入力を前記ミキシン
グ工程で重ね合わせて前記１入力音声認識工程に入力
し、前記信号切り替え工程を所定側に設定することで、
前記１入力音声認識工程が、実際には人間の発声による
音声入力が行われていない場合における認識処理を行っ
て環境適応の学習を行う１入力時環境適応処理を実行す
る工程を含むことを特徴とする請求項１３に記載の音声
入出力方法に存する。また請求項１５に記載の発明の要
旨は、音声入力開始後であって人間による発声が行われ
る場合、前記信号切り替え工程を他の所定側に設定し、
外部からの音声入力をそのまま前記１入力音声認識工程
に入力して認識処理を行う１入力時音声認識処理を実行
する工程を含むことを特徴とする請求項１３または１４
に記載の音声入出力方法に存する。また請求項１６に記
載の発明の要旨は、外部の音声を音声信号に変換する音
声入力工程と、外部の環境音を環境音信号に変換する環
境音入力工程と、前記音声信号が入力されることを期待
する音声入力用の音声入力チャネルと前記環境音信号が
入力されることを期待する環境音入力用のノイズ入力チ
ャネルを２つ備え前記音声信号に含まれる人間の声以外
の背景雑音を学習することにより認識率の向上を目的と
した環境適応機能を有する２入力の前記音声認識工程
と、前記２入力音声認識工程に入力する信号を切り替え
る信号切り替え工程と、前記サンプル音声パターン記憶
工程に保存されているサンプル音声パターンと前記環境
音入力工程から入力される前記環境音信号とを重ね合わ
せて出力するミキシング工程を有することを特徴とする
請求項１０または１１に記載の音声入出力方法に存す
る。また請求項１７に記載の発明の要旨は、音声入力開
始後であって人間が発声していない場合、前記環境音入
力工程から出力される前記環境音信号を前記サンプル音
声パターン記憶工程に保持されているサンプル音声パタ
ーンに重ね合わせて生成した信号を前記２入力音声認識
工程の前記音声入力チャネルに入力し、前記信号切り替
え工程を所定側に設定し、前記環境音入力工程から入力
された前記環境音信号を前記２入力音声認識工程の前記
ノイズ入力チャネルに直接入力することで、前記２入力
音声認識工程が、人間が発声していない場合における認
識処理を行って環境情報を学習する２入力時環境適応処
理を実行する工程を含むことを特徴とする請求項１６に
記載の音声入出力方法に存する。また請求項１８に記載
の発明の要旨は、音声入力開始後であって人間による発
声が行われる場合、前記信号切り替え工程を他の所定側
に設定し、前記音声入力工程からの音声信号を前記音声
入力チャネルに入力し、前記環境音入力工程から入力さ
れた前記環境音信号を前記ノイズ入力チャネルに入力す
ることで、前記２入力音声認識工程が、実際に人間によ
り発声された声を含む音声信号を認識対象とした認識処
理を行う２入力時音声認識処理を実行する工程を含むこ
とを特徴とする請求項１６または１７に記載の音声入出
力方法に存する。An object of the present invention is to provide a speech input / output system capable of executing an environment adaptation process for at least a speech recognition process for a human voice, wherein the environment learning process is performed. A voice recognition engine that executes simultaneously with the voice recognition process or as a post-process of the voice recognition process, and a voice pattern prepared in advance when a human does not execute vocalization and an external input signal are superimposed. A means for inputting to a speech recognition engine and performing a recognition process in advance to learn environmental information; and a method in which when a human utters, the environment learning process is completed at least once and learned and held internally. A voice input / output system characterized by having means for executing a recognition process based on existing environmental learning information. The gist of the invention according to claim 2 is a voice input / output system capable of executing an environment adaptation process at least for a voice recognition process for a human voice, wherein the environment learning process is performed simultaneously with the voice recognition process or A voice recognition engine to be executed as post-processing of the voice recognition process, and a voice pattern prepared in advance when a human does not execute voice and a background sound from outside are superimposed and input to the voice recognition engine. Based on the means for learning the environment information by executing the recognition process in advance, and the environment learning information that is learned in advance and held internally when the human has uttered the environment learning process at least once. The speech input / output system has means for executing a recognition process. The gist of the invention according to claim 3 is that an audio input unit for converting an external audio into an audio signal and only one audio input channel for inputting the audio signal are provided except for a human voice included in the audio signal. A speech recognition engine having an environment adaptation function for improving a recognition rate by learning background noise of the above, a signal switching unit for switching a signal to be input to the one-input speech recognition engine, and a human prepared in advance. 3. The audio input / output system according to claim 1, further comprising a mixer section for superimposing a sample pattern of the utterance and an externally input audio signal. The gist of the invention described in claim 4 is that an audio input unit for converting an external audio into an audio signal and only one input terminal for the audio signal are provided, and background noise other than human voice included in the audio signal is eliminated. Said speech recognition engine having an environment adaptation function aimed at improving the recognition rate by learning,
A signal switching unit that switches a signal input to the one-input speech recognition engine; a sample speech pattern storage unit that stores a sample pattern of a human utterance prepared in advance; and a sample stored in the sample speech pattern storage unit. 3. The audio input / output system according to claim 1, further comprising a mixer section for superposing and outputting an audio pattern and an audio signal input from the audio input section. The gist of the invention according to claim 5 is that, after the start of voice input and when no human is uttering, an external input is made to the human voice sample pattern held in the sample voice pattern storage unit. By superimposing in the mixer unit and inputting to the one-input speech recognition engine and setting the signal switching unit to a predetermined side, the one-input speech recognition engine is:
5. The apparatus according to claim 4, wherein the apparatus performs recognition processing when no voice is input by a human utterance and performs one-input environment adaptation processing for learning environment adaptation. In the voice input / output system described in (1). The gist of the invention according to claim 6 is that, when voice input is started and a human utterance is made, the signal switching unit is set to another predetermined side, and the external voice input is directly input to the one input. The speech input / output system according to claim 4 or 5, characterized in that the speech recognition engine is configured to execute a one-input speech recognition process for performing a recognition process by inputting to the speech recognition engine. The gist of the invention described in claim 7 is that an audio input unit that converts an external sound into an audio signal, an environmental sound input unit that converts an external environmental sound into an environmental sound signal, and the audio signal is input. It is provided with two voice input channels for voice input and a noise input channel for environmental sound input in which the environmental sound signal is expected to be input. A two-input speech recognition engine having an environment adaptation function for improving a recognition rate by learning, a signal switching unit for switching a signal input to the two-input speech recognition engine, and the sample speech pattern storage unit And a mixer unit that superimposes and outputs the sample sound pattern stored in the unit and the environmental sound signal input from the environmental sound input unit. Or it consists in the audio input and output system according to 2. The gist of the invention described in claim 8 is that, after the start of voice input and when no human is uttering, the environmental sound signal output from the environmental sound input unit is held in the sample voice pattern storage unit. A signal generated by being superimposed on the sample voice pattern is input to the voice input channel of the two-input voice recognition engine, the signal switching unit is set to a predetermined side, and the signal input from the environmental sound input unit is set. By directly inputting the environmental sound signal to the noise input channel of the two-input speech recognition engine, the two-input speech recognition engine performs recognition processing when no human is uttering and learns environmental information. The speech input / output system according to claim 7, wherein the speech input / output system is configured to execute a two-input environment adaptation process. The gist of the invention according to claim 9 is that, when voice input is started and a human utterance is made, the signal switching unit is set to another predetermined side, and the audio signal from the audio input unit is set to the other side. The two-input voice recognition engine includes a voice actually uttered by a human by inputting to the voice input channel and inputting the environmental sound signal input from the environmental sound input unit to the noise input channel. The speech input / output system according to claim 7 or 8, wherein the speech input / output system is configured to execute a two-input speech recognition process for performing a recognition process for a speech signal as a recognition target. Claim 1
The gist of the invention described in claim 0 is a voice input / output method capable of executing an environment adaptation process for at least a voice recognition process for a human voice, wherein the environment learning process is performed simultaneously with the voice recognition process or the voice recognition process. A voice recognition step to be performed as a post-processing, and a voice pattern prepared in advance when a human does not execute a voice and an input signal from the outside are superimposed and input to the voice recognition step to perform a recognition process in advance. And learning the environmental information, and executing recognition processing based on environmental learning information that is learned in advance and stored internally when the environmental learning processing is completed at least once when a human utters. And a voice input / output method. Claim 11
The gist of the invention described in (1) is a voice input / output method capable of executing an environment adaptation process for at least a voice recognition process for a human voice, wherein the environment learning process is performed simultaneously with the voice recognition process or the voice recognition process. A voice recognition step to be executed as post-processing, and a voice pattern prepared in advance when a human does not execute utterance and an external background sound are superimposed and input to the voice recognition step to perform a recognition process in advance. And learning the environmental information, and executing recognition processing based on environmental learning information that is learned in advance and stored internally when the environmental learning processing is completed at least once when a human utters. And a voice input / output method. The gist of the invention according to claim 12 is that an audio input step of converting external audio into an audio signal, and that only one audio input channel for inputting the audio signal is provided, other than a human voice included in the audio signal A speech recognition step having an environment adaptation function for improving a recognition rate by learning background noise of the above, a signal switching step of switching a signal input to the one-input speech recognition step, and a human voice prepared in advance 12. The audio input / output method according to claim 10, further comprising: a mixing step of superimposing the sample pattern of (1) with an externally input audio signal. The gist of the invention according to claim 13 is that an audio input step of converting an external audio into an audio signal, and that only one input terminal for the audio signal is provided to remove background noise other than human voice included in the audio signal. The voice recognition step having an environment adaptation function for improving the recognition rate by learning; a signal switching step of switching a signal input to the one-input voice recognition step;
A sample voice pattern storage step of storing a sample pattern of a human utterance prepared in advance, and a sample voice pattern stored in the sample voice pattern storage step and a voice signal input from the voice input step superimposed and output 12. The audio input / output method according to claim 10, further comprising a mixing step. The gist of the invention according to claim 14 is that, after the start of voice input and when no human is uttering, an external input is applied to the human voice sample pattern held in the sample voice pattern storage step. By superimposing in the mixing step and inputting to the one-input speech recognition step, and setting the signal switching step to a predetermined side,
The one-input speech recognition step includes a step of performing a one-input environment adaptation process of performing an environment adaptation learning by performing a recognition process when a speech input by a human utterance is not actually performed. A voice input / output method according to claim 13. The gist of the invention according to claim 15 is that, when voice input is started and human utterance is performed, the signal switching step is set to another predetermined side,
The method according to claim 13 or 14, further comprising a step of executing a one-input speech recognition process in which a speech input from the outside is directly input to the one-input speech recognition process to perform a recognition process.
In the voice input / output method described in (1). The gist of the invention described in claim 16 is that an audio input step of converting an external audio into an audio signal, an environmental sound input step of converting an external environmental sound into an environmental sound signal, and the audio signal is input. It is provided with two voice input channels for voice input and a noise input channel for environmental sound input in which the environmental sound signal is expected to be input. The two-input speech recognition step having an environment adaptation function for improving the recognition rate by learning, a signal switching step of switching a signal input to the two-input speech recognition step, and a sample speech pattern storage step A mixing step of superimposing and outputting the stored sample sound pattern and the environmental sound signal input from the environmental sound input step. It consists in the audio input and output method according to Motomeko 10 or 11. The gist of the invention according to claim 17 is that, after the start of voice input and when no human is speaking, the environmental sound signal output from the environmental sound input step is stored in the sample voice pattern storage step. A signal generated by being superimposed on the sample voice pattern is input to the voice input channel of the two-input voice recognition step, the signal switching step is set to a predetermined side, and the environment input from the environmental sound input step is set. By directly inputting a sound signal to the noise input channel in the two-input speech recognition step, the two-input speech recognition step performs a recognition process when no human is uttering to learn environmental information. The voice input / output method according to claim 16, further comprising a step of executing an environment adaptation process. Further, the gist of the invention according to claim 18 is that, when voice input is started and human utterance is performed, the signal switching step is set to another predetermined side, and the voice signal from the voice input step is set The two-input speech recognition step is performed by inputting a sound input channel to the noise input channel and inputting the environmental sound signal input from the environmental sound input step to the noise input channel. The speech input / output method according to claim 16 or 17, further comprising a step of performing a two-input speech recognition process for performing a recognition process on a signal as a recognition target.

【０００７】[0007]

【発明の実施の形態】以下に示す各実施の形態の特徴
は、人間が発声を実行しないときにあらかじめ用意して
おいた音声パターンと外部からの入力（人間の発声は含
まれず背景音のみの入力）を重ね合わせて音声認識エン
ジンに入力して事前に認識処理を実行して環境情報を学
習する構成を設けることで、人間が発声したときに環境
学習処理が少なくとも一回以上終了した状態で事前に学
習され内部に保持されている環境学習情報を基に効果的
に認識処理を実行できることにある。以下、本発明の実
施の形態を図面に基づいて詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The features of each embodiment described below are characterized in that a voice pattern prepared in advance when a human does not execute a voice and an external input (only a background voice without a human voice is included). Input) to the speech recognition engine to perform recognition processing in advance and learn environment information, so that when a human utters, the environment learning processing is completed at least once. The object of the present invention is to enable a recognition process to be executed effectively based on environmental learning information that has been learned in advance and is internally stored. Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【０００８】（第１の実施の形態）図１は本発明の第１
の実施の形態にかかる音声入出力システム１０を説明す
るためのブロック図である。図１において、１は１入力
の音声認識エンジン、２は信号切り替え部、４はミキサ
ー部、５は音声入力部、１０は音声入出力システムを示
している。図１を参照すると、音声入出力システム１０
は、少なくとも人間の音声を対象とする音声認識処理に
対する環境適応処理を実行できる音声入出力システムで
あって、外部音声の入力を１つ持ち音声信号に含まれる
人間の声以外の背景雑音を学習することにより認識率の
向上を目的とした環境適応機能を有し、少なくとも人間
の音声を対象とする音声認識処理に対する前記環境学習
処理を前記音声認識処理と同時または当該音声認識処理
の後処理として実行する１入力の音声認識エンジン１
と、１入力の音声認識エンジン１に入力する信号を切り
替える信号切り替え部２と、あらかじめ用意した人間の
発声のサンプルパターンと外部から入力される音声信号
とをミキシングする（重ね合わせる）ミキサー部４と、
外部の音声を音声信号に変換する音声入力部５を備えて
いる。(First Embodiment) FIG. 1 shows a first embodiment of the present invention.
1 is a block diagram for explaining a voice input / output system 10 according to an embodiment. In FIG. 1, 1 is a one-input speech recognition engine, 2 is a signal switching unit, 4 is a mixer unit, 5 is a speech input unit, and 10 is a speech input / output system. Referring to FIG. 1, a voice input / output system 10
Is a speech input / output system capable of executing an environment adaptation process for at least a speech recognition process for a human speech, which has one external speech input and learns background noise other than a human voice included in a speech signal. Has an environment adaptation function for the purpose of improving the recognition rate by performing the environment learning process for at least the speech recognition process for human voice at the same time as the speech recognition process or as a post-process of the speech recognition process. 1-input speech recognition engine 1 to be executed
A signal switching unit 2 for switching a signal to be input to the one-input speech recognition engine 1; a mixer unit 4 for mixing (overlapping) a sample pattern of a human utterance prepared in advance with an externally input audio signal. ,
An audio input unit 5 for converting external audio into an audio signal is provided.

【０００９】図３は図１の音声入出力システム１０の具
体的な構成図である。図３において、１は１入力の音声
認識エンジン、２は信号切り替え部、３はサンプル音声
パターン記憶部、４はミキサー部、５は音声入力部（マ
イク）、１０は音声入出力システムを示している。図３
を参照すると、音声入出力システム１０は、外部音声の
入力を１つ持ち前記環境学習処理を前記音声認識処理と
同時または当該音声認識処理の後処理として実行する１
入力の音声認識エンジン１と、１入力の音声認識エンジ
ン１に入力する信号を切り替える信号切り替え部２と、
あらかじめ用意した人間の発声のサンプルパターンを記
憶するサンプル音声パターン記憶部３と、サンプル音声
パターン記憶部３に保存されているサンプル音声パター
ンと音声入力部５から入力される音声信号とをミキシン
グして（重ね合わせて）出力するミキサー部４と、外部
の音声を音声信号に変換して出力する音声入力部５を備
えている。FIG. 3 is a specific configuration diagram of the voice input / output system 10 of FIG. 3, reference numeral 1 denotes a one-input voice recognition engine, 2 denotes a signal switching unit, 3 denotes a sample voice pattern storage unit, 4 denotes a mixer unit, 5 denotes a voice input unit (microphone), and 10 denotes a voice input / output system. I have. FIG.
Referring to FIG. 1, the voice input / output system 10 has one external voice input and executes the environmental learning process simultaneously with the voice recognition process or as a post-process of the voice recognition process.
An input speech recognition engine 1, a signal switching unit 2 for switching a signal input to the one-input speech recognition engine 1,
A sample voice pattern storage unit 3 that stores a sample pattern of a human utterance prepared in advance, and a sample voice pattern stored in the sample voice pattern storage unit 3 and a voice signal input from the voice input unit 5 are mixed. A mixer section 4 for outputting (overlapping) and an audio input section 5 for converting external audio into an audio signal and outputting the audio signal are provided.

【００１０】次に、１入力の音声認識エンジン１を用い
た音声入出力システム１０の動作（音声入出力方法）に
ついて図２、図３を参照して説明する。図２は本発明の
第１の実施の形態にかかる音声入出力方法（第１の実施
の形態の音声入出力システム１０の動作）を説明するた
めのフローチャートである。図２，３を参照すると、音
声入力開始（ステップＳ１）後、まず、人間が発声して
いない場合（ステップＳ２の発声無）、サンプル音声パ
ターン記憶部３に保持されている人間の音声のサンプル
パターンに外部からの入力をミキサー部４で重ね合わせ
て（ステップＳ２１）１入力の音声認識エンジン１に入
力し、信号切り替え部２をＡ側（図３参照）に設定（ス
テップＳ３）することで、１入力の音声認識エンジン１
が、実際には人間の発声による音声入力が行われていな
い場合における認識処理（ステップＳ４）を行って環境
適応の学習を行う（１入力時環境適応処理）。Next, the operation (speech input / output method) of the speech input / output system 10 using the one-input speech recognition engine 1 will be described with reference to FIGS. FIG. 2 is a flowchart for explaining a voice input / output method (operation of the voice input / output system 10 of the first embodiment) according to the first embodiment of the present invention. Referring to FIGS. 2 and 3, after the start of voice input (step S1), if no human is uttering (no utterance in step S2), a sample of human voice held in the sample voice pattern storage unit 3 is obtained. The input from the outside is superimposed on the pattern by the mixer unit 4 (step S21), input to the one-input speech recognition engine 1, and the signal switching unit 2 is set to the A side (see FIG. 3) (step S3). , One-input speech recognition engine 1
However, learning of environment adaptation is performed by performing a recognition process (step S4) in the case where a voice input by a human utterance is not actually performed (1 input environment adaptation process).

【００１１】一方、人間による発声が行われる場合（ス
テップＳ２の発声有）は、信号切り替え部２をＢ側（図
３参照）に設定（ステップＳ２２）し、外部からの音声
入力をそのまま１入力の音声認識エンジン１に入力して
認識処理（ステップＳ４）を行う（１入力時音声認識処
理）。On the other hand, if a human utterance is made (the utterance is made in step S2), the signal switching unit 2 is set to the B side (see FIG. 3) (step S22), and one external voice input is directly input. And performs recognition processing (step S4) (Speech recognition processing at one input).

【００１２】以上説明したように第１の実施の形態によ
れば、人間の発声を認識処理する場合、１入力時音声認
識処理を事前に実行して環境の学習情報を蓄積すること
で環境学習処理を一回以上実行するので、効果的に認識
処理を行うことができるようになるといった効果を奏す
る。一方、１入力時音声認識処理を事前に実行しない場
合、１入力の音声認識エンジン１内部に環境情報を蓄積
しないことで環境学習処理が実行されないため、人間の
発声による音声入力を認識処理する際に環境適応機能に
よる効果的な認識処理が期待できないと思考される。As described above, according to the first embodiment, in the case of recognizing a human utterance, the environment recognition information is accumulated by executing the one-input speech recognition processing in advance and accumulating the learning information of the environment. Since the processing is performed one or more times, the recognition processing can be effectively performed. On the other hand, if the one-input speech recognition process is not performed in advance, the environmental learning process is not performed because the environment information is not stored in the one-input speech recognition engine 1. It is thought that effective recognition processing by the environment adaptation function cannot be expected.

【００１３】（第２の実施の形態）図４は本発明の第２
の実施の形態にかかる音声入出力システム１０を説明す
るためのブロック図である。図４において、２は信号切
り替え部、３はサンプル音声パターン記憶部、４はミキ
サー部、５は音声入力部、６は環境音入力部、８は２入
力の音声認識エンジン、１０は音声入出力システムを示
している。図４を参照すると、本実施の形態の音声入出
力システム１０は、少なくとも人間の音声を対象とする
音声認識処理に対する環境適応処理を実行できる音声入
出力システムであって、外部音声の入力を２つ持ち、音
声信号に含まれる人間の声以外の背景雑音を学習するこ
とにより認識率の向上を目的とした環境適応機能を有
し、前記環境学習処理を前記音声認識処理と同時または
当該音声認識処理の後処理として実行する２入力の音声
認識エンジン８と、２入力の音声認識エンジン８に入力
する信号を切り替える信号切り替え部２と、あらかじめ
用意した人間の発声のサンプルパターンを記憶するサン
プル音声パターン記憶部３と、サンプル音声パターン記
憶部３に保存されているサンプル音声パターンと環境音
入力部６から入力される環境音信号とをミキシングして
（重ね合わせて）出力するミキサー部４と、外部の音声
を音声信号に変換する音声入力部５（マイク）と、外部
の環境音を環境音信号に変換する環境音入力部６（マイ
ク）を備えている。本実施の形態の２入力の音声認識エ
ンジン８とは、前記環境学習処理を前記音声認識処理と
同時または当該音声認識処理の後処理として実行する音
声認識エンジンであって、人間の声が入力されることを
期待する音声入力（音声入力チャネルＣという）と、周
囲（背景）の騒音（ノイズ）が入力されることを期待す
る環境音入力（ノイズ入力チャネルＤという）とを備え
た音声認識エンジンであり、その基本機能は上記１入力
の音声認識エンジン１とほぼ同様である。(Second Embodiment) FIG. 4 shows a second embodiment of the present invention.
1 is a block diagram for explaining a voice input / output system 10 according to an embodiment. In FIG. 4, 2 is a signal switching unit, 3 is a sample voice pattern storage unit, 4 is a mixer unit, 5 is a voice input unit, 6 is an environmental sound input unit, 8 is a 2-input voice recognition engine, and 10 is a voice input / output. Shows the system. Referring to FIG. 4, a voice input / output system 10 according to the present embodiment is a voice input / output system capable of executing an environment adaptation process at least for a voice recognition process for a human voice, and inputs two external voices. And has an environment adaptation function for improving a recognition rate by learning background noise other than a human voice included in a voice signal, and performing the environment learning process simultaneously with the voice recognition process or the voice recognition process. A two-input speech recognition engine 8 to be executed as post-processing, a signal switching unit 2 for switching a signal input to the two-input speech recognition engine 8, and a sample speech pattern for storing a sample pattern of a human utterance prepared in advance The storage unit 3, the sample voice pattern stored in the sample voice pattern storage unit 3 and the environmental sound input from the environmental sound input unit 6 Mixer 4 for mixing (superimposing) signals with each other and outputting, an audio input unit 5 (microphone) for converting external sound into an audio signal, and an environmental sound input for converting external environmental sound into an environmental sound signal. A unit 6 (microphone) is provided. The two-input speech recognition engine 8 of the present embodiment is a speech recognition engine that executes the environmental learning process simultaneously with the speech recognition process or as a post-process of the speech recognition process. Speech recognition engine provided with a voice input (referred to as a voice input channel C) that expects to be input and an environmental sound input (referred to as a noise input channel D) that expects a surrounding (background) noise to be input. The basic functions are almost the same as those of the one-input speech recognition engine 1.

【００１４】次に、２入力の音声認識エンジン８を用い
た音声入出力システム１０の動作（音声入出力方法）に
ついて図２、図４を参照して説明する。図２は本発明の
第２の実施の形態にかかる音声入出力方法（第２の実施
の形態の音声入出力システム１０の動作）を説明するた
めのフローチャートである。図２，４を参照すると、音
声入力開始（ステップＳ１）後、まず、人間が発声して
いない場合（ステップＳ２の発声無）、環境音入力部６
から出力される背景のノイズ音（環境音信号）をサンプ
ル音声パターン記憶部３に保持されているサンプル音声
パターンに重ね合わせて（ステップＳ２１）生成した信
号を２入力の音声認識エンジン８の音声入力チャネルＣ
に入力し、信号切り替え部２をＡ側（図４参照）に設定
（ステップＳ３）し、環境音入力部６から入力された環
境音信号を２入力の音声認識エンジン８のノイズ入力チ
ャネルＤに直接入力することで、２入力の音声認識エン
ジン８が、人間が発声していない場合における認識処理
（ステップＳ４）を行って環境情報を学習する（すなわ
ち、環境適応処理が行われる）（２入力時環境適応処
理）。Next, the operation (speech input / output method) of the speech input / output system 10 using the two-input speech recognition engine 8 will be described with reference to FIGS. FIG. 2 is a flowchart for explaining a voice input / output method (operation of the voice input / output system 10 of the second embodiment) according to the second embodiment of the present invention. Referring to FIGS. 2 and 4, after the start of voice input (step S1), if no human is uttering (no utterance in step S2), the environmental sound input unit 6
A noise signal (environmental sound signal) of the background output from is superimposed on the sample voice pattern held in the sample voice pattern storage unit 3 (step S21), and the generated signal is input to the two-input voice recognition engine 8 Channel C
The signal switching unit 2 is set to the A side (see FIG. 4) (step S3), and the environmental sound signal input from the environmental sound input unit 6 is input to the noise input channel D of the two-input speech recognition engine 8. By directly inputting, the two-input speech recognition engine 8 performs recognition processing when no human is uttering (step S4) and learns environment information (that is, performs environment adaptation processing) (two-input speech recognition engine 8). Time environment adaptation processing).

【００１５】一方、人間による発声が行われる場合（ス
テップＳ２の発声有）は、信号切り替え部２をＢ側（図
４参照）に設定（ステップＳ２２）し、音声入力部５か
らの音声信号を音声入力チャネルＣに入力し、環境音入
力部６から入力された環境音信号をノイズ入力チャネル
Ｄに入力することで、２入力の音声認識エンジン８が、
実際に人間により発声された声を含む音声信号を認識対
象とした認識処理（ステップＳ４）を行う（２入力時音
声認識処理）。On the other hand, when a human utterance is made (with utterance in step S2), the signal switching section 2 is set to the B side (see FIG. 4) (step S22), and the audio signal from the audio input section 5 is output. By inputting the sound signal to the sound input channel C and the environmental sound signal input from the environmental sound input unit 6 to the noise input channel D, the two-input sound recognition engine 8
Recognition processing (step S4) is performed with a speech signal including a voice actually uttered by a human being as a recognition target (2-input speech recognition processing).

【００１６】以上説明したように第２の実施の形態によ
れば、人間の発声を認識処理する場合、２入力時音声認
識処理を事前に実行して環境の学習情報を蓄積すること
で環境学習処理を一回以上実行するので、効果的に認識
処理を行うことができるようになるといった効果を奏す
る。一方、２入力時音声認識処理を事前に実行しない場
合、２入力の音声認識エンジン８内部に環境情報を蓄積
しないことで環境学習処理が実行されないため、人間の
発声による音声入力を認識処理する際に環境適応機能に
よる効果的な認識処理が期待できないと思考される。As described above, according to the second embodiment, when recognizing a human utterance, a two-input speech recognition process is executed in advance to accumulate learning information of the environment, thereby enabling environmental learning. Since the processing is performed one or more times, the recognition processing can be effectively performed. On the other hand, if the speech recognition processing at the time of two inputs is not executed in advance, the environment learning processing is not executed because the environment information is not accumulated inside the two-input speech recognition engine 8, so that the speech input by human utterance is recognized. It is thought that effective recognition processing by the environment adaptation function cannot be expected.

【００１７】最後に、本実施の形態と前述の従来技術と
を対比してその技術的差違および効果について説明す
る。前述の特開平８−２１１８８８号公報に記載の第３
従来技術では、学習用音声と環境雑音とを重ね合わせた
擬似データを環境学習手段に入力する経路と通常の認識
処理を行う経路がまったく異なるために、通常の認識処
理の際は環境学習を行うことが困難であると思考され
る。Finally, technical differences and effects of the present embodiment and the above-mentioned conventional technology will be described. The third technique described in the above-mentioned Japanese Patent Application Laid-Open No. H8-212888.
In the prior art, since the path for inputting pseudo data obtained by superimposing the learning voice and the environmental noise to the environmental learning means is completely different from the path for performing the normal recognition processing, the environment learning is performed during the normal recognition processing. It is considered difficult.

【００１８】一方、本発明は、少なくとも人間の音声を
対象とする音声認識処理に対する上記環境適応機能を有
し、前記環境学習処理を前記音声認識処理と同時または
当該音声認識処理の後処理として実行する音声認識エン
ジンを備え、実話者の音声データの代わりに擬似的に用
意した音声データを音声認識処理することで、環境学習
処理を事前に実行できるように構成されている音声認識
システムであり、同一経路上で学習用の音声データと話
者の発声データとを切り替えるといった構成を備えるこ
とにより、常に認識処理とともに環境学習処理を行うこ
とができ、また、システムも簡略化できるといった効果
を奏する点で、従来技術とは構成および効果を異にする
と思考される。On the other hand, the present invention has the above-mentioned environment adaptation function for speech recognition processing for at least human speech, and executes the environment learning processing simultaneously with the speech recognition processing or as a post-processing of the speech recognition processing. A speech recognition system that includes a speech recognition engine that performs a speech recognition process on pseudo-prepared speech data instead of speech data of a true speaker, so that an environment learning process can be performed in advance. By providing a configuration in which voice data for learning and utterance data of a speaker are switched on the same route, the environment learning process can be always performed together with the recognition process, and the system can be simplified. Therefore, it is considered that the configuration and the effect are different from those of the related art.

【００１９】以上説明したように上記各実施の形態によ
れば、人間の発声を認識処理する場合、１入力時音声認
識処理または２入力時音声認識処理を事前に実行して環
境の学習情報を蓄積することで環境学習処理を一回以上
実行するので、効果的に認識処理を行うことができるよ
うになるといった効果を奏する。一方、１入力時音声認
識処理または２入力時音声認識処理を事前に実行しない
場合、１入力の音声認識エンジン１内部に環境情報を蓄
積しないことで環境学習処理が実行されないため、人間
の発声による音声入力を認識処理する際に環境適応機能
による効果的な認識処理が期待できないと思考される。As described above, according to each of the above embodiments, when recognizing a human utterance, the one-input speech recognition process or the two-input speech recognition process is executed in advance to obtain the learning information of the environment. By accumulating the information, the environmental learning process is executed one or more times, so that the recognition process can be effectively performed. On the other hand, if the one-input speech recognition process or the two-input speech recognition process is not performed in advance, the environment learning process is not executed by not storing the environmental information inside the one-input speech recognition engine 1, so that the human voice is used. It is thought that effective recognition processing by the environment adaptation function cannot be expected when recognizing speech input.

【００２０】なお、本発明が上記各実施の形態に限定さ
れず、本発明の技術思想の範囲内において、各実施形態
は適宜変更され得ることは明らかである。また上記構成
部材の数、位置、形状等は上記実施の形態に限定され
ず、本発明を実施する上で好適な数、位置、形状等にす
ることができる。また、各図において、同一構成要素に
は同一符号を付している。It should be noted that the present invention is not limited to the above embodiments, and each embodiment can be appropriately modified within the scope of the technical idea of the present invention. Further, the number, position, shape, and the like of the constituent members are not limited to the above-described embodiment, and can be set to numbers, positions, shapes, and the like suitable for carrying out the present invention. In each drawing, the same components are denoted by the same reference numerals.

【００２１】[0021]

【発明の効果】本発明は以上のように構成されているの
で、以上説明したように上記実施の形態によれば、人間
の発声を認識処理する場合、１入力時音声認識処理また
は２入力時音声認識処理を事前に実行して環境の学習情
報を蓄積することで環境学習処理を一回以上実行するの
で、効果的に認識処理を行うことができるようになると
いった効果を奏する。一方、１入力時音声認識処理また
は２入力時音声認識処理を事前に実行しない場合、音声
認識エンジン内部に環境情報を蓄積しないことで環境学
習処理が実行されないため、人間の発声による音声入力
を認識処理する際に環境適応機能による効果的な認識処
理が期待できないと思考される。Since the present invention is constructed as described above, according to the above-described embodiment, when recognizing a human utterance, one-input speech recognition processing or two-input speech recognition processing is performed. Since the environment learning process is executed one or more times by executing the voice recognition process in advance and accumulating the learning information of the environment, there is an effect that the recognition process can be effectively performed. On the other hand, if the one-input speech recognition process or the two-input speech recognition process is not performed in advance, the environment learning process is not performed because the environment information is not stored in the speech recognition engine. It is thought that effective recognition processing by the environment adaptation function cannot be expected during processing.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態にかかる音声入出力
システムを説明するためのブロック図である。FIG. 1 is a block diagram for explaining a voice input / output system according to a first embodiment of the present invention.

【図２】本発明の第１および第２の実施の形態にかかる
音声入出力方法（音声入出力システムの動作）を説明す
るためのフローチャートである。FIG. 2 is a flowchart illustrating a voice input / output method (operation of a voice input / output system) according to the first and second embodiments of the present invention.

【図３】図１の音声入出力システムの具体的な構成図で
ある。FIG. 3 is a specific configuration diagram of the audio input / output system of FIG. 1;

【図４】本発明の第２の実施の形態にかかる音声入出力
システムを説明するためのブロック図である。FIG. 4 is a block diagram for explaining a voice input / output system according to a second embodiment of the present invention.

【図５】従来のシステムの構成を表すブロック図であ
る。FIG. 5 is a block diagram illustrating a configuration of a conventional system.

【図６】従来のシステムの構成を表すブロック図であ
る。FIG. 6 is a block diagram illustrating a configuration of a conventional system.

[Explanation of symbols]

１…１入力の音声認識エンジン２…信号切り替え部３…サンプル音声パターン記憶部４…ミキサー部５…音声入力部（マイク）６…環境音入力部（マイク）８…２入力の音声認識エンジン１０…音声入出力システムＣ…音声入力チャネルＤ…ノイズ入力チャネル DESCRIPTION OF SYMBOLS 1 ... 1-input speech recognition engine 2 ... Signal switching unit 3 ... Sample speech pattern storage unit 4 ... Mixer unit 5 ... Speech input unit (microphone) 6 ... Environmental sound input unit (microphone) 8 ... 2-input speech recognition engine 10 ... Sound input / output system C ... Sound input channel D ... Noise input channel

Claims

[Claims]

1. A speech input / output system capable of executing an environment adaptation process at least for a speech recognition process for a human speech, wherein the environment learning process is performed simultaneously with the speech recognition process or after the speech recognition process. A speech recognition engine to be executed as, and a speech pattern prepared in advance when a human does not execute vocalization and an external input signal are superimposed and input to the speech recognition engine to execute a recognition process in advance. A means for learning environmental information; and a means for executing recognition processing based on environmental learning information that has been learned in advance and held internally when the environmental learning processing has been completed at least once when a human utters. A voice input / output system characterized by the following.

2. A speech input / output system capable of executing an environment adaptation process for at least a speech recognition process for a human speech, wherein the environment learning process is performed simultaneously with the speech recognition process or after the speech recognition process. A speech recognition engine to be executed as, and a speech pattern prepared in advance when a human does not execute utterance and an external background sound are superimposed and input to the speech recognition engine to execute a recognition process in advance. Means for learning the environmental information by means of the method, and means for executing the recognition processing based on the environmental learning information which is learned in advance and stored therein when the human utters the environment learning processing at least once. A voice input / output system characterized by having.

3. An audio input unit for converting external audio into an audio signal, and only one audio input channel for inputting the audio signal is provided, and background noise other than human voice included in the audio signal is learned. The speech recognition engine having an environment adaptation function for improving the recognition rate by the above, a signal switching unit for switching a signal to be inputted to the speech recognition engine of one input, a sample pattern of a human utterance prepared in advance and an external 3. The audio input / output system according to claim 1, further comprising a mixer unit that superimposes an audio signal input from the audio input device.

4. An audio input unit for converting an external audio into an audio signal, and having only one input terminal for the audio signal, learning a background noise other than a human voice included in the audio signal, thereby reducing a recognition rate. A speech recognition engine having an environment adaptation function for the purpose of improvement, a signal switching unit for switching a signal inputted to the speech recognition engine of one input, and a sample speech pattern for storing a sample pattern of a human utterance prepared in advance 3. The storage device according to claim 1, further comprising: a storage unit; and a mixer unit configured to superimpose and output the sample voice pattern stored in the sample voice pattern storage unit and the voice signal input from the voice input unit. The voice input / output system according to 1.

5. After the start of voice input and when no human is uttering, the mixer unit superimposes an external input on the sample pattern of the human voice stored in the sample voice pattern storage unit. By inputting to the one-input speech recognition engine and setting the signal switching unit to a predetermined side, the one-input speech recognition engine does not actually perform a speech input by human utterance. 5. The speech input / output system according to claim 4, wherein the one-input environment adaptation process for performing the environment adaptation learning by performing the recognition process is performed.

6. When a human utterance is made after the start of voice input, the signal switching unit is set to another predetermined side, and an external voice input is directly input to the one-input voice recognition engine. The voice input / output system according to claim 4, wherein the voice input / output system is configured to execute a one-input voice recognition process for performing a recognition process.

7. An audio input unit for converting an external sound into an audio signal, an environmental sound input unit for converting an external environmental sound into an environmental sound signal, and an audio input unit expecting the audio signal to be input. And two noise input channels for inputting environmental sound, which are expected to receive the environmental sound signal, and learning background noise other than the human voice included in the audio signal to reduce the recognition rate. A two-input speech recognition engine having an environment adaptation function for the purpose of improvement, a signal switching unit for switching a signal input to the two-input speech recognition engine, and a sample speech stored in the sample speech pattern storage unit 3. The sound according to claim 1, further comprising a mixer that superimposes and outputs a pattern and the environmental sound signal input from the environmental sound input unit. 4. Voice input / output system.

8. After the start of voice input and when no human is uttering, the environmental sound signal output from the environmental sound input unit is superimposed on the sample voice pattern held in the sample voice pattern storage unit. The generated signal is input to the voice input channel of the two-input voice recognition engine, the signal switching unit is set to a predetermined side, and the environmental sound signal input from the environmental sound input unit is input to the 2nd voice recognition engine.
By directly inputting to the noise input channel of the input speech recognition engine, the two-input speech recognition engine
The speech input / output system according to claim 7, wherein the speech input / output system is configured to execute a two-input environment adaptation process of learning environment information by performing a recognition process when no human is speaking.

9. When a human utterance is performed after the start of voice input, the signal switching unit is set to another predetermined side, and a voice signal from the voice input unit is input to the voice input channel. By inputting the environmental sound signal input from the environmental sound input unit to the noise input channel, the two-input voice recognition engine recognizes a voice signal including a voice actually uttered by a human. 9. The voice input / output system according to claim 7, wherein the voice input / output system is configured to execute a two-input speech recognition process for performing a recognition process.

10. A speech input / output method capable of executing environment adaptation processing at least for speech recognition processing for human speech, wherein said environment learning processing is performed simultaneously with said speech recognition processing or post-processing of said speech recognition processing. A voice recognition step to be executed as, and a voice pattern prepared in advance when a human does not execute vocalization and an external input signal are overlapped and input to the voice recognition step to perform a recognition process in advance. A step of learning environmental information, and a step of executing a recognition process based on environmental learning information that is learned in advance and held internally in a state where the environmental learning process has been completed at least once when a human utters. A voice input / output method characterized by the above-mentioned.

11. A speech input / output method capable of executing an environment adaptation process for at least a speech recognition process for a human speech, wherein the environment learning process is performed simultaneously with the speech recognition process or after the speech recognition process. A voice recognition step to be executed as, and a voice pattern prepared in advance when a human does not execute vocalization and an external background sound are overlapped and input to the voice recognition step to execute a recognition process in advance. Learning the environmental information, and executing the recognition process based on the environmental learning information that is learned in advance and stored internally when the human utters the environmental learning process at least once. A voice input / output method comprising:

12. A voice input step of converting external voice into a voice signal, and comprising only one voice input channel for voice signal input and learning background noise other than human voice included in the voice signal. The speech recognition step having an environment adaptation function for improving the recognition rate by the following: a signal switching step of switching a signal to be input to the one-input speech recognition step; a sample pattern of a human utterance prepared in advance and input from outside 12. The audio input / output method according to claim 10, further comprising a mixing step of superimposing the audio signal to be performed.

13. A speech inputting step of converting an external speech into a speech signal; and providing a speech signal input terminal only and learning a background noise other than a human voice included in the speech signal to reduce a recognition rate. The voice recognition step having an environment adaptation function for the purpose of improvement, a signal switching step of switching a signal input to the one-input voice recognition step, and a sample voice pattern storage step of storing a sample pattern of a human voice prepared in advance 12. A mixing step of superimposing and outputting a sample voice pattern stored in the sample voice pattern storage step and a voice signal input from the voice input step. Voice input / output method.

14. After the start of voice input and when no human is speaking, an external input is superimposed on the sample pattern of the human voice held in the sample voice pattern storage step in the mixing step. By inputting the signal to the one-input speech recognition step and setting the signal switching step to a predetermined side, the one-input speech recognition step performs a recognition process in a case where a human voice is not actually input. 14. The voice input / output method according to claim 13, further comprising a step of performing a one-input environment adaptation process for performing environment adaptation learning.

15. When a human utterance is made after the start of voice input, the signal switching step is set to another predetermined side, and an external voice input is directly input to the one-input voice recognition step. 14. The method according to claim 13, further comprising a step of performing a one-input speech recognition process for performing a recognition process.
Or the voice input / output method according to 14.

16. An audio input step of converting an external sound into an audio signal, an environmental sound input step of converting an external environmental sound into an environmental sound signal, and a sound input for expecting the audio signal to be input. And two noise input channels for inputting environmental sound, which are expected to receive the environmental sound signal, and learning background noise other than the human voice included in the audio signal to reduce the recognition rate. A two-input speech recognition step having an environment adaptation function for the purpose of improvement, a signal switching step of switching a signal to be input to the two-input speech recognition step, and a sample speech pattern stored in the sample speech pattern storage step 11. A mixing step of superimposing and outputting the environmental sound signal input from the environmental sound input step and the environmental sound signal input from the environmental sound input step. Voice input and output method according to.

17. After the start of voice input and when no human is speaking, the environmental sound signal output from the environmental sound input step is superimposed on the sample voice pattern stored in the sample voice pattern storage step. Inputting the signal generated together to the voice input channel of the two-input voice recognition step, setting the signal switching step to a predetermined side,
By directly inputting the environmental sound signal input from the environmental sound input step to the noise input channel of the two-input voice recognition step, the two-input voice recognition step performs a recognition process when no human is speaking. 17. The method according to claim 16, further comprising the step of: performing a two-input environment adaptation process of learning environment information by performing the following.

18. When a human utterance is made after the start of voice input, the signal switching step is set to another predetermined side, and a voice signal from the voice input step is input to the voice input channel; By inputting the environmental sound signal input from the environmental sound input step to the noise input channel, the two-input voice recognition step recognizes a voice signal including a voice actually uttered by a human as a recognition target. 18. The voice input / output method according to claim 16, further comprising a step of performing a two-input speech recognition process for performing a process.