JP3982289B2

JP3982289B2 - Voice recognition device

Info

Publication number: JP3982289B2
Application number: JP2002068510A
Authority: JP
Inventors: 健大野
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2002-03-13
Filing date: 2002-03-13
Publication date: 2007-09-26
Anticipated expiration: 2022-03-13
Also published as: JP2003271192A

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された音声を認識して、入力された音声に対する認識候補を提示して選択可能にする音声認識装置に関する。
【０００２】
【従来の技術】
従来の音声認識装置として、特開平１１−３５２９９１号公報に開示されたものがある。この音声認識装置は、単音節ごとに区切って発声された音声を認識して認識候補を表示するものである。所望の認識候補が音声入力者によって確定されるまで、次の認識候補を順次表示していくことができる。
【０００３】
【発明が解決しようとする課題】
しかし、従来の音声認識装置では、例えば音声入力時に大きいレベルの騒音が混入して誤認識が生じた場合、認識候補を順次表示させていっても所望の認識候補が表示されないことがある。従って、正しい認識候補の有無が分からないまま、認識候補を１つ１つ確認しながら選択操作を行わなければならなかった。
【０００４】
また、認識候補の順番を変え、誤認識されやすい認識候補を音声との一致度が上位の候補と並べる方策も考えられる。しかし、一致度第１位の認識候補が誤認識で一致度第２位の認識候補が正しい認識である場合であって、第１位の認識候補と他の語が誤認識されやすい場合には問題が生じる。この場合に上記方策を採った場合の出現順位は、第１位の認識候補、これと誤認識されやすい語、第２位の認識候補の順になり、一致度第２位の認識候補の出現順位が下がってしまう。認識候補はあくまでも一致度の順に並べることが望ましい。
【０００５】
本発明の目的は、操作装置を用いて認識候補の選択を行う際に、認識候補の中に誤認識されやすい認識対象語が存在するときは、音声との一致度に関わらず簡単な操作で認識対象語を選択することができる音声認識装置を提供することにある。
【０００６】
【課題を解決する手段】
（１）請求項１の発明による音声認識装置は、音声を入力する音声入力装置と、複数の認識対象語を記憶するとともに、個々の認識対象語と誤認識されやすい他の認識対象語との対応を記憶した記憶装置と、入力された音声と各認識対象語との一致度を演算し、一致度に基づいて複数の認識対象語を複数の認識候補とする候補抽出装置と、複数の認識候補を第１の方向に並べて表示し、認識候補に対して誤認識されやすい他の認識対象語を第２の方向に並べて表示する表示装置と、第１の方向と対応する第１の操作および第２の方向と対応する第２の操作を行う操作装置と、第１の操作により前記認識候補からの選択を行い、第２の操作により選択されている認識候補に対して誤認識されやすい他の認識対象語からの選択を行う選択装置と、誤認識されやすい認識対象語の存在する認識候補が前記第１の操作により選択された場合に、操作装置に対して第２の方向に力を加える力発生装置とを備えることを特徴とする。
（２）請求項２の発明は、音声を入力する音声入力装置と、複数の認識対象語を記憶するとともに、個々の認識対象語と誤認識されやすい他の認識対象語との対応を記憶した記憶装置と、入力された音声と各認識対象語との一致度を演算し、一致度に基づいて複数の認識対象語を複数の認識候補とする候補抽出装置と、複数の認識候補を第１の方向に並べて表示し、認識候補に対して誤認識されやすい他の認識対象語を第２の方向に並べて表示する表示装置と、第１の方向と対応する第１の操作および第２の方向と対応する第２の操作を行う操作装置と、第１の操作により前記認識候補からの選択を行い、第２の操作により選択されている認識候補に対して誤認識されやすい他の認識対象語からの選択を行う選択装置と、第１の操作により選択した認識候補に対して誤認識されやすい認識対象語が第２の操作により最後まで選択された場合に、操作装置に対して第１の方向に力を加える力発生装置とを備えることを特徴とする。
（３）請求項３の発明は、請求項１または２の音声認識装置において、力発生装置により力を加えた前後において、選択されている認識対象語が同一であることを特徴とする。
（４）請求項４の発明は、請求項１乃至３のいずれか一項の音声認識装置において、第１の方向及び第２の方向は互いにほぼ直交することを特徴とする。
（５）請求項５の発明は、請求項１の音声認識装置において、表示装置は、複数の認識候補と、複数の認識候補のうち選択されている認識候補に対して誤認識されやすい認識対象語とを表示するとともに、複数の認識候補のうち未選択の認識候補に対して誤認識されやすい認識対象語は非表示とすることを特徴とする。
【０００７】
【発明の効果】
本発明によれば、次のような効果を奏する。
（１）請求項１〜７の発明によれば、操作装置を用いて認識候補の選択を行う際に、認識候補の中に誤認識されやすい認識対象語が存在するときに、音声の一致度に基づく候補からの選択と誤認識されやすい候補からの選択とを独立の操作で行なうことができる。このため、簡単な操作で所望の入力をすることができる。
（２）請求項２の発明によれば、操作装置の操作方向と表示装置における選択の移動方向が一致するので、操作しやすい認識装置とすることができる。
（３）請求項３の発明によれば、操作装置に対して力を加えるので、第２の方向への操作により誤認識されやすい候補の選択が可能であることを容易に知ることができる。
（４）請求項４の発明によれば、操作装置に対して力を加えるので、第２の方向に操作しても、誤認識されやすい認識対象語がそれ以上にないことを容易に知ることができる。
（５）請求項５の発明によれば、力が加わっただけでは認識対象語の選択が変更されないため、最後に選択された認識対象語が入力したい語である場合に、そのまま入力確定することができる。
（６）請求項６の発明によれば、操作装置を操作した方向とその時に力が加わる方向がほぼ直交するため、力が加えられたことを操作者に認識しやすくすることができる。
（７）請求項７の発明によれば、誤認識されやすい認識対象語の表示を制限するので、表示を簡明にすることができる。
【０００８】
【発明の実施の形態】
＜１．全体構成＞
図１は、本発明の一実施形態による音声認識装置の全体構成を示す図である。
第１の実施の形態における音声認識装置は、音声入力装置であるマイク１０１と、スピーカ１０２と、信号処理ユニット１０３と、操作装置である入力装置１０４と、表示装置であるディスプレイ１０５とを備える。信号処理ユニット１０３は、Ａ／Ｄコンバータ１０３１と、Ｄ／Ａコンバータ１０３２と、出力アンプ１０３３と、制御装置である信号処理装置１０３４と、外部記憶装置１０３５とを有する。
【０００９】
マイク１０１を介して入力された音声は、音声信号として信号処理ユニット１０３のＡ／Ｄコンバータ１０３１に入力される。Ａ／Ｄコンバータ１０３１は、入力された音声信号をデジタル信号に変換して、信号処理装置１０３４に出力する。信号処理装置１０３４は、ＣＰＵ１０３４ａとメモリ１０３４ｂとを有している。この信号処理装置１０３４は、外部記憶装置１０３５に記憶されている認識対象語のデジタルデータと、入力された音声のデジタルデータとの一致度を演算する。外部記憶装置１０３５には、複数の認識対象語が記憶されている。これら認識対象語に、それぞれ誤認識されやすい認識対象語があるときは、これを対応付けて記憶している。
【００１０】
誤認識されやすい語の選定は以下のように行なう。例えば、過去の実験等のデータにおいて、音声との一致度の高い第１の認識候補が「神奈川県」である場合に、実際に入力確定された言葉が「長野県」である頻度が高かったとする。この場合、外部記憶装置１０３５に、「神奈川県」に対して誤認識されやすい認識対象語として、「長野県」を対応付けて記憶させる。あるいは、「神奈川県」と「長野県」とを相互に誤認識されやすいものとして、相互に対応づけて記憶させる。
【００１１】
Ｄ／Ａコンバータ１０３２は、認識対象語のデジタルデータをアナログ信号に変換して、出力アンプ１０３３に出力する。出力アンプ１０３３は、入力されたアナログ信号を増幅して、この信号をスピーカ１０２が音声として出力する。
【００１２】
ディスプレイ１０５は、入力された音声に対する認識候補等を表示するためのものである。入力装置１０４は、操作者の音声認識開始要求入力、入力の取り消し、認識候補選択操作等を検出して信号処理装置１０３４に出力する。入力装置１０４は、例えばジョイスティックを有している。ジョイスティックは、図１の矢印１０４Ａの方向（第１の方向）への回動操作と、矢印１０４Ｂの方向（第２の方向）への回動操作と、矢印１０４Ｃの方向への押し込み操作とが可能である。第１の方向及び第２の方向への回動操作は、ディスプレイ１０５に表示された認識対象語を選択するために行われる。矢印１０４Ｃの方向への押し込み操作は、上記回動操作により選択された認識対象語を確定するために行われる。
【００１３】
＜２．入力装置の構成＞
図２は、入力装置１０４の詳細な構成を示すブロック図である。入力装置１０４は、ジョイスティック５０１と、力発生装置であるジョイスティック駆動モータ５０２ａ、５０２ｂと、ジョイスティック位置センサ５０３ａ、５０３ｂと、ジョイスティック制御ＣＰＵ５０４と、通信デバイス５０５と、その他のスイッチ５０６を備えている。
【００１４】
ジョイスティック５０１は、第１の方向である縦方向の操作、第２の方向である横方向の操作、および押し込み操作が可能である。ここでは縦方向をＹ軸方向とし、Ｙ軸方向の操作は画面縦方向のアイコンの選択に対応するものとする。また横方向をＸ軸方向とし、Ｘ軸方向の操作は画面横方向のアイコンの選択に対応するものとする。
【００１５】
ジョイスティック駆動モータ５０２ａ、５０２ｂは、ジョイスティックの操作方向にトルクを発生する。ジョイスティック位置センサ５０３ａ、５０３ｂは、ジョイスティック５０１の操作方向を検出する。ジョイスティック制御ＣＰＵ５０４は、ジョイスティック駆動モータ５０２ａ、５０２ｂにトルク制御信号を出力する。またジョイスティック制御ＣＰＵ５０４は、ジョイスティック位置センサ５０３ａ、５０３ｂで検出された位置情報を取得する。通信デバイス５０５は、ジョイスティック制御ＣＰＵ５０４から入力されるジョイスティック位置情報を信号処理装置１０３４に出力する。また通信デバイス５０５は、信号処理装置１０３４から出力される発生トルクをジョイスティック制御ＣＰＵ５０４に出力する。
【００１６】
図３に、図２のジョイスティック、駆動モータ及び位置センサの具体的構造を示す。ジョイスティック操作部６０１はジョイスティック５０１の操作部であり、通常、使用者からはこの部分しか見えていない。ジョイスティック回転部６０２は球形状であり、ジョイスティック操作部６０１と一体となっている。ジョイスティック操作部６０１は、複数方向への操作が可能である。本実施例においては図１に示す通り、十字型の溝１０４ＤによりＸ軸、Ｙ軸の２方向のみの操作を可能としている。駆動モータ６０５、６０６は２方向にトルクを発生させるためのモータである。駆動モータ６０５、６０６で発生したトルクはそれぞれトルク伝達部６０３、６０４によってジョイスティック回転部６０２を経てジョイスティック操作部６０１に伝わる。駆動モータ６０５、６０６は、各々図２の駆動モータ５０２ａ、５０２ｂに相当する。
【００１７】
ジョイスティック回転部６０２には電気接点６０９が取り付けられている。この電気接点が外側電気接点６０７又は６０８への接触を検出することによって、どちらの方向に操作されているかが判別される。なお各々の外側電気接点に対しては、ジョイスティック回転部６０２を挟んだ反対側に、それぞれ対をなす外側電気接点（図示せず）が存在する。外側電気接点６０８、６０７は各々図２の位置センサ５０３ａ、５０３ｂに相当する。また、ジョイスティック回転部６０２の下方に位置する図示しない外側電気接点により、ジョイスティックの押し込み操作も検出する。
【００１８】
＜３．処理の概要＞
図４は、上記音声認識装置により行われる処理の手順を示すフローチャートである。この制御は、信号処理ユニット１０３の信号処理装置１０３４により行われる。
【００１９】
ステップＳ２０１において、操作者が入力装置１０４を操作して、音声入力を開始する旨の信号が信号処理装置１０３４に入力されると、その後の処理が開始される。この場合、入力装置１０４は発話スイッチとして機能し、信号処理装置１０３４は音声認識開始要求信号を入力する。
【００２０】
ステップＳ２０２では、音声認識処理を開始する旨を操作者に知らせるための告知音信号を外部記憶装置１０３５から読み込んで、Ｄ／Ａコンバータ１０３２に出力する。Ｄ／Ａコンバータ１０３２でアナログ変換された告知音信号は、出力アンプ１０３３を介してスピーカ１０２から告知音として出力される。操作者は、スピーカ１０２から発せられる告知音を聞いて、マイク１０１に音声入力を開始する。ここでは、本発明による音声認識装置をカーナビゲーション装置に適用した例について取りあげる。すなわち、操作者が目的地を音声入力するものである。説明を容易にするために、ここでは目的地の都道府県の名称を音声入力するものとし、外部記憶装置１０３５には、都道府県の名称が認識対象語として記憶されているものとする。
【００２１】
次のステップＳ２０３では、入力された音声の取り込みを開始する。操作者がマイク１０１に向かって発した音声は、Ａ／Ｄコンバータ１０３１でデジタル信号に変換された後、信号処理装置１０３４に入力される。マイク１０１は、不図示の電源から電力が供給されると、ステップＳ２０１で操作者が入力装置１０４を操作する前から、周辺の音を拾ってＡ／Ｄコンバータ１０３１に出力している。Ａ／Ｄコンバータ１０３１で変換されたデジタル信号は、信号処理装置１０３４に入力される。
【００２２】
信号処理装置１０３４は、ステップＳ２０１で操作者が入力装置１０４を操作して音声認識開始要求信号が入力するまでは、入力されるデジタル信号の平均パワーを演算している。入力装置１０４が操作されて音声認識開始要求信号が入力され、続いて音声が入力されると、演算していたデジタル信号の平均パワーより大きいパワーのデジタル信号が入力される。従って、信号処理装置１０３４は、演算していた平均パワーより所定値以上のパワーのデジタル信号が入力されたときに、操作者がマイク１０１に向かって音声入力を行ったと判断し、音声の取り込みを開始する。
【００２３】
音声の取り込みを開始すると、ステップＳ２０４に進む。ステップＳ２０４では、取り込んだ音声と、外部記憶装置１０３５に記憶されている認識対象語との一致度を演算する。信号処理装置１０３４は、取り込みを開始した音声のデジタル信号のうち、信号のパワーに基づいて、操作者が発した音声区間の開始を識別しておく。この音声区間の開始以降のデジタル信号と、外部記憶装置１０３５に記憶されている複数の認識対象語のデジタル信号とが、それぞれどれほど似ているか（一致度）を常時演算し、数値化していく。数値化された一致度の値が大きいほど、比較している両者が似ていることを意味する。なお、並列処理により、一致度の演算が行われている間も、音声の取り込みは継続して行われている。
【００２４】
取り込んでいる音声のデジタル信号のパワーが所定値以下となる時間が所定時間以上継続すると、操作者による音声入力が終了したと判断して、ステップＳ２０５にて音声の取り込みを終了する。
【００２５】
次のステップＳ２０６では、一致度の演算処理が終了した後に、一致度の大きい順に所定の数の認識対象語を抽出して認識候補とする。抽出する認識対象語の所定の数は、予め定めることができ、例えば１０である。
【００２６】
所定の数の認識候補が抽出されると、ステップＳ２０７に進む。ステップＳ２０７では、入力装置１０４の操作信号を受信して、ディスプレイ１０５に表示された認識候補の中から、所望の認識対象語の選択および決定処理を行なう。
【００２７】
すなわち、ジョイスティックの回動操作や押し込み操作は、ジョイスティック位置センサ５０３ａ、５０３ｂ等にて検出され、その信号がジョイスティック制御ＣＰＵ５０４に送られる。そして、その信号は通信デバイス５０５を介して信号処理装置１０３４に入力される。信号処理装置１０３４は、ジョイスティックの回動操作信号を受信すると認識対象語の選択を実行する。この選択は、更なる回動操作により変更することができる。所望の認識対象語が選択された状態でジョイスティックの押し込み操作信号を受信すると、認識対象語の確定処理を実行して本制御を終了する。
【００２８】
＜４．認識候補決定処理の詳細＞
図５は、図４のステップＳ２０７における処理の詳細を示すフローチャートである。まず、音声の一致度に従って抽出された認識候補を、ステップＳ３０１において、ディスプレイ１０５に表示させる。
【００２９】
図６は、図５のステップＳ３０１で出力される認識候補の画面表示の一例である。着色部７０１の表示は選択状態を示しており、この時点ではどの認識候補も選択されていない。図６では、一致度が高い順に３つの認識候補が、第１の方向である縦方向に並べられて表示されている。抽出した認識候補の所定の数を１０とした場合、一致度が佐賀県より小さい７つの認識候補がさらに存在する。なお、ディスプレイ１０５には、認識候補とともに一致度も表示することが望ましい。
【００３０】
図７に、認識候補の並びおよび認識候補と誤認識されやすい候補の並びの対応関係を示す。認識候補は、音声に対する音響的な近さの順番で縦方向に並べられ、ここでは神奈川県、滋賀県、佐賀県の順番である。ある認識候補に対して誤認識されやすい候補が存在する場合には、当該認識候補の横に、横方向に並べられる。図７の例では神奈川県に対して誤認識されやすい語が１つあり、それが長野県である。例えば、話者が「長野県」と発音した場合でも、語頭に雑音が入った結果、音声が最も一致する候補が「神奈川県」となってしまうことがある。この場合、音声の一致度に関わらず「長野県」を容易に選択できるようにするのである。
【００３１】
図６の状態から、使用者が下方向（図３の６１０方向）にジョイスティックを操作すると（ステップＳ３０２：ＹＥＳ）、図３の電気接点６０９が外側電気接点６０７の反対側にある外側電気接点と接触することで操作方向が検出される。この結果、図７の神奈川県８０１を選択しようとしたことが信号処理ユニット１０３に伝達される。これを受けて信号処理ユニット１０３は図８の９０１の通り、神奈川県が選択状態となるように表示を変更する（ステップＳ３０３）。
【００３２】
選択された「神奈川県」に対し、誤認識されやすい語（長野県）がある場合（Ｓ３０４：ＹＥＳ）、駆動モータ６０６がトルクを発生させる。このトルクにより、図３のジョイスティック操作部６０１が右方向（図３の６１１方向）の力を使用者に伝える。しかしこのトルクの継続時間は極めて短いため、電気接点６０９が新たに他の接点に接触することはない。同時に、「神奈川県」に対して第２の方向である横方向の位置に、誤認識候補である「長野県」８０３を表示する（ステップＳ３０５）。使用者は、このトルクと誤認識候補の表示とに基づき、右方向にジョイスティックを操作すれば、神奈川県に間違われやすい候補を選択できることを、知ることができる。また、このように認識候補の「神奈川県」が選択されてから、これと誤認識されやすい「長野県」を表示する。これにより、ステップＳ３０１における表示を簡明にすることができ、表示処理を高速にすることができる。
【００３３】
ここで使用者がジョイスティックを下方に操作すると（ステップＳ３０６：「下」）、次の認識候補（ここでは滋賀県）を選択し（ステップＳ３１１）、ステップＳ３０４に戻る。ジョイスティックを押し込み操作すると（ステップＳ３０６：「決定」）、入力語は「神奈川県」で確定される。
【００３４】
使用者がジョイスティックを右に操作、すなわち図３の６１１方向にジョイスティックを操作すると（ステップＳ３０６：「右」）、図３の電気接点６０９が外側電気接点６０８の反対側にある外側電気接点と接触することで操作方向が検出される。操作が検出されると、その信号は信号処理ユニット１０３に伝達される。これを受けて信号処理ユニット１０３は図９の１００１に示すように長野県が選択状態となるように表示を変更する（ステップＳ３０７）。
【００３５】
ここで、長野県以外に、「神奈川県」と誤認識されやすい語がある場合には（ステップＳ３０８：ＮＯ）、ステップＳ３０６に戻って次の操作を待つ。この場合、ステップＳ３０６においてジョイスティックを右に操作して更に右隣の誤認識候補を選択してもよく（ステップＳ３０７）、ジョイスティックを下に操作して次の認識候補（滋賀県）を選択しても良く（ステップＳ３１１）、ジョイスティックを押し込み操作して現在選択されている誤認識候補（長野県）に確定させてもよい。
【００３６】
長野県が、「神奈川県」と誤認識されやすい語のうち最後の候補であれば（ステップＳ３０８：ＹＥＳ）、図３の６１０に示す下方向のトルクを発生させるよう駆動モータ６０５に信号を送信する。これにより駆動モータ６０５はトルクを発生させ、図３のジョイスティック操作部６０１は６１０方向に動くようなトルクを使用者の手に伝える（ステップＳ３０９）。しかしこのトルクの継続時間は極めて短いため、それだけで電気接点６０９が新たに他の接点に接触することはない。使用者は、このトルクに基づき、これ以上誤認識されやすい候補は存在しないこと、未選択の認識対象語を選択するには次に下方向にジョイスティックを操作した方が良いことを、知ることができる。特に、ディスプレイ１０５の表示領域内に、誤認識されやすい語を数多く並べて同時に表示することができない場合でも、上記トルクに基づき、現在表示されている語の後に誤認識されやすい語があるか否かを容易に知ることができる。
【００３７】
そこで操作者がジョイスティックを下に操作、すなわち図３の６１０方向にジョイスティックを操作すると（ステップＳ３１０：「下」）、図１０に示すように、音声との一致度の順に抽出した次の認識候補である「滋賀県」を選択した状態となり（ステップＳ３１１）、ステップＳ３０４に戻る。なお、「滋賀県」を選択した場合、表示を簡明にするため、前の認識候補「神奈川県」に誤認識されやすい語「長野県」は、非表示にしてもよい。一方、ジョイスティックの押し込み操作をすると（ステップＳ３１０：「決定」）、入力語は現在選択されている「長野県」で確定される。
【００３８】
ステップＳ３０４において、例えば図７の「滋賀県」のように、選択されている認識候補に対して誤認識されやすい語が登録されていない場合（ステップＳ３０４：ＮＯ）、次の操作を待つ。ジョイスティックの押し込み操作がされれば（ステップＳ３１２：「決定」）、選択されている認識候補に確定される。ジョイスティックが下方に操作されれば（ステップＳ３１２：「下」）、次の認識候補を選択し（ステップＳ３１３）、ステップＳ３０４に戻る。なお、ステップＳ３１２において、「決定」又は「下」に限らず、上方への操作を許容してもよい。その場合、１つ前の認識候補を選択し、ステップＳ３０４に戻る。ステップＳ３１２において、右又は左へ操作された場合には、図示しないエラーメッセージを出力してもよいし、それぞれ下又は上への操作がされた場合と同様に処理しても良い。
【００３９】
なお、選択された認識候補又は誤認識されやすい語は、ディスプレイ１０５において反転表示されると同時に（図８〜図１０）、スピーカ１０２により合成音声で操作者に知らされる。これにより、操作者は選択した語が何であるかを正確に知ることができる。
【００４０】
以上の動作により、使用者は縦方向、横方向の操作により所望の候補を効率よく探すことが可能である。
【００４１】
上述した実施の形態では、誤認識されやすい認識候補が１番目にある場合について説明したが、誤認識されやすい認識候補の順番は何番目でもよい。
【００４２】
また、本実施形態では入力装置１０４としてジョイスティックを例にとって説明したが、２方向以上の操作が可能なものであれば他の如何なる入力装置でもよい。望ましくは、十字キー、トラックボールなど、単一の操作部で２方向以上の操作ができるものであれば、１つの方向に向かって操作しているときに、他の方向へのトルクを使用者の手に伝えることができる。
【００４３】
上述した実施の形態では、本発明による音声認識装置をカーナビゲーション装置に適用した例について説明したが、カーナビゲーション装置以外のものにも適用することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態による音声認識装置の全体構成を示す図
【図２】入力装置の詳細な構成を示すブロック図
【図３】図２のジョイスティック、駆動モータ及び位置センサの具体的構造を示す図
【図４】上記音声認識装置により行われる処理の手順を示すフローチャート
【図５】図４のステップＳ２０７における処理の詳細を示すフローチャート
【図６】図５のステップＳ３０１で出力される認識候補の画面表示の一例を示す図
【図７】認識候補の並びおよび認識候補と誤認識されやすい候補の並びの対応関係を示す図
【図８】図５のステップＳ３０３で出力される認識候補の画面表示の一例を示す図
【図９】図５のステップＳ３０７で出力される認識候補の画面表示の一例を示す図
【図１０】図５のステップＳ３１１で出力される認識候補の画面表示の一例を示す図
【符号の説明】
１０１…マイク（音声入力装置）、
１０２…スピーカ、
１０３…信号処理ユニット、
１０３１…Ａ／Ｄコンバータ、
１０３２…Ｄ／Ａコンバータ、
１０３３…出力アンプ、
１０３４…信号処理装置（制御装置）、
１０３４ａ…ＣＰＵ、
１０３４ｂ…メモリ、
１０３５…外部記憶装置、
１０４…入力装置（操作装置）、
５０６…スイッチ、
５０２ａ、５０２ｂ…駆動モータ（力発生装置）、
５０４…ジョイスティック制御ＣＰＵ、
５０３ａ、５０３ｂ…ジョイスティック位置センサ、
５０５…通信デバイス、
１０５…ディスプレイ（表示装置）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus that recognizes input speech and presents selection candidates for the input speech to enable selection.
[0002]
[Prior art]
A conventional speech recognition apparatus is disclosed in Japanese Patent Application Laid-Open No. 11-352991. This speech recognition device recognizes speech uttered by dividing into single syllables and displays recognition candidates. The next recognition candidate can be sequentially displayed until the desired recognition candidate is confirmed by the voice input person.
[0003]
[Problems to be solved by the invention]
However, in the conventional speech recognition apparatus, for example, when a high level of noise is mixed during speech input and erroneous recognition occurs, the desired recognition candidate may not be displayed even if the recognition candidates are sequentially displayed. Therefore, the selection operation must be performed while confirming the recognition candidates one by one without knowing whether there is a correct recognition candidate.
[0004]
Further, it is conceivable to change the order of recognition candidates and arrange recognition candidates that are easily misrecognized with candidates having higher matching degree with speech. However, when the recognition candidate with the first matching score is erroneous recognition and the recognition candidate with the second matching score is correct, and the first recognition candidate and other words are likely to be erroneously recognized. Problems arise. In this case, when the above measures are taken, the order of appearance is the first recognition candidate, the word that is easily misrecognized, and the second recognition candidate in this order, and the appearance order of the recognition candidate with the second highest match. Will go down. It is desirable to arrange the recognition candidates in order of the degree of coincidence.
[0005]
The object of the present invention is to select a recognition candidate using an operation device, and if there is a recognition target word that is easily misrecognized among the recognition candidates, a simple operation can be performed regardless of the degree of coincidence with speech. An object of the present invention is to provide a speech recognition apparatus that can select a recognition target word.
[0006]
[Means for solving the problems]
(1) A speech recognition device according to the invention of claim 1 includes a speech input device for inputting speech, a plurality of recognition target words, and individual recognition target words and other recognition target words that are easily misrecognized. A storage device that stores the correspondence; a degree of coincidence between the input speech and each recognition target word; a candidate extraction device that sets a plurality of recognition target words as a plurality of recognition candidates based on the degree of coincidence; and a plurality of recognitions A display device that displays the candidates side by side in the first direction, and displays other recognition target words that are likely to be erroneously recognized with respect to the recognition candidates in the second direction; a first operation corresponding to the first direction; an operation unit for performing a second operation corresponding to the second direction, and selects from the recognition candidates by a first operation, which are easily recognized by other erroneous against recognition candidate selected by the second operation and a selection device for selecting from the recognition target word, If the existing recognition candidates of the recognized easily recognized target word is selected by the first operation, characterized in that it comprises a force generator for applying a force in a second direction relative to the operating device.
(2) The invention of claim 2 stores a correspondence between a speech input device for inputting speech and a plurality of recognition target words and each recognition target word and another recognition target word that is easily misrecognized. A storage device, a candidate extraction device that calculates the degree of coincidence between the input speech and each recognition target word and sets a plurality of recognition target words as a plurality of recognition candidates based on the degree of coincidence, and a plurality of recognition candidates as a first A display device that displays other words to be recognized that are likely to be erroneously recognized with respect to the recognition candidates, and displays the first recognition operation and the second direction corresponding to the first direction. An operation device that performs a second operation corresponding to, and another recognition target word that is selected from the recognition candidates by the first operation and that is easily misrecognized with respect to the recognition candidate selected by the second operation A selection device for selecting from the above and a selection by the first operation. A force generation device that applies a force in the first direction to the operating device when a recognition target word that is easily misrecognized with respect to the recognized candidate is selected to the end by the second operation. To do.
(3) The invention of claim 3 is characterized in that, in the speech recognition apparatus of claim 1 or 2, the selected recognition target words are the same before and after the force is applied by the force generator.
(4) The speech recognition device according to any one of claims 1 to 3, wherein the first direction and the second direction are substantially orthogonal to each other.
(5) The invention of claim 5 is the speech recognition apparatus of claim 1, wherein the display device is a recognition target that is likely to be erroneously recognized with respect to a plurality of recognition candidates and a recognition candidate selected from the plurality of recognition candidates. A recognition target word that is easily misrecognized with respect to an unselected recognition candidate among a plurality of recognition candidates is not displayed .
[0007]
【The invention's effect】
The present invention has the following effects.
(1) According to the first to seventh aspects of the present invention, when selecting a recognition candidate using the operating device, if a recognition target word that is likely to be erroneously recognized exists in the recognition candidate, the degree of coincidence of speech The selection from the candidates based on the above and the selection from the candidates easily misrecognized can be performed by independent operations. Therefore, a desired input can be performed with a simple operation.
(2) According to the invention of claim 2, since the operating direction of the operating device matches the moving direction of the selection on the display device, the recognition device can be easily operated.
(3) According to the invention of claim 3, since a force is applied to the operating device, it is possible to easily know that a candidate that is easily misrecognized by an operation in the second direction can be selected.
(4) According to the invention of claim 4, since a force is applied to the operating device, it is easy to know that there are no more recognition target words that are easily misrecognized even if the operation is performed in the second direction. Can do.
(5) According to the invention of claim 5, since the selection of the recognition target word is not changed only by applying force, when the recognition target word selected last is the word to be input, the input is confirmed as it is. Can do.
(6) According to the invention of claim 6, since the direction in which the operating device is operated and the direction in which the force is applied at that time are almost orthogonal to each other, the operator can easily recognize that the force has been applied.
(7) According to the invention of claim 7, since the display of the recognition target words that are easily misrecognized is limited, the display can be simplified.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
<1. Overall configuration>
FIG. 1 is a diagram showing an overall configuration of a speech recognition apparatus according to an embodiment of the present invention.
The speech recognition device according to the first embodiment includes a microphone 101 that is a speech input device, a speaker 102, a signal processing unit 103, an input device 104 that is an operation device, and a display 105 that is a display device. The signal processing unit 103 includes an A / D converter 1031, a D / A converter 1032, an output amplifier 1033, a signal processing device 1034 that is a control device, and an external storage device 1035.
[0009]
The audio input via the microphone 101 is input to the A / D converter 1031 of the signal processing unit 103 as an audio signal. The A / D converter 1031 converts the input audio signal into a digital signal and outputs the digital signal to the signal processing device 1034. The signal processing device 1034 includes a CPU 1034a and a memory 1034b. The signal processing device 1034 calculates the degree of coincidence between the digital data of the recognition target word stored in the external storage device 1035 and the digital data of the input voice. The external storage device 1035 stores a plurality of recognition target words. When these recognition target words include recognition target words that are easily misrecognized, they are stored in association with each other.
[0010]
Selection of words that are easily misrecognized is performed as follows. For example, in the past data such as experiments, when the first recognition candidate having a high degree of coincidence with the voice is “Kanagawa”, the frequency that the actually input confirmed word is “Nagano” is high. To do. In this case, “Nagano” is stored in the external storage device 1035 in association with “Kanagawa” as a recognition target word that is easily misrecognized. Alternatively, “Kanagawa Prefecture” and “Nagano Prefecture” are stored in association with each other as being easily misrecognized.
[0011]
The D / A converter 1032 converts the digital data of the recognition target word into an analog signal and outputs it to the output amplifier 1033. The output amplifier 1033 amplifies the input analog signal, and the speaker 102 outputs this signal as sound.
[0012]
The display 105 is used to display recognition candidates for the input voice. The input device 104 detects an operator's voice recognition start request input, cancels input, recognizes candidate selection operation, and the like, and outputs them to the signal processing device 1034. The input device 104 has a joystick, for example. The joystick has a rotation operation in the direction of the arrow 104A (first direction) in FIG. 1, a rotation operation in the direction of the arrow 104B (second direction), and a push-in operation in the direction of the arrow 104C. Is possible. The rotation operation in the first direction and the second direction is performed to select a recognition target word displayed on the display 105. The pushing operation in the direction of the arrow 104C is performed in order to confirm the recognition target word selected by the turning operation.
[0013]
<2. Configuration of input device>
FIG. 2 is a block diagram illustrating a detailed configuration of the input device 104. The input device 104 includes a joystick 501, joystick drive motors 502 a and 502 b that are force generation devices, joystick position sensors 503 a and 503 b, a joystick control CPU 504, a communication device 505, and other switches 506.
[0014]
The joystick 501 can be operated in the vertical direction as the first direction, in the horizontal direction as the second direction, and pushed in. Here, the vertical direction is the Y-axis direction, and the operation in the Y-axis direction corresponds to selection of an icon in the vertical direction of the screen. The horizontal direction is the X-axis direction, and an operation in the X-axis direction corresponds to selection of an icon in the horizontal direction of the screen.
[0015]
Joystick drive motors 502a and 502b generate torque in the operation direction of the joystick. The joystick position sensors 503a and 503b detect the operation direction of the joystick 501. The joystick control CPU 504 outputs a torque control signal to the joystick drive motors 502a and 502b. The joystick control CPU 504 acquires position information detected by the joystick position sensors 503a and 503b. The communication device 505 outputs joystick position information input from the joystick control CPU 504 to the signal processing device 1034. The communication device 505 outputs the generated torque output from the signal processing device 1034 to the joystick control CPU 504 .
[0016]
FIG. 3 shows a specific structure of the joystick, drive motor, and position sensor of FIG. The joystick operation unit 601 is an operation unit of the joystick 501, and usually only this portion is visible to the user. The joystick rotation unit 602 has a spherical shape and is integrated with the joystick operation unit 601. The joystick operation unit 601 can be operated in a plurality of directions. In this embodiment, as shown in FIG. 1, the operation in only two directions of the X axis and the Y axis is enabled by the cross-shaped groove 104D. Drive motors 605 and 606 are motors for generating torque in two directions. Torques generated by the drive motors 605 and 606 are transmitted to the joystick operation unit 601 through the joystick rotation unit 602 by the torque transmission units 603 and 604, respectively. The drive motors 605 and 606 correspond to the drive motors 502a and 502b in FIG. 2, respectively.
[0017]
An electrical contact 609 is attached to the joystick rotating unit 602. By detecting the contact of the electrical contact with the outer electrical contact 607 or 608, it is determined in which direction the electrical contact is operated. Each outer electrical contact has a pair of outer electrical contacts (not shown) on the opposite side across the joystick rotating portion 602. The outer electrical contacts 608 and 607 correspond to the position sensors 503a and 503b in FIG. 2, respectively. The joystick push-in operation is also detected by an outer electrical contact (not shown) located below the joystick rotating unit 602.
[0018]
<3. Outline of processing>
FIG. 4 is a flowchart showing a procedure of processing performed by the voice recognition apparatus. This control is performed by the signal processing device 1034 of the signal processing unit 103.
[0019]
In step S <b> 201, when the operator operates the input device 104 and a signal for starting voice input is input to the signal processing device 1034, subsequent processing is started. In this case, the input device 104 functions as a speech switch, and the signal processing device 1034 inputs a voice recognition start request signal.
[0020]
In step S 202, a notification sound signal for notifying the operator that voice recognition processing is to be started is read from the external storage device 1035 and output to the D / A converter 1032. The notification sound signal analog-converted by the D / A converter 1032 is output as a notification sound from the speaker 102 via the output amplifier 1033. The operator listens to the notification sound emitted from the speaker 102 and starts voice input to the microphone 101. Here, an example in which the speech recognition apparatus according to the present invention is applied to a car navigation apparatus will be described. That is, the operator inputs the destination by voice. In order to facilitate the explanation, it is assumed here that the name of the destination prefecture is inputted by voice, and the name of the prefecture is stored in the external storage device 1035 as a recognition target word.
[0021]
In the next step S203, input of the input voice is started. The voice uttered by the operator toward the microphone 101 is converted into a digital signal by the A / D converter 1031 and then input to the signal processing device 1034. When power is supplied from a power source (not shown), the microphone 101 picks up surrounding sounds and outputs them to the A / D converter 1031 before the operator operates the input device 104 in step S201. The digital signal converted by the A / D converter 1031 is input to the signal processing device 1034.
[0022]
The signal processing device 1034 calculates the average power of the input digital signal until the operator operates the input device 104 and inputs the voice recognition start request signal in step S201. When the input device 104 is operated to input a voice recognition start request signal and then a voice is input, a digital signal having a power higher than the average power of the calculated digital signal is input. Therefore, the signal processing device 1034 determines that the operator has made a voice input to the microphone 101 when a digital signal having a power greater than a predetermined value from the calculated average power is input, and captures the voice. Start.
[0023]
When the audio capturing is started, the process proceeds to step S204. In step S204, the degree of coincidence between the captured voice and the recognition target word stored in the external storage device 1035 is calculated. The signal processing device 1034 identifies the start of the voice section issued by the operator based on the power of the signal among the digital signals of the voice that has been captured. The degree of similarity (degree of coincidence) between the digital signal after the start of the speech section and the digital signals of the plurality of recognition target words stored in the external storage device 1035 is always calculated and digitized. The larger the value of the degree of coincidence, the more similar the two being compared. Note that, while the matching degree is being calculated by the parallel processing, the voice is continuously captured.
[0024]
If the time during which the power of the digital signal of the voice being captured is less than or equal to a predetermined value continues for a predetermined time or longer, it is determined that the voice input by the operator has been completed, and the voice capture is terminated in step S205.
[0025]
In the next step S206, after completion of the coincidence calculation process, a predetermined number of recognition target words are extracted in descending order of coincidence to be recognition candidates. The predetermined number of recognition target words to be extracted can be determined in advance, for example, 10.
[0026]
When a predetermined number of recognition candidates are extracted, the process proceeds to step S207. In step S207, the operation signal of the input device 104 is received, and a desired recognition target word is selected and determined from the recognition candidates displayed on the display 105.
[0027]
That is, a joystick rotation operation or push-in operation is detected by the joystick position sensors 503a and 503b and the signal is sent to the joystick control CPU 504. Then, the signal is input to the signal processing apparatus 1034 via the communication device 505. When the signal processing device 1034 receives a joystick rotation operation signal, the signal processing device 1034 selects a recognition target word. This selection can be changed by a further turning operation. When a joystick push-in operation signal is received in a state where a desired recognition target word is selected, a recognition target word determination process is executed and this control is terminated.
[0028]
<4. Details of recognition candidate determination process>
FIG. 5 is a flowchart showing details of the processing in step S207 of FIG. First, recognition candidates extracted according to the degree of coincidence of voice are displayed on the display 105 in step S301.
[0029]
FIG. 6 is an example of a screen display of recognition candidates output in step S301 in FIG. The display of the coloring unit 701 indicates a selected state, and no recognition candidate is selected at this time. In FIG. 6, three recognition candidates are displayed in the vertical direction, which is the first direction, in descending order of coincidence. When the predetermined number of extracted recognition candidates is 10, there are further seven recognition candidates whose degree of matching is smaller than that of Saga Prefecture. Note that it is desirable to display the degree of coincidence along with the recognition candidates on the display 105.
[0030]
FIG. 7 shows the correspondence between the recognition candidates and the recognition candidates and the candidates that are easily misrecognized. The recognition candidates are arranged in the vertical direction in the order of acoustic closeness to the speech, and here are the order of Kanagawa Prefecture, Shiga Prefecture, and Saga Prefecture. If there is a candidate that is easily misrecognized with respect to a certain recognition candidate, it is arranged in the horizontal direction next to the recognition candidate. In the example of FIG. 7, there is one word that is easily misrecognized for Kanagawa Prefecture, which is Nagano Prefecture. For example, even if the speaker pronounces “Nagano Prefecture”, the candidate with the best matching voice may be “Kanagawa Prefecture” as a result of noise at the beginning of the word. In this case, “Nagano Prefecture” can be easily selected regardless of the degree of coincidence of speech.
[0031]
When the user operates the joystick downward (direction 610 in FIG. 3) from the state of FIG. 6 (step S302: YES), the electrical contact 609 in FIG. The operation direction is detected by contact. As a result, an attempt to select Kanagawa Prefecture 801 in FIG. 7 is transmitted to the signal processing unit 103. In response to this, the signal processing unit 103 changes the display so that Kanagawa Prefecture is in the selected state as indicated at 901 in FIG. 8 (step S303).
[0032]
When there is a word (Nagano Prefecture) that is easily misrecognized with respect to the selected “Kanagawa Prefecture” (S304: YES), the drive motor 606 generates torque. With this torque, the joystick operation unit 601 in FIG. 3 transmits a force in the right direction (direction 611 in FIG. 3) to the user. However, since the duration of this torque is extremely short, the electrical contact 609 does not newly contact another contact. At the same time, “Nagano Prefecture” 803, which is a candidate for erroneous recognition, is displayed at a position in the horizontal direction that is the second direction with respect to “Kanagawa Prefecture” (step S305). Based on this torque and the display of the misrecognition candidate, the user can know that a candidate that is easily mistaken for Kanagawa can be selected by operating the joystick in the right direction. In addition, after “Kanagawa Prefecture” as a recognition candidate is selected in this way, “Nagano Prefecture” that is easily misrecognized is displayed. Thereby, the display in step S301 can be simplified, and the display process can be speeded up.
[0033]
Here, when the user operates the joystick downward (step S306: “down”), the next recognition candidate (here, Shiga Prefecture) is selected (step S311), and the process returns to step S304. When the joystick is pushed in (step S306: “OK”), the input word is determined as “Kanagawa Prefecture”.
[0034]
When the user operates the joystick to the right, that is, operates the joystick in the direction 611 in FIG. 3 (step S306: “right”), the electrical contact 609 in FIG. 3 contacts the outer electrical contact on the opposite side of the outer electrical contact 608. By doing so, the operation direction is detected. When an operation is detected, the signal is transmitted to the signal processing unit 103. In response to this, the signal processing unit 103 changes the display so that Nagano Prefecture is selected as indicated by 1001 in FIG. 9 (step S307).
[0035]
If there is a word that is easily misrecognized as “Kanagawa prefecture” other than Nagano prefecture (step S308: NO), the process returns to step S306 to wait for the next operation. In this case, in step S306, the joystick may be operated to the right to select the right next erroneous recognition candidate (step S307), or the joystick may be operated to select the next recognition candidate (Shiga Prefecture). It is also possible (step S311), and the joystick may be pushed to determine the currently selected erroneous recognition candidate (Nagano Prefecture).
[0036]
If Nagano Prefecture is the last candidate among words that are easily misrecognized as “Kanagawa Prefecture” (step S308: YES), a signal is transmitted to the drive motor 605 to generate a downward torque indicated by 610 in FIG. To do. As a result, the drive motor 605 generates torque, and the joystick operation unit 601 in FIG. 3 transmits such torque that moves in the 610 direction to the user's hand (step S309). However, since the duration of this torque is extremely short, the electric contact 609 does not newly contact another contact by itself. Based on this torque, the user knows that there are no more candidates for erroneous recognition and that it is better to operate the joystick downward to select an unselected recognition target word. it can. In particular, whether or not there is a word that is likely to be misrecognized after the currently displayed word based on the torque even when many misrecognizable words cannot be displayed side by side in the display area of the display 105 at the same time. Can be easily known.
[0037]
Therefore, when the operator operates the joystick downward, that is, operates the joystick in the direction 610 in FIG. 3 (step S310: “down”), as shown in FIG. 10, the next recognition candidates extracted in order of the degree of coincidence with the speech. “Shiga Prefecture” is selected (step S311), and the process returns to step S304. When “Shiga Prefecture” is selected, the word “Nagano Prefecture”, which is easily misrecognized by the previous recognition candidate “Kanagawa Prefecture”, may be hidden in order to simplify the display. On the other hand, when the joystick is pushed down (step S310: “decision”), the input word is fixed at “Nagano prefecture” currently selected.
[0038]
In step S304, when a word that is easily misrecognized is not registered for the selected recognition candidate, such as “Shiga Prefecture” in FIG. 7 (step S304: NO), the next operation is awaited. If the joystick is pushed in (step S312: “OK”), the selected recognition candidate is confirmed. If the joystick is operated downward (step S312: “down”), the next recognition candidate is selected (step S313), and the process returns to step S304. Note that in step S312, not only “decision” or “down” but an upward operation may be permitted. In that case, the previous recognition candidate is selected, and the process returns to step S304. In step S312, when the operation is performed to the right or left, an error message (not shown) may be output, or the processing may be performed in the same manner as when the operation is performed downward or upward.
[0039]
The selected recognition candidate or misrecognized word is displayed in reverse on the display 105 (FIGS. 8 to 10), and at the same time, the speaker 102 informs the operator with synthesized speech. Thereby, the operator can know exactly what the selected word is.
[0040]
With the above operation, the user can efficiently search for a desired candidate by operations in the vertical direction and the horizontal direction.
[0041]
In the above-described embodiment, the case where there is the first recognition candidate that is easily misrecognized has been described. However, the order of recognition candidates that are easily misrecognized may be any order.
[0042]
In the present embodiment, a joystick has been described as an example of the input device 104. However, any other input device may be used as long as operations in two or more directions are possible. Desirably, if the operation can be performed in two or more directions with a single operation unit, such as a cross key or a trackball, the user can apply torque in the other direction when operating in one direction. I can tell you.
[0043]
In the above-described embodiment, an example in which the speech recognition apparatus according to the present invention is applied to a car navigation apparatus has been described, but the present invention can also be applied to apparatuses other than a car navigation apparatus.
[Brief description of the drawings]
FIG. 1 is a diagram showing an overall configuration of a voice recognition device according to an embodiment of the present invention. FIG. 2 is a block diagram showing a detailed configuration of an input device. FIG. 3 is a specific example of a joystick, a drive motor, and a position sensor in FIG. FIG. 4 is a flowchart showing a procedure of processing performed by the speech recognition apparatus. FIG. 5 is a flowchart showing details of processing in step S207 in FIG. 4. FIG. 6 is output in step S301 in FIG. FIG. 7 is a diagram showing an example of the screen display of recognition candidates. FIG. 7 is a diagram showing the alignment of recognition candidates and the correspondence between recognition candidates and misalignment candidates. FIG. 8 is a recognition output in step S303 of FIG. FIG. 9 is a view showing an example of a candidate screen display. FIG. 9 is a view showing an example of a recognition candidate screen display output in step S307 in FIG. 5. FIG. 10 is an output in step S311 in FIG. Of an example of a screen display of recognition candidates to be displayed 【Explanation of symbols】
101 ... Microphone (voice input device),
102 ... Speaker,
103 ... Signal processing unit,
1031 ... A / D converter,
1032 ... D / A converter,
1033: Output amplifier,
1034 ... Signal processing device (control device),
1034a ... CPU,
1034b ... memory,
1035 ... External storage device,
104 ... input device (operation device),
506 ... switch,
502a, 502b ... drive motor (force generator),
504 ... Joystick control CPU,
503a, 503b ... Joystick position sensors,
505 ... Communication device,
105. Display (display device)

Claims

A voice input device for inputting voice;
A storage device that stores a plurality of recognition target words and stores correspondences between individual recognition target words and other recognition target words that are easily misrecognized.
A candidate extraction device that calculates a degree of coincidence between the input speech and each recognition target word, and sets a plurality of recognition target words as a plurality of recognition candidates based on the degree of coincidence;
A display device that displays the plurality of recognition candidates side by side in a first direction and displays other recognition target words that are easily misrecognized with respect to the recognition candidates in a second direction;
An operating device for performing a first operation corresponding to the first direction and a second operation corresponding to the second direction ;
Said the first operation and selects from the recognition candidates, selecting make a selection from the misrecognized susceptible another recognition target word against recognition candidate that is the selected by said second operating device,
And a force generation device that applies a force to the operation device in the second direction when a recognition candidate having a recognition target word that is likely to be erroneously recognized is selected by the first operation. Voice recognition device.

A voice input device for inputting voice;
A storage device that stores a plurality of recognition target words and stores correspondences between individual recognition target words and other recognition target words that are easily misrecognized.
A candidate extraction device that calculates a degree of coincidence between the input speech and each recognition target word, and sets a plurality of recognition target words as a plurality of recognition candidates based on the degree of coincidence;
A display device that displays the plurality of recognition candidates side by side in a first direction and displays other recognition target words that are easily misrecognized with respect to the recognition candidates in a second direction;
An operating device for performing a first operation corresponding to the first direction and a second operation corresponding to the second direction;
A selection device that performs selection from the recognition candidates by the first operation, and performs selection from other recognition target words that are likely to be erroneously recognized by the second operation;
When a recognition target word that is easily misrecognized with respect to the recognition candidate selected by the first operation is selected to the end by the second operation, a force is applied to the operation device in the first direction. A voice recognition device comprising a force generation device.

The speech recognition apparatus according to claim 1 or 2,
The speech recognition apparatus according to claim 1, wherein the selected recognition target words are the same before and after the force is applied by the force generation apparatus.

The speech recognition apparatus according to any one of claims 1 to 3,
The speech recognition apparatus according to claim 1, wherein the first direction and the second direction are substantially orthogonal to each other.

The speech recognition apparatus according to any one of claims 1 to 4,
The display device displays the plurality of recognition candidates and a recognition target word that is easily misrecognized with respect to a recognition candidate selected from the plurality of recognition candidates, and is not selected from the plurality of recognition candidates. A speech recognition apparatus characterized in that recognition target words that are easily misrecognized with respect to recognition candidates are not displayed.