JP2010128015A

JP2010128015A - Device and program for determining erroneous recognition in speech recognition

Info

Publication number: JP2010128015A
Application number: JP2008300021A
Authority: JP
Inventors: Iko Terasawa; 位好寺澤; Kinichi Wada; 錦一和田; Hiroaki Sekiyama; 博昭関山; Toshiyuki Nanba; 利行難波; Keisuke Okamoto; 圭介岡本
Original assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Current assignee: Toyota Motor Corp; Toyota Central R&D Labs Inc
Priority date: 2008-11-25
Filing date: 2008-11-25
Publication date: 2010-06-10

Abstract

【課題】音声認識結果が誤認識か否かを判定する音声認識の誤認識判定装置を提供する。
【解決手段】音声認識の誤認識判定装置は、音声データと音声コマンド辞書２１とに基いて音声コマンドを認識する音声認識部１２と、音声認識部１２による認識結果に対する応答処理を実行する認識結果応答部と、応答処理後一定時間内にユーザの顔画像データを取得する顔画像取得部１５と、発声データを取得する発声データ取得部１６と、顔画像取得部１５により取得された顔画像データに基いて予め定めた表情及び頭部動作を画像認識する顔画像認識部１７と、発声データ取得部１６により取得された発声データと無意識発話辞書２２とに基いて無意識発話を認識する無意識発話認識部１８と、顔画像認識部１７により予め定めた表情か頭部動作が認識された場合又は無意識発話認識部１８により無意識発話が認識された場合に認識結果が誤認識と判定する誤認識判定部１９とを備えている。
【選択図】図１Disclosed is a speech recognition misrecognition determination apparatus that determines whether a speech recognition result is misrecognition.
A speech recognition misrecognition determination apparatus includes a speech recognition unit that recognizes a speech command based on speech data and a speech command dictionary, and a recognition result that executes a response process for the recognition result by the speech recognition unit. The response unit, the face image acquisition unit 15 that acquires the user's face image data within a certain time after the response process, the utterance data acquisition unit 16 that acquires the utterance data, and the face image data acquired by the face image acquisition unit 15 Unconscious utterance recognition for recognizing unconscious utterances based on utterance data acquired by the utterance data acquisition unit 16 and unconscious utterance dictionary 22 Recognized when a predetermined facial expression or head movement is recognized by the unit 18 and the face image recognition unit 17 or when an unconscious utterance is recognized by the unconscious utterance recognition unit 18 And a erroneous recognition determination unit 19 determines that the recognition result is erroneous.
[Selection] Figure 1

Description

本発明は、ユーザが発声した音声の認識が誤認識か否かを判定する音声認識の誤認識判定装置及び音声認識の誤認識判定プログラムに関する。 The present invention relates to a speech recognition misrecognition determination apparatus and a speech recognition misrecognition determination program for determining whether recognition of speech uttered by a user is erroneous recognition.

従来、様々な分野において、ユーザが発声した音声を認識し、その認識結果に応じて機器を動作させる音声認識装置が用いられている。このような音声認識装置では、音声認識が正しく行われないと、ユーザが意図していない誤認識による機器動作が行われ、ユーザは不快感を覚える場合がある。 2. Description of the Related Art Conventionally, in various fields, a voice recognition device that recognizes a voice uttered by a user and operates a device according to the recognition result is used. In such a speech recognition apparatus, if speech recognition is not performed correctly, device operation due to misrecognition that is not intended by the user is performed, and the user may feel uncomfortable.

このような場合に、音声認識が正しく行われなかったことを、誤認識による機器動作を取り消すための機器動作などのユーザの応答によって判定し、その際にユーザに不快感を与えない処置を講ずる音声認識装置（例えば、特許文献１参照。）が提案されている。 In such a case, it is determined that the voice recognition has not been performed correctly based on a user response such as a device operation for canceling the device operation due to the misrecognition, and at that time, a measure that does not cause discomfort to the user is taken. A speech recognition device (see, for example, Patent Document 1) has been proposed.

また、コマンド間違いなど、ユーザが誤操作を行った場合の無意識発話から誤操作を認識し、適切な応答を返す無意識発話による制御装置（例えば、特許文献２参照。）も提案されている。
特開２００１−２２８８９４号公報特開平５−１６５６００号公報 In addition, a control device based on unconscious utterance that recognizes an erroneous operation from an unconscious utterance when a user performs an erroneous operation such as a command mistake and returns an appropriate response (see, for example, Patent Document 2) has also been proposed.
JP 2001-228894 A JP-A-5-165600

しかしながら、特許文献１の音声認識装置では、音声認識の誤りをユーザの機器操作の反応によって判定している。従って、ユーザによる機器操作が行われた後に判定を行うため、取り消し作業などが煩わしくなる。さらに、ユーザによる取り消し機器操作が行われるまで判定を待つ必要があり、時間がかるという問題点もある。 However, in the speech recognition device of Patent Document 1, an error in speech recognition is determined based on a reaction of a user's device operation. Accordingly, since the determination is performed after the device operation by the user, the canceling operation becomes troublesome. Furthermore, it is necessary to wait for the determination until the user performs a canceling device operation, and there is a problem that it takes time.

また、特許文献２の無意識発話による制御装置では、ユーザ自身のエラーを対象としており、装置側のエラーへの対応は不十分である。また、無意識発話という音声だけを対象としており、エラーの検出精度にも問題点がある。 In addition, the control device based on unconscious utterances in Patent Document 2 targets the user's own error, and the response to the error on the device side is insufficient. Moreover, only the voice of unconscious utterance is targeted, and there is a problem in the error detection accuracy.

このように音声認識装置では、誤認識は避けることができない。通常ユーザは、自分が入力した音声と異なる結果が出力された場合（例えば、入力「アバトン」に対して、出力「甘党（あまとう）」）、無意識に何らかの反応を見せる。具体的には、入力と著しく結果が異なった場合の「笑い（苦笑）」や「驚き」、何度も誤認識して目的が達成できないときの「怒り」、「落胆」、「あきれ」、「悲しみ」などの表情を見せたりする。さらに、首をかしげたり、首を振ったり、のけぞったりする頭部のジェスチャ動作をしたり、「えっ」、「うそ」、「何で」などの発声を無意識にしてしまうことがある
本発明は、上記問題点を解決するために成されたものであり、ユーザ発話の音声認識結果が誤りである場合のユーザの反応に着目し、誤認識か否かを精度よく判定する音声認識の誤認識判定装置及び音声認識の誤認識判定プログラムを提供することを目的とする。 As described above, in the speech recognition apparatus, erroneous recognition cannot be avoided. Usually, when a result different from the voice input by the user is output (for example, the output “Amaton” for the input “Avaton”), the user unconsciously shows some reaction. Specifically, “laughter” or “surprise” when the result is significantly different from the input, “angry”, “disappointment”, “drill” when the goal cannot be achieved due to misrecognition many times, Show expressions such as “sadness”. Furthermore, the head may bend, shake the head, or sway, or the voice of "Eh", "Lie", "Why", etc. may be unconscious. In order to solve the above-mentioned problems, focusing on the user's reaction when the speech recognition result of the user's utterance is incorrect, the voice recognition misrecognition determination that accurately determines whether or not the recognition is wrong An object of the present invention is to provide an apparatus and a program for determining misrecognition of speech recognition.

上記目的を達成するために、請求項１記載の音声認識の誤認識判定装置は、ユーザにより入力された音声データと、音声データに対応する音声コマンドを登録した音声コマンド辞書とに基づいて、入力された音声データに対応する音声コマンドを認識する音声コマンド認識手段と、前記音声コマンド認識手段による認識結果に対する応答処理を実行する認識結果応答手段と、前記認識結果応答手段による応答処理を実行した後、予め定めた時間内において、前記ユーザの顔画像データを取得する顔画像取得手段と、前記予め定めた時間内において、前記ユーザの発声データを取得する発声データ取得手段と、前記顔画像取得手段により取得された顔画像データに基づいて、予め定めた表情及び予め定めた頭部動作を画像認識する画像認識手段と、前記発声データ取得手段により取得された発声データと、発声データに対応する無意識発話を登録した無意識発話辞書とに基づいて、取得された発声データに対応する無意識発話を認識する無意識発話認識手段と、前記画像認識手段により前記予め定めた表情又は前記予め定めた頭部動作が認識された場合、又は、前記無意識発話認識手段により無意識発話が認識された場合に、前記音声コマンド認識手段による認識結果が誤認識と判定する誤認識判定手段と、を備えている。 In order to achieve the above object, an erroneous recognition determination apparatus for speech recognition according to claim 1 is based on speech data input by a user and a speech command dictionary in which speech commands corresponding to speech data are registered. A voice command recognition means for recognizing a voice command corresponding to the voice data that has been recorded, a recognition result response means for executing a response process for a recognition result by the voice command recognition means, and a response process by the recognition result response means A face image acquiring means for acquiring the user's face image data within a predetermined time; a utterance data acquiring means for acquiring the user's utterance data within the predetermined time; and the face image acquiring means. Image recognition means for recognizing a predetermined facial expression and a predetermined head movement based on the face image data acquired by Unconscious utterance recognition means for recognizing the unconscious utterance corresponding to the acquired utterance data based on the utterance data acquired by the utterance data acquisition means and the unconscious utterance dictionary in which the unconscious utterance corresponding to the utterance data is registered; When the predetermined facial expression or the predetermined head movement is recognized by the image recognition unit, or when the unconscious utterance is recognized by the unconscious utterance recognition unit, the recognition result by the voice command recognition unit Misrecognition judging means for judging that it is misrecognized.

請求項１記載の発明によれば、ユーザ発話の認識結果に対応した応答処理に対してユーザが見せる表情や頭部動作又は無意識発話に基づいて、音声認識結果の誤認識を判定することができる。 According to the first aspect of the present invention, it is possible to determine the misrecognition of the speech recognition result based on the facial expression, head movement, or unconscious utterance that the user shows in response processing corresponding to the recognition result of the user utterance. .

請求項２記載の音声認識の誤認識判定装置は、請求項１記載の音声認識の誤認識判定装置において、前記認識結果応答手段は、前記音声コマンド認識手段による認識結果を出力する認識結果出力手段、及び、前記音声コマンド認識手段による認識結果に対応して機器を動作させる機器動作手段の少なくとも何れか一方である。 3. The voice recognition misrecognition determination apparatus according to claim 2, wherein the recognition result response means outputs a recognition result by the voice command recognition means. And / or device operating means for operating the device in response to the recognition result by the voice command recognition means.

請求項２記載の発明によれば、ユーザ発話の認識結果に対応した認識結果の出力又は機器動作に対してユーザが見せる反応に基づいて、音声認識結果の誤認識を判定することができる。 According to the second aspect of the present invention, it is possible to determine the erroneous recognition of the speech recognition result based on the output of the recognition result corresponding to the recognition result of the user utterance or the reaction shown by the user to the device operation.

請求項３記載の音声認識の誤認識判定装置は、請求項１又は請求項２記載の音声認識の誤認識判定装置において、前記誤認識判定手段により前記音声コマンド認識手段による認識結果が誤認識と判定された場合に、前記機器動作手段による機器の動作を停止する機器動作制御手段を、更に備えている。 The voice recognition misrecognition determination apparatus according to claim 3 is the voice recognition misrecognition determination apparatus according to claim 1 or 2, wherein the recognition result by the voice command recognition means is erroneously recognized by the misrecognition determination means. The apparatus further includes device operation control means for stopping the operation of the device by the device operation means when determined.

請求項３記載の発明によれば、ユーザ発話の認識結果が誤認識と判定された場合に、誤認識に基づく機器動作を停止することができる。 According to the third aspect of the present invention, when the recognition result of the user utterance is determined to be misrecognition, the device operation based on the misrecognition can be stopped.

請求項４記載の音声認識の誤認識判定装置は、請求項１から請求項３の何れか１項記載の音声認識の誤認識判定装置において、前記予め定めた表情は笑い、驚き、怒り、落胆、あきれ、悲しみなどの前記音声コマンド認識手段が誤認識したときに前記ユーザが示す表情であり、前記予め定めた頭部動作は首かしげ、首振り、のけぞりなどの前記音声コマンド認識手段が誤認識したときに前記ユーザが示す動作である。 The voice recognition misrecognition determination apparatus according to claim 4 is the voice recognition misrecognition determination apparatus according to any one of claims 1 to 3, wherein the predetermined facial expression is laughing, surprised, angry, discouraged. The voice command recognition means misidentified by the voice command recognition means, such as bruise, sadness, etc., and the voice command recognition means, such as the predetermined head movement is a neck movement, a head swing, a sledding, etc. This is the operation shown by the user when

請求項４記載の発明によれば、ユーザ発話の認識結果に対応した応答処理に対して、ユーザが見せる表情及び頭部動作については、笑い、驚き、怒り、落胆、あきれ、悲しみなどの表情を見せたとき、又は首かしげ、首振り、のけぞりなどの頭部動作をしたときに認識結果が誤認識と判定することができる。 According to the fourth aspect of the present invention, with respect to the response process corresponding to the recognition result of the user utterance, the facial expression and the head movement that the user shows, such as laughter, surprise, anger, discouragement, fear, sadness, etc. The recognition result can be determined to be misrecognition when shown or when a head movement such as a neck sway, a head swing, or a sled is performed.

請求項５記載の音声認識の誤認識判定プログラムは、コンピュータを、請求項１から請求項４の何れか１項記載の音声認識の誤認識判定装置を構成する各手段として機能させる。 The speech recognition misrecognition determination program according to claim 5 causes a computer to function as each unit constituting the speech recognition misrecognition determination apparatus according to any one of claims 1 to 4.

請求項５記載の発明によれば、ユーザ発話の認識結果に対応した応答処理に対してユーザが見せる表情や頭部動作又は無意識発話に基づいて、音声認識結果の誤認識を判定することができる。 According to the fifth aspect of the present invention, it is possible to determine the misrecognition of the voice recognition result based on the facial expression, the head movement, or the unconscious utterance that the user shows in response processing corresponding to the recognition result of the user utterance. .

請求項６記載の音声認識の誤認識判定プログラムは、コンピュータを、ユーザにより入力された音声データと、音声データに対応する音声コマンドを登録した音声コマンド辞書とに基づいて、入力された音声データに対応する音声コマンドを認識する音声コマンド認識手段、前記音声コマンド認識手段による認識結果に対する応答処理を実行する認識結果応答手段、前記認識結果応答手段による応答処理を実行した後、予め定めた時間内において、前記ユーザの顔画像データを取得する顔画像取得手段、前記予め定めた時間内において、前記ユーザの発声データを取得する発声データ取得手段、前記顔画像取得手段により取得された顔画像データに基づいて、予め定めた表情及び予め定めた頭部動作を画像認識する画像認識手段、前記発声データ取得手段により取得された発声データと、発声データに対応する無意識発話を登録した無意識発話辞書とに基づいて、取得された発声データに対応する無意識発話を認識する無意識発話認識手段、及び前記画像認識手段により前記予め定めた表情又は前記予め定めた頭部動作が認識された場合、又は、前記無意識発話認識手段により無意識発話が認識された場合に、前記音声コマンド認識手段による認識結果が誤認識と判定する誤認識判定手段、として機能させる。 According to a sixth aspect of the present invention, there is provided a misrecognition determination program for voice recognition, wherein a computer is configured to input voice data based on voice data inputted by a user and a voice command dictionary in which voice commands corresponding to the voice data are registered. Voice command recognition means for recognizing the corresponding voice command, recognition result response means for executing a response process for the recognition result by the voice command recognition means, and within a predetermined time after executing the response process by the recognition result response means Based on the face image data acquired by the face image acquisition means for acquiring the face image data of the user, the utterance data acquisition means for acquiring the utterance data of the user within the predetermined time, and the face image data acquired by the face image acquisition means. Image recognition means for recognizing an image of a predetermined facial expression and a predetermined head movement; Unconscious utterance recognition means for recognizing unconscious utterances corresponding to the acquired utterance data based on the utterance data acquired by the data acquisition means and an unconscious utterance dictionary in which unconscious utterances corresponding to the utterance data are registered, and the image When the predetermined facial expression or the predetermined head movement is recognized by the recognition means, or when the unconscious utterance is recognized by the unconscious utterance recognition means, the recognition result by the voice command recognition means is erroneously recognized. Function as a misrecognition determination means.

請求項６記載の発明によれば、ユーザ発話の認識結果に対応した応答処理に対してユーザが見せる表情や頭部動作又は無意識発話に基づいて、音声認識結果の誤認識を判定することができる。 According to the sixth aspect of the present invention, it is possible to determine the misrecognition of the speech recognition result based on the facial expression, the head movement, or the unconscious utterance that the user shows with respect to the response process corresponding to the recognition result of the user utterance. .

以上説明したように、本発明によれば、ユーザ発話の音声認識結果が誤りであるか否かを精度よく判定することができるという効果が得られる。 As described above, according to the present invention, it is possible to accurately determine whether or not the speech recognition result of the user utterance is an error.

以下、本発明の実施の形態について図面を参照しながら詳細に説明する。本実施の形態では、音声認識機能を持つ車両用カーナビゲーションシステム（以下、「ナビ」という。）に本発明に係る音声認識の誤認識判定装置を用いた場合の機器操作に関して説明する。なお、本発明は、上述の実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で設計上の変更をされたものにも適用可能である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the present embodiment, a description will be given of device operation when the erroneous recognition determination device for speech recognition according to the present invention is used in a vehicle car navigation system (hereinafter referred to as “navigation”) having a speech recognition function. In addition, this invention is not limited to the above-mentioned embodiment, It is applicable also to what changed the design within the range described in the claim.

図１は、本発明の実施の形態に係る音声認識の誤認識判定装置の構成を示すブロック図である。同図に示すように、音声認識の誤認識判定装置は、音声データ入力部１１と、音声認識部１２と、コマンド実行部１３と、認識結果出力部１４と、顔画像取得部１５と、発声データ取得部１６と、顔画像認識部１７と、無意識発話認識部１８と、音声コマンド辞書２１と、無意識発話辞書２２と、を備えている。 FIG. 1 is a block diagram showing a configuration of a speech recognition misrecognition determination apparatus according to an embodiment of the present invention. As shown in the figure, the voice recognition misrecognition determination apparatus includes a voice data input unit 11, a voice recognition unit 12, a command execution unit 13, a recognition result output unit 14, a face image acquisition unit 15, and a utterance. The data acquisition unit 16, the face image recognition unit 17, the unconscious utterance recognition unit 18, the voice command dictionary 21, and the unconscious utterance dictionary 22 are provided.

音声データ入力部１１は、マイクを含んで構成され、音声コマンド実行のために入力されるユーザの音声データを受理する。 The voice data input unit 11 includes a microphone, and accepts user voice data input to execute a voice command.

音声認識部１２は、音声データ入力部１１により入力された音声データを音声コマンド辞書２１を用いて音声認識する。 The voice recognition unit 12 recognizes the voice data input by the voice data input unit 11 using the voice command dictionary 21.

コマンド実行部１３は、音声認識部１２により認識された音声コマンドを実行して機器操作を行う。 The command execution unit 13 executes the voice command recognized by the voice recognition unit 12 to perform device operation.

認識結果出力部１４は、スピーカを含んで構成され、音声認識部１２により認識された音声コマンドに基づくメッセージをスピーカから音声出力する。また、認識結果出力部１４は、メッセージを音声出力ではなく、ナビの地図表示画面に文字表示しても、或いは、両方同時に行ってもよい。 The recognition result output unit 14 includes a speaker, and outputs a message based on the voice command recognized by the voice recognition unit 12 from the speaker. Further, the recognition result output unit 14 may display the message on the map display screen of the navigation instead of voice output, or both may be performed simultaneously.

顔画像取得部１５は、ＣＣＤカメラを含んで構成され、コマンド実行部１３による音声コマンドの実行および認識結果出力部１４によるメッセージの出力が行われた後の一定時間、ユーザの顔画像データを取得する。 The face image acquisition unit 15 includes a CCD camera, and acquires the user's face image data for a predetermined time after the voice command is executed by the command execution unit 13 and the message is output by the recognition result output unit 14. To do.

発声データ取得部１６は、マイクを含んで構成され、コマンド実行部１３による音声コマンドの実行および認識結果出力部１４によるメッセージの出力が行われた後の一定時間、ユーザが発声する音声データを取得する。 The utterance data acquisition unit 16 includes a microphone, and acquires voice data uttered by the user for a predetermined time after the execution of the voice command by the command execution unit 13 and the output of the message by the recognition result output unit 14. To do.

顔画像認識部１７は、顔画像取得部１５により取得されたユーザの顔画像データに対して画像認識を行い、「笑い」、「驚き」、「怒り」、「落胆」、「あきれ」、「悲しみ」などの表情、及び、「首かしげ」、「首振り」、「のけぞり」などの頭部ジェスチャの何れかが認識された場合に誤認識と判定する。 The face image recognition unit 17 performs image recognition on the user's face image data acquired by the face image acquisition unit 15 and performs “laughter”, “surprise”, “anger”, “disappointment”, “clear”, “ If any of facial expressions such as “sadness” and head gestures such as “neck wagging”, “swinging”, and “no-sledding” are recognized, it is determined that the recognition is erroneous.

無意識発話認識部１８は、発声データ取得部１６により取得された発声データを音声コマンド辞書２１及び無意識発話辞書２２を用いて音声認識し、無意識発話辞書２２に登録された単語が１つ以上認識された場合に誤認識と判定する。 The unconscious utterance recognition unit 18 recognizes the utterance data acquired by the utterance data acquisition unit 16 using the voice command dictionary 21 and the unconscious utterance dictionary 22, and one or more words registered in the unconscious utterance dictionary 22 are recognized. If it is detected, it is determined as a misrecognition.

誤認識判定部１９は、顔画像認識部１７及び無意識発話認識部１８の判定結果に基づいて、音声認識部１２による認識結果が誤認識であったか否かを判定する。本実施の形態では、誤認識判定部１９は、顔画像認識部１７及び無意識発話認識部１８の何れか一方でも誤認識と判定した場合には誤認識と判定する。 Based on the determination results of the face image recognition unit 17 and the unconscious utterance recognition unit 18, the misrecognition determination unit 19 determines whether or not the recognition result by the voice recognition unit 12 is a misrecognition. In the present embodiment, the misrecognition determination unit 19 determines misrecognition when any one of the face image recognition unit 17 and the unconscious utterance recognition unit 18 determines misrecognition.

音声コマンド辞書２１は、ナビの音声コマンドとその読みとが対で登録された辞書である。図２は、音声コマンド辞書２１の一例を示す。 The voice command dictionary 21 is a dictionary in which navigation voice commands and their readings are registered in pairs. FIG. 2 shows an example of the voice command dictionary 21.

無意識発話辞書２２は、ユーザが発した音声コマンドが誤認識されたと分かったときに無意識に発すると考えられる発話とその読みとが対で登録された辞書である。無意識発話辞書２２は、別途実施する音声認識実験などで音声入力に対し誤認識をわざと生じさせ、誤認識直後のユーザの発話を収集するなどして予め作成しておけばよい。図３は、無意識発話辞書２２の一例を示す。 The unconscious utterance dictionary 22 is a dictionary in which utterances that are considered to be uttered unconsciously and their readings are registered in pairs when it is understood that a voice command issued by the user has been misrecognized. The unconscious utterance dictionary 22 may be created in advance by intentionally causing misrecognition with respect to voice input in a voice recognition experiment or the like separately performed, and collecting user utterances immediately after the misrecognition. FIG. 3 shows an example of the unconscious utterance dictionary 22.

以上のように構成された音声認識の誤認識判定装置は、ユーザにより入力された音声コマンドの認識結果に対するユーザの反応に基づいて、認識結果が誤認識か否かを判定する。図４は、音声認識の誤認識判定装置の作用の流れを示すフローチャートである。 The voice recognition misrecognition determination apparatus configured as described above determines whether or not the recognition result is a misrecognition based on the user's reaction to the recognition result of the voice command input by the user. FIG. 4 is a flowchart showing the flow of operation of the speech recognition misrecognition determination apparatus.

ステップ１００では、音声データ入力部１１が、ユーザがナビの操作のために発したコマンド発話を受理する。本実施例では、ナビの地図画面にコンビニエンスストアのアイコンを表示するために「コンビニ表示」と入力されたとする。 In step 100, the voice data input unit 11 accepts a command utterance uttered by a user for navigation operation. In this embodiment, it is assumed that “convenience store display” is input in order to display a convenience store icon on the map screen of the navigation.

ステップ１０２では、音声認識部１２が、音声データ入力部１１が受理した音声データを音声コマンド辞書２１を用いて音声認識する。本実施例では、音声認識部１２が、入力された音声データに対し、音声コマンド辞書２１に登録された音声コマンドの中から「２画面表示」と誤認識したとする。 In step 102, the voice recognition unit 12 recognizes the voice data received by the voice data input unit 11 using the voice command dictionary 21. In the present embodiment, it is assumed that the voice recognition unit 12 erroneously recognizes “two-screen display” from the voice commands registered in the voice command dictionary 21 for the input voice data.

ステップ１０４では、コマンド実行部１３が音声認識部１２により認識された音声コマンドを実行すると共に、認識結果出力部１４が認識された音声コマンドに基づいてメッセージを出力する。本実施例では、コマンド実行部１３は、認識結果に基づいて、ナビの地図画面表示を２画面分割する機器操作コマンドを実行する。また、認識結果出力部１４は、認識結果に基づいて、「２画面表示にします」とスピーカを用いて音声出力する。認識結果出力部１４による出力は、音声出力に限らず、ナビの画面に文字出力してもよい。 In step 104, the command execution unit 13 executes the voice command recognized by the voice recognition unit 12, and the recognition result output unit 14 outputs a message based on the recognized voice command. In this embodiment, the command execution unit 13 executes a device operation command for dividing the navigation map screen display into two screens based on the recognition result. In addition, the recognition result output unit 14 outputs a voice “Use two screen display” based on the recognition result using a speaker. The output by the recognition result output unit 14 is not limited to voice output, and may be output as characters on a navigation screen.

ステップ１０６では、コマンド実行部１３による機器操作コマンドの実行及び認識結果出力部１４にいる認識結果に基づくメッセージ出力がなされた後の一定時間において、顔画像取得部１５がユーザの顔画像データを取得すると共に、発声データ取得部１６がユーザが発声する音声データを取得する。顔画像データ及び音声データを取得する時間は、コマンド実行及びメッセージ出力により認識結果が誤認識と分かったときの反応を捉えるための時間として、本実施の形態では５秒とする。また、本実施例では、ユーザは「驚き」の表情をみせ、「何で」と発声したとする。 In step 106, the face image acquisition unit 15 acquires the face image data of the user at a fixed time after the command execution unit 13 executes the device operation command and outputs a message based on the recognition result in the recognition result output unit 14. At the same time, the utterance data acquisition unit 16 acquires voice data uttered by the user. The time for acquiring the face image data and the voice data is set to 5 seconds in this embodiment as a time for capturing a reaction when the recognition result is recognized as misrecognition by command execution and message output. In the present embodiment, it is assumed that the user shows an expression of “surprise” and utters “what”.

ステップ１０８では、顔画像認識部１７が、顔画像取得部１５が取得した顔画像データに対して画像認識を行い、「笑い」、「驚き」、「怒り」、「落胆」、「あきれ」「悲しみ」などの表情、及び、「首かしげ」、「首振り」、「のけぞり」などの頭部ジェスチャの何れか１つ以上が認識された場合に誤認識と判定する。本実施例では、驚きの表情が認識され、音声認識部１２による認識結果が誤認識と判定される。 In step 108, the face image recognition unit 17 performs image recognition on the face image data acquired by the face image acquisition unit 15, and “laughs”, “surprise”, “anger”, “disappointment”, “clear”, “ When any one or more of facial expressions such as “sadness” and head gestures such as “neck wagging”, “swinging”, and “no sleigh” are recognized, it is determined that the recognition is erroneous. In this embodiment, a surprised facial expression is recognized, and the recognition result by the voice recognition unit 12 is determined to be erroneous recognition.

ここで、表情の認識方法は、公知の如何なる方法でもよいが、例えば文献１（特開２００８−１４６３１８号「感情推定装置」）にあるような方法で行う。具体的には、予め認識対象とする各表情（笑い、驚き、怒り、落胆、あきれ、悲しみ、通常状態）をニューラルネットワークによって各表情の特徴量（表情マップ）を学習しておく。次に、ユーザ反応データとして顔画像取得部１５により取得された顔画像を加工処理したデータと上記表情マップとの類似度を算出し、最も類似度の高いものを表情認識結果として採用する。 Here, the facial expression recognition method may be any known method, for example, a method described in Document 1 (Japanese Unexamined Patent Application Publication No. 2008-146318 “Emotion Estimation Device”). Specifically, each facial expression to be recognized (laughter, surprise, anger, discouragement, fear, sadness, normal state) is learned in advance by using a neural network for the feature amount (expression map) of each facial expression. Next, the similarity between the facial image acquired by the facial image acquisition unit 15 as user reaction data and the facial expression map is calculated, and the highest similarity is adopted as the facial expression recognition result.

また、頭部ジェスチャの認識方法も公知の如何なる方法でもよいが、例えば文献２（「対話ロボットの動作に頑健な頭部ジェスチャ認識」、電子情報通信学会論文誌Ｄ Vol.J89-D No.7 pp.1514-1522）にあるような方法で行う。具体的には、予め対象とする頭部ジェスチャ（首かしげ、首振り、のけぞり、通常状態）の顔画像データを多数収集しておき、各ジェスチャに対する顔画像の特徴点（目尻位置、鼻位置など）をＨＭＭ（Hidden Marcov Model）を用いてモデル化する。次に、ユーザ反応データとして顔画像取得部１５により取得された顔画像と前記ＨＭＭによるモデルとのマッチング度合いによって、頭部ジェスチャを決定する。 The head gesture recognition method may be any known method. For example, Reference 2 (“Head gesture recognition robust to dialogue robot operation”, IEICE Transactions D Vol. J89-D No. 7 pp.1514-1522). Specifically, a large number of face image data of head gestures (neck wagging, swinging, sliding, and normal state) is collected in advance, and facial image feature points (eye corner position, nose position, etc.) for each gesture ) Is modeled using HMM (Hidden Marcov Model). Next, a head gesture is determined based on the degree of matching between the face image acquired by the face image acquisition unit 15 as user reaction data and the model by the HMM.

ステップ１１０では、無意識発話認識部１８が、発声データ取得部１６によりユーザの発声データが取得されたか否かを判定し、発声データが取得された場合にはステップ１１２に進み、発声データが取得されなかった場合にはステップ１１４に進む。 In step 110, the unconscious utterance recognition unit 18 determines whether the utterance data acquisition unit 16 has acquired the user's utterance data. If the utterance data is acquired, the process proceeds to step 112, where the utterance data is acquired. If not, the process proceeds to step 114.

ステップ１１２では、無意識発話認識部１８が、発声データ取得部１６により取得されたユーザの発声データを音声コマンド辞書２１及び無意識発話辞書２２を用いて音声認識し、無意識発話辞書２２に登録された単語が１つ以上認識された場合に誤認識と判定する。本実施例では、無意識発話認識部１８が、上述の「何で」の発声に対し、音声コマンド「拡大」と認識したものとする。この場合、無意識発話は認識されなかったので、音声認識部１２による認識結果が正しいと判定される。 In step 112, the unconscious utterance recognition unit 18 recognizes the user's utterance data acquired by the utterance data acquisition unit 16 using the voice command dictionary 21 and the unconscious utterance dictionary 22, and the words registered in the unconscious utterance dictionary 22. When one or more are recognized, it is determined as a misrecognition. In the present embodiment, it is assumed that the unconscious utterance recognition unit 18 recognizes the voice command “expanded” for the above-mentioned “what” utterance. In this case, since the unconscious utterance has not been recognized, it is determined that the recognition result by the voice recognition unit 12 is correct.

ステップ１１４では、誤認識判定部１９が、顔画像認識部１７及び無意識発話認識部１８の判定結果に基づいて、音声認識部１２による認識結果が誤認識であったか否かを判定する。本実施例では、顔画像認識部１７では誤認識と判定され、無意識発話認識部１８では正しいと判定されたため、音声認識部１２による認識結果は誤認識と判定される。 In step 114, the misrecognition determination unit 19 determines whether the recognition result by the speech recognition unit 12 is a misrecognition based on the determination results of the face image recognition unit 17 and the unconscious utterance recognition unit 18. In this embodiment, since the face image recognition unit 17 determines that the recognition is incorrect and the unconscious utterance recognition unit 18 determines that the recognition is correct, the recognition result by the voice recognition unit 12 is determined to be erroneous recognition.

以上のように、本実施の形態に係る音声認識の誤認識判定装置は、音声入力に対する音声認識結果出力直後のユーザの反応から、認識結果が誤認識であるか否かを精度よく判定することができる。また、誤認識と判定した場合には、その後の対話処理をスムーズに進めることができる。 As described above, the speech recognition misrecognition determination apparatus according to the present embodiment accurately determines whether or not the recognition result is a misrecognition from the user's reaction immediately after the speech recognition result is output to the speech input. Can do. Further, when it is determined that the recognition is erroneous, the subsequent dialogue processing can be smoothly proceeded.

なお、本発明は、上述の実施の形態に限定されるものではなく、特許請求の範囲に記載された範囲内で設計上の変更をされたものにも適用可能である。 In addition, this invention is not limited to the above-mentioned embodiment, It is applicable also to what changed the design within the range described in the claim.

例えば、本実施の形態では、顔画像取得部１５及び顔画像認識部１７と、発声データ取得部１６及び無意識発話認識部１８との両方を用いているが、何れか一方のみを用いて判定してもよい。 For example, in the present embodiment, both the face image acquisition unit 15 and the face image recognition unit 17 and the utterance data acquisition unit 16 and the unconscious utterance recognition unit 18 are used, but the determination is made using only one of them. May be.

また、誤認識判定部１９によりユーザのコマンド発話が誤認識されたと判定された場合には、誤認識に基づくコマンド操作を停止するように機器動作を制御する構成としてもよい。 In addition, when the erroneous recognition determination unit 19 determines that the user's command utterance has been erroneously recognized, the device operation may be controlled so as to stop the command operation based on the erroneous recognition.

本発明の実施の形態に係る音声認識の誤認識判定装置の構成を示すブロック図であるIt is a block diagram which shows the structure of the misrecognition determination apparatus of the speech recognition which concerns on embodiment of this invention. 音声コマンド辞書の構成例を示す図である。It is a figure which shows the structural example of a voice command dictionary. 無意識発話辞書の構成例を示す図である。It is a figure which shows the structural example of an unconscious utterance dictionary. 本発明の実施の形態に係る音声認識の誤認識判定装置の作用の流れを示すフローチャートである。It is a flowchart which shows the flow of an effect | action of the misrecognition determination apparatus of the speech recognition which concerns on embodiment of this invention.

Explanation of symbols

１１音声データ入力部
１２音声認識部
１３コマンド実行部
１４認識結果出力部
１５顔画像取得部
１６発声データ取得部
１７顔画像認識部
１８無意識発話認識部
１９誤認識判定部
２１音声コマンド辞書
２２無意識発話辞書 DESCRIPTION OF SYMBOLS 11 Voice data input part 12 Voice recognition part 13 Command execution part 14 Recognition result output part 15 Face image acquisition part 16 Speech data acquisition part 17 Face image recognition part 18 Unconscious speech recognition part 19 False recognition determination part 21 Voice command dictionary 22 Unconscious speech dictionary

Claims

Voice command recognition means for recognizing a voice command corresponding to the input voice data based on the voice data input by the user and a voice command dictionary in which voice commands corresponding to the voice data are registered;
Recognition result response means for executing a response process for the recognition result by the voice command recognition means;
Face image acquisition means for acquiring face image data of the user within a predetermined time after executing the response processing by the recognition result response means;
Utterance data acquisition means for acquiring the user's utterance data within the predetermined time;
Image recognition means for recognizing a predetermined facial expression and a predetermined head movement based on the face image data acquired by the face image acquisition means;
Unconscious utterance recognition means for recognizing an unconscious utterance corresponding to the acquired utterance data, based on the utterance data acquired by the utterance data acquisition means, and an unconscious utterance dictionary in which the unconscious utterance corresponding to the utterance data is registered,
When the predetermined facial expression or the predetermined head movement is recognized by the image recognition unit, or when an unconscious utterance is recognized by the unconscious utterance recognition unit, the recognition result by the voice command recognition unit is A misrecognition determination means for determining misrecognition;
An erroneous recognition determination device for speech recognition.

The recognition result response means is at least one of a recognition result output means for outputting a recognition result by the voice command recognition means and a device operation means for operating a device in response to the recognition result by the voice command recognition means. The misrecognition determination apparatus according to claim 1.

2. The apparatus operation control unit according to claim 1, further comprising: a device operation control unit that stops the operation of the device by the device operation unit when the recognition result by the voice command recognition unit is determined to be erroneous recognition by the misrecognition determination unit. 2. A speech recognition misrecognition determination apparatus according to 2.

The predetermined facial expression is a facial expression shown by the user when the voice command recognition means misrecognizes, such as laughter, surprise, anger, discouragement, anger, sadness, etc., and the predetermined head movement is a neck and neck The voice recognition misrecognition determination apparatus according to any one of claims 1 to 3, which is an operation indicated by the user when the voice command recognizing means such as swinging or sliding is erroneously recognized.

A speech recognition misrecognition determination program for causing a computer to function as each means constituting the speech recognition misrecognition determination apparatus according to any one of claims 1 to 4.

Computer
A voice command recognition means for recognizing a voice command corresponding to the input voice data based on the voice data input by the user and a voice command dictionary in which a voice command corresponding to the voice data is registered;
A recognition result response means for executing a response process for the recognition result by the voice command recognition means;
Face image acquisition means for acquiring face image data of the user within a predetermined time after executing the response process by the recognition result response means;
Utterance data acquisition means for acquiring utterance data of the user within the predetermined time;
Image recognition means for recognizing a predetermined facial expression and a predetermined head movement based on the face image data acquired by the face image acquisition means;
Unconscious utterance recognition means for recognizing an unconscious utterance corresponding to the acquired utterance data, based on the utterance data acquired by the utterance data acquisition means, and an unconscious utterance dictionary in which unconscious utterances corresponding to the utterance data are registered, and When the predetermined facial expression or the predetermined head movement is recognized by the image recognition unit, or when an unconscious utterance is recognized by the unconscious utterance recognition unit, the recognition result by the voice command recognition unit is Misrecognition judging means for judging as misrecognition,
A misrecognition judgment program of voice recognition for functioning as