JPH09269889A

JPH09269889A - Dialogue device

Info

Publication number: JPH09269889A
Application number: JP8080023A
Authority: JP
Inventors: Keiko Watanuki; 啓子綿貫
Original assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Current assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Priority date: 1996-04-02
Filing date: 1996-04-02
Publication date: 1997-10-14

Abstract

(57)【要約】【課題】ユーザとコンピュータの対話装置において、
複数の入力チャネルからのユーザの音声や表情等の入力
を統合して扱うことで発話権の所在を決め、その所在に
応じてコンピュータ側から出力されるメッセージ等の情
報を制御することにより、対話をスムースに運ぶように
した当該装置を提供すること。【解決手段】音声信号，頭の動き，視線の方向，表情
の入力データ１及び時刻付与手段６による時間情報とか
らユーザの動作状態を認識手段２により認識し、その認
識結果が発話権制御手段３に渡され、発話権がコンピュ
ータとユーザのどちらにあるか、その所在が判定され
る。対話管理手段４では、発話権の所在に応じてコンピ
ュータ側からの応答を生成し、出力手段５を介してユー
ザに伝えることにより、対話をスムースに運ぶ。 (57) [Abstract] [Problem] In a dialogue device between a user and a computer,
Dialogue is provided by integrating the input of user's voices and facial expressions from multiple input channels to determine the location of the right to speak, and controlling information such as messages output from the computer according to the location. To provide such a device that is capable of smoothly carrying. SOLUTION: A user's operation state is recognized by a recognition means 2 from a voice signal, a head movement, a gaze direction, facial expression input data 1 and time information by a time giving means 6, and the recognition result is a speech right control means. The location of the right to speak is determined by the computer or the user. The dialogue management means 4 smoothly carries the dialogue by generating a response from the computer side according to the location of the speaking right and transmitting it to the user via the output means 5.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、人間とコンピュー
タの間の対話を可能にする対話装置に関し、より詳細に
は、音声だけではなく、音声以外の諸表情を通じてスム
ースな対話を実現する当該装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a dialogue device that enables a dialogue between a human and a computer, and more specifically, a device that realizes a smooth dialogue not only by voice but also by expressions other than voice. Regarding

【０００２】[0002]

【従来の技術】従来、人間とコンピュータの間のインタ
フェースとしては、コンピュータが主導権をもって問い
を発し、ユーザはそれに従って受動的に答えながら作業
を進める形態が主であった。そのため、ユーザの発話の
順番が固定的で、ユーザの発声のしやすさや対話の自然
性等には配慮がなされていない。そこで、特開平６−１
１０８３５号公報には、コンピュータからの音声出力を
遮ってユーザが発話できる装置が開示されている。ま
た、特開昭６２−４０５７７号公報には、コンピュータ
の発話の途中で聞き返しができる装置が開示されてい
る。2. Description of the Related Art Conventionally, as an interface between a human being and a computer, the computer has mainly taken the initiative in asking questions, and the user has been passively answering the questions while proceeding with the work. Therefore, the order of the user's utterances is fixed, and no consideration is given to the easiness of the user's utterance and the naturalness of the dialogue. Therefore, Japanese Patent Laid-Open No. 6-1
Japanese Patent No. 10835 discloses an apparatus in which a user can speak by interrupting audio output from a computer. Further, Japanese Patent Application Laid-Open No. 62-40577 discloses a device capable of listening back in the middle of a computer utterance.

【０００３】[0003]

【発明が解決しようとする課題】コンピュータと対話す
るとき、コンピュータとユーザの発話のタイミングがよ
くないと、対話の自然性が失われる。その対応策とし
て、特開平６−１１０８３５号公報では、ユーザの発話
を検出してコンピュータ発話に割り込みができるように
したものであり、また、特開昭６２−４０５７７号公報
では、「え、なんですか」等の発話を検出して聞き返し
ができるようにしたものである。しかし、人間が話し始
めたり、話し終わるという雰囲気は、音声のみに現れる
のではなく、音声，顔の表情及び身振りなどに同時に、
あるいは、相補的に現れるものであるから、従来例とし
て示した対応策は、必ずしも満足できるものではない。
その一つは、人間が話し始めたり、話し終わってからコ
ンピュータからの応答をスタートさせているため、応答
のタイミングが遅れ、スムースな対話がコンピュータと
ユーザの間に実現できないでいるためである。When interacting with a computer, the naturalness of the interaction is lost unless the computer and the user speak in a timely manner. As a countermeasure, Japanese Patent Laid-Open No. 6-110835 discloses a method in which a user's utterance is detected and a computer utterance can be interrupted. Further, in Japanese Laid-Open Patent Publication No. 62-40577, "Eh, why? It is designed so that utterances such as "Suka" can be detected and reflected back. However, the atmosphere in which humans start or end talking does not appear only in the voice, but at the same time in the voice, facial expressions and gestures,
Alternatively, since they appear in a complementary manner, the countermeasures shown as the conventional example are not always satisfactory.
One of the reasons is that the human beings start or start the response from the computer after finishing the conversation, so that the timing of the response is delayed, and a smooth dialogue cannot be realized between the computer and the user.

【０００４】本発明は、上述の従来技術における問題点
に鑑みてなされたもので、人間とコンピュータの対話装
置において、ユーザの行動に対応して発生する複数の信
号特徴からユーザ発話の始まりや終わりをより早く予測
し、発話の主導権（発話権）がコンピュータとユーザの
どちらにあるか、その所在を判定するとともに、さら
に、この発話権の所在に応じて、コンピュータ側から出
力される音声情報や画像情報等を制御することにより、
自然な対話を実現する対話装置を提供することをその解
決すべき課題とする。The present invention has been made in view of the above-mentioned problems in the prior art. In a dialogue device between a human and a computer, a user's utterance begins or ends based on a plurality of signal features generated in response to a user's action. More quickly, determine whether the utterance initiative (speaking right) lies between the computer and the user, and determine the location of the utterance right. Further, according to the location of the speaking right, the audio information output from the computer side Or by controlling image information,
The problem to be solved is to provide a dialogue device that realizes a natural dialogue.

【０００５】[0005]

【課題を解決するための手段】請求項１の発明は、人間
等から発生される音声及び映像化可能な人間等の表情を
入力する入力手段と、人間等に対し可聴音声及び可視信
号を出力する出力手段と、前記入力手段により得られた
信号に基づいてユーザの動作状態を認識する認識手段
と、該認識手段の出力を処理するとともに前記出力手段
を制御する処理・制御手段とを有する対話装置におい
て、前記認識手段は、複数の所定動作モードにおけるユ
ーザの動作状態を認識結果として出力し、前記処理・制
御手段は、該認識結果から対話の状況を判断するととも
に、発話権の所在を決定する処理を行い、該処理の結果
に対応した出力を行うべく前記出力手段を制御するよう
にし、前記認識結果を統合し対話の状況を判断してより
早く発話の終わり／始まりを判定し、すなわち、発話権
がコンピュータとユーザのどちらにあるかを決定するこ
とができ、また、発話権の所在に応じてユーザへの音声
や可視信号の出力を制御して対話をスムースに運ぶよう
にすることを可能とする。According to a first aspect of the present invention, an input means for inputting a voice generated by a human or the like and a facial expression of the human or the like that can be visualized, and an audible voice and a visible signal to the human or the like are output. Dialogue having output means for performing the recognition, recognizing means for recognizing the operating state of the user based on the signal obtained by the input means, and processing / control means for processing the output of the recognizing means and controlling the output means. In the device, the recognizing means outputs the operation state of the user in a plurality of predetermined operation modes as a recognition result, and the processing / control means determines the dialogue situation from the recognition result and determines the whereabouts of the right to speak. Is performed, and the output means is controlled to perform an output corresponding to the result of the processing, the recognition results are integrated, the dialogue situation is judged, and the utterance end / start is started earlier. That is, whether the speaking right is on the computer or the user, and depending on the location of the speaking right, control the output of audio or visual signals to the user to smooth the dialogue. It is possible to carry to.

【０００６】請求項２の発明は、請求項１の発明におい
て、前記処理・制御手段における処理の結果に対応する
出力を生成する出力生成手段を設け、前記対話の状況及
び前記発話権の所在に応じて該出力生成手段により生成
された音声及び可視信号を出力し、しかも、決定された
発話権の所在に応じて対話を促すメッセージを生成して
コンピュータ側からユーザに対して出力することも可能
とする。According to a second aspect of the present invention, in the first aspect of the present invention, output generating means for generating an output corresponding to the result of the processing in the processing / control means is provided, and the situation of the dialogue and the whereabouts of the right to speak are provided. It is also possible to output the audio and visual signals generated by the output generating means in response to the output, and generate a message prompting a dialogue in accordance with the determined whereabouts of the speaking right and outputting the message from the computer side to the user. And

【０００７】請求項３の発明は、請求項１又は２の発明
において、前記認識手段により認識されたユーザの動作
状態の開始及び終了時刻を付与するとともに、前記処理
・制御手段における動作の時間管理に用いる時刻付与手
段を備えるようにし、時刻情報（時間的な相関関係）に
基づいて動作させることにより、対話をよりスムースに
運ぶようにすることを可能とする。According to a third aspect of the present invention, in the first or second aspect of the present invention, the start and end times of the user's operating state recognized by the recognizing means are added, and the time management of the operation in the processing / controlling means is performed. It is possible to carry the dialogue more smoothly by providing the time giving means used for the above and operating it based on the time information (temporal correlation).

【０００８】請求項４の発明は、請求項１ないし３のい
ずれかの発明において、前記複数の所定動作モードを音
声の大きさ，音声の高さ，視線の方向及び頭の縦振りと
し、該モードそれぞれの動作状態を認識するようにし、
ユーザの音声の大きさ（音声パワー），音声の高さ（音
声ピッチ），視線の方向及び頭の縦振り（うなずくこ
と）とから発話権の所在が判定されることになり、より
間違いの少ない動作を可能とする。According to a fourth aspect of the invention, in any one of the first to third aspects of the invention, the plurality of predetermined operation modes are the volume of voice, the height of voice, the direction of the line of sight, and the vertical swing of the head. To recognize the operating status of each mode,
The location of the speaking right is determined based on the volume of the user's voice (voice power), the height of the voice (voice pitch), the direction of the line of sight, and the vertical swing of the head (nodding), and the error is less likely to occur. It is possible to operate.

【０００９】請求項５の発明は、請求項１ないし４のい
ずれかの発明において、前記複数の所定モードを音声の
大きさ，音声の高さ及び視線の方向とし、該モードそれ
ぞれの動作状態を認識し、その認識結果からユーザの感
情を判定してユーザにある発話権の所在をユーザ或いは
当該装置のいずれかに再決定するようにし、感情を判定
してユーザが返答に困っていることが判定されることに
なり、コンピュータがユーザの発話を助けるような応
答、或いは、ユーザの困惑を回避するとともに、より満
足のできる対話を可能とする。According to a fifth aspect of the present invention, in any one of the first to fourth aspects, the plurality of predetermined modes are the volume of voice, the pitch of voice, and the direction of the line of sight, and the operating state of each of the modes is set. Recognize and determine the user's emotion from the recognition result to redetermine the location of the user's speaking right to either the user or the device, and determine the emotion to make the user difficult to reply. As a result, the computer avoids a response that helps the user speak, or avoids the user's confusion, and enables a more satisfying conversation.

【００１０】[0010]

【発明の実施の形態】図１は、本発明の対話装置の第１
の実施形態を示すブロック図である。本実施形態の対話
装置は、音声信号，頭の動き，視線の方向及び表情の時
刻情報を含む入力データ１を認識する複数チャネルの認
識手段２を具備しており、認識手段２には、時刻情報を
出力する時刻付与手段６と各認識手段２-1，２-2…より
並列に出力される認識結果を統合処理して発話権の所在
を判定する発話権制御手段３とが接続されている。発話
権制御手段３には、発話権の所在の履歴を保持する履歴
格納手段７と、発話権制御手段３により認識された発話
権の所在に基づいて対話を進める対話管理手段４が接続
されており、対話管理手段４には、出力データを出力す
る出力手段５が接続されている。なお、各認識手段２-
1，２-2…は、その認識データに応じた認識アルゴリズ
ムを持ち、さらに認識結果の開始時刻と終了時刻を時刻
付与手段６から得るように構成されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a first interactive device of the present invention.
It is a block diagram showing an embodiment. The dialogue apparatus of the present embodiment includes a plurality of channels of recognition means 2 for recognizing input data 1 including voice signals, head movements, gaze direction, and facial expression time information. The time giving means 6 for outputting information and the speech right control means 3 for integrating the recognition results output in parallel from the respective recognition means 2-1, 2-2 ... There is. The speech right control means 3 is connected to a history storage means 7 for holding a history of the location of the speech right, and a dialogue management means 4 for advancing a dialogue based on the location of the speech right recognized by the speech right control means 3. Therefore, the dialogue management means 4 is connected to the output means 5 for outputting output data. In addition, each recognition means 2-
.. have a recognition algorithm corresponding to the recognition data, and are configured to obtain the start time and end time of the recognition result from the time giving means 6.

【００１１】次に、本実施形態の動作を説明する。入力
データとしては、ユーザの行動あるいは動作に対応して
発生する複数の信号を取り込むことが可能であり、例え
ば、カメラやマイク，動きセンサー、あるいは、心電計
などから得られる、音声の大きさ（以下、「音声パワ
ー」という）や音声の高低（以下、「音声ピッチ」とい
う），発話の速度，ポーズ（間）の長さ，顔の表情，顔
の向き，口の大きさや形，目の大きさやまばたき，視線
の方向，身振り，手振り，頭の動き及び心拍数が認識さ
れる。出力手段としては、例えば、スピーカやディスプ
レイ（視認可能な表示手段），触覚装置が可能である。
以下では、入力データとして音声パワー，音声ピッチ，
視線の方向及び頭の振りを、また、出力手段として合成
音声を出力する音声出力手段と、ＣＧによる疑似人間を
表示する出力手段とを具備するコンピュータによるもの
として、本発明の実施形態を説明する。Next, the operation of this embodiment will be described. As input data, it is possible to capture multiple signals generated in response to a user's action or motion. For example, the volume of voice obtained from a camera, microphone, motion sensor, or electrocardiograph. (Hereinafter referred to as "voice power"), voice pitch (hereinafter referred to as "voice pitch"), speech speed, pause (between) length, facial expression, face orientation, mouth size and shape, eyes The size, blinking, gaze direction, gesture, hand gesture, head movement, and heart rate are recognized. As the output means, for example, a speaker, a display (visible display means), or a tactile device can be used.
In the following, as input data, voice power, voice pitch,
The embodiment of the present invention will be described as a computer having a line-of-sight direction and a head swing, a voice output means for outputting a synthetic voice as an output means, and an output means for displaying a pseudo human being by CG. .

【００１２】入力データ１-1の音声パワーはＡ／Ｄ変換
され、あらかじめ決められた処理単位（フレーム：１フ
レームは１／３０秒）毎に音声パワーレベルが認識さ
れ、「音声の有無」が検出され、発話権制御手段３に送
出される。入力データ１-2の音声ピッチはＡ／Ｄ変換さ
れ、あらかじめ決められた処理単位（フレーム：１フレ
ームは１／３０秒）毎に音声ピッチレベルが認識され、
あらかじめ決められたレベルの「上昇ピッチ」が検出さ
れ、発話権制御手段３に送出される。入力データ１-3の
視線の方向は、あらかじめ決められた処理単位（フレー
ム：１フレームは１／３０秒）毎にアイコンタクト（ユ
ーザがディスプレイ上のＣＧ疑似人間に視線を向けてい
ること）が認識され、あらかじめ決められたフレーム長
の「アイコンタクト」が検出され、発話権制御部３に送
出される。入力データ１-4の頭の動きは、あらかじめ決
められた処理単位（フレーム：１フレームは１／３０
秒）毎に頭の縦振りが認識され、あらかじめ決められた
フレーム長の「頭の縦振り」が検出され、発話権制御部
３に送出される。The voice power of the input data 1-1 is A / D converted, and the voice power level is recognized for each predetermined processing unit (frame: 1 frame is 1/30 seconds). It is detected and sent to the speech right control means 3. The voice pitch of the input data 1-2 is A / D converted, and the voice pitch level is recognized for each predetermined processing unit (frame: 1 frame is 1/30 seconds).
A predetermined level of "rising pitch" is detected and sent to the speech right control means 3. The direction of the line of sight of the input data 1-3 is determined by eye contact (the user is looking at the CG pseudo human on the display) for each predetermined processing unit (frame: 1/30 second for one frame). The recognized “eye contact” having a predetermined frame length is detected and sent to the speech right control unit 3. The head movement of the input data 1-4 is determined by a predetermined processing unit (frame: 1 frame is 1/30).
The vertical swing of the head is recognized every second), the "vertical swing of the head" having a predetermined frame length is detected, and the vertical swing of the head is transmitted to the speech right control unit 3.

【００１３】各認識手段２-1，２-2…から送出される情
報は、開始時刻，終了時刻，モード及び認識結果で表わ
される情報である。開始時刻や終了時刻は、時刻付与手
段６から渡される値であり、認識結果を得た入力データ
の開始時刻と終了時刻とを表わす。モードとは、この実
施形態では、音声パワー，音声ピッチ，視線の方向，頭
の動きで、ユーザから同時に発生される複数の出力の種
類を表わす。認識結果は、これらのモードに応じて、音
声パワーでは「音声の有無」、音声ピッチでは「上昇ピ
ッチ」、視線の方向では「アイコンタクト」、頭の動き
では「頭の縦振り」が得られる。また、笑いや困惑の感
情を表わす表情や、ジェスチャをモードとして加えるこ
とも可能である。The information transmitted from each of the recognition means 2-1, 2-2 ... Is information represented by a start time, an end time, a mode and a recognition result. The start time and the end time are values passed from the time giving means 6, and represent the start time and the end time of the input data for which the recognition result is obtained. In this embodiment, the mode represents a plurality of types of outputs simultaneously generated by the user, such as voice power, voice pitch, line-of-sight direction, and head movement. According to the recognition result, "power of voice", "pitch" of voice pitch, "eye contact" of gaze direction, and "vertical head swing" of head movement are obtained according to these modes. . It is also possible to add a facial expression that expresses feelings of laughter or confusion or a gesture as a mode.

【００１４】発話権制御手段３では、入力されたユーザ
の音声パワー，音声ピッチ，視線の方向，および頭の動
きから、ユーザが発話しようとしているか（発話の始ま
り）、発話を終了しようとしているか（発話の終り）を
判定し、履歴格納手段７から渡される情報により、時刻
情報とともに現在の発話の主導権（発話権）がコンピュ
ータとユーザのどちらにあるか発話権の所在が判断され
る。そして、発話権の所在がコンピュータ側にあるとき
には、ユーザからの「割り込み」が予想され、対話をス
ムースに進めるためには、ユーザが発話しようとしてい
るか、あるいは、したそうにしているかをユーザの発話
のみならず、発話の直前に現われる動作から判断し、ユ
ーザが発話しようとしていれば、コンピュータは発話を
中断し、ユーザの発話を促すことが望ましい。In the utterance right control means 3, whether the user is going to speak (beginning of utterance) or is about to end utterance based on the input voice power, voice pitch, line of sight direction, and head movement of the user ( The end of the utterance is determined, and the location of the utterance right is determined based on the information passed from the history storage means 7, which of the computer and the user has the initiative (utterance right) of the current utterance together with the time information. When the right to speak is on the computer side, an “interruption” from the user is expected, and in order to smoothly proceed with the dialogue, the user's utterance should tell whether the user is trying to speak or has done so. In addition, it is desirable that the computer interrupts the utterance and prompts the user's utterance if the user is trying to utter, judging from the action appearing immediately before the utterance.

【００１５】図２は、発話権の所在がコンピュータ側に
あるときの、発話権制御手段３での処理の一例を説明す
るための認識手段の出力の状態を示す図である。発話権
がコンピュータ側にあるときに、ユーザから発生する行
為には、「割り込み」の他に、コンピュータの発話に対
する「あいづち」がある。本実施形態では、「割り込
み」なのか「あいづち」なのかの判断は、発話権がコン
ピュータ側にある時刻Ｔ１からＴ１＋ＷＴ（例えば０.
５秒）の間に発生したユーザの「音声の有無」,「上昇
ピッチ」及び「アイコンタクト」のベイズ識別による。
ベイズ識別により、ユーザが発話をしようとしている
（割り込み）と判断されると、その旨が対話管理手段４
に伝えられ、対話管理手段４は、コンピュータの発話を
中断し、「どうかしましたか」という発話をさせると
か、あるいは、「どうしたの」という表情をＣＧ疑似人
間にさせる等の出力を出力手段５を通じて行う。一方、
あいづちと判断された場合は、その旨が対話管理手段４
に伝えられ、対話管理手段４は、そのまま発話を続ける
よう、出力手段５を通じて行う。なお、ここで、ある一
定の長さの発話で、上昇ピッチが見られるデータ例と
は、「えっ」,「はい？」,「はっ」などの、いわゆる付
加語とか不要語と呼ばれる発話を指す。システム発話中
にこれらの反応が見られなかった場合は、「ユーザから
の割り込み／あいづち無し」の旨が対話管理手段４に伝
えられ、対話管理手段４は、そのまま発話を続けるよう
出力手段５を通じて行う。FIG. 2 is a diagram showing an output state of the recognizing means for explaining an example of processing in the utterance right control means 3 when the utterance right is located on the computer side. When the right to speak is on the computer side, actions that occur from the user include "interruption" as well as "aizuchi" for computer utterances. In the present embodiment, the judgment of "interruption" or "aizuchi" is made from time T1 when the speaking right is on the computer side, T1 + WT (for example, 0.
5 seconds) according to the Bayesian identification of "presence or absence of voice", "rising pitch" and "eye contact" of the user.
When it is determined by the Bayesian identification that the user is trying to speak (interrupt), that effect is indicated by the dialogue management means 4.
The dialogue management means 4 interrupts the utterance of the computer and makes the utterance "What's wrong?", Or the output means such as making the CG pseudo-human express the expression "what's wrong?" Do through 5. on the other hand,
If it is determined to be an affair, the dialogue management means 4 to that effect.
The dialogue management means 4 uses the output means 5 to continue the utterance as it is. It should be noted that here, with a certain length of utterance, an example of data in which a rising pitch can be seen is an utterance called so-called additional word or unnecessary word such as "eh", "yes?", "Ha". Point to. If these reactions are not seen during the system utterance, the message "interruption from the user / no gap" is transmitted to the dialogue management means 4, and the dialogue management means 4 outputs the output means 5 so as to continue the utterance. Through.

【００１６】ここで、ベイズ識別の方法について説明す
る。ベイズ識別では、マルチモーダル対話データベース
の情報が用いられるが、そこには、人間同志の対話やコ
ンピュータとユーザとの対話の様子をさまざまなモード
でとらえたものが記録されている。図３は、マルチモー
ダル対話データベースのデータ例を示す図である。あい
づち／割り込みを識別するためのベイズ識別を行うに
は、マルチモーダル対話データベース中から、コンピュ
ータ発話中のユーザのデータから、ユーザのあいづちと
割り込みが抜き出され、そのときの「音声」,「上昇ピ
ッチ」と「アイコンタクト」の存在の有無が調べられ、
その調査結果が一つの学習データとされる。The Bayes identification method will be described below. The Bayesian identification uses information from a multimodal dialogue database, which records the state of dialogue between human beings and the dialogue between a computer and a user in various modes. FIG. 3 is a diagram showing an example of data in the multimodal dialogue database. In order to perform Bayesian identification for identifying azure / interruption, the user's azure and interruption are extracted from the data of the user who is speaking the computer from the multimodal dialogue database, and the "voice" at that time, The existence of "rising pitch" and "eye contact" is checked,
The result of the survey is used as one learning data.

【００１７】たとえば、ユーザからの割り込みのデータ
例として、コンピュータ発話中の時刻Ｔ１にある一定の
長さの「音声」が見られ、その音声が「上昇ピッチ」
で、同時に、時刻Ｔ１＋ＷＴの間に「アイコンタクト」
が見られたという図２（Ａ）の場合は、Ｗ１１１という学習データが得られる。先頭のＷは割り込みを意
味し、次の１と０は、それぞれ、「音声」,「上昇ピッ
チ」と「アイコンタクト」が存在するなら１、存在しな
いなら０である。また、あいづちのデータ例として、コ
ンピュータ発話中の時刻Ｔ１に、ある一定の長さの「音
声」が見られたが、「上昇ピッチ」ではなく、また、時
刻Ｔ１＋ＷＴの間に「アイコンタクト」が見られなかっ
た図２（Ｂ）の場合は、Ａ１００という学習データが得られる。先頭のＡはあいづちを意
味する。このような学習データをたくさん用意してお
き、認識データとして、例えば、「１１１」（ある
一定の長さの「音声」が見られ、その音声が「上昇ピッ
チ」で、同時に「アイコンタクト」が見られた）が与え
られたら、学習データの中の「Ｗ１１１」と「Ａ
１１１」の個数が比べられ、「Ｗ１１１」の
方が多ければ、その時は、ユーザは割り込み、すなわ
ち、発話をしようとしていると判断され、「Ａ１１
１」の方が多ければ、その時は、ユーザはあいづちを
していると判断される。もし同数（両方とも０の場合を
含む）の場合は「不明」なので、その旨が対話管理手段
４に伝えられ、対話管理手段４は、そのまま発話を続け
るよう出力手段５を通じて行う。For example, as an example of interrupt data from a user, a "voice" of a certain length is seen at time T1 during computer utterance, and the voice is "rising pitch".
Then, at the same time, during the time T1 + WT, "eye contact"
In the case of FIG. 2 (A) in which is observed, learning data of W 1 1 1 is obtained. The leading W means an interrupt, and the following 1 and 0 are 1 if "voice", "rising pitch" and "eye contact" are present, and 0 if they are not present. Further, as an example of the data of the azuchi, a certain length of "speech" was seen at time T1 during computer utterance, but it was not "rising pitch" and "eye contact" between time T1 + WT. In the case of FIG. 2 (B) where no is seen, learning data of A 1 0 0 is obtained. A at the beginning means Aizuchi. A lot of such learning data is prepared, and as the recognition data, for example, "1 1 1" (a "speech" of a certain fixed length is seen, and the sound is an "increasing pitch", and at the same time, "eye contact""Wasseen") was given, "W 1 1 1" and "A
If the number of “1 1 1” is compared and the number of “W 1 11” is larger, then it is determined that the user is trying to interrupt, that is, speak, and “A 1 1”
If there are more "1" s, then it is determined that the user is making a tie. If the same number (including the case where both are 0) is “unknown”, the fact is notified to the dialogue management means 4, and the dialogue management means 4 performs the utterance as it is through the output means 5.

【００１８】次に、発話権の所在がユーザ側にあるとき
の発話権制御部３での処理の例を図４に基づいて説明す
る。発話権の所在がユーザ側にある場合は、ユーザが発
話を終えようとしているかをユーザのみならず、発話の
終わり直前に現われる動作からすばやく判断し、コンピ
ュータ側からの応答がタイミングよく行われるようにす
ることが望ましい。発話権がユーザ側にあるときに、ユ
ーザ発話の終わりと誤りやすいものに、発話途中のポー
ズ（間）がある。本実施形態では、ユーザ発話の終わり
が判断されるには、発話権がユーザ側にある時刻Ｔ２か
らＴ２＋ＴＷ（例えば０.５秒）の間に発生したユーザ
の「頭の縦振り」,「上昇ピッチ」,「アイコンタクト」
及び「音声の有無」のベイズ識別による。ベイズ識別に
よりポーズと判定されたら、対話管理手段４では、あい
づち音声を入れながら、笑顔のＣＧ顔画像付きで、出力
手段５を通じて行う。ユーザが発話を終えようとしてい
ると判断されたら、「ユーザが発話を終えようとしてい
る」旨が対話管理手段４に伝えられ、対話管理手段４
は、コンピュータの発話を、音声出力とともに、顔画像
が口ぱく（発話するときに口を動かす様子）するよう出
力手段５を通じて行う。また、音声の大きさや発話の速
度を指定したり、また、表情や身振りも加えるようにし
てもよい。Next, an example of processing in the speech right control unit 3 when the speech right is on the user side will be described with reference to FIG. If the user has the right to speak, whether the user is about to finish speaking or not is determined not only by the user but also by the action that appears immediately before the end of the speech so that the response from the computer can be made in a timely manner. It is desirable to do. When the user has the right to speak, there is a pause (interval) in the middle of the utterance that is easily mistaken for the end of the user's utterance. In the present embodiment, in order to determine the end of the user's utterance, the user's “vertical swing” and “rise” that occur between the time T2 and T2 + TW (for example, 0.5 seconds) when the user has the right to speak. Pitch "," eye contact "
Bayes identification of "presence or absence of voice". When it is determined to be a pose based on the Bayesian identification, the dialogue management means 4 performs it through the output means 5 with a smiling CG face image while inserting the voice of the voice. If it is determined that the user is about to finish the utterance, "the user is about to finish the utterance" is transmitted to the dialogue management means 4, and the dialogue management means 4 is notified.
Performs the utterance of the computer through the output unit 5 so that the face image is lips (moving the mouth when uttering) together with the voice output. Further, the volume of voice and the speed of speech may be designated, and facial expressions and gestures may be added.

【００１９】ユーザ発話途中のポーズと発話の終わりを
識別するためのベイズ識別の一例を説明する。この場合
は、マルチモーダル対話データベースの中から、ユーザ
が発話中のポーズと発話の終わりのデータが抜き出さ
れ、そのときの「頭の縦振り」,「上昇ピッチ」,「アイ
コンタクト」,「音声」の存在の有無が調べられ、その
調査結果を一つの学習データとする。例えば、ユーザ発
話の「終わり」のデータ例として、ユーザ発話中の時刻
Ｔ２に「頭の縦振り」が発生し、かつ、時刻Ｔ２からＴ
２＋ＷＴの間に「アイコンタクト」が見られ、「音声」
が見られなかった、という場合（図３（Ａ））は、Ｆ１０１０という学習データが得られる。先頭のＦは発話の終わり
を意味し、次の１と０は、それぞれの状態が存在するな
ら１、存在しないなら０である。An example of Bayes identification for identifying a pause in the middle of user utterance and the end of utterance will be described. In this case, the pose data of the user and the data of the end of the utterance are extracted from the multimodal dialogue database, and the vertical swing of the head, the ascending pitch, the eye contact, the The presence or absence of "voice" is checked, and the result of the check is used as one learning data. For example, as an example of the “end” data of the user's utterance, “vertical head swing” occurs at the time T2 during the user's utterance, and from time T2 to T.
"Eye contact" can be seen between 2 + WT and "voice"
In the case where is not observed (FIG. 3 (A)), the learning data of F 1 0 1 0 is obtained. The leading F means the end of the utterance, and the next 1 and 0 are 1 if the respective states exist, and 0 if they do not exist.

【００２０】ユーザ発話の「終わり」の別のデータ例と
して、ユーザ発話中の時刻Ｔ２に「頭の縦振り」と「上
昇ピッチ」の発話が見られ、かつ、時刻Ｔ２からＴ２＋
ＷＴの間に「アイコンタクト」が見られ、「音声」が見
られなかったという場合（図３（Ｂ））は、Ｆ１１１０という学習データが得られる。一方、ユーザ発話の「ポ
ーズ」のデータ例として、ユーザ発話中の時刻Ｔ２に
「頭の縦振り」が発生し、かつ、時刻Ｔ２からＴ２＋Ｗ
Ｔの間に「アイコンタクト」も「音声」も見られなかっ
たという場合（図３（Ｃ））は、Ｐ１０００という学習データが得られる。先頭のＰは発話のポーズ
を意味し、次の１と０は、それぞれの状態が存在するな
ら１、存在しないなら０である。As another data example of the "end" of the user's utterance, utterances of "vertical head swing" and "rising pitch" are seen at time T2 during the user's utterance, and from time T2 to T2 +.
If “eye contact” is seen during WT and no “voice” is seen (FIG. 3 (B)), learning data of F 1 1 1 0 is obtained. On the other hand, as a data example of the “pause” of the user's utterance, “vertical head swing” occurs at time T2 during the user's utterance, and from time T2 to T2 + W.
When neither "eye contact" nor "voice" is seen during T (FIG. 3C), learning data of P 1 0 0 0 is obtained. The first P means a pause of utterance, and the next 1 and 0 are 1 if the respective states exist, and 0 if they do not exist.

【００２１】このような学習データをたくさん用意して
おき、認識データとして、例えば、「１０１０」
（ユーザ発話中の時刻Ｔ２に「頭の縦振り」が発生し、
かつ、時刻Ｔ２からＴ２＋ＷＴの間に「アイコンタク
ト」が見られ、「音声」が見られなかった）が与えられ
たら、学習データ中の「Ｆ１０１０」と「Ｐ１
０１０」の個数が比べられる。その結果、「Ｆ
１０１０」の方が多ければ、その時は、ユーザは
発話を終えようとしていると判断され、「Ｐ１０
１０」の方が多ければ、その時は、ポーズであると判
断される。もし、同数（両方とも０の場合を含む）の場
合は「不明」なので、その旨が対話管理手段４に伝えら
れ、対話管理手段４は、ユーザ発話を待つよう出力手段
５を通じて行う。A large amount of such learning data is prepared, and the recognition data is, for example, "1 0 1 0".
(A "vertical head swing" occurs at time T2 when the user is speaking,
Moreover, when the "eye contact" was seen and the "voice" was not seen) from the time T2 to T2 + WT), "F 1 0 1 0" and "P1" in the learning data were given.
The numbers of "0 1 0" are compared. As a result, "F
If there are more "1 0 1 0", then it is determined that the user is about to finish speaking, and "P 1 0
If there are more "10", then it is determined to be a pose. If the number is the same (including the case where both are 0), it is "unknown", so that is notified to the dialogue management means 4, and the dialogue management means 4 performs the output means 5 so as to wait for the user's utterance.

【００２２】なお、「上昇ピッチ」や「頭の縦振り」
は、発話の終わりの直前に起こることから、まず、ユー
ザから発生される「上昇ピッチ」と「頭の縦振り」から
発話の終わりの予測を行うためのベイズ識別を行っても
よい。この場合は、ベイズ識別により発話の終わりの予
測がされたら、「ユーザ発話が終わるかもしれない」と
いう旨が対話管理手段４に伝えられる。ただし、ユーザ
発話の終わりではなく、単なるポーズである場合もある
ため、対話管理手段４では、あいづち音声を入れなが
ら、笑顔のＣＧ顔画像付きで、出力手段５を通じて行う
とともに、コンピュータ側からの発話の準備をする。そ
して、さらにユーザから発生されるある一定時間の「音
声の有無」と「アイコンタクト」が調べられ、発話の終
わり／ポーズのベイズ識別が行われる。ベイズ識別によ
り、ユーザ発話の終わりと判断されたら、「ユーザが発
話を終えようとしている」旨が対話管理手段４に伝えら
れ、対話管理手段４は、コンピュータの発話を音声出力
と共に、顔画像が口ぱくするよう出力手段５を通じて行
う。人間は、話し始めようとしたり、終えようとしたり
する前に、表情や身振りなどに、その意図を表わす。し
たがって、このような発話が始まる／終わる前に現われ
る人間の動作を認識することにより、発話権の所在がよ
り早く判定でき、人間とコンピュータとの間の対話をタ
イミング良く、よりスムースに進めることができるよう
になる。"Rise pitch" and "vertical swing of the head"
Occurs immediately before the end of the utterance, and thus Bayes identification for predicting the end of the utterance may be first performed from the "rise pitch" and the "vertical head swing" generated by the user. In this case, when the end of the utterance is predicted by Bayes identification, the dialog management means 4 is informed that “the user's utterance may end”. However, since the user's utterance may not be the end but may be a simple pose, the dialogue management means 4 performs it with the smiley CG face image with the smiling CG face image through the output means 5 and from the computer side. Prepare to speak. Then, "presence or absence of voice" and "eye contact" generated by the user for a certain period of time are examined to perform Bayesian identification of the end / utterance of the utterance. If it is determined by Bayes identification that the user's utterance is the end, "the user is about to end the utterance" is transmitted to the dialogue management unit 4, and the dialogue management unit 4 outputs the utterance of the computer along with the voice output and the face image. The output means 5 is used so that the user can speak. Human beings express their intentions in facial expressions, gestures, etc. before starting or ending talking. Therefore, by recognizing the human motion that appears before or after such utterance, the location of the utterance right can be determined more quickly, and the dialogue between the human and the computer can be advanced in a timely and smooth manner. become able to.

【００２３】次に、本発明の対話装置の第２の実施形態
を添付図を参照しながら説明する。発話権の所在がユー
ザ側にある場合は、ユーザが発話をしている場合の他
に、コンピュータ側の発話が終了して、発話権がユーザ
に移ることが期待されている場合がある。本実施形態で
は、このように、コンピュータの発話が終って、ユーザ
に発話権が渡されたにもかかわらず、ユーザ発話がない
ときに、ユーザが困っているのかどうかを判定するもの
である。このような状況下でユーザ発話がないときの可
能性として、コンピュータ側の発話が聞き取れなかった
り、内容が分らなくて「困っている」場合と、応答の内
容を「考えている」場合とが考えられる。Next, a second embodiment of the dialogue apparatus of the present invention will be described with reference to the accompanying drawings. When the user has the right to speak, in addition to the case where the user is speaking, it may be expected that the speech on the computer side ends and the right to speak is transferred to the user. In the present embodiment, as described above, it is determined whether or not the user is in trouble when there is no user utterance even though the computer has finished speaking and the user has been given the speaking right. In such a situation, when there is no user's utterance, there are two cases: when the user cannot hear the utterance on the computer side, "I am in trouble" because I do not understand the content, and when I am "thinking" about the content of the response. Conceivable.

【００２４】本実施形態の構成は、図５に示すように、
第１の実施形態の構成に、ユーザが返答に困っているか
どうかを判定する感情判定手段８が付加されている。以
下に、本実施形態でユーザの感情を判定する動作につい
て説明する。発話権制御手段３では、まず、履歴格納手
段７から渡される情報により、時刻情報とともに発話権
の所在が判断される。ここで、発話権がコンピュータか
らユーザに移った時刻情報Ｔ３からＴ３＋ＷＴ（例えば
２秒）の間の各認識手段２-1，２-2…からの情報が感情
判定手段８に送出される。The configuration of this embodiment is as shown in FIG.
The emotion determination means 8 for determining whether or not the user is having difficulty answering is added to the configuration of the first embodiment. The operation of determining the emotion of the user in this embodiment will be described below. In the right to speak control unit 3, first, the location of the right to speak is determined based on the information passed from the history storage unit 7 together with the time information. Here, the information from each recognition means 2-1, 2-2, ... During the time information T3 to T3 + WT (for example, 2 seconds) when the speech right is transferred from the computer to the user is sent to the emotion determination means 8.

【００２５】図６は、感情判定手段８での処理の一例を
説明するための認識手段の出力の状態を示す図である。
感情判定手段８では、ユーザが困っているか、或いは、
考えているかを判定するのは、「アイコンタクト」,
「音声の有無」及び「上昇ピッチ」のベイズ識別によ
る。ベイズ識別によりユーザが「困っている」と判定さ
れたら、「ユーザは困っている」という旨が対話管理手
段４に伝えられる。すると、対話管理手段４は、コンピ
ュータから、例えば、「どうしたの」という発話や、あ
るいは「どうしたの」という表情のＣＧ顔画像を出力手
段５を通じて行う。一方、ベイズ識別により、ユーザが
「考えている」と判定されたら、「ユーザは考えてい
る」という旨が対話管理手段４に伝えられる。すると、
対話管理手段４は、コンピュータに、そのままユーザか
らの応答を待つために、例えば、首をかしげたＣＧ顔画
像を出力手段５を通じて行う。FIG. 6 is a diagram showing an output state of the recognition means for explaining an example of processing in the emotion determination means 8.
In the emotion determination means 8, the user is in trouble, or
It is "eye contact" that determines whether you are thinking,
Based on Bayesian identification of "presence or absence of voice" and "rising pitch". If it is determined that the user is “in trouble” by the Bayesian identification, the message “the user is in trouble” is transmitted to the dialogue management means 4. Then, the dialogue management means 4 causes the computer to output, for example, a utterance "What's wrong" or a CG face image with the expression "what's wrong" through the output means 5. On the other hand, when it is determined that the user is “thinking” by the Bayes identification, the fact that “the user is thinking” is transmitted to the dialogue management means 4. Then
The dialogue management unit 4 causes the computer to output a CG face image with a bowed head, for example, through the output unit 5 in order to wait for a response from the user.

【００２６】次に、困っているか、或いは、考えている
かを識別するためのベイズ識別の一例を説明する。この
場合は、マルチモーダル対話データベースの中から、コ
ンピュータの発話が終った時刻情報Ｔ３からＴ３＋ＷＴ
の間にユーザが困っている、或いは、考えている時のデ
ータについて、「アイコンタクト」,「音声」及び「上
昇ピッチ」の存在の有無が調べられ、その調査結果を一
つの学習データとする。例えば、ユーザが「困ってい
る」のデータ例として、Ｔ３からＴ３＋Ｗの間に「アイ
コンタクト」とある一定の長さの「音声」があり、その
音声が「上昇ピッチ」であったという場合（図６
（Ａ））は、Ｋ１１１という学習データが得られる。先頭のＫは、困っている
を意味し、次の１と０は、それぞれの状態が存在するな
らば１で、存在しないならば０である。また、ユーザが
「困っている」の別のデータ例として、「アイコンタク
ト」が見られたが、「音声」が見られなかったという場
合（図６（Ｂ））は、Ｋ１００という学習データが得られる。一方、ユーザが「考えて
いる」のデータ例として、「アイコンタクト」も「音
声」も見られなかったという場合（図６（Ｃ））は、Ｃ０００という学習データが得られる。先頭のＣは、考えている
を意味する。Next, an example of Bayes identification for identifying whether a person is in trouble or thinking. In this case, from the multimodal dialogue database, the time information T3 to T3 + WT at which the computer uttered is finished.
For the data when the user is in trouble or thinking during, the presence or absence of "eye contact", "voice" and "rising pitch" is checked, and the result of the investigation is used as one learning data. . For example, as an example of data in which the user is "in trouble", there is a certain length of "voice" between "T3" and "T3 + W", and the voice is "up pitch" ( Figure 6
In (A)), the learning data of K 1 1 1 is obtained. The leading K means troubled, and the next 1 and 0 are 1 if the respective states exist, and 0 if they do not exist. As another example of data in which the user is “in trouble,” “eye contact” was seen, but “voice” was not seen (FIG. 6 (B)), it was K 1 0 0. Learning data is obtained. On the other hand, as an example of the data “thinking” by the user, when neither “eye contact” nor “voice” is seen (FIG. 6C), learning data C 0 0 0 is obtained. The leading C means thinking.

【００２７】このような学習データをたくさん用意して
おき、認識データとして、例えば「１１１」（「ア
イコンタクト」とある一定の長さの「音声」があり、そ
の音声が「上昇ピッチ」であった）が与えられたら、学
習データ中の「Ｋ１１１」と「Ｃ１１１」の
個数が比べられ、「Ｋ１１１」の方が多ければ、
その時は、ユーザは「困っている」と判断され、「Ｃ
１１１」の方が多ければ、その時は、ユーザは考え
ていると判断される。もし、同数（両方とも０の場合を
含む）の場合は「不明」なので、その旨が対話管理手段
４に伝えられ、対話管理手段４は、例えば、「どうした
の」という発話や、あるいは、「どうしたの」という表
情のＣＧ顔画像を出力手段５を通じて行う。A large amount of such learning data is prepared, and as recognition data, for example, "1 1 1"("eyecontact", there is "voice" of a certain length, and that voice is "rising pitch" Was given), the numbers of “K 1 11” and “C 1 1 1” in the learning data were compared, and if the number of “K 1 1 1” was larger,
At that time, the user is judged to be "in trouble" and "C
If there are more "1 1 1", then it is determined that the user is thinking. If the same number (including the case where both are 0) is “unknown”, the fact is transmitted to the dialogue management means 4, and the dialogue management means 4 utters, for example, “what happened” or A CG face image with the expression "What happened" is output through the output means 5.

【００２８】このようにして、コンピュータ発話が終っ
て、ユーザに発話権が渡されたにもかかわらず、ユーザ
発話がないときに、ユーザが返答に困っているかどうか
を判定して、困っていれば、コンピュータに発話権を戻
すことができ、また、ユーザが考えていれば、発話権を
ユーザに渡したまま、ユーザの応答を待つように制御す
ることができる。また、こうしてユーザの感情を理解す
ることにより、ユーザが返答に困っている状態を回避
し、また、ユーザの思考を妨げずにコンピュータとの間
にスムースな対話を実現することができる。なお、ここ
では、コンピュータから「どうしたの」と応答するよう
にしているが、もちろん、コンピュータ発話を繰り返し
たり、発話内容を変えて再度ユーザに話しかけるように
してもよい。In this way, when the user has no utterance even after the computer utterance has been finished and the utterance right has been handed to the user, it is judged whether or not the user has a problem in replying. For example, the speaking right can be returned to the computer, and if the user thinks, it can be controlled to wait for the user's response while giving the speaking right to the user. Further, by understanding the user's emotions in this way, it is possible to avoid a situation in which the user is having difficulty answering, and realize a smooth dialogue with the computer without disturbing the user's thinking. Although the computer responds "what's wrong" here, of course, computer utterance may be repeated or the utterance content may be changed to speak again to the user.

【００２９】[0029]

【The invention's effect】

請求項１の効果：ユーザの行動（動作状態）に対応して
発生する音声及び映像といった複数の信号の特徴から発
話権がコンピュータとユーザのどちらにあるかを判定す
ることができるとともに、発話権の所在に応じてユーザ
への音声や可視信号の出力を制御することができる。従
って、対話をスムースに運ぶことが可能な対話装置を提
供することができる。Effect of claim 1: It is possible to determine whether the computer or the user has the speaking right based on the characteristics of a plurality of signals such as audio and video generated in response to the user's action (operating state), and the speaking right. It is possible to control the output of audio and visual signals to the user according to the location of the user. Therefore, it is possible to provide a dialogue device capable of smoothly carrying the dialogue.

【００３０】請求項２の効果：請求項１の効果に加え
て、発話権の所在に応じて対話を促すメッセージをコン
ピュータ側から生成してユーザに対して出力することに
より、対話のよりスムースな進行が可能となる。Effect of claim 2: In addition to the effect of claim 1, a message for prompting a dialogue is generated from the computer side according to the location of the right to speak and is output to the user, so that the dialogue is smoother. It is possible to proceed.

【００３１】請求項３の効果：請求項１又は２の効果に
加えて、時刻情報を出力する時刻付与手段をさらに備え
たことで、ユーザの行動に対応して発生する複数の信号
特徴の時間的な相関関係を捉えることができる。これに
より、より間違いの少ない発話権の所在の判定ができ
る。Effect of claim 3: In addition to the effect of claim 1 or 2, by further comprising time adding means for outputting time information, time of a plurality of signal features generated corresponding to user's action It is possible to capture the real correlation. As a result, it is possible to determine the location of the speaking right with less error.

【００３２】請求項４の効果：請求項１ないし３の効果
に加えて、ユーザの音声の大きさ（音声パワー），音声
の高さ（音声ピッチ），視線の方向及び頭の縦振り（う
なずくこと）から発話権の所在が判定される。これによ
り、より間違いの少ない発話権の所在の判定ができる。Effect of claim 4 In addition to the effects of claims 1 to 3, the volume of the user's voice (voice power), the pitch of the voice (voice pitch), the direction of the line of sight, and the vertical swing of the head (nod). The location of the right to speak is determined from this). As a result, it is possible to determine the location of the speaking right with less error.

【００３３】請求項５の効果：請求項１ないし４の効果
に加えて、ユーザの音声の大きさ（音声パワー），音声
の高さ（音声ピッチ）および視線の方向を認識して、そ
の結果からユーザの感情を判定することにより、コンピ
ュータがユーザの発話を助けるような応答ができるよう
になり、ユーザの困惑を回避するとともに、より満足の
できる対話装置が得られる。Effect of claim 5: In addition to the effects of claims 1 to 4, the user's voice volume (voice power), voice pitch (voice pitch), and line-of-sight direction are recognized, and the result is obtained. By determining the user's emotion from the computer, the computer can make a response that helps the user to speak, avoid the confusion of the user, and obtain a more satisfying interactive device.

[Brief description of drawings]

【図１】本発明の対話装置の実施形態の概略構成を示す
ブロック図である。FIG. 1 is a block diagram showing a schematic configuration of an embodiment of a dialogue apparatus of the present invention.

【図２】本発明の一実施形態において、発話権の所在が
システム（コンピュータ）側にあるときの発話権制御手
段での処理の一例を説明するための認識手段の出力状態
を示す図である。FIG. 2 is a diagram showing an output state of a recognition unit for explaining an example of processing in a speech right control unit when the location of the speech right is on the system (computer) side in an embodiment of the present invention. .

【図３】ベイズ識別の方法を説明するための図で、マル
チモーダル対話データベースのデータを示す図である。FIG. 3 is a diagram for explaining a Bayes identification method and is a diagram showing data in a multimodal dialogue database.

【図４】本発明の一実施形態において、発話権の所在が
ユーザ側にあるときの発話権制御手段での処理の一例を
示す図である。FIG. 4 is a diagram showing an example of processing in a speech right control unit when the location of the speech right is on the user side in an embodiment of the present invention.

【図５】本発明の対話装置の実施形態の概略構成を示す
ブロック図である。FIG. 5 is a block diagram showing a schematic configuration of an embodiment of a dialogue apparatus of the present invention.

【図６】本発明の一実施形態における感情判定手段の処
理の一例を説明するための認識手段の出力の状態を示す
図である。FIG. 6 is a diagram showing an output state of the recognition means for explaining an example of processing of the emotion determination means in the embodiment of the present invention.

[Explanation of symbols]

１，１-1，１-2…入力データ、２，２-1，２-2…認識手
段、３…発話権制御手段、４…対話管理手段、５，５-
1，５-2…出力手段、６…時刻付与手段、７…履歴格納
手段、８…感情判定手段。1, 1-1, 1-2 ... Input data, 2, 2-1, 2-2 ... Recognition means, 3 ... Speech right control means, 4 ... Dialogue management means, 5, 5-
1, 5-2 ... Output means, 6 ... Time assigning means, 7 ... History storing means, 8 ... Emotion determining means

Claims

[Claims]

1. An input unit for inputting a voice generated by a human or the like and a facial expression of the human or the like that can be visualized, an output unit for outputting an audible voice and a visible signal to the human or the like, and the input unit. In a dialogue device having a recognition means for recognizing a user's operation state based on the signal, and a processing / control means for processing the output of the recognition means and controlling the output means, the recognition means has a plurality of predetermined values. The operation state of the user in the operation mode is output as a recognition result, and the processing / control means determines the situation of the dialogue from the recognition result, performs processing for determining the whereabouts of the speech right, and responds to the result of the processing. An interactive device characterized in that the output means is controlled to perform the output.

2. An output generation means for generating an output corresponding to a result of the processing in the processing / control means is provided, and a voice generated by the output generation means according to the situation of the dialogue and the whereabouts of the speech right. The interactive device according to claim 1, wherein a visual signal is output.

3. The time addition means for giving the start and end times of the operation state of the user recognized by the recognition means, and the time addition means used for time management of the operation in the processing / control means. The dialogue apparatus according to claim 1 or 2.

4. The plurality of predetermined operation modes are the volume of voice, the height of voice, the direction of the line of sight and the vertical swing of the head, and the operation state of each of the modes is recognized. Item 5. The dialog device according to any one of items 1 to 3.

5. The plurality of predetermined modes are set to a voice volume,
The height of the voice and the direction of the line of sight are used, the operating states of the respective modes are recognized, the emotion of the user is judged from the recognition result, and the location of the speaking right of the user is re-determined as either the user or the device concerned. The dialogue apparatus according to any one of claims 1 to 4, characterized in that.