JP2011175304A

JP2011175304A - Voice interactive device and method

Info

Publication number: JP2011175304A
Application number: JP2011131432A
Authority: JP
Inventors: Tomonori Irie; 友紀入江; Kunio Yokoi; 邦雄横井; Katsushi Asami; 克志浅見
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2011-06-13
Filing date: 2011-06-13
Publication date: 2011-09-08

Abstract

PROBLEM TO BE SOLVED: To provide a voice interactive device and method capable of responding at suitable timing. SOLUTION: At first, a word string is extracted from input voice (S30). A speaking speed of the input voice is calculated (S40). Then, by comparing an extracted word string with an appearance probability list for storing the word string which is predicted to follow a present input word string (hereunder, a following predicted word string) and an appearance probability corresponding to the following predicted word string, the following predicted word string of the highest appearance probability in the following predicted word strings which are predicted to follow the extracted word string is extracted (S50). By using the calculated speaking speed, time which requires for the following predicted word string to be input (hereunder, following input time) is calculated (S60). Later, the appearance probability of the extracted following predicted word string is regarded as a confidence factor, and it is applied to the extracted following predicted word string (S70). When the applied confidence factor is a response determination value or more, output timing prediction is fixed at the following input timing (S80). COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、利用者が発した音声に対応した応答を行う音声対話装置および方法に関する。 The present invention relates to a voice interactive apparatus and method for performing a response corresponding to a voice uttered by a user.

従来、音声対話装置において、音声が入力されていることを人間に知らせることで音声入力の不安感を軽減するものや、対話中に間が空いた場合に入力を促進するものが知られている。その一つとして、擬人化された人工エージェントが対話中に相槌や頷きをすることで、「話を聞いている」または「話を続けて」などといった意思を人間に対して明確にすることにより、人間と人工エージェントとの対話を円滑に進めることを目的とする技術も提案されている（例えば、特許文献１、特許文献２、特許文献３、特許文献４、特許文献５、特許文献６参照。）。 2. Description of the Related Art Conventionally, in a voice dialogue apparatus, there are known ones that reduce anxiety of voice input by notifying humans that voice is being input, and those that facilitate input when there is a gap during dialogue . As one of them, by making humanized artificial agents interact and whisper during conversations, by clarifying the intentions such as “listening to the story” or “continuing the story” to humans In addition, a technique for smoothly promoting a dialogue between a human and an artificial agent has also been proposed (see, for example, Patent Document 1, Patent Document 2, Patent Document 3, Patent Document 4, Patent Document 5, and Patent Document 6). .)

例えば、特許文献１に開示されている音声対話システムは、音声認識結果、ピッチの時系列情報、視線の時系列情報、及び係り受け情報等に基づいて応答タイミングや意味処理タイミングを判定して、応答タイミングであって且つ意味処理タイミングでないと判定したときに相槌また発話中のキーワードを発するように構成されている。 For example, the speech dialogue system disclosed in Patent Document 1 determines response timing and semantic processing timing based on speech recognition results, pitch time-series information, line-of-sight time-series information, dependency information, and the like. When it is determined that it is a response timing and not a semantic processing timing, a keyword that is in conflict or uttered is issued.

また、特許文献２に開示されている相槌ロボットは、音声登録テーブルと認識結果が一致した場合に、対応する相槌登録データ（「そうだね」等）を読み出すように構成されている。 The conflict robot disclosed in Patent Document 2 is configured to read corresponding conflict registration data (such as “That's right”) when the recognition result matches the voice registration table.

また、特許文献３、特許文献４、特許文献５、及び特許文献６には、予め決められたポーズやキーワードを検出した場合に相槌を出力する技術が開示されている。 Patent Document 3, Patent Document 4, Patent Document 5, and Patent Document 6 disclose a technique for outputting a conflict when a predetermined pose or keyword is detected.

特開２００５−１９６１３４号公報JP-A-2005-196134 特開２００３−８８６８６号公報JP 2003-88686 A 特開平７−１９１６８７号公報JP-A-7-191687 特開平７−２１９９６１号公報JP 7-219961 A 特開平８−２１１９８６号公報JP-A-8-211986 特開２００４−８６００１号公報JP 2004-86001 A

しかし、上述の特許文献１〜６に記載の技術では、ポーズやキーワード等の言語情報などといった、相槌や頷きを行うタイミングの直前における特徴量を用いて判定する。そして、判定してから相槌や頷きを行うまでの処理に時間がかかるため、適切なタイミングで相槌や頷きを入れることが困難である。 However, in the techniques described in Patent Documents 1 to 6 described above, the determination is performed using the feature amount immediately before the timing of performing the reconciliation or whispering, such as language information such as a pose or a keyword. Then, since it takes time to perform a process of making a check and a call after making the determination, it is difficult to add a check and a call at an appropriate timing.

そして、このような不適切なタイミングでの相槌では、逆に話が遮られるなどの悪い印象を与えてしまったり（文献「音声対話システムにおける相槌認識／生成機能の言語情報と韻律情報による実現」、三宅他、2005年日本音響学会秋季研究発表会、1-P-20、pp.191-192）、発話の流れを止めてしまったりして（文献「韻律情報を用いた相槌生成システムとその評価」、竹内他、情報処理学会第64 回全国大会、Vol.2、pp.101-102）、対話のリズムを崩す可能性がある。 And, in the case of such inadequate timing, it may give a bad impression that the talk is interrupted (refer to the document "Realization of language recognition and prosodic information for conflict recognition / generation function in spoken dialogue system") , Miyake et al., 2005 Acoustical Society of Japan Autumn Meeting, 1-P-20, pp.191-192) "Evaluation", Takeuchi et al., IPSJ 64th National Convention, Vol.2, pp.101-102), there is a possibility of disrupting the rhythm of dialogue.

また、人間同士の対話では、発話にオーバーラップする相槌が多い（例えば文献「コーパスに基づく相槌の時間的分析と考察」、中里収、人工知能学会研究会資料、SIG-SLUD-A003-7（3/2）を参照）。しかし、相槌や頷きを行うタイミングをポーズや発話末の表現によって判定する手法では、発話にオーバーラップする相槌を実現することができない。 Moreover, in human-to-human dialogue, there are many conflicts that overlap utterances (for example, the literature “Corpus-based temporal analysis and discussion of conflicts”, Haruka Nakazato, Artificial Intelligence Society study material, SIG-SLUD-A003-7 ( (See 3/2)). However, with the method of determining the timing for performing a match or a call by the expression of the pause or the end of the utterance, it is not possible to realize a conflict that overlaps the utterance.

本発明は、こうした問題に鑑みなされたものであり、適切なタイミングで応答することができる音声対話装置および方法を提供することを目的とする。 The present invention has been made in view of these problems, and an object thereof is to provide a voice interactive apparatus and method capable of responding at an appropriate timing.

上記目的を達成するためになされた請求項１〜請求項５に記載の音声対話装置では、入力手段が、利用者が発した音声を入力し、予測手段が、入力手段に入力した音声に基づいて、該入力した音声に対応した応答を行う応答タイミングの予測結果を示すタイミング予測情報を取得する。更にタイミング判断手段が、予測手段により取得されたタイミング予測情報に基づいて、応答タイミングになったか否かを判断する。そして応答手段が、タイミング判断手段により応答タイミングになったと判断された場合に、応答を行う。 In the spoken dialogue apparatus according to any one of claims 1 to 5, which is made to achieve the above object, the input means inputs the voice uttered by the user, and the prediction means based on the voice input to the input means. Thus, timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice is acquired. Further, the timing determination unit determines whether or not the response timing has been reached based on the timing prediction information acquired by the prediction unit. The response means makes a response when it is determined by the timing determination means that the response timing has come.

このように構成された音声対話装置によれば、予測手段によって応答タイミングを前もって予測することができるので、応答手段に応答を開始させるための処理時間を確保できる。つまり、応答手段が応答する前に応答タイミングになってしまうという事態が発生することを抑制でき、適切なタイミングで応答することができるという優れた効果を奏する。 According to the voice interaction apparatus configured as described above, since the response timing can be predicted in advance by the prediction means, it is possible to secure a processing time for causing the response means to start a response. That is, it is possible to suppress the occurrence of a situation in which the response timing is reached before the response means responds, and an excellent effect is achieved in that a response can be made at an appropriate timing.

また、請求項１〜請求項５に記載の音声対話装置では、予測手段は、利用者による発話が終了する少なくとも１文字前まで、或いは、利用者による発話中に割り込んで応答を行うことができる少なくとも１文字前までに、タイミング予測情報を取得するようにする。 In the spoken dialogue apparatus according to any one of claims 1 to 5, the predicting means can respond by interrupting at least one character before the end of the utterance by the user or during the utterance by the user. Timing prediction information is acquired at least one character before.

このように構成された音声対話装置によれば、予測手段によって応答タイミングを予測してから応答タイミングになるまでに、少なくとも１文字以上の発話がされる時間を確保することができる。 According to the spoken dialogue apparatus configured in this manner, it is possible to secure a time during which at least one character is uttered from when the response means predicts the response timing to when the response timing is reached.

また、請求項１〜請求項５の何れかに記載の音声対話装置において、予測手段は、具体的には、請求項６に記載のように、応答タイミングが予め決定されている予測モデルと、入力手段により入力した音声とについて、予測モデルの特徴量と、入力した音声の特徴量とを比較することにより、タイミング予測情報を取得するようにしてもよい。 Further, in the spoken dialogue apparatus according to any one of claims 1 to 5, the prediction means, specifically, as described in claim 6, a prediction model in which response timing is determined in advance; Timing prediction information may be acquired by comparing the feature amount of the prediction model with the feature amount of the input speech for the speech input by the input means.

更に、請求項６に記載の音声対話装置において、特徴量は、請求項７に記載のように、利用者による発話についての、統語的な特徴を示す統語的特徴量、及び韻律的な特徴を示す韻律的特徴量の少なくとも一方であるようにするとよい。 Further, in the spoken dialogue apparatus according to claim 6, as described in claim 7, the feature amount includes a syntactic feature amount indicating a syntactic feature and a prosodic feature regarding the utterance by the user. It is preferable that at least one of the prosodic feature values to be shown.

この統語的特徴量及び韻律的特徴量は、入力した音声から逐次的に得ることができるものであり、このように構成された音声対話装置によれば、予測モデルと逐次比較することにより、音声入力時に常に応答タイミングの予測をすることができる。 The syntactic feature value and the prosodic feature value can be obtained sequentially from the input voice. According to the spoken dialogue apparatus configured in this way, The response timing can always be predicted at the time of input.

また、請求項７に記載の音声対話装置において、請求項８に記載のように、統語的特徴量は、予め設定されたキーワード、単語列、形態素列、品詞列、音素列の少なくとも１つを含む情報であり、韻律的特徴量は、発話長、基本周波数の時系列情報、ピッチの時系列情報、パワーの時系列情報、及び話速の時系列情報の少なくとも１つを含む情報であるようにしてもよい。 Further, in the spoken dialogue apparatus according to claim 7, as described in claim 8, the syntactic feature amount is at least one of a preset keyword, word string, morpheme string, part of speech string, and phoneme string. The prosodic feature value is information including at least one of utterance length, basic frequency time-series information, pitch time-series information, power time-series information, and speech speed time-series information. It may be.

また、請求項４に記載の音声対話装置では、タイミング予測情報は、予測手段がタイミング予測情報を取得した後から応答タイミングになるまでに続く単語数、形態素数、品詞数、及び音素数の少なくとも１つであり、請求項５に記載の音声対話装置では、予測手段がタイミング予測情報を取得した後から応答タイミングになるまでに続く単語列、形態素列、品詞列、及び音素列の少なくとも１つである。 In the spoken dialogue apparatus according to claim 4, the timing prediction information includes at least the number of words, the number of morphemes, the number of parts of speech, and the number of phonemes that continue from when the prediction unit acquires the timing prediction information until the response timing is reached. In the spoken dialogue apparatus according to claim 5, at least one of a word string, a morpheme string, a part-of-speech string, and a phoneme string that continues from when the prediction unit acquires timing prediction information until the response timing is reached. It is.

また、請求項１に記載の音声対話装置では、入力手段に入力した音声に基づいて、現在の話速を算出する話速算出手段を備え、予測手段は、予測手段がタイミング予測情報を取得した後から応答タイミングになるまでに続く単語数、形態素数、品詞数、音素数、単語列、形態素列、品詞列、音素列の少なくとも１つを取得し、これらと、話速算出手段により算出された話速とに基づいて、応答タイミングになるまでの応答タイミング到達時間を算出し、この応答タイミング到達時間をタイミング予測情報とするようにする。 The spoken dialogue apparatus according to claim 1, further comprising: a speech speed calculation unit that calculates a current speech speed based on the voice input to the input unit, and the prediction unit acquires the timing prediction information. Obtain at least one of the number of words, the number of morphemes, the number of parts of speech, the number of phonemes, the word sequence, the morpheme sequence, the part of speech sequence, and the phoneme sequence that follow until the response timing is reached, and these are calculated by the speech speed calculation means Based on the talk speed, the response timing arrival time until the response timing is calculated is calculated, and the response timing arrival time is used as timing prediction information.

このように構成された音声対話装置によれば、話速に応じて応答タイミングを調整することができる。
また、請求項２に記載の音声対話装置では、予測手段は、予測手段がタイミング予測情報を取得した後から応答タイミングになるまでに経過する時間を予測し、この予測した時間をタイミング予測情報とし、請求項３に記載の音声対話装置では、予測手段がタイミング予測情報を取得した後から応答タイミングになるまでに続くフレーム数を予測し、この予測したフレーム数をタイミング予測情報とする。 According to the voice interaction apparatus configured as described above, the response timing can be adjusted according to the speech speed.
In the spoken dialogue apparatus according to claim 2, the predicting unit predicts a time that elapses after the predicting unit acquires the timing prediction information until the response timing is reached, and uses the predicted time as timing prediction information. In the spoken dialogue apparatus according to claim 3, the number of frames that continue until the response timing comes after the prediction means acquires the timing prediction information is used, and the predicted number of frames is used as the timing prediction information.

また、請求項１〜請求項８の何れかに記載の音声対話装置において、応答タイミングは、請求項９に記載のように、利用者の発話に重複して応答が行われるタイミングであるようにするとよい。 Further, in the voice interactive apparatus according to any one of claims 1 to 8, the response timing is a timing at which a response is made overlapping with a user's utterance as described in claim 9. Good.

このように構成された音声対話装置によれば、人間同士が対話を行っている状態に近づけることができ、対話をより円滑に進行させることができる。
また、請求項１０〜請求項１４に記載の音声対話方法は、まず入力ステップにおいて、利用者が発した音声を入力し、続く予測ステップにおいて、入力ステップに入力した音声に基づいて、入力した音声に対応した応答を行う応答タイミングの予測結果を示すタイミング予測情報を取得する。更にタイミング判断ステップにおいて、予測ステップにより取得されたタイミング予測情報に基づいて、応答タイミングになったか否かを判断する。そして応答ステップにおいて、タイミング判断ステップにより応答タイミングになったと判断された場合に、応答を行う。 According to the voice dialogue apparatus configured as described above, it is possible to bring a person close to a state in which dialogue is being performed, and the dialogue can be advanced more smoothly.
Further, in the voice interaction method according to any one of claims 10 to 14, in the input step, first, the voice uttered by the user is input, and in the subsequent prediction step, the input voice is based on the voice input in the input step. Timing prediction information indicating a prediction result of response timing for performing a response corresponding to is acquired. Further, in the timing determination step, it is determined whether or not a response timing has been reached based on the timing prediction information acquired in the prediction step. In the response step, when it is determined that the response timing is reached in the timing determination step, a response is made.

この音声対話方法は、請求項１〜請求項５に記載の音声対話装置にて実行される方法であり、当該方法を実行することで、請求項１〜請求項５に記載の音声対話装置と同様の効果を得ることができる。 The voice interaction method is a method executed by the voice interaction device according to claims 1 to 5, and by executing the method, the voice interaction device according to claims 1 to 5. Similar effects can be obtained.

また、請求項１０〜請求項１４に記載の音声対話方法において、予測ステップは、利用者による発話が終了する少なくとも１文字前まで、或いは、利用者による発話中に割り込んで応答を行うことができる少なくとも１文字前までに、タイミング予測情報を取得するようにする。 Furthermore, in the spoken dialogue method according to any one of claims 10 to 14, the prediction step can respond by interrupting at least one character before the end of the utterance by the user or during the utterance by the user. Timing prediction information is acquired at least one character before.

また、請求項１０〜請求項１４の何れかに記載の音声対話方法において、予測ステップは、具体的には、請求項１５に記載のように、応答タイミングが予め決定されている予測モデルと、入力ステップにより入力した音声とについて、予測モデルの特徴量と、入力した音声の特徴量とを比較することにより、タイミング予測情報を取得するようにしてもよい。 Further, in the voice interaction method according to any one of claims 10 to 14, the prediction step, specifically, as described in claim 15, a prediction model in which response timing is determined in advance, Timing prediction information may be acquired by comparing the feature amount of the prediction model with the feature amount of the input speech for the speech input in the input step.

更に、請求項１５に記載の音声対話方法において、特徴量は、請求項１６に記載のように、利用者による発話についての、統語的な特徴を示す統語的特徴量、及び韻律的な特徴を示す韻律的特徴量の少なくとも一方であるようにするとよい。 Furthermore, in the spoken dialogue method according to claim 15, as described in claim 16, the feature amount includes a syntactic feature amount indicating a syntactic feature and a prosodic feature regarding the utterance by the user. It is preferable that at least one of the prosodic feature values to be shown.

この音声対話方法は、請求項７に記載の音声対話装置にて実行される方法であり、当該方法を実行することで、請求項７に記載の音声対話装置と同様の効果を得ることができる。 This voice interaction method is a method executed by the voice interaction device according to claim 7, and the same effect as that of the voice interaction device according to claim 7 can be obtained by executing the method. .

また、請求項１６に記載の音声対話方法において、請求項１７に記載のように、統語的特徴量は、予め設定されたキーワード、単語列、形態素列、品詞列、音素列の少なくとも１つを含む情報であり、韻律的特徴量は、発話長、基本周波数の時系列情報、ピッチの時系列情報、パワーの時系列情報、及び話速の時系列情報の少なくとも１つを含む情報であるようにしてもよい。 Further, in the spoken dialogue method according to claim 16, as described in claim 17, the syntactic feature quantity is at least one of a preset keyword, word string, morpheme string, part of speech string, and phoneme string. The prosodic feature value is information including at least one of utterance length, basic frequency time-series information, pitch time-series information, power time-series information, and speech speed time-series information. It may be.

また、請求項１３に記載の音声対話方法では、タイミング予測情報は、予測ステップによりタイミング予測情報を取得した後から応答タイミングになるまでに続く単語数、形態素数、品詞数、及び音素数の少なくとも１つであり、請求項１４に記載の音声対話方法では、予測ステップによりタイミング予測情報を取得した後から応答タイミングになるまでに続く単語列、形態素列、品詞列、及び音素列の少なくとも１つである。 In the voice interaction method according to claim 13, the timing prediction information includes at least one of a word number, a morpheme number, a part-of-speech number, and a phoneme number that continues until the response timing comes after the timing prediction information is acquired by the prediction step. 15. The spoken dialogue method according to claim 14, wherein at least one of a word string, a morpheme string, a part-of-speech string, and a phoneme string following the timing prediction information obtained by the prediction step until the response timing is reached. It is.

また、請求項１０に記載の音声対話方法では、入力ステップにより入力した音声に基づいて、現在の話速を算出する話速算出ステップを備え、予測ステップは、予測手段ステップによりタイミング予測情報を取得した後から応答タイミングになるまでに続く単語数、形態素数、品詞数、音素数、単語列、形態素列、品詞列、音素列の少なくとも１つを取得し、これらと、話速算出ステップにより算出された話速とに基づいて、応答タイミングになるまでの応答タイミング到達時間を算出し、この応答タイミング到達時間をタイミング予測情報とするようにする。 The voice interaction method according to claim 10 further comprises a speech speed calculation step for calculating a current speech speed based on the voice input in the input step, and the prediction step acquires timing prediction information by the prediction means step. And at least one of the number of words, the number of morphemes, the number of parts of speech, the number of phonemes, the word sequence, the morpheme sequence, the part of speech sequence, and the phoneme sequence that are acquired until the response timing is reached, and these are calculated by the speech speed calculation step Based on the spoken speed, the response timing arrival time until the response timing is calculated is calculated, and this response timing arrival time is used as timing prediction information.

この音声対話方法は、請求項１に記載の音声対話装置にて実行される方法であり、当該方法を実行することで、請求項１に記載の音声対話装置と同様の効果を得ることができる。 This voice interaction method is a method executed by the voice interaction device according to claim 1, and the same effect as the voice interaction device according to claim 1 can be obtained by executing the method. .

また、請求項１１に記載の音声対話方法では、予測ステップは、予測ステップによりタイミング予測情報を取得した後から応答タイミングになるまでに経過する時間を予測し、この予測した時間をタイミング予測情報とし、請求項１２に記載の音声対話方法では、予測ステップによりタイミング予測情報を取得した後から応答タイミングになるまでに続くフレーム数を予測し、この予測したフレーム数をタイミング予測情報とする。 In the spoken dialogue method according to claim 11, the prediction step predicts a time that elapses from when the timing prediction information is acquired by the prediction step until the response timing is reached, and the predicted time is used as the timing prediction information. In the voice interaction method according to the twelfth aspect, after predicting the timing prediction information in the prediction step, the number of frames that continue until the response timing is predicted, and the predicted number of frames is used as the timing prediction information.

また、請求項１０〜請求項１７の何れかに記載の音声対話方法において、応答タイミングは、請求項１８に記載のように、利用者の発話に重複して応答が行われるタイミングであるようにするとよい。 Further, in the voice interaction method according to any one of claims 10 to 17, the response timing is a timing at which a response is made overlapping with a user's utterance as described in claim 18. Good.

この音声対話方法は、請求項９に記載の音声対話装置にて実行される方法であり、当該方法を実行することで、請求項９に記載の音声対話装置と同様の効果を得ることができる。 This voice interaction method is a method executed by the voice interaction device according to claim 9, and the same effect as that of the voice interaction device according to claim 9 can be obtained by executing the method. .

音声対話装置１の構成を示すブロック図である。1 is a block diagram showing a configuration of a voice interaction device 1. FIG. 制御部４が実行する処理の概要を示す機能ブロック図である。It is a functional block diagram which shows the outline | summary of the process which the control part 4 performs. 音声対話処理を示すフローチャートである。It is a flowchart which shows a voice interaction process. 出力タイミング予測の方法を説明する図である。It is a figure explaining the method of output timing prediction. 音量パラメータリスト２１の内容を示す図である。FIG. 6 is a diagram showing the contents of a volume parameter list 21. 従来の出力タイミング決定方法と、音声対話装置１の出力タイミング決定方法を説明する図である。It is a figure explaining the conventional output timing determination method and the output timing determination method of the voice interactive apparatus.

以下に本発明の実施形態について図面とともに説明する。
図１は本実施形態の音声対話装置１の構成を示すブロック図である。
図１に示すように、音声対話装置１は、利用者が発話した音声を入力する音声入力部２と、音声を出力する音声出力部３と、音声入力部２からの入力に応じて各種処理を実行し、音声出力部３を制御する制御部４とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing the configuration of the voice interaction apparatus 1 of this embodiment.
As shown in FIG. 1, the voice interaction apparatus 1 includes a voice input unit 2 that inputs voice spoken by a user, a voice output unit 3 that outputs voice, and various processes according to input from the voice input unit 2. And a control unit 4 that controls the audio output unit 3.

これらのうち音声入力部２は、利用者が音声を入力（発話）するとその入力した音声に基づく電気信号（音声信号）を制御部４に出力するものである。
また制御部４は、ＣＰＵ，ＲＯＭ，ＲＡＭ，Ｉ／Ｏ及びこれらの構成を接続するバスラインなどからなる周知のマイクロコンピュータを中心に構成されており、ＲＯＭ及びＲＡＭに記憶されたプログラムに基づいて各種処理を実行する。 Among these, the voice input unit 2 outputs an electric signal (voice signal) based on the inputted voice to the control unit 4 when the user inputs (speaks) voice.
The control unit 4 is mainly composed of a well-known microcomputer comprising a CPU, ROM, RAM, I / O and a bus line connecting these components, and is based on a program stored in the ROM and RAM. Perform various processes.

ここで図２は、制御部４が実行する処理の概要を示す機能ブロック図である。
図２に示すように、制御部４は、音声入力部２で入力した音声の認識処理を行う音声認識部１１と、音声認識部１１による認識結果に基づいて対話を進めるための応答（例えば、相槌や、「明日の天気はどうですか？」という音声入力に対して「明日は晴れです」というような応答）を生成する応答生成部１２と、応答生成部１２で生成された応答を音声出力部３に出力させる出力部１３と、音声認識部１１による認識結果に基づいて相槌等の応答を出力するタイミングを予測する出力タイミング予測部１４と、出力タイミング予測部１４で予測された出力タイミングで出力部１３に音声出力部３による出力をさせる出力タイミング制御部１５と、出力タイミング予測部１４での予測結果に基づいて出力部１３に音声出力部３による応答を変更させる応答変更部１６と、出力タイミング予測部１４での予測に用いる予測モデル（例えば、コーパス等の学習データ等を用いて予め作成したモデル）を記憶するモデル記憶部１７とを備えている。 Here, FIG. 2 is a functional block diagram showing an outline of processing executed by the control unit 4.
As illustrated in FIG. 2, the control unit 4 includes a speech recognition unit 11 that performs recognition processing of speech input by the speech input unit 2, and a response (e.g. A response generation unit 12 that generates a response to a voice input such as “How is the weather tomorrow?” And a response that is generated by the response generation unit 12 as a voice output unit 3, an output timing prediction unit 14 that predicts a timing for outputting a response such as a conflict based on a recognition result by the speech recognition unit 11, and an output timing predicted by the output timing prediction unit 14. The output timing control unit 15 that causes the audio output unit 3 to output to the unit 13 and the response by the audio output unit 3 are changed to the output unit 13 based on the prediction result in the output timing prediction unit 14 It includes a response change unit 16 to the prediction model used to predict the output timing prediction unit 14 (e.g., models created in advance using the learning data of the corpus or the like) and a model storage unit 17 for storing.

これらのうちモデル記憶部１７は、上記予測モデルとして、現在の入力単語列に後続すると予測される単語列（以下、後続予測単語列ともいう）と、後続予測単語列に対応した出現確率とを記憶する出現確率リスト１７ａを記憶する。 Among these, the model storage unit 17 includes, as the prediction model, a word string predicted to follow the current input word string (hereinafter also referred to as a subsequent prediction word string) and an appearance probability corresponding to the subsequent prediction word string. The appearance probability list 17a to be stored is stored.

このように構成された音声対話装置１において、制御部４は、入力した音声に基づいて対話を行う音声対話処理を実行する。
ここで、音声対話装置１の制御部４が実行する音声対話処理の手順を、図３，図４を用いて説明する。図３は音声対話処理を示すフローチャート、図４は出力タイミング予測の方法を説明する図である。 In the voice interaction apparatus 1 configured as described above, the control unit 4 executes a voice interaction process for performing a conversation based on the input voice.
Here, the procedure of the voice dialogue process executed by the control unit 4 of the voice dialogue apparatus 1 will be described with reference to FIGS. FIG. 3 is a flowchart showing a voice dialogue process, and FIG. 4 is a diagram for explaining a method for predicting output timing.

この音声対話処理は、制御部４が起動（電源オン）している間に繰り返し実行される処理である。
音声対話処理が実行されると、制御部４は、まずＳ１０にて、音声入力部２に音声が入力したか否かを判断する。ここで音声が入力していない場合には（Ｓ１０）、音声対話処理を一旦終了する。一方、音声が入力した場合には（Ｓ１０）、Ｓ２０にて、音声入力部２に入力した音声について音声認識を行う。 This voice interaction process is a process repeatedly executed while the control unit 4 is activated (powered on).
When the voice dialogue processing is executed, the control unit 4 first determines whether or not a voice is input to the voice input unit 2 in S10. If no voice is input here (S10), the voice dialogue process is temporarily terminated. On the other hand, when voice is input (S10), voice recognition is performed on the voice input to the voice input unit 2 in S20.

その後Ｓ３０にて、Ｓ２０での音声認識結果に基づき、音声入力部２に入力した音声から単語列を抽出する。更にＳ４０にて、Ｓ２０での音声認識結果に基づき、音声入力部２に入力した音声の話速を算出する。 Thereafter, in S30, based on the voice recognition result in S20, a word string is extracted from the voice input to the voice input unit 2. Further, in S40, the speech speed of the voice input to the voice input unit 2 is calculated based on the voice recognition result in S20.

そしてＳ５０にて、Ｓ３０で抽出した単語列と、出現確率リスト１７ａとを比較して、Ｓ３０で抽出した単語列に後続すると予測される単語列（後続予測単語列）の中で最も出現確率の高い後続予測単語列を抽出する。更にＳ６０にて、Ｓ４０で算出した話速を用いて、Ｓ５０で抽出した後続予測単語列が入力されるのにかかる時間（以下、後続入力時間ともいう）を算出する。その後Ｓ７０にて、Ｓ５０で抽出した後続予測単語列の出現確率を確信度として、Ｓ５０で抽出した後続予測単語列に付与する。 In S50, the word string extracted in S30 is compared with the appearance probability list 17a, and the word string extracted in S30 is predicted to follow the word string extracted in S30 (subsequent predicted word string). Extract high succession prediction word string. Further, in S60, using the speech speed calculated in S40, the time required to input the subsequent predicted word string extracted in S50 (hereinafter also referred to as subsequent input time) is calculated. After that, in S70, the appearance probability of the subsequent prediction word string extracted in S50 is assigned as the certainty to the subsequent prediction word string extracted in S50.

その後Ｓ８０にて、Ｓ５０で抽出した後続予測単語列に付与された確信度が、予め設定された応答判定値（本実施形態では、例えば「０．１」）以上であるか否かを判断する。即ち、出力タイミング予測を確定させることができるか否かを判断する。ここで、確信度が応答判定値未満である場合には、出力タイミング予測を確定させることができないと判断し（Ｓ８０）、Ｓ１０に移行して上述の処理を繰り返す。一方、確信度が応答判定値以上である場合には、出力タイミング予測を確定させることができると判断し（Ｓ８０）、Ｓ９０に移行する。 Thereafter, in S80, it is determined whether or not the certainty given to the subsequent predicted word string extracted in S50 is equal to or higher than a preset response determination value (for example, “0.1” in the present embodiment). . That is, it is determined whether the output timing prediction can be confirmed. Here, when the certainty factor is less than the response determination value, it is determined that the output timing prediction cannot be determined (S80), the process proceeds to S10, and the above-described processing is repeated. On the other hand, if the certainty factor is greater than or equal to the response determination value, it is determined that the output timing prediction can be confirmed (S80), and the process proceeds to S90.

ここで、Ｓ５０〜Ｓ８０の処理の具体例を図４を用いて説明する。まず、図４に示すように、「すごく」という単語列が入力された場合には、「すごく」の後続予測単語列として、出現確率リスト１７ａから、「うれしいね」（出現確率は０．０１５）、「おもしろかった」（出現確率は０．０１３）、「欲しいものです」（出現確率は０．００２）などという候補が上がり、この中で、出現確率が最も高いもの、例えば「うれしいね」（出現確率は０．０１５）という後続予測単語列が抽出される（Ｓ５０）。そして、「うれしいね」という後続予測単語列には、出現確率に等しい「０．０１５」という確信度が付与される（Ｓ７０）。しかし、この時点では、付与された確信度が応答判定値（０．１）未満であるため、出力タイミング予測は確定されない（Ｓ８０）。 Here, a specific example of the processing of S50 to S80 will be described with reference to FIG. First, as shown in FIG. 4, when the word string “very” is input, it is “happy” (appearance probability is 0.015) from the appearance probability list 17 a as a subsequent predicted word string “very”. ), “It was fun” (appearance probability is 0.013), “I want something” (appearance probability is 0.002), etc., and the one with the highest appearance probability, for example, “I ’m happy” A subsequent predicted word string (appearance probability is 0.015) is extracted (S50). Then, a certainty factor of “0.015” equal to the appearance probability is given to the subsequent predicted word string “I'm happy” (S70). However, at this time, since the given certainty is less than the response determination value (0.1), the output timing prediction is not finalized (S80).

その後、「おもしろかっ」という単語列が入力された場合には、「すごくおもしろかっ」の後続予測単語列として、出現確率リスト１７ａから、「た＜ポーズ＞」（出現確率は０．２３５）、「たよ」（出現確率は０．１８６）、「たと思う」（出現確率は０．００８）などという候補が上がり、この中で、出現確率が最も高いもの、例えば「た＜ポーズ＞」（出現確率は０．２３５）という後続予測単語列が抽出される（Ｓ５０）。そして、「た＜ポーズ＞」という後続予測単語列には、出現確率に等しい「０．２３５」という確信度が付与される（Ｓ７０）。そして、この時点では、付与された確信度が応答判定値（０．１）以上であるため、出力タイミング予測が確定される（Ｓ８０）。 After that, when the word string “interesting” is input, “ta <pause>” (appearance probability is 0.235), “ta” as the subsequent predicted word string “very interesting” from the appearance probability list 17a. ”(Appearance probability is 0.186),“ I think it is ”(appearance probability is 0.008), etc., among them, the one with the highest appearance probability, such as“ ta <pause> ”(appearance A subsequent predicted word string having a probability of 0.235) is extracted (S50). Then, a certainty factor “0.235” equal to the appearance probability is given to the subsequent predicted word string “ta <pause>” (S70). At this time, since the given certainty is greater than or equal to the response determination value (0.1), the output timing prediction is confirmed (S80).

また図３に戻り、Ｓ９０に移行すると、Ｓ７０で付与された確信度に応じて応答の仕方を変更させる処理を行う。具体的には、Ｓ７０で付与された確信度に比例して応答の音量を大きくするように、音量パラメータを設定する。 Returning to FIG. 3, when the process proceeds to S 90, a process of changing the response method according to the certainty given in S 70 is performed. Specifically, the volume parameter is set so that the volume of the response is increased in proportion to the certainty given in S70.

そしてＳ１００にて、Ｓ８０で出力タイミング予測が確定された時点から、Ｓ６０で算出された後続入力時間が経過したか否かを判断する。即ち、出力タイミングになったか否かを判断する。ここで、後続入力時間が経過していない場合には（Ｓ１００）、Ｓ１００の処理を繰り返す。一方、後続入力時間が経過した場合には、出力タイミングになったと判断し（Ｓ１００）、Ｓ１１０にて、応答生成部１２で生成された応答を、Ｓ９０で設定された音量パラメータに対応した音量で音声出力部３に出力させ、音声対話処理を一旦終了する。 In S100, it is determined whether or not the subsequent input time calculated in S60 has elapsed since the output timing prediction was determined in S80. That is, it is determined whether or not the output timing has come. If the subsequent input time has not elapsed (S100), the process of S100 is repeated. On the other hand, when the subsequent input time has elapsed, it is determined that the output timing has come (S100), and in S110, the response generated by the response generator 12 is set to a volume corresponding to the volume parameter set in S90. The voice output unit 3 is made to output, and the voice dialogue processing is once ended.

このように構成された音声対話装置１によれば、Ｓ５０〜Ｓ８０の処理によって出力タイミングを前もって予測することができるので、Ｓ１１０の処理による応答の出力を開始させるための処理時間（以下、応答処理時間ともいう）を確保できる。つまり、応答の出力を開始する前に出力タイミングになってしまうという事態が発生することを抑制でき、適切なタイミングで応答することができるという優れた効果を奏する。 According to the spoken dialogue apparatus 1 configured as described above, since the output timing can be predicted in advance by the processing of S50 to S80, the processing time for starting output of the response by the processing of S110 (hereinafter referred to as response processing). Time). That is, it is possible to suppress the occurrence of a situation in which the output timing is reached before starting the output of the response, and an excellent effect is obtained that the response can be made at an appropriate timing.

また、出力タイミングを、利用者の発話に重複して応答が行われるタイミングとすることができる、このため、人間同士が対話を行っている状態に近づけることができ、対話をより円滑に進行させることができる。 Also, the output timing can be the timing at which a response is made overlapping with the user's utterance, so that it can be brought closer to the state in which humans are engaged in dialogue, and the dialogue proceeds more smoothly be able to.

具体的には、従来は、図６（ａ）に示すように、例えば「すごくおもしろかったよ＜ポーズ＞」という発話において、終助詞「よ」が発話された時点ＨＴ１や、ポーズの時点ＨＴ２で、文末であるか否かの判定を行い、この判定から応答処理時間ＳＪ１が経過した時点ＯＴ１，ＯＴ２で応答を出力する。このため、発話が終了した直後に応答を出力したり、文末にオーバーラップして応答を出力したりすることが困難である。 Specifically, conventionally, as shown in FIG. 6A, for example, in the utterance “It was very interesting <pause>”, at the time HT1 when the final particle “yo” was uttered or at the time HT2 of the pose, It is determined whether or not it is the end of the sentence, and a response is output at the time point OT1 and OT2 when the response processing time SJ1 has elapsed from this determination. For this reason, it is difficult to output a response immediately after the end of the utterance or to output a response overlapping the end of the sentence.

一方、音声対話装置１は、図６（ｂ）に示すように、発話が終了する前に、例えば図６(ｂ)では「すごくおもしろかっ」の時点ＨＴ３で、発話が終了する時点を予測する。このため、この予測から応答処理時間ＳＪ１が経過した時点ＯＴ３では、まだ発話が終了していない。これにより、発話が終了した直後に応答を出力したり、文末にオーバーラップして応答を出力したりすることができる。 On the other hand, as shown in FIG. 6B, the voice interactive apparatus 1 predicts the time point when the utterance ends, for example, at the time point HT3 of “very interesting” in FIG. 6B, before the utterance ends. For this reason, at the time point OT3 when the response processing time SJ1 has elapsed from this prediction, the utterance is not yet finished. Thereby, a response can be output immediately after the utterance is completed, or a response can be output by overlapping the end of the sentence.

またＳ５０の処理では、Ｓ３０で抽出した単語列と、出現確率リスト１７ａとを比較して、利用者による発話が終了する少なくとも１文字前までに、後続予測単語列を抽出する。このため、少なくとも１文字以上の発話がされる時間分の応答処理時間を確保することができる。 In the process of S50, the word string extracted in S30 is compared with the appearance probability list 17a, and the subsequent predicted word string is extracted at least one character before the end of the utterance by the user. For this reason, it is possible to secure a response processing time for a time during which at least one character is uttered.

また、Ｓ３０の処理で抽出される単語列は、入力した音声から逐次的に得ることができるものであるので、Ｓ５０の処理で、出現確率リスト１７ａと逐次比較することにより、音声入力時に常に出力タイミングの予測をすることができる。 Further, since the word string extracted in the process of S30 can be obtained sequentially from the input voice, it is always output at the time of voice input by sequentially comparing with the appearance probability list 17a in the process of S50. Timing can be predicted.

またＳ６０の処理では、Ｓ４０で算出した話速を用いて、Ｓ５０で抽出した後続予測単語列が入力されるのにかかる時間（後続入力時間）を算出する。このため、話速に応じて出力タイミングを調整することができる。 In the process of S60, the time (subsequent input time) required for inputting the subsequent predicted word string extracted in S50 is calculated using the speech speed calculated in S40. For this reason, output timing can be adjusted according to speech speed.

またＳ７０の処理では、Ｓ５０で抽出した後続予測単語列の出現確率を確信度として、Ｓ５０で抽出した後続予測単語列に付与し、更にＳ８０の処理で、確信度が応答判定値以上である場合に出力タイミング予測を確定させる。このため、確信度に基づいて信頼性の高い後続予測単語列を抽出することができるので、より適切なタイミングで応答を行うことができる。 Moreover, in the process of S70, when the appearance probability of the subsequent prediction word string extracted in S50 is given as the certainty to the subsequent prediction word string extracted in S50, and the certainty is equal to or higher than the response determination value in the process of S80. To confirm the output timing prediction. For this reason, since a reliable subsequent prediction word sequence can be extracted based on the certainty factor, a response can be performed at a more appropriate timing.

またＳ９０の処理では、確信度に応じて応答の仕方を変更させる処理を行う。具体的には、確信度に比例して応答の音量を大きくするように、音量パラメータを設定する。このため、確信度が低い場合には利用者に対する働きかけの効果を小さくすることができ、対話のリズムが崩れるのを抑制することができる。 Further, in the process of S90, a process of changing the response method according to the certainty factor is performed. Specifically, the volume parameter is set so as to increase the response volume in proportion to the certainty factor. For this reason, when the certainty factor is low, the effect of acting on the user can be reduced, and the rhythm of the dialogue can be prevented from being lost.

以上説明した実施形態において、音声入力部２は本発明における入力手段及び入力ステップ、Ｓ５０〜Ｓ８０の処理は本発明における予測手段及び予測ステップ、Ｓ１００の処理は本発明におけるタイミング判断手段及びタイミング判断ステップ、Ｓ１１０の処理は本発明における応答手段及び応答ステップ、後続予測単語列は本発明におけるタイミング予測情報、出現確率リスト１７ａは本発明における予測モデルである。 In the embodiment described above, the voice input unit 2 is the input means and input step in the present invention, the processes in S50 to S80 are the predictor and prediction step in the present invention, and the process in S100 is the timing determination means and timing determination step in the present invention. , S110 is the response means and response step in the present invention, the subsequent prediction word string is the timing prediction information in the present invention, and the appearance probability list 17a is the prediction model in the present invention.

以上、本発明の一実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の技術的範囲に属する限り種々の形態を採ることができる。
例えば、上記実施形態においては、音声対話装置１は、音声出力部３から音声を出力することにより相槌などの応答するものを示したが、これに限られるものではなく、視覚的に応答するものであってもよい。例えば、ＬＥＤを備えてＬＥＤの点灯により応答するものであってもよいし、ディスプレイを備えて頷く動作を表示させるようにしてもよいし、ロボットに頷く動作をさせるようにしてもよい。 As mentioned above, although one Embodiment of this invention was described, this invention is not limited to the said embodiment, As long as it belongs to the technical scope of this invention, a various form can be taken.
For example, in the above-described embodiment, the voice interaction device 1 has shown a response such as a conflict by outputting a voice from the voice output unit 3, but is not limited to this, and a visual response is made. It may be. For example, an LED may be provided to respond by turning on the LED, or a display may be provided to display a whispering action, or a robot may be whispered.

また上記実施形態においては、Ｓ３０で単語列を抽出して、この抽出した情報を用いて出力タイミングの予測を行っているが、これに限ったものではなく、単語列とは異なる統語的特徴量、例えば、予め定められたキーワード、形態素列、品詞列、及び音素列の少なくとも１つを抽出するようにしてもよいし、発話長、基本周波数の時系列情報、ピッチの時系列情報、パワーの時系列情報、及び話速の時系列情報などの韻律的特徴量の少なくとも１つを抽出するようにしてもよい。 In the above embodiment, the word string is extracted in S30, and the output timing is predicted using the extracted information. However, the present invention is not limited to this, and the syntactic feature quantity different from the word string is used. For example, at least one of a predetermined keyword, morpheme string, part-of-speech string, and phoneme string may be extracted, utterance length, basic frequency time-series information, pitch time-series information, power At least one of prosodic feature quantities such as time-series information and speech speed time-series information may be extracted.

また上記実施形態のＳ５０では、単語列と出現確率リスト１７ａとを比較することにより出力タイミングの予測を行っているが、これに限ったものではなく、時系列データから作られたモデルとの距離を測る手法（例えば、テンプレートマッチング）や他のＮ−ｇｒａｍモデル（例えば、単語Ｎ−ｇｒａｍ、品詞Ｎ−ｇｒａｍ、音素Ｎ−ｇｒａｍ）による予測でもよい。 In S50 of the above embodiment, the output timing is predicted by comparing the word string and the appearance probability list 17a. However, the present invention is not limited to this, and the distance from the model created from the time series data May be predicted by a method of measuring (for example, template matching) or another N-gram model (for example, word N-gram, part-of-speech N-gram, phoneme N-gram).

また上記実施形態のＳ５０では、後続する単語列を予測しているが、これに限ったものではなく、後続する形態素列、品詞列、及び音素列の少なくとも一つを予測するようにしてもよいし、後続する単語数、形態素数、品詞数、及び音素数の少なくとも一つを予測するようにしてもよい。 In S50 of the above embodiment, the following word string is predicted. However, the present invention is not limited to this, and at least one of the following morpheme string, part-of-speech string, and phoneme string may be predicted. In addition, at least one of the following number of words, number of morphemes, number of parts of speech, and number of phonemes may be predicted.

または、Ｓ５０で出力タイミングになるまでの時間を直接予測するようにしてもよいし、出力タイミングになるまでのフレーム数を予測するようにしてもよい。この場合には、後続入力時間を算出する処理（Ｓ６０）が不要となる。 Alternatively, the time until the output timing is reached in S50 may be directly predicted, or the number of frames until the output timing is reached may be predicted. In this case, the process (S60) for calculating the subsequent input time is not necessary.

また上記実施形態のＳ６０では、話速を用いて、後続する単語列が入力されるのにかかる時間（後続入力時間）を算出しているが、これに限ったものではなく、例えば単語数を予測した場合には、予測された単語数が入力されるのにかかる時間を話速から算出するようにすればよい。また、予測された単語数、音素数、及び単語列等と、予め定めた１単語あたりの時間や単語列を入力するのにかかる時間とから後続入力時間を算出するようにしてもよい。 In S60 of the above embodiment, the time required to input the subsequent word string (following input time) is calculated using the speech speed. However, the present invention is not limited to this. For example, the number of words is calculated. When predicted, the time taken to input the predicted number of words may be calculated from the speech speed. Further, the subsequent input time may be calculated from the predicted number of words, number of phonemes, word string, and the like, and a predetermined time per word and time taken to input the word string.

また上記実施形態のＳ７０では、Ｓ５０で抽出した後続予測単語列の出現確率を確信度としているが、これに替えて或いはこれとともに、モデルとの一致率、音声認識部１１による認識結果の確信度、及び、新しい予測結果の方が古い予測結果よりも予測の信頼性が高くなるように設定された時定数の少なくとも１つ以上の情報を用いて、Ｓ７０における確信度を算出してもよい。 Moreover, in S70 of the said embodiment, although the appearance probability of the subsequent prediction word sequence extracted by S50 is made into the certainty degree, it replaces with this, or with this, the matching rate with a model, the certainty degree of the recognition result by the speech recognition part 11 In addition, the certainty factor in S70 may be calculated using at least one piece of information of a time constant set so that the new prediction result has higher prediction reliability than the old prediction result.

また上記実施形態のＳ８０では、確信度が応答判定値以上であるか否かによって出力タイミング予測の確定を行っているが、これに限ったものではなく、ある範囲内（一定時間内、一定単語数内など）で出力タイミング予測された複数の候補の中から、この候補に付加された確信度の大小を比較することにより行うようにしてもよいし、上記応答判定値と確信度の大小比較の組み合わせによって行うようにしてもよい。 In S80 of the above embodiment, the output timing prediction is determined based on whether or not the certainty level is greater than or equal to the response determination value. However, the present invention is not limited to this. It may be performed by comparing the degree of certainty added to this candidate from a plurality of candidates whose output timing is predicted within the number, etc., or the magnitude comparison between the response determination value and the certainty degree You may make it carry out by the combination of these.

また上記実施形態のＳ９０では、確信度に比例して応答の音量を大きくするようにしているが、これに限ったものではなく、確信度と後続予測単語列に対応した音量パラメータとを記憶する音量パラメータリスト２１（図５を参照）を予め設け、この音量パラメータリスト２１を参照することにより応答の音量を設定するようにしてもよい。 In S90 of the above embodiment, the response volume is increased in proportion to the certainty factor. However, the present invention is not limited to this, and the certainty factor and the volume parameter corresponding to the subsequent predicted word string are stored. A volume parameter list 21 (see FIG. 5) may be provided in advance, and the volume of the response may be set by referring to the volume parameter list 21.

また上記実施形態のＳ９０では、確信度に応じて応答の仕方を変更させるが、これに限ったものではなく、予測された単語列、形態素列、品詞列、及び音素列等に応じて、応答の仕方を変更するようにしてもよい。 In S90 of the above embodiment, the response method is changed according to the certainty factor. However, the method is not limited to this, and the response is made according to the predicted word string, morpheme string, part of speech string, phoneme string, and the like. You may make it change the way of.

また上記実施形態での応答は音声によるものであるが、応答の形態は頷きや瞬きなどであってもよいし、また、相槌の代表的な機能である「発話内容を理解したことを示す」「聞いていることを示す」「ターンテイキングの明確化」「感情や同意・否定を示す」「発話を促す」といった働きをもつメッセージや動作であってもよい。例えば、ＬＥＤを点灯させる、ディスプレイの明るさを変更する、物体の傾きを変更する、動きのスピードを変更する、動作回数を変更する、色や明るさを変更する、応答音声の声を変更する、応答メッセージを変更する、ＣＧアニメーションを変更するということが考えられる。 In addition, the response in the above embodiment is by voice, but the form of response may be whispering, blinking, or the like, and a typical function of the companion is “indicating understanding of utterance content” It may be a message or action having functions such as “indicating what is being heard”, “clarification of turn taking”, “indicating feelings and consent / denial”, and “promoting utterance”. For example, turn on the LED, change the brightness of the display, change the tilt of the object, change the speed of movement, change the number of operations, change the color and brightness, change the voice of the response voice It is conceivable to change the response message or change the CG animation.

１…音声対話装置、２…音声入力部、３…音声出力部、４…制御部、１１…音声認識部、１２…応答生成部、１３…出力部、１４…出力タイミング予測部、１５…出力タイミング制御部、１６…応答変更部、１７…モデル記憶部、１７ａ…出現確率リスト、２１…音量パラメータリスト DESCRIPTION OF SYMBOLS 1 ... Voice interactive apparatus, 2 ... Voice input part, 3 ... Voice output part, 4 ... Control part, 11 ... Voice recognition part, 12 ... Response generation part, 13 ... Output part, 14 ... Output timing prediction part, 15 ... Output Timing control unit, 16 ... response changing unit, 17 ... model storage unit, 17a ... appearance probability list, 21 ... volume parameter list

Claims

An input means for inputting voice uttered by the user;
Prediction means for acquiring timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice based on the voice input to the input means;
Timing determination means for determining whether or not the response timing has been reached based on the timing prediction information acquired by the prediction means;
Response means for performing the response when it is determined by the timing determination means that the response timing has been reached, and
The prediction means includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A speech speed calculating means for calculating a current speech speed based on the voice input to the input means;
The prediction means includes
Acquire at least one of the number of words, the number of morphemes, the number of parts of speech, the number of phonemes, the word string, the morpheme string, the part of speech string, and the phoneme string after the prediction means acquires the timing prediction information and before the response timing is reached. Then, based on these and the speech speed calculated by the speech speed calculating means, a response timing arrival time until the response timing is reached is calculated, and the response timing arrival time is used as the timing prediction information. A featured voice dialogue device.

An input means for inputting voice uttered by the user;
Prediction means for acquiring timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice based on the voice input to the input means;
Timing determination means for determining whether or not the response timing has been reached based on the timing prediction information acquired by the prediction means;
Response means for performing the response when it is determined by the timing determination means that the response timing has been reached, and
The prediction means includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A spoken dialogue apparatus characterized by predicting a time that elapses from when the prediction means acquires the timing prediction information until the response timing is reached, and using the predicted time as the timing prediction information.

An input means for inputting voice uttered by the user;
Prediction means for acquiring timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice based on the voice input to the input means;
Timing determination means for determining whether or not the response timing has been reached based on the timing prediction information acquired by the prediction means;
Response means for performing the response when it is determined by the timing determination means that the response timing has been reached, and
The prediction means includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A spoken dialogue apparatus characterized by predicting the number of frames that continue from when the prediction means acquires the timing prediction information until the response timing is reached, and using the predicted number of frames as the timing prediction information.

An input means for inputting voice uttered by the user;
Prediction means for acquiring timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice based on the voice input to the input means;
Timing determination means for determining whether or not the response timing has been reached based on the timing prediction information acquired by the prediction means;
Response means for performing the response when it is determined by the timing determination means that the response timing has been reached, and
The prediction means includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
The timing prediction information is
The spoken dialogue apparatus according to claim 1, wherein the prediction means is at least one of the number of words, the number of morphemes, the number of parts of speech, and the number of phonemes that follow from the acquisition of the timing prediction information to the response timing.

An input means for inputting voice uttered by the user;
Prediction means for acquiring timing prediction information indicating a prediction result of response timing for performing a response corresponding to the input voice based on the voice input to the input means;
Timing determination means for determining whether or not the response timing has been reached based on the timing prediction information acquired by the prediction means;
Response means for performing the response when it is determined by the timing determination means that the response timing has been reached, and
The prediction means includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
The timing prediction information is
A spoken dialogue apparatus, comprising: at least one of a word string, a morpheme string, a part-of-speech string, and a phoneme string that continues from when the prediction unit acquires the timing prediction information until the response timing is reached.

The prediction means includes
The timing prediction information is obtained by comparing a feature amount of the prediction model with a feature amount of the input speech for the prediction model in which the response timing is determined in advance and the speech input by the input unit. The voice dialogue apparatus according to claim 1, wherein the voice dialogue apparatus is obtained.

The feature amount is
The spoken dialogue apparatus according to claim 6, which is at least one of a syntactic feature indicating a syntactic feature and a prosodic feature indicating a prosodic feature regarding an utterance by a user.

The syntactic feature is
Information including at least one of a preset keyword, word string, morpheme string, part of speech string, phoneme string,
The prosodic feature amount is:
The voice according to claim 7, wherein the audio includes at least one of utterance length, basic frequency time-series information, pitch time-series information, power time-series information, and speech speed time-series information. Interactive device.

The response timing is
The voice interactive apparatus according to any one of claims 1 to 8, wherein the response is made at the timing when the response is made in duplicate with the user's utterance.

An input step for inputting voice uttered by the user;
A prediction step of acquiring timing prediction information indicating a prediction result of a response timing for performing a response corresponding to the input voice based on the voice input in the input step;
A timing determination step for determining whether or not the response timing has been reached based on the timing prediction information acquired in the prediction step;
When it is determined that the response timing is reached by the timing determination step, the response step performs the response, and
The prediction step includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A speech speed calculating step of calculating a current speech speed based on the voice input in the input step,
The prediction step includes
Acquire at least one of the number of words, the number of morphemes, the number of parts of speech, the number of phonemes, the word string, the morpheme string, the part of speech string, and the phoneme string after the timing prediction information is acquired by the prediction step until the response timing is reached. Then, based on these and the speech speed calculated in the speech speed calculation step, a response timing arrival time until the response timing is reached is calculated, and the response timing arrival time is used as the timing prediction information. A featured voice interaction method.

An input step for inputting voice uttered by the user;
A prediction step of acquiring timing prediction information indicating a prediction result of a response timing for performing a response corresponding to the input voice based on the voice input in the input step;
A timing determination step for determining whether or not the response timing has been reached based on the timing prediction information acquired in the prediction step;
When it is determined that the response timing is reached by the timing determination step, the response step performs the response, and
The prediction step includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A voice interaction method characterized by predicting a time that elapses after the timing prediction information is acquired by the prediction step until the response timing is reached, and the predicted time is used as the timing prediction information.

An input step for inputting voice uttered by the user;
A prediction step of acquiring timing prediction information indicating a prediction result of a response timing for performing a response corresponding to the input voice based on the voice input in the input step;
A timing determination step for determining whether or not the response timing has been reached based on the timing prediction information acquired in the prediction step;
When it is determined that the response timing is reached by the timing determination step, the response step performs the response, and
The prediction step includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A voice interaction method characterized by predicting the number of frames that continue from when the timing prediction information is acquired by the prediction step until the response timing is reached, and using the predicted number of frames as the timing prediction information.

An input step for inputting voice uttered by the user;
A prediction step of acquiring timing prediction information indicating a prediction result of a response timing for performing a response corresponding to the input voice based on the voice input in the input step;
A timing determination step for determining whether or not the response timing has been reached based on the timing prediction information acquired in the prediction step;
When it is determined that the response timing is reached by the timing determination step, the response step performs the response, and
The prediction step includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
A spoken dialogue method, comprising: at least one of the number of words, the number of morphemes, the number of parts of speech, and the number of phonemes that continue from when the timing prediction information is acquired by the prediction step until the response timing is reached.

An input step for inputting voice uttered by the user;
A prediction step of acquiring timing prediction information indicating a prediction result of a response timing for performing a response corresponding to the input voice based on the voice input in the input step;
A timing determination step for determining whether or not the response timing has been reached based on the timing prediction information acquired in the prediction step;
When it is determined that the response timing is reached by the timing determination step, the response step performs the response, and
The prediction step includes
The timing prediction information is acquired at least one character before the end of the utterance by the user, or at least one character before the user can interrupt and perform the response.
The timing prediction information is
A spoken dialogue method, comprising: at least one of a word string, a morpheme string, a part-of-speech string, and a phoneme string after the timing prediction information is acquired by the prediction step and before the response timing is reached.

The prediction step includes
For the prediction model in which the response timing is determined in advance and the speech input in the input step, the timing prediction information is obtained by comparing the feature amount of the prediction model with the feature amount of the input speech. The voice dialogue method according to claim 10, wherein the voice dialogue method is obtained.

The feature amount is
The spoken dialogue method according to claim 15, wherein at least one of a syntactic feature indicating a syntactic feature and a prosodic feature indicating a prosodic feature of an utterance by a user is provided.

The syntactic feature is
Information including at least one of a preset keyword, word string, morpheme string, part of speech string, phoneme string,
The prosodic feature amount is:
The speech according to claim 16, characterized in that it is information including at least one of speech length, basic frequency time-series information, pitch time-series information, power time-series information, and speech speed time-series information. How to interact.

The response timing is
The voice interaction method according to any one of claims 10 to 17, wherein the response is made at the timing when the response is made in duplicate with the user's utterance.