JP7689787B1

JP7689787B1 - Information processing device, information processing method, and program

Info

Publication number: JP7689787B1
Application number: JP2025046733A
Authority: JP
Inventors: 寧杰白; 瑞貴土井
Original assignee: 株式会社Recho
Priority date: 2025-03-21
Filing date: 2025-03-21
Publication date: 2025-06-09
Anticipated expiration: 2045-03-21

Abstract

[Problem] To improve the efficiency of calls made by AI agents.
[Solution] An information processing device 2 includes: an acquisition unit 100 that acquires speech information regarding a target person's utterance, the speech information including first utterance information regarding a first utterance and second utterance information regarding a second utterance subsequent to the first utterance; a first response determination unit 102a that determines first response information regarding a response to at least a part of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; a second response determination unit 102b that determines second response information regarding another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target person when a specified condition regarding the utterance is satisfied; and an output unit 106 that outputs the first response information when the specified condition is not satisfied, and outputs at least one of the first response information and the second response information when the specified condition is satisfied.
[Selected Figure] Figure 1

Description

本開示は、情報処理装置、情報処理方法及びプログラムに関する。 The present disclosure relates to an information processing device, an information processing method, and a program.

従来、通話に音声認識を用いる技術が知られている。例えば、特許文献１には、未登録または非通知の電話番号からの着信時に、生成系ＡＩ（ＡｒｔｉｆｉｃｉａｌＩｎｔｅｌｌｉｇｅｎｃｅ）が応答し、通話終了後に用件を文書化して送信するシステムにおいて、家族や知人の確認手段や悪用歴のある番号の警察連携手段を提供する技術が記載されている。 Conventionally, technology that uses voice recognition for phone calls is known. For example, Patent Document 1 describes technology that provides a means of identifying family members or acquaintances and a means of coordinating with the police for numbers with a history of misuse in a system in which a generative AI (Artificial Intelligence) answers calls from unregistered or withheld phone numbers and documents and transmits the purpose of the call after the call ends.

特許第７５５０３３５号Patent No. 7550335

しかしながら、特許文献１に記載された技術では、ＡＩエージェントによる通話を十分に効率化することができない。例えば、対象者に対するレスポンスの遅延を抑制することに関して検討の余地がある。 However, the technology described in Patent Document 1 does not sufficiently improve the efficiency of calls made by AI agents. For example, there is room for further study regarding reducing delays in responses to the target person.

本開示は、ＡＩエージェントによる通話を効率化することができる情報処理装置、情報処理方法及びプログラムを提供する。 The present disclosure provides an information processing device, an information processing method, and a program that can improve the efficiency of calls made by AI agents.

本開示の一態様に係る情報処理装置は、対象者の発話に関する発話情報を取得する取得部であって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、取得部と、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定する第１応答決定部と、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定する第２応答決定部と、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力する出力部と、を備える。 An information processing device according to one aspect of the present disclosure includes an acquisition unit that acquires speech information related to an utterance of a target person, the speech information including first utterance information related to a first utterance and second utterance information related to a second utterance subsequent to the first utterance; a first response determination unit that determines first response information related to a response to at least a part of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; a second response determination unit that determines second response information related to another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target person when a predetermined condition related to the utterance is satisfied; and an output unit that outputs the first response information when the predetermined condition is not satisfied and outputs at least one of the first response information and the second response information when the predetermined condition is satisfied.

本開示の他の一態様に係る情報処理方法は、情報処理装置が、対象者の発話に関する発話情報を取得することであって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、発話情報を取得することと、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定することと、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定することと、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力することと、を実行する。 An information processing method according to another aspect of the present disclosure includes an information processing device acquiring speech information related to an utterance of a target person, the speech information including first utterance information related to a first utterance and second utterance information related to a second utterance subsequent to the first utterance, determining first response information related to a response to at least a part of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model, determining second response information related to another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target when a predetermined condition related to the utterance is satisfied, and outputting the first response information when the predetermined condition is not satisfied, and outputting at least one of the first response information and the second response information when the predetermined condition is satisfied.

本開示の他の一態様に係るプログラムは、情報処理装置に、対象者の発話に関する発話情報を取得することであって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、発話情報を取得することと、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定することと、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定することと、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力することと、を実行させる。 A program according to another aspect of the present disclosure causes an information processing device to execute the following: acquire speech information regarding an utterance of a target person, the speech information including first utterance information regarding a first utterance and second utterance information regarding a second utterance subsequent to the first utterance; determine first response information regarding a response to at least a portion of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; determine second response information regarding another response to at least a portion of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target when a predetermined condition regarding the utterance is satisfied; output the first response information when the predetermined condition is not satisfied, and output at least one of the first response information and the second response information when the predetermined condition is satisfied.

本開示によれば、ＡＩエージェントによる通話を効率化することができる。 This disclosure makes it possible to make calls using AI agents more efficient.

本実施形態に係るシステム１の概要を説明するための図である。FIG. 1 is a diagram for explaining an overview of a system 1 according to an embodiment of the present invention. 本実施形態に係るシステム１の機能構成例を説明するための図である。FIG. 2 is a diagram for explaining an example of a functional configuration of the system 1 according to the present embodiment. 本実施形態に係る情報処理装置２の動作例を説明するための図である。10A to 10C are diagrams for explaining an example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係る情報処理装置２の動作例を説明するための図である。10A to 10C are diagrams for explaining an example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係る情報処理装置２の動作例を説明するための図である。10A to 10C are diagrams for explaining an example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係る情報処理装置２の他の動作例を説明するための図である。FIG. 11 is a diagram for explaining another example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係る情報処理装置２の他の動作例を説明するための図である。FIG. 11 is a diagram for explaining another example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係る情報処理装置２の他の動作例を説明するための図である。FIG. 11 is a diagram for explaining another example of the operation of the information processing device 2 according to the present embodiment. 本実施形態に係るシステム１の各装置のハードウェア構成例を説明するための図である。FIG. 2 is a diagram for explaining an example of the hardware configuration of each device in the system 1 according to the present embodiment.

１概要
本実施形態では、対象者は、自身の端末装置を利用してＡＩエージェントと通話をする状況を想定する。すなわち、対象者の端末装置は、当該対象者の発話を音声入力装置（例えば、マイク等）を介して受け付けることによって音声情報を生成し、当該音声情報を情報処理装置（例えば、サーバ装置等）に対して送信し、ＡＩエージェントによる応答に関する応答情報を取得する。この応答情報は、端末装置において音声による出力される。これにより、対象者は、端末装置を介してあたかもＡＩエージェントと会話しているかのように体感することができる。本実施形態に係るシステム１（以下、単に「システム１」と称する）が解決しようとする課題の一つは、このような状況下におけるＡＩエージェントとの通話を効率化することである。 1. Overview In this embodiment, a situation is assumed in which a subject uses his/her own terminal device to talk to an AI agent. That is, the subject's terminal device generates voice information by receiving the subject's speech via a voice input device (e.g., a microphone, etc.), transmits the voice information to an information processing device (e.g., a server device, etc.), and acquires response information regarding a response by the AI agent. This response information is output by voice in the terminal device. This allows the subject to experience as if he/she is talking to the AI agent via the terminal device. One of the problems that the system 1 according to this embodiment (hereinafter simply referred to as "system 1") aims to solve is to make a call with an AI agent efficient under such circumstances.

図１を参照して、システム１の概要について説明する。システム１は、端末装置３、情報処理装置２、及びＬＬＭサーバ装置４を含む。端末装置３は、対象者が利用する装置である。情報処理装置２は、ＡＩエージェントによる通話を効率化することに関する処理の少なくとも一部を実行する装置である。ＬＬＭサーバ装置４は、大規模言語モデル（ＬＬＭ：ＬａｒｇｅＬａｎｇｕａｇｅＭｏｄｅｌ、以下「ＬＬＭ」と称する）に基づくサービスを提供する装置である。 An overview of system 1 will be described with reference to FIG. 1. System 1 includes a terminal device 3, an information processing device 2, and an LLM server device 4. The terminal device 3 is a device used by the target person. The information processing device 2 is a device that executes at least a part of the processing related to making calls by an AI agent more efficient. The LLM server device 4 is a device that provides services based on a large language model (LLM: Large Language Model, hereinafter referred to as "LLM").

端末装置３は、音声入力装置を介して対象者の発話を受け付けることによって第１音声情報を生成し、情報処理装置２に対して送信する（Ｓ１）。情報処理装置２は、第１音声情報に基づいて第１発話情報を取得する。第１発話情報は、第１音声情報を文字起こししたテキストデータを含んでよく、第１音声情報そのもの（例えば、音声データ等）を含んでもよい。 The terminal device 3 generates first voice information by receiving the speech of the target person via the voice input device, and transmits it to the information processing device 2 (S1). The information processing device 2 acquires the first speech information based on the first voice information. The first speech information may include text data that is a transcription of the first voice information, or may include the first voice information itself (e.g., voice data, etc.).

次に、情報処理装置２は、第１発話情報を含む第１応答決定指示をＬＬＭサーバ装置４に対して送信する（Ｓ２）。第１応答決定指示は、対象者の発話の少なくとも一部に対する応答に関する第１応答情報を決定するための指示を含み得る。ＬＬＭサーバ装置４は、第１応答決定指示に基づいて第１応答情報を生成し、当該第１応答情報を情報処理装置２に対して送信する（Ｓ３）。 Next, the information processing device 2 transmits a first response determination instruction including the first utterance information to the LLM server device 4 (S2). The first response determination instruction may include an instruction for determining first response information related to a response to at least a portion of the target person's utterance. The LLM server device 4 generates first response information based on the first response determination instruction and transmits the first response information to the information processing device 2 (S3).

なお、情報処理装置２は、ステップＳ３の直後の時点では第１応答情報を対象者に対して出力しない。情報処理装置２は、第１応答情報を暫定的な応答として保持したまま、端末装置３から追加の音声情報の入力を待ち受ける。そして、情報処理装置２は、対象者の発話に関する所定の条件が満たされない場合（一例では、追加の音声情報の入力がなく、発話が継続されないと判定される場合）には、第１応答情報を出力する（図示しない）。これに対して、対象者の発話に関する所定の条件が満たされる場合（一例では、追加の音声情報の入力があり、発話が継続されると判定される場合）に関して、以下で説明する。 The information processing device 2 does not output the first response information to the target person immediately after step S3. The information processing device 2 waits for the input of additional voice information from the terminal device 3 while retaining the first response information as a provisional response. Then, the information processing device 2 outputs the first response information (not shown) when a predetermined condition related to the target person's speech is not satisfied (in one example, when no additional voice information is input and it is determined that the speech will not continue). In contrast, the case where a predetermined condition related to the target person's speech is satisfied (in one example, when additional voice information is input and it is determined that the speech will continue) will be described below.

端末装置３は、音声入力装置を介して対象者の発話を受け付けることによって第２音声情報をさらに生成し、情報処理装置２に対して送信する（Ｓ４）。情報処理装置２は、第２音声情報に基づいて第２発話情報を取得する。第２発話情報は、第１発話情報と同様に、第２音声情報を文字起こししたテキストデータを含んでよく、第２音声情報そのもの（例えば、音声データ等）を含んでもよい。 The terminal device 3 further generates second voice information by receiving the target person's speech via the voice input device, and transmits it to the information processing device 2 (S4). The information processing device 2 acquires second speech information based on the second voice information. The second speech information may include text data transcribed from the second voice information, similar to the first speech information, or may include the second voice information itself (e.g., voice data, etc.).

次に、情報処理装置２は、第２発話情報と、既に取得済みの第１発話情報、及び／又は暫定的な応答として保持している第１応答情報とを含む第２応答決定指示をＬＬＭサーバ装置４に対して送信する（Ｓ５）。第２応答決定指示は、対象者の発話の少なくとも一部に対する応答に関する第２応答情報を決定するための指示を含み得る。ＬＬＭサーバ装置４は、第２応答決定指示に基づいて第２応答情報を生成し、当該第２応答情報を情報処理装置２に対して送信する（Ｓ６）。第２応答情報は、対象者による追加的な入力に相当する第２発話情報に基づいて、暫定的な応答の候補である第１応答情報を更新した情報であるということもできる。 Next, the information processing device 2 transmits a second response determination instruction to the LLM server device 4, the second response determination instruction including the second utterance information, the first utterance information already acquired, and/or the first response information held as a provisional response (S5). The second response determination instruction may include an instruction for determining second response information related to a response to at least a portion of the target person's utterance. The LLM server device 4 generates second response information based on the second response determination instruction and transmits the second response information to the information processing device 2 (S6). The second response information can also be said to be information obtained by updating the first response information, which is a candidate for a provisional response, based on the second utterance information corresponding to additional input by the target person.

次に、情報処理装置２は、第２応答情報を出力する（Ｓ７）。情報処理装置２は、対象者による追加の発話がないと判定したうえで第２応答情報を出力してもよい。 Next, the information processing device 2 outputs second response information (S7). The information processing device 2 may output the second response information after determining that there is no additional speech from the subject.

システム１によれば、対象者からすると応答のレイテンシとして感じられる時間間隔を抑制することができる。システム１は、対象者の発話に関する第１発話情報に基づいて暫定的な応答である第１応答情報を取得し（Ｓ３参照）、その後、発話に関する所定の条件が満たされない場合は当該第１応答情報を出力し、当該所定の条件が満たされる場合には第１発話情報及び／又は第１応答情報に基づいて第２応答情報を取得する（Ｓ６参照）。このような構成によれば、一例では、ユーザの発話が継続している間にも応答情報が継続的に更新され、発話が終了すると速やかに直近の更新後の応答情報が出力される。すなわち、例えば対象者による発話が終了した後に応答の生成を開始するような従来技術と比較して、対象者が感じるレイレンシを抑制することができる。その結果、ＡＩエージェントによる通話が効率化される。 According to the system 1, the time interval that the subject feels as latency of the response can be suppressed. The system 1 acquires first response information, which is a provisional response, based on first speech information related to the subject's speech (see S3), and then outputs the first response information if a predetermined condition related to the speech is not met, and acquires second response information based on the first speech information and/or the first response information if the predetermined condition is met (see S6). According to this configuration, in one example, the response information is continuously updated while the user's speech continues, and the most recently updated response information is output promptly when the speech ends. That is, compared to the conventional technology in which the generation of a response is started after the subject's speech ends, for example, the latency felt by the subject can be suppressed. As a result, the efficiency of calls by AI agents is improved.

なお、本実施形態において、説明の便宜上、例えば「更新」及び「暫定的」等の用語を用いる場合があるが、これらの用語はコンピュータによる実際の処理を限定するものではない。 In this embodiment, for the sake of convenience, terms such as "update" and "provisional" may be used, but these terms do not limit the actual processing performed by a computer.

また、以下では、第１音声情報及び第２音声情報を特に区別しない場合、又はこれらをまとめて称する場合、これらを「音声情報」と称する。同様に、第１発話情報及び第２発話情報を特に区別しない場合、又はこれらをまとめて称する場合、これらを「発話情報」と称する。同様に、第１応答情報及び第２応答情報を特に区別しない場合、又はこれらをまとめて称する場合、これらを「応答情報」と称する。同様に、第１応答決定指示及び第２応答決定指示を特に区別しない場合、又はこれらをまとめて称する場合、これらを「応答決定指示」と称する。 Furthermore, in the following, when there is no particular distinction between the first voice information and the second voice information, or when they are referred to collectively, they are referred to as "voice information". Similarly, when there is no particular distinction between the first utterance information and the second utterance information, or when they are referred to collectively, they are referred to as "utterance information". Similarly, when there is no particular distinction between the first response information and the second response information, or when they are referred to collectively, they are referred to as "response information". Similarly, when there is no particular distinction between the first response determination instruction and the second response determination instruction, or when they are referred to collectively, they are referred to as "response determination instructions".

以下、図２～９を参照して、システム１の詳細な態様について例示的に説明する。 Detailed aspects of system 1 are described below with reference to Figures 2 to 9.

２機能構成
図２を参照して、本実施形態のシステム１の機能構成について説明する。システム１は、情報処理装置２、端末装置３、ＬＬＭサーバ装置４及び通信ネットワーク５を含む。情報処理装置２、端末装置３及びＬＬＭサーバ装置４は、通信ネットワーク５を介して通信可能に構成されている。 2 Functional Configuration The functional configuration of the system 1 of this embodiment will be described with reference to Fig. 2. The system 1 includes an information processing device 2, a terminal device 3, an LLM server device 4, and a communication network 5. The information processing device 2, the terminal device 3, and the LLM server device 4 are configured to be able to communicate with each other via the communication network 5.

２．１情報処理装置２
情報処理装置２は、ＡＩエージェントによる通話を効率化することに関する処理の少なくとも一部を実行する。一実施形態において、情報処理装置２は、端末装置３をクライアント装置とした場合におけるサーバ装置である。一実施形態において、情報処理装置２は、クラウドサーバ装置である。なお、情報処理装置２は、例えば、仮想的又は物理的な一以上のｗｅｂサーバ装置と、仮想的又は物理的な一以上のデータベースサーバ装置とを含む装置であってよい。 2.1 Information processing device 2
The information processing device 2 executes at least a part of the processing related to improving the efficiency of calls made by an AI agent. In one embodiment, the information processing device 2 is a server device in the case where the terminal device 3 is a client device. In one embodiment, the information processing device 2 is a cloud server device. Note that the information processing device 2 may be, for example, a device including one or more virtual or physical web server devices and one or more virtual or physical database server devices.

情報処理装置２は、制御部１０、記憶部１２、ネットワークインタフェース部１４及びバス１６を備える。制御部１０、記憶部１２及びネットワークインタフェース部１４は、バス１６を介して電気的に接続されている。 The information processing device 2 includes a control unit 10, a memory unit 12, a network interface unit 14, and a bus 16. The control unit 10, the memory unit 12, and the network interface unit 14 are electrically connected via the bus 16.

２．１．１制御部１０
制御部１０は、後述する記憶部１２が記憶する各種プログラムを実行することにより、取得部１００、決定部１０２、判定部１０４及び出力部１０６として機能し得る。 2.1.1 Control unit 10
The control unit 10 can function as an acquisition unit 100, a determination unit 102, a judgment unit 104, and an output unit 106 by executing various programs stored in the storage unit 12, which will be described later.

２．１．１．１取得部１００
取得部１００は、対象者の発話に関する発話情報を取得する。発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む。 2.1.1.1 Acquisition unit 100
The acquiring unit 100 acquires speech information related to an utterance of a target person. The speech information includes first utterance information related to a first utterance and second utterance information related to a second utterance subsequent to the first utterance.

一実施形態において、発話情報は、情報処理装置２が端末装置３から受信した音声情報に基づいて対象者の発話を文字起こししたテキストデータを含んでよい。すなわち、第１発話情報は、対象者の発話のうち、第１の発話の部分を文字起こししたテキストデータを含んでよく、第２発話情報は、対象者の発話のうち、第１の発話より後の第２の発話の部分を文字起こししたテキストデータを含んでよい。取得部１００は、当業者の知識に基づいて選択される音声認識プログラムによって、音声情報を文字起こしし得る。 In one embodiment, the speech information may include text data transcribed from the target person's speech based on the audio information received by the information processing device 2 from the terminal device 3. That is, the first speech information may include text data transcribed from a first portion of the target person's speech, and the second speech information may include text data transcribed from a second portion of the target person's speech that follows the first portion. The acquisition unit 100 may transcribe the audio information using a voice recognition program selected based on the knowledge of a person skilled in the art.

一例では、対象者の発話が「えーと、今日の夜７時に２名で予約したいんですけど…あ、名前は○○です。」の場合、音声情報はこの発話に対応する音声データを含むことができ、第１発話情報は、「えーと、今日の夜７時に２名で予約したいんですけど」というテキストデータを含むことができ、第２発話情報は「あ、名前は○○です。」というテキストデータを含むことができる。 In one example, if the subject's utterance is "Um, I'd like to make a reservation for two people at 7pm tonight...oh, my name is ____," the voice information can include voice data corresponding to this utterance, the first utterance information can include text data saying "Um, I'd like to make a reservation for two people at 7pm tonight," and the second utterance information can include text data saying "oh, my name is ____."

なお、対象者の発話のうち、どこまでを第１の発話とし、どこからを第２の発話とするかは、後述する判定部１０４の判定結果に基づいて決定され得る。判定部１０４は、対象者の発話に区切りが生じたか否かを判定する。取得部１００は、対象者の発話のうち、区切りが生じたと判定された時点までの少なくとも一部の発話を第１の発話とし、区切りが生じたと判定された時点以降の少なくとも一部の発話を第２の発話として、第１発話情報及び第２発話情報を取得し得る。上記では、「えーと、今日の夜７時に２名で予約したいんですけど」（第１発話情報に対応）と「あ、名前は○○です。」（第２発話情報に対応）との間に区切りが生じたと判定された場合について例示した。 It should be noted that the extent of the target person's speech to be the first utterance and the extent to be the second utterance may be determined based on the determination result of the determination unit 104 described later. The determination unit 104 determines whether or not a break has occurred in the target person's speech. The acquisition unit 100 may acquire first utterance information and second utterance information by treating at least a portion of the target person's speech up to the point at which it is determined that a break has occurred as the first utterance and at least a portion of the utterance after the point at which it is determined that a break has occurred as the second utterance. In the above, an example was given of a case in which it was determined that a break occurred between "Um, I'd like to make a reservation for two people at 7pm tonight" (corresponding to the first utterance information) and "Oh, my name is ____" (corresponding to the second utterance information).

一実施形態において、発話情報は、音声情報そのものを含んでよい。すなわち、第１発話情報は、対象者の発話のうち、第１の発話の部分の音声情報そのものを含んでよく、第２発話情報は、対象者の発話のうち、第１の発話より後の第２の発話の部分の音声情報そのものを含んでよい。 In one embodiment, the speech information may include the audio information itself. That is, the first speech information may include the audio information itself of the first utterance portion of the target person's speech, and the second speech information may include the audio information itself of the second utterance portion of the target person's speech that follows the first utterance.

２．１．１．２決定部１０２
決定部１０２は、第１応答決定部１０２ａ、第２応答決定部１０２ｂ及び相槌決定部１０２ｃを含む。 2.1.1.2 Determination unit 102
The determination unit 102 includes a first response determination unit 102a, a second response determination unit 102b, and a backchannel determination unit 102c.

２．１．１．２．１第１応答決定部１０２ａ
第１応答決定部１０２ａは、第１発話情報を含む第１応答決定指示をＬＬＭに入力することによって、対象者の発話の少なくとも一部に対する応答に関する第１応答情報を決定する。 2.1.1.2.1 First response determining unit 102a
The first response determination unit 102a determines first response information regarding a response to at least a part of the target person's utterance by inputting a first response determination instruction including first utterance information to the LLM.

一実施形態において、第１応答決定指示をＬＬＭに入力することは、第１応答決定指示を含むＨＴＴＰリクエストをＬＬＭサーバ装置４に対して送信することを含み得る。第１応答情報を決定することは、当該ＨＴＴＰリクエストに対するＨＴＴＰレスポンスを取得することを含み得る。 In one embodiment, inputting the first response determination instruction into the LLM may include transmitting an HTTP request including the first response determination instruction to the LLM server device 4. Determining the first response information may include obtaining an HTTP response to the HTTP request.

一実施形態において、第１応答決定指示は、第１発話情報の他にも、例えばシステムプロンプト、及び対象者とＡＩエージェントとの会話履歴（すなわち、第１応答情報を決定するより前の対象者の発話に関する情報、及びそれに対する情報処理装置２による応答に関する情報）等を含み得る。 In one embodiment, the first response determination instruction may include, in addition to the first utterance information, for example, a system prompt and a conversation history between the subject and the AI agent (i.e., information regarding the subject's utterance prior to determining the first response information, and information regarding the response thereto by the information processing device 2), etc.

一実施形態において、第１応答決定部１０２ａは、対象者の発話に第１の区切りが生じたと判定部１０４（後述）によって判定された場合に第１応答決定指示をＬＬＭに入力する。一例では、対象者の発話が「えーと、今日の夜７時に２名で予約したいんですけど…あ、名前は○○です。」の場合、判定部１０４は、「今日の夜７時に２名で予約したいんですけど」と「あ、名前は○○です」の間で第１の区切りが生じたと判定し得る。第１応答決定部１０２ａは、第１の区切りが生じたと判定された時点で第１応答決定指示をＬＬＭサーバ装置４に送信し、第１応答情報を決定し得る。 In one embodiment, the first response determination unit 102a inputs a first response determination instruction to the LLM when the determination unit 104 (described later) determines that a first division has occurred in the target person's speech. In one example, when the target person's speech is "Um, I'd like to make a reservation for two people at 7 p.m. tonight...oh, my name is ____," the determination unit 104 may determine that a first division has occurred between "I'd like to make a reservation for two people at 7 p.m. tonight" and "oh, my name is ____." The first response determination unit 102a may transmit a first response determination instruction to the LLM server device 4 at the time it is determined that a first division has occurred, and determine first response information.

一実施形態において、第１応答決定部１０２ａは、第１応答情報を決定する際に、当該第１応答情報を音声により出力するための情報をさらに決定してよい。応答情報を音声により出力するための情報は、当業者に任意に選択できる音声生成プログラムに基づいて生成され得る。なお、「応答情報を決定する際」は、応答情報を決定する直前であってよく、応答情報を決定する処理と並行であってよく、応答情報を決定する処理と一体的及び／又は連続的に行われることであってよく、応答情報を決定する処理の直後であってもよい。 In one embodiment, the first response determination unit 102a may further determine information for outputting the first response information by voice when determining the first response information. The information for outputting the response information by voice may be generated based on a voice generation program that can be arbitrarily selected by a person skilled in the art. Note that "when determining the response information" may be immediately before determining the response information, may be in parallel with the process of determining the response information, may be performed integrally and/or consecutively with the process of determining the response information, or may be immediately after the process of determining the response information.

一実施形態において、記憶部１２は一以上のテキストのそれぞれを音声により出力するための情報を記憶し、第１応答情報が当該一以上のテキストの少なくとも１つと整合する場合には、当該第１応答情報を音声により出力するための情報を決定することは、当該整合するテキストを音声により出力するための情報を記憶部１２から取得することを含み、第１応答情報が一以上のテキストのいずれとも整合しない場合には、当該第１応答情報を音声により出力するための情報を決定することは、所定の音声生成プログラムに基づいて当該第１応答情報を音声により出力するための情報を生成することを含む。すなわち、応答情報が典型的には”よくある"単語及び／又はフレーズを含む場合には、第１応答決定部１０２ａは、予め用意された当該単語及び／又はフレーズの音声を決定し得る。一例では、記憶部１２が「ご予約承りました」というテキストを音声により出力するための情報を記憶する場合において、第１応答情報に「ご予約承りました」というテキストが含まれる場合には、第１応答情報のその部分を音声により出力するための情報は、記憶部１２から取得される。他の一例では、記憶部１２が上記同様「ご予約承りました」というテキストを音声により出力するための情報を記憶する場合において、第１応答情報に「ご予約承りました」というテキストが含まれない場合（すなわち、記憶部１２からそれに対応する音声を出力するための情報を取得することができない場合）には、第１応答決定部１０２ａは、所定の音声生成プログラムにより「ご予約承りました」を音声により出力するための情報を生成する。この構成によれば、第１応答情報に含まれるテキストの一部が、予め用意された音声により出力され得るため、対象者が感じるレイテンシが抑制され得る。 In one embodiment, the storage unit 12 stores information for outputting each of one or more texts by voice, and when the first response information matches at least one of the one or more texts, determining information for outputting the first response information by voice includes acquiring information for outputting the matching text by voice from the storage unit 12, and when the first response information does not match any of the one or more texts, determining information for outputting the first response information by voice includes generating information for outputting the first response information by voice based on a predetermined voice generation program. That is, when the response information typically includes "common" words and/or phrases, the first response determination unit 102a may determine the voice of the words and/or phrases prepared in advance. In one example, when the storage unit 12 stores information for outputting the text "We have accepted your reservation" by voice, when the first response information includes the text "We have accepted your reservation", information for outputting that part of the first response information by voice is acquired from the storage unit 12. In another example, when the storage unit 12 stores information for outputting the text "Your reservation has been accepted" by voice as described above, if the first response information does not include the text "Your reservation has been accepted" (i.e., if information for outputting the corresponding voice cannot be obtained from the storage unit 12), the first response determination unit 102a generates information for outputting "Your reservation has been accepted" by voice using a predetermined voice generation program. With this configuration, a portion of the text included in the first response information can be output by a voice prepared in advance, thereby suppressing the latency felt by the subject.

２．１．１．２．２第２応答決定部１０２ｂ
第２応答決定部１０２ｂは、対象者の発話に関する所定の条件が満たされる場合（一例では、対象者の発話が継続されると判定される場合）に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示をＬＬＭに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定する。 2.1.1.2.2 Second response determining unit 102b
When a predetermined condition regarding the target's speech is satisfied (in one example, when it is determined that the target's speech will continue), the second response determination unit 102b determines second response information regarding another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information to the LLM before outputting the first response information to the target.

一実施形態において、第２応答決定指示をＬＬＭに入力することは、第１応答決定指示をＬＬＭに入力することと同様に、第２応答決定指示を含むＨＴＴＰリクエストをＬＬＭサーバ装置４に対して送信することを含み得る。第２応答情報を決定することは、当該ＨＴＴＰリクエストに対するＨＴＴＰレスポンスを取得することを含み得る。 In one embodiment, inputting the second response determination instruction to the LLM may include transmitting an HTTP request including the second response determination instruction to the LLM server device 4, similar to inputting the first response determination instruction to the LLM. Determining the second response information may include obtaining an HTTP response to the HTTP request.

一実施形態において、第２応答決定指示は、第１応答決定指示と同様に、第１応答情報及び第２発話情報の他にも、例えばシステムプロンプト、及び対象者とＡＩエージェントとの会話履歴（すなわち、第２応答情報を決定するより前の対象者の発話に関する情報、及びそれに対する情報処理装置２による応答に関する情報）等を含み得る。 In one embodiment, the second response determination instruction, like the first response determination instruction, may include, in addition to the first response information and the second utterance information, for example, a system prompt and a conversation history between the subject and the AI agent (i.e., information about the subject's utterance prior to determining the second response information, and information about the response thereto by the information processing device 2), etc.

図１を参照して例示したように、第１応答情報は暫定的な応答であり得る。第２応答決定部１０２ｂは、対象者の発話に関する所定の条件が満たされる場合には、第１応答情報を対象者に出力する前に（一例ではそのような暫定的な応答を出力することなく）、第２応答情報を決定し得る。一実施形態において、第１応答情報を対象者に出力する前の期間は、第１応答情報が端末装置３において音声等により出力される前の期間であり得る。 As illustrated with reference to FIG. 1, the first response information may be a tentative response. If a predetermined condition regarding the target person's speech is satisfied, the second response determination unit 102b may determine the second response information before outputting the first response information to the target person (in one example, without outputting such a tentative response). In one embodiment, the period before the first response information is output to the target person may be the period before the first response information is output by voice or the like on the terminal device 3.

一実施形態において、第２応答決定部１０２ｂは、第２応答情報を決定する際に、当該第２応答情報を音声により出力するための情報をさらに決定してよい。 In one embodiment, when determining the second response information, the second response determination unit 102b may further determine information for outputting the second response information by voice.

一実施形態において、記憶部１２は一以上のテキストのそれぞれを音声により出力するための情報を記憶し、第２応答情報が当該一以上のテキストの少なくとも１つと整合する場合には、当該第２応答情報を音声により出力するための情報を決定することは、当該整合するテキストを音声により出力するための情報を記憶部１２から取得することを含み、第２応答情報が一以上のテキストのいずれとも整合しない場合には、当該第２応答情報を音声により出力するための情報を決定することは、所定の音声生成プログラムに基づいて当該第２応答情報を音声により出力するための情報を生成することを含む。 In one embodiment, the storage unit 12 stores information for outputting each of the one or more texts by voice, and when the second response information is consistent with at least one of the one or more texts, determining information for outputting the second response information by voice includes obtaining information for outputting the matching text by voice from the storage unit 12, and when the second response information is not consistent with any of the one or more texts, determining information for outputting the second response information by voice includes generating information for outputting the second response information by voice based on a predetermined voice generation program.

一実施形態において、第２応答決定部１０２ｂは、発話に第２の区切りが生じたと判定部１０４によって判定された場合に第２応答決定指示をＬＬＭに入力する。一例では、対象者の発話が「えーと、今日の夜７時に２名で予約したいんですけど…あ、名前は○○です。」の場合、判定部１０４は、上述したとおり「今日の夜７時に２名で予約したいんですけど」と「あ、名前は○○です」の間で第１の区切りが生じたと判定し、さらに「あ、名前は○○です。」の後に第２の区切りが生じたと判定し得る。第２応答決定部１０２ｂは、第２の区切りが生じたと判定された時点で第２応答決定指示をＬＬＭサーバ装置４に送信し、第２応答情報を決定し得る。なお、第１発話情報を取得した後、対象者の発話に関する所定の条件が満たされない場合（一例では、対象者の発話が継続されないと判定される場合）には、第１応答情報が出力され得るため、第２応答決定部１０２ｂは第２応答情報を決定しなくてよい。 In one embodiment, the second response determination unit 102b inputs a second response determination instruction to the LLM when the determination unit 104 determines that a second division has occurred in the utterance. In one example, when the target person's utterance is "Um, I'd like to make a reservation for two people at 7 p.m. tonight...oh, my name is ____," the determination unit 104 may determine that a first division has occurred between "I'd like to make a reservation for two people at 7 p.m. tonight" and "oh, my name is ____," as described above, and may further determine that a second division has occurred after "oh, my name is ____." The second response determination unit 102b may transmit a second response determination instruction to the LLM server device 4 at the time when it is determined that a second division has occurred, and determine second response information. Note that if a predetermined condition regarding the target person's speech is not satisfied after the first speech information is acquired (in one example, if it is determined that the target person's speech will not continue), the first response information may be output, and the second response determination unit 102b does not need to determine the second response information.

２．１．１．２．３相槌決定部１０２ｃ
相槌決定部１０２ｃは、第１応答情報を決定する際に、第１発話情報に対応する相槌に関する相槌情報を決定する。なお、「第１応答情報を決定する際」は、第１応答情報を決定する前であってよく、第１応答情報を決定する処理の実行中であってよく、第１応答情報を決定した後であってもよい。 2.1.1.2.3 Backchannel determination unit 102c
The backchannel determining unit 102c determines backchannel information related to the backchannel corresponding to the first utterance information when determining the first response information. Note that "when determining the first response information" may be before the first response information is determined, may be during the process of determining the first response information, or may be after the first response information is determined.

一実施形態において、相槌決定部１０２ｃは、第１発話情報が対象者の肯定的な発話に関する情報を含むか、否定的な発話に関する情報を含むかに基づいて相槌情報を決定し得る。一例では、第１発話情報が「はい、可能です」というような肯定的な発話に関する情報を含む場合、相槌決定部１０２ｃは、「ありがとうございます」というテキストデータを含む相槌情報を決定し得る。他の一例では、第１発話情報が「申し訳ございません」というような否定的な発話に関する情報を含む場合、相槌決定部１０２ｃは、「かしこまりました」というテキストデータを含む相槌情報を決定し得る。この場合において、相槌決定部１０２ｃは、例えば、肯定的な発話に含まれる可能性のある単語のリスト、及び否定的な発話に含まれる単語のリスト等を参照し、いずれかのリストに対応する相槌情報を決定してよい。 In one embodiment, the backchannel determination unit 102c may determine the backchannel information based on whether the first utterance information includes information about a positive utterance by the target person or information about a negative utterance. In one example, if the first utterance information includes information about a positive utterance such as "Yes, it's possible," the backchannel determination unit 102c may determine the backchannel information including text data "Thank you." In another example, if the first utterance information includes information about a negative utterance such as "I'm sorry," the backchannel determination unit 102c may determine the backchannel information including text data "I understand." In this case, the backchannel determination unit 102c may refer to, for example, a list of words that may be included in a positive utterance and a list of words that may be included in a negative utterance, and determine the backchannel information corresponding to either list.

一実施形態において、相槌決定部１０２ｃは、第１発話情報に含まれる質問の形式に基づいて相槌情報を決定し得る。一例では、第１発話情報が「～してもいいですか？」というようなＹＥＳ／ＮＯで回答できる質問や、「ＡとＢ、どちらにしたらいいですか？」というようなクローズドな形式で回答できる質問を含む場合には、相槌決定部１０２ｃは「ご確認ありがとうございます」というテキストデータを含む相槌情報を決定し得る。他の一例では、第１発話情報が「もう一度言ってください」というようなオープンな形式で回答できる質問を含む場合には、相槌決定部１０２ｃは「かしこまりました」というテキストデータを含む相槌情報を決定し得る。この場合において、相槌決定部１０２ｃは、当業者の知識に基づいて選択される自然言語処理アルゴリズムに基づいて、第１発話情報に含まれる質問の形式を判定し、当該判定結果に基づいて相槌情報を決定してよい。 In one embodiment, the backchannel determination unit 102c may determine the backchannel information based on the format of the question included in the first utterance information. In one example, when the first utterance information includes a question that can be answered with YES/NO, such as "May I do ~?", or a question that can be answered in a closed format, such as "Which should I choose, A or B?", the backchannel determination unit 102c may determine the backchannel information including the text data "Thank you for checking." In another example, when the first utterance information includes a question that can be answered in an open format, such as "Please say it again," the backchannel determination unit 102c may determine the backchannel information including the text data "I understand." In this case, the backchannel determination unit 102c may determine the format of the question included in the first utterance information based on a natural language processing algorithm selected based on the knowledge of a person skilled in the art, and determine the backchannel information based on the determination result.

一実施形態において、相槌決定部１０２ｃは、相槌情報を決定する際に、当該相槌情報を音声により出力するための情報をさらに決定してよい。 In one embodiment, when determining the backchannel information, the backchannel determination unit 102c may further determine information for outputting the backchannel information by voice.

一実施形態において、記憶部１２は一以上のテキストのそれぞれを音声により出力するための情報を記憶し、相槌情報が当該一以上のテキストの少なくとも１つと整合する場合には、当該相槌情報を音声により出力するための情報を決定することは、当該整合するテキストを音声により出力するための情報を記憶部１２から取得することを含み、相槌情報が一以上のテキストのいずれとも整合しない場合には、当該相槌情報を音声により出力するための情報を決定することは、所定の音声生成プログラムに基づいて当該相槌情報を音声により出力するための情報を生成することを含む。 In one embodiment, the storage unit 12 stores information for outputting each of the one or more texts by voice, and when the backchannel information matches at least one of the one or more texts, determining information for outputting the backchannel information by voice includes obtaining information for outputting the matching text by voice from the storage unit 12, and when the backchannel information does not match any of the one or more texts, determining information for outputting the backchannel information by voice includes generating information for outputting the backchannel information by voice based on a predetermined voice generation program.

一実施形態において、相槌決定部１０２ｃは、第１発話情報、及び／又は当該第１発話情報に対応する発話の音声に基づいて、対象者の発話における感情を推定し、当該推定された感情にさらに基づいて相槌情報を決定する。一例として、第１発話情報に対応する発話が「これはなぜできないのですか？」という音声である場合を想定する。このとき、当該音声に基づいて対象者が怒っていることが推定される場合には、相槌決定部１０２ｃは、「申し訳ございません」という相槌を決定し得る。これに対して、当該音声に基づいて対象者が怒っていないと推定される場合には、相槌決定部１０２ｃは、「ご質問ありがとうございます」という相槌を決定し得る。この構成によれば、対象者にとってより自然な相槌が出力され得る。なお、第１発話情報、及び／又は当該第１発話情報に対応する発話の音声に基づく対象者の発話における感情の推定は、当業者が任意に選択し得る感情推定アルゴリズムに基づいて実行され得る。 In one embodiment, the backchannel determination unit 102c estimates the emotion of the target's speech based on the first speech information and/or the voice of the speech corresponding to the first speech information, and determines the backchannel information based on the estimated emotion. As an example, assume that the speech corresponding to the first speech information is a voice saying "Why can't you do this?" In this case, if it is estimated that the target is angry based on the voice, the backchannel determination unit 102c may determine the backchannel saying "I'm sorry." In contrast, if it is estimated that the target is not angry based on the voice, the backchannel determination unit 102c may determine the backchannel saying "Thank you for your question." With this configuration, a backchannel that is more natural for the target can be output. Note that the estimation of the emotion of the target's speech based on the first speech information and/or the voice of the speech corresponding to the first speech information can be performed based on an emotion estimation algorithm that can be arbitrarily selected by a person skilled in the art.

一実施形態において、相槌決定部１０２ｃは、第１発話情報を含む相槌決定指示をＬＬＭに入力することによって相槌情報を決定する。 In one embodiment, the backchannel determination unit 102c determines the backchannel information by inputting a backchannel determination instruction including the first speech information to the LLM.

一実施形態において、相槌決定指示をＬＬＭに入力することは、応答決定指示をＬＬＭに入力することと同様に、相槌決定指示を含むＨＴＴＰリクエストをＬＬＭサーバ装置４に対して送信することを含み得る。相槌情報を決定することは、当該ＨＴＴＰリクエストに対するＨＴＴＰレスポンスを取得することを含み得る。 In one embodiment, inputting a backchannel determination instruction to the LLM may include transmitting an HTTP request including the backchannel determination instruction to the LLM server device 4, similar to inputting a response determination instruction to the LLM. Determining backchannel information may include obtaining an HTTP response to the HTTP request.

一実施形態において、相槌決定指示は、応答決定指示と同様に、第１発話情報の他にも、例えばシステムプロンプト等を含み得る。 In one embodiment, the backchannel decision instruction, like the response decision instruction, may include, in addition to the first speech information, for example, a system prompt.

一実施形態において、相槌決定部１０２ｃは、第２応答情報を決定する際に、第２発話情報に対応する他の相槌に関する他の相槌情報を決定する。この際、相槌決定部１０２ｃは、第２発話情報を含む相槌決定指示をＬＬＭに入力することによって当該他の相槌情報を決定してよい。 In one embodiment, when determining the second response information, the backchannel determination unit 102c determines other backchannel information related to another backchannel corresponding to the second utterance information. At this time, the backchannel determination unit 102c may determine the other backchannel information by inputting a backchannel determination instruction including the second utterance information to the LLM.

２．１．１．３判定部１０４
判定部１０４は、発話に第１の区切りが生じた否かを判定する。第１発話情報は、対象者の発話のうち、第１の区切りまでの部分に関する情報を含む。 2.1.1.3 Determination unit 104
The determination unit 104 determines whether a first division has occurred in the utterance. The first utterance information includes information on the part of the utterance of the target person up to the first division.

一実施形態において、判定部１０４は、発話の第１の区切りの後に第２の区切りが生じたか否かをさらに判定する。第２発話情報は、対象者の発話のうち、第２の区切りまでの部分に関する情報を含む。 In one embodiment, the determination unit 104 further determines whether a second division occurs after the first division of the utterance. The second utterance information includes information about the portion of the target person's utterance up to the second division.

一実施形態において、判定部１０４は、対象者の発話の音声上の連続性に基づいて、発話に区切りが生じたか否かを判定する。一例では、判定部１０４は、音声情報に含まれる複数の音声チャンク（例えば、音声情報を０．１秒毎に分割したデータ）のそれぞれに基づいて対象者が発声中であるか否かを逐次的に判定し、連続した所定数の音声チャンクにおいて対象者が発声中ではないと判定された場合に、そこに区切りが生じたと判定し得る。 In one embodiment, the determination unit 104 determines whether a break has occurred in the speech based on the audio continuity of the target's speech. In one example, the determination unit 104 sequentially determines whether the target is speaking based on each of a plurality of audio chunks (e.g., data obtained by dividing the audio information every 0.1 seconds) included in the audio information, and may determine that a break has occurred when it is determined that the target is not speaking in a predetermined number of consecutive audio chunks.

一実施形態において、判定部１０４は、対象者の発話の意味上の連続性に基づいて、発話に区切りが生じたか否かを判定する。一例では、発話情報が音声情報を文字起こししたテキストデータを含む場合において、判定部１０４は、当該テキストデータを、第１のテーマに関する部分と、第２のテーマに関する部分に分割し、これらの間に区切りが生じたと判定し得る。例えば、発話情報が「明日の夜７時に、４人で予約したいんですけど、コースに飲み放題はつけられますか？」というテキストデータを含む場合において、判定部１０４は、「明日の夜７時に、４人で予約したいんですけど」という第１のテーマ（この例では、予約の可否の質問）に関する部分と、「コースに飲み放題はつけられますか？」という第２のテーマ（この例では、コースの内容の質問）に関する部分とに分割し、その間に区切りが生じたと判定し得る。 In one embodiment, the determination unit 104 determines whether a break has occurred in the speech based on the semantic continuity of the target person's speech. In one example, when the speech information includes text data obtained by transcribing audio information, the determination unit 104 may divide the text data into a portion related to a first theme and a portion related to a second theme, and determine that a break has occurred between them. For example, when the speech information includes text data such as "I would like to make a reservation for four people at 7 p.m. tomorrow night. Can I add all-you-can-drink to the course?", the determination unit 104 may divide the text data into a portion related to the first theme of "I would like to make a reservation for four people at 7 p.m. tomorrow night" (in this example, a question about whether a reservation can be made) and a portion related to the second theme of "Can I add all-you-can-drink to the course?" (in this example, a question about the content of the course), and determine that a break has occurred between them.

一実施形態において、判定部１０４は、発話の第１の区切り及び第２の区切りの少なくとも一方の後に、応答の出力に関する所定の時間が経過したか否かをさらに判定する。一実施形態において、所定の時間は、相槌情報に基づいて決定される。一例では、所定の時間は、相槌情報を音声により出力する場合における再生時間に基づいて決定される。例えば、相槌情報が「かしこまりました」というテキストデータを含み、これを音声により出力する場合には０．８秒かかるとした場合には、所定の時間は、０．８秒（あるいは、これに対してバッファとして０．１秒程度を加えた秒数）であり得る。 In one embodiment, the determination unit 104 further determines whether a predetermined time for outputting a response has elapsed after at least one of the first and second segments of the utterance. In one embodiment, the predetermined time is determined based on the backchannel information. In one example, the predetermined time is determined based on the playback time when the backchannel information is output by voice. For example, if the backchannel information includes text data such as "Understood," and it takes 0.8 seconds to output this by voice, the predetermined time may be 0.8 seconds (or a number of seconds obtained by adding about 0.1 seconds to this as a buffer).

２．１．１．４出力部１０６
出力部１０６は、対象者の発話に関する所定の条件が満たされない場合（例えば、発話が継続する場合）には第１応答情報を出力し、当該所定の条件が満たされる場合（例えば、発話が終了する場合）には第１応答情報及び第２応答情報の少なくとも一方を出力する。応答情報を出力することは、応答情報を出力するための情報を端末装置３に対して送信することを含む。 2.1.1.4 Output Unit 106
The output unit 106 outputs the first response information when a predetermined condition regarding the target person's speech is not satisfied (for example, when the speech continues), and outputs at least one of the first response information and the second response information when the predetermined condition is satisfied (for example, when the speech ends). Outputting the response information includes transmitting information for outputting the response information to the terminal device 3.

一実施形態において、出力部１０６は、対象者の端末装置３が応答情報を音声により出力するように制御する。この際、出力部１０６は、応答情報に対応する音声を出力するための情報を端末装置３に対して送信し得る。応答情報に対応する音声を出力するための情報は、当業者の知識に基づいて選択される音声生成プログラムに基づいて生成され得る。 In one embodiment, the output unit 106 controls the subject's terminal device 3 to output the response information by voice. At this time, the output unit 106 may transmit information for outputting a voice corresponding to the response information to the terminal device 3. The information for outputting a voice corresponding to the response information may be generated based on a voice generation program selected based on the knowledge of a person skilled in the art.

一実施形態において、出力部１０６は、相槌情報をさらに出力する。一実施形態において、出力部１０６は、対象者の端末装置３が相槌情報を音声により出力するように制御する。この際、出力部１０６は、相槌情報に対応する音声を出力するための情報を端末装置３に対して送信し得る。相槌情報に対応する音声を出力するための情報は、当業者の知識に基づいて選択される音声生成プログラムに基づいて生成され得る。 In one embodiment, the output unit 106 further outputs backchannel information. In one embodiment, the output unit 106 controls the terminal device 3 of the subject to output the backchannel information by voice. At this time, the output unit 106 may transmit information for outputting a voice corresponding to the backchannel information to the terminal device 3. The information for outputting a voice corresponding to the backchannel information may be generated based on a voice generation program selected based on the knowledge of a person skilled in the art.

一実施形態において、出力部１０６は、相槌情報を出力した後に第１応答情報及び第２応答情報の少なくとも一方を出力する。相槌情報を出力した後に第１応答情報及び第２応答情報の少なくとも一方を出力することは、相槌情報が端末装置３において音声により出力された後で当該応答情報が端末装置３において音声により出力されるように端末装置３を制御することを含む。 In one embodiment, the output unit 106 outputs at least one of the first response information and the second response information after outputting the backchannel information. Outputting at least one of the first response information and the second response information after outputting the backchannel information includes controlling the terminal device 3 so that the backchannel information is output by voice on the terminal device 3 after the response information is output by voice on the terminal device 3.

一実施形態において、出力部１０６は、第２の区切りの後に所定の時間が経過したと判定部１０４によって判定された場合に第２応答情報を出力する。 In one embodiment, the output unit 106 outputs the second response information when the determination unit 104 determines that a predetermined time has elapsed after the second division.

２．１．２記憶部１２
記憶部１２は、情報処理装置２が動作するための各種情報を記憶する。一実施形態において、記憶部１２は、制御部１０が実行するプログラムを記憶する。 2.1.2 Storage unit 12
The storage unit 12 stores various types of information for the operation of the information processing device 2. In one embodiment, the storage unit 12 stores a program executed by the control unit 10.

２．１．３ネットワークインタフェース部１４
ネットワークインタフェース部１４は、通信ネットワーク５を介した他の装置との通信を実現する。 2.1.3 Network interface unit 14
The network interface unit 14 realizes communication with other devices via the communication network 5 .

２．２端末装置
端末装置３は、対象者が使用する通信用装置である。端末装置３は、例えば、スマートフォン、パーソナルコンピュータ、タブレット端末及びウェアラブル端末等である。端末装置３は、入力インタフェース、出力インタフェース及び通信インタフェースを備える。 2.2 Terminal Device The terminal device 3 is a communication device used by the subject. The terminal device 3 is, for example, a smartphone, a personal computer, a tablet terminal, a wearable terminal, etc. The terminal device 3 includes an input interface, an output interface, and a communication interface.

入力インタフェースは、端末装置３が対象者からの入力を受け付けるためのインタフェースである。入力インタフェースは、タッチパネル、マイク、カメラ、キーボード及びマウス等であってよい。 The input interface is an interface through which the terminal device 3 receives input from the subject. The input interface may be a touch panel, a microphone, a camera, a keyboard, a mouse, etc.

出力インタフェースは、画像及び音声等により情報を対象者に対して伝達するためのインタフェースである。出力インタフェースは、ディスプレイ（タッチパネルを兼ねる場合がある）及びスピーカー等である。 The output interface is an interface for transmitting information to the target person through images, audio, etc. Output interfaces include a display (which may also serve as a touch panel) and a speaker, etc.

通信インタフェースは、通信ネットワーク５を介した他の装置との通信を実現するためのインタフェースである。通信インタフェースは、無線通信インタフェースであってよく、有線通信インタフェースであってもよい。 The communication interface is an interface for realizing communication with other devices via the communication network 5. The communication interface may be a wireless communication interface or a wired communication interface.

端末装置３は、例えばｗｅｂブラウザを介して情報処理装置２が提供するサービスにアクセスすることができてよく、専用のソフトウェアをインストールすることによって当該サービスにアクセスすることができてもよい。 The terminal device 3 may be able to access the services provided by the information processing device 2, for example, via a web browser, or may be able to access the services by installing dedicated software.

２．３ＬＬＭサーバ装置４
ＬＬＭサーバ装置４は、ＬＬＭによるサービスを提供する装置である。ＬＬＭは、数億以上のパラメータを有し、数百ＧＢ以上の自然言語に関するデータを学習した深層学習モデルであってよい。ＬＬＭは、例えば、ｇｐｔ－４о等である。一例では、ＬＬＭサーバ装置４は、ＡＰＩ（ＡｐｐｌｉｃａｔｉｏｎＰｒｏｇｒａｍｍｉｎｇＩｎｔｅｒｆａｃｅ）を介してＬＬＭを用いるサービスを提供する。 2.3 LLM server device 4
The LLM server device 4 is a device that provides services using the LLM. The LLM may be a deep learning model that has hundreds of millions of parameters and has learned data related to natural languages of hundreds of GB or more. The LLM is, for example, gpt-4o. In one example, the LLM server device 4 provides a service using the LLM via an API (Application Programming Interface).

一実施形態において、ＬＬＭサーバ装置４は、指示（プロンプトということもできる）の入力を他の装置から受け付け、当該指示に沿った応答を当該他の装置に返す。一例では、指示及び応答は、いずれもテキストである。 In one embodiment, the LLM server device 4 accepts input of instructions (which may also be called prompts) from other devices and returns a response to the other devices according to the instructions. In one example, both the instructions and the response are text.

２．４通信ネットワーク５
通信ネットワーク５は、システム１に含まれる各装置間の通信を実現する。通信ネットワーク５は、例えば、ＴＣＰ／ＩＰプロトコルに基づいて各装置間の通信を実現する。 2.4 Communication Network 5
The communication network 5 realizes communication between each device included in the system 1. The communication network 5 realizes communication between each device based on, for example, the TCP/IP protocol.

３動作
図３～８を参照して、システム１の動作の一例について説明する。 3. Operation An example of the operation of the system 1 will now be described with reference to FIGS.

３．１第１実施形態
図３～５を参照して、第１実施形態に係るシステム１の動作について説明する。第１実施形態では、システム１の基本的な態様について例示的に説明する。 3 to 5, the operation of the system 1 according to the first embodiment will be described. In the first embodiment, a basic aspect of the system 1 will be exemplarily described.

３．１．１フローチャート
図３は、第１実施形態に係る情報処理装置２の動作について説明するためのフローチャートである。図３のフローチャートにはステップＳ１００～Ｓ１１２が記載されているが、情報処理装置２は、これらの処理と並行して、音声チャンクの取得、当該音声チャンクに基づく発声中か否かの判定、及び音声情報の文字起こしを継続的・逐次的に実行する。この処理は、取得部１００が対象者の発話に関する発話情報を取得することの一例である。 3.1.1 Flowchart Fig. 3 is a flowchart for explaining the operation of the information processing device 2 according to the first embodiment. The flowchart in Fig. 3 describes steps S100 to S112, but in parallel with these processes, the information processing device 2 continuously and sequentially executes acquisition of voice chunks, determination of whether or not speech is being generated based on the voice chunks, and transcription of voice information. This process is an example of the acquisition unit 100 acquiring speech information related to the speech of the subject.

情報処理装置２は、その時点で対象者による発話に区切りが生じているか否かを判定する（Ｓ１００）。一例では、情報処理装置２は、発声なしと判定された音声チャンクがその時点において所定数連続しているか否かに基づいてこの判定を実行する。この処理は、判定部１０４が、対象者の発話に第１の区切りが生じた否かを判定することの一例である。 The information processing device 2 determines whether or not a break has occurred in the speech of the subject at that time (S100). In one example, the information processing device 2 performs this determination based on whether or not a predetermined number of audio chunks determined to be no speech occur consecutively at that time. This process is an example of the determination unit 104 determining whether or not a first break has occurred in the speech of the subject.

発話に区切りが生じていない場合（Ｓ１００ＮＯ）には、情報処理装置２は発話に区切りが生じるまで待機する（Ｓ１０２）。なお、上述したとおり、待機中にも情報処理装置２は音声チャンクの取得、当該音声チャンクに基づく発声中か否かの判定、及び音声情報の文字起こしを継続的・逐次的に実行する。 If there is no break in the speech (S100 NO), the information processing device 2 waits until there is a break in the speech (S102). As described above, even while waiting, the information processing device 2 continuously and sequentially acquires voice chunks, determines whether or not speech is being generated based on the voice chunks, and transcribes the voice information.

これに対して、発話に区切りが生じている場合（Ｓ１００ＹＥＳ）には、情報処理装置２は、応答決定処理を実行する。この時点での応答決定処理は、以下の（１）～（３）の処理を含み得る。
（１）第１発話情報（一例では、その時点までに得られた文字起こしのテキストデータ）を含む第１応答決定指示をＬＬＭサーバ装置４に送信すること。
（２）ＬＬＭサーバ装置４から第１応答情報を取得すること。
（３）第１応答情報に対応する音声を出力するための情報を生成すること。 On the other hand, if a break has occurred in the speech (S100 YES), the information processing device 2 executes a response determination process. The response determination process at this point may include the following processes (1) to (3).
(1) Sending a first response determination instruction to the LLM server device 4, including first speech information (in one example, the transcribed text data obtained up to that point).
(2) Obtaining first response information from the LLM server device 4.
(3) Generating information for outputting a voice corresponding to the first response information.

この時点までに得られた文字起こしのテキストデータは、対象者の発話のうち、第１の区切りまでの部分に関する情報の一例である。 The transcribed text data obtained up to this point is an example of information about the portion of the subject's speech up to the first segment.

また、この時点での応答決定処理は、第１応答決定部１０２ａが、第１発話情報を含む第１応答決定指示をＬＬＭに入力することによって、対象者の発話の少なくとも一部に対する応答に関する第１応答情報を決定することの一例である。 The response determination process at this point is also an example of the first response determination unit 102a determining first response information regarding a response to at least a portion of the target person's utterance by inputting a first response determination instruction including first utterance information to the LLM.

また、この時点での応答決定処理は、第１応答決定部１０２ａが、対象者の発話に第１の区切りが生じたと判定部１０４によって判定された場合に第１応答決定指示をＬＬＭに入力することの一例である。 The response determination process at this point is an example of the first response determination unit 102a inputting a first response determination instruction to the LLM when the determination unit 104 determines that a first break has occurred in the target person's speech.

次に、情報処理装置２は、発話が終了したか否かを判定する（Ｓ１１０）。一例では、情報処理装置２は、その時点まで発声なしと判定された音声チャンクが、ステップＳ１００の時点からさらに所定数連続しているか否かに基づいてこの判定を実行する。 Next, the information processing device 2 determines whether or not the speech has ended (S110). In one example, the information processing device 2 performs this determination based on whether or not a predetermined number of speech chunks that have been determined to be unspoken up until that point continue from the point in step S100.

発話が終了したと判定される場合（Ｓ１１０ＹＥＳ）、後述するように、情報処理装置２は、第１応答情報を出力し得る。これに対して、発話が終了したと判定されない場合、すなわち、対象者による発話が継続している場合（Ｓ１１０ＮＯ）には、情報処理装置２は次に発話に区切りが生じるまで待機する（Ｓ１０２）。そして、情報処理装置２は、以降のループで発話に区切りが生じた際に（Ｓ１００ＹＥＳ）、応答決定処理を再度実行する。この時点での応答決定処理は、以下の（１）～（３）の処理を含み得る。
（１）第１発話情報及び／又は第１応答情報と、第２発話情報（一例では、その時点までに得られた文字起こしのテキストデータ）を含む第２応答決定指示をＬＬＭサーバ装置４に送信すること。
（２）ＬＬＭサーバ装置４から第２応答情報を取得すること。
（３）第２応答情報に対応する音声を出力するための情報を生成すること。 When it is determined that the speech has ended (S110 YES), the information processing device 2 may output first response information, as described below. On the other hand, when it is determined that the speech has not ended, that is, when the target person's speech is continuing (S110 NO), the information processing device 2 waits until the next break in the speech occurs (S102). Then, when a break in the speech occurs in the subsequent loop (S100 YES), the information processing device 2 executes the response determination process again. The response determination process at this point may include the following processes (1) to (3).
(1) Sending a second response determination instruction to the LLM server device 4, the second response determination instruction including the first utterance information and/or the first response information and the second utterance information (in one example, the transcribed text data obtained up to that point).
(2) Obtaining second response information from the LLM server device 4.
(3) Generating information for outputting a voice corresponding to the second response information.

２回目のステップＳ１００は、判定部１０４が、対象者の発話の第１の区切りの後に第２の区切りが生じたか否かをさらに判定することの一例である。 The second step S100 is an example of the determination unit 104 further determining whether or not a second segment has occurred after a first segment of the target person's speech.

この時点までに得られた文字起こしのテキストデータは、対象者の発話のうち、第２の区切りまでの部分に関する情報の一例である。 The transcribed text data obtained up to this point is an example of information about the subject's speech up to the second segment.

また、この応答決定処理は、第２応答決定部１０２ｂが、対象者の発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示をＬＬＭに入力することによって、対象者の発話の少なくとも一部に対する他の応答に関する第２応答情報を決定することの一例である。 This response determination process is also an example of the second response determination unit 102b determining second response information related to another response to at least a portion of the target's utterance when a predetermined condition related to the target's utterance is satisfied, by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information to the LLM before outputting the first response information to the target.

また、この応答決定処理は、第２応答決定部１０２ｂが、対象者の発話に第２の区切りが生じたと判定部１０４によって判定された場合に第２応答決定指示をＬＬＭに入力することの一例である。 This response determination process is also an example of the second response determination unit 102b inputting a second response determination instruction to the LLM when the determination unit 104 determines that a second break has occurred in the target person's speech.

情報処理装置２は、ステップＳ１００～ステップＳ１１０のループを、対象者の発話が継続している間繰り返し実行する。そして、ステップＳ１０４では、応答決定処理によりその都度応答情報が更新される。 The information processing device 2 repeatedly executes the loop of steps S100 to S110 while the target person continues speaking. Then, in step S104, the response information is updated each time by the response determination process.

２回目のステップＳ１００の直後のステップＳ１１０は、判定部１０４が、対象者の発話の第２の区切りの後に、応答の出力に関する所定の時間が経過したか否かをさらに判定することの一例である。 Step S110 immediately following step S100 for the second time is an example in which the determination unit 104 further determines whether or not a predetermined time for outputting a response has elapsed after the second division of the subject's speech.

その後、発話が終了したと判定される場合（Ｓ１１０ＹＥＳ）には、情報処理装置２は、直近のステップＳ１０４の応答決定処理において決定した応答情報に対応する音声を出力するための情報を端末装置３に送信する（Ｓ１１２）。「直近のステップＳ１０４の応答決定処理において決定した応答情報」とは、例えば、ステップＳ１１０において一度もＮＯと判定されなかった場合には初回の応答決定処理において決定された応答情報であってよく、ステップＳ１１０において一度だけＮＯと判定された場合には２度目の応答決定処理において決定された応答情報であってよい。 After that, when it is determined that the speech has ended (S110 YES), the information processing device 2 transmits information for outputting a voice corresponding to the response information determined in the most recent response determination process of step S104 to the terminal device 3 (S112). The "response information determined in the most recent response determination process of step S104" may be, for example, the response information determined in the first response determination process if NO has never been determined in step S110, or the response information determined in the second response determination process if NO has only been determined once in step S110.

応答情報に対応する音声を出力するための情報を端末装置３に送信することは、出力部１０６が第１応答情報及び第２応答情報の少なくとも一方を出力することの一例である。 Transmitting information to the terminal device 3 for outputting a voice corresponding to the response information is an example of the output unit 106 outputting at least one of the first response information and the second response information.

また、発話が終了したと判定される場合に応答情報を出力することは、出力部１０６が第２の区切りの後に所定の時間が経過したと判定部１０４によって判定された場合に第２応答情報を出力することの一例である。 In addition, outputting response information when it is determined that the speech has ended is an example of the output unit 106 outputting second response information when the determination unit 104 determines that a predetermined time has elapsed after the second division.

３．１．２テーブル（第１応答情報を出力する場合）
図４は、第１実施形態に係る情報処理装置２の動作を異なる観点から説明するための図である。以下の説明において、「Ｔｈ１」は、発話に区切りが生じたと判定されるための時間の長さに対応する値であり、「Ｔｈ２」は、発話が終了したと判定されるための時間の長さに対応する値であるとする。 3.1.2 Table (when outputting first response information)
4 is a diagram for explaining the operation of the information processing device 2 according to the first embodiment from a different perspective. In the following explanation, "Th1" is a value corresponding to the length of time for determining that a break has occurred in an utterance, and "Th2" is a value corresponding to the length of time for determining that an utterance has ended.

情報処理装置２は、時点ｔ₁～ｔ_N+Th1+Th2+1のそれぞれにおける音声チャンク（音声チャンクｖ₁～ｖ_N+Th1+Th2+1）を逐次的に取得しながら、それぞれの音声チャンクについて発声判定を実行する。 The information processing device 2 sequentially acquires the voice chunks (voice chunks v ₁ to v _N+Th1+Th2+1 ) at each of the times t ₁ to t _N+Th1+Th2+1 , and executes vocalization determination for each of the voice chunks.

情報処理装置２は、Ｎ個の音声チャンクｖ₁～ｖ_Nのそれぞれに基づいて、時点ｔ₁～ｔ_Nにおいて対象者は発声中であると判定し、文字起こしを実行するものとする。音声チャンクｖ₁～ｖ_Nを文字起こししたテキストデータは、第１の発話に関する第１発話情報の一例である。この処理は、図３のステップＳ１００ＮＯ～Ｓ１０２で説明した処理に対応する。 The information processing device 2 determines that the subject is speaking at time points t ₁ to t _N based on each of the N voice chunks v ₁ to v _N , and performs transcription. The text data obtained by transcribing the voice chunks v ₁ to v _N is an example of first utterance information related to the first utterance. This process corresponds to the process described in steps S100 NO to S102 in FIG. 3.

次に、情報処理装置２は、Ｔｈ１個の音声チャンクｖ_N+1～ｖ_N+Th1に基づいて、時点ｔ_N+1～ｔ_N+Th1において対象者は発声していないと判定する。そして、情報処理装置２は、この時間区間において対象者の発声がなかったことに基づいて、発話に区切りが生じていると判定、応答決定処理を実行するものとする。これにより、情報処理装置２は、第１応答情報を決定する。これらの処理は、図３のステップＳ１００ＹＥＳ～Ｓ１０４で説明した処理に対応する。応答決定処理は、Ｔｈ２個の時間区間である時点ｔ_N+Th1+1から時点ｔ_N+Th1+Th2まで継続するものとする。 Next, the information processing device 2 determines that the target person does not speak from time t _N+1 to t N+Th1 based on Th1 voice chunks v _N+1 to v _N+Th1 _. Then, the information processing device 2 determines that there is a break in the speech based on the fact that the target person did not speak in this time period, and executes a response determination process. In this way, the information processing device 2 determines the first response information. These processes correspond to the processes described in steps S100 YES to S104 in FIG. 3. The response determination process is continued from time t _N+Th1+1 to time t _N+Th1+Th2, which is a time period of Th2.

次に、情報処理装置２は、Ｔｈ２個の時間区間である時点ｔ_N+Th1+1から時点ｔ_N+Th1+Th2まで対象者の発声がなかったことに基づいて、発話が終了したと判定し、第１応答情報を出力する。これらの処理は、図３のステップＳ１１０ＹＥＳ～Ｓ１１２で説明した処理に対応する。 Next, the information processing device 2 determines that the target person has finished speaking based on the fact that the target person has not spoken during the Th2 time periods from time t _N+Th1+1 to time t _N+Th1+Th2 , and outputs the first response information. These processes correspond to the processes described in steps S110 YES to S112 in FIG. 3.

３．１．３テーブル（第２応答情報を出力する場合）
図５は、第１実施形態に係る情報処理装置２の動作を異なる観点からさらに説明するための図である。情報処理装置２は、時点ｔ₁～ｔ_M+Th1+Th2+1のそれぞれにおける音声チャンク（音声チャンクｖ₁～ｖ_M+Th1+Th2+1）を逐次的に取得しながら、それぞれの音声チャンクについて発声判定を実行する。なお、時点ｔ₁～ｔ_N+Th1までの動作は図４の例と共通するため、以下では説明を省略する。 3.1.3 Table (when outputting second response information)
5 is a diagram for further explaining the operation of the information processing device 2 according to the first embodiment from a different perspective. The information processing device 2 sequentially acquires voice chunks (voice chunks v ₁ to v _M+Th1+Th2+1 ) at each of the times t ₁ to t _M+Th1+Th2+1 , and executes utterance determination for each voice chunk. Note that the operation from time t ₁ to t _N+Th1 is the same as the example in FIG. 4, and therefore will not be described below.

図４では、時点ｔ_N+Th1+1～ｔ_N+Th1+Th2において対象者は発声していないと判定される例について説明した。図５では、時点ｔ_N+Th1+1～ｔ_N+Th1+Th2の間の時点ｔ_N+Th1+k+1において、対象者の発声が検知される例（すなわち、時点ｔ_N+Th1+1において情報処理装置２が発話に区切りが生じていると判定した後、時点ｔ_N+Th1+Th2で第１応答情報が出力される前に、時点ｔ_N+Th1+k+1において対象者が発声を再開したような場合の例）について説明する。 Fig. 4 describes an example in which it is determined that the target person is not speaking at time points tN+ _Th1 ₊₁ to tN _+Th1+Th2 . Fig. 5 describes an example in which the target person's speech is detected at time point _tN+Th1+ _{k+1 between time points tN+Th1+1} to tN _{+Th1+Th2 (i.e., an example in which the target person resumes speaking at time point tN+Th1+k+} _{1 after the information processing device 2 determines that a break has occurred in the speech at time point tN+Th1} _{+1 and before the first response information is output at time point tN+Th1+Th2} ).

情報処理装置２は、音声チャンクｖ_N+Th1+k+1～ｖ_Mに基づいて、時点ｔ_N+Th1+k+1～ｔ_Mにおいて対象者は発声中であると判定し、文字起こしを実行する。音声チャンクｖ_N+Th1+k+1～ｖ_Mを文字起こししたテキストデータは、第１の発話より後の第２の発話に関する第２発話情報の一例である。この処理は、図３のステップＳ１００ＮＯ～Ｓ１０２で説明した処理に対応する。 The information processing device 2 determines that the subject is speaking at time points tN _+Th1+k+1 to _tM based on the voice chunks vN _+Th1+k+1 to _vM , and performs transcription. The text data obtained by transcribing the voice chunks vN _+Th1+k+1 to _vM is an example of second utterance information related to a second utterance that follows the first utterance. This process corresponds to the process described in steps S100 NO to S102 in FIG. 3.

次に、情報処理装置２は、Ｔｈ１個の音声チャンクｖ_M+1～ｖ_M+Th1に基づいて、時点ｔ_M+1～ｔ_M+Th1において対象者は発声していないと判定するものとする。そして、情報処理装置２は、この時間区間において対象者の発声がなかったことに基づいて、発話に区切りが生じていると判定し、応答決定処理を実行するものとする。これにより、情報処理装置２は、第２応答情報を決定する。これらの処理は、図３のステップＳ１００ＹＥＳ～Ｓ１０４で説明した処理に対応する。応答決定処理は、Ｔｈ２個の時間区間である時点ｔ_M+Th1+1から時点ｔ_M+Th1+Th2まで継続するものとする。 Next, the information processing device 2 determines that the target person does not speak from time t _M+1 to t M+Th1 based on Th1 voice chunks v _M+1 to v _M+Th1 _. Then, the information processing device 2 determines that a break has occurred in the speech based on the fact that the target person did not speak in this time period, and executes a response determination process. In this way, the information processing device 2 determines the second response information. These processes correspond to the processes described in steps S100 YES to S104 in FIG. 3. The response determination process is continued from time t _M+Th1+1 to time t _M+Th1+Th2, which is a time period of Th2.

次に、情報処理装置２は、Ｔｈ２個の時間区間である時点ｔ_M+Th1+1から時点ｔ_M+Th1+Th2まで対象者の発声がなかったことに基づいて、発話が終了したと判定し、第２応答情報を出力する。これらの処理は、図３のステップＳ１１０ＹＥＳ～Ｓ１１２で説明した処理に対応する。 Next, the information processing device 2 determines that the target person has finished speaking based on the fact that the target person has not spoken during the Th2 time periods from time t _+Th1+1 to time t _+Th1+Th2 , and outputs the second response information. These processes correspond to the processes described in steps S110 YES to S112 in FIG. 3.

３．２第２実施形態
図６～８を参照して、第２実施形態に係るシステム１の動作について説明する。第２実施形態では、情報処理装置２が相槌情報を出力する場合のシステム１の態様について例示的に説明する。 6 to 8, the operation of the system 1 according to the second embodiment will be described. In the second embodiment, a mode of the system 1 when the information processing device 2 outputs backchannel information will be described as an example.

３．２．１フローチャート
図６は、第６実施形態に係る情報処理装置２の動作について説明するためのフローチャートである。図６のフローチャートにはステップＳ２００～Ｓ２１２が記載されているが、情報処理装置２は、これらの処理と並行して、音声チャンクの取得、当該音声チャンクに基づく発声中か否かの判定、及び音声情報の文字起こしを継続的・逐次的に実行する。 6 is a flowchart for explaining the operation of the information processing device 2 according to the sixth embodiment. The flowchart in Fig. 6 describes steps S200 to S212, but in parallel with these processes, the information processing device 2 continuously and sequentially acquires audio chunks, determines whether or not speech is being generated based on the audio chunks, and transcribes audio information.

なお、図６のステップＳ２００～Ｓ２０４及びステップＳ２１０～Ｓ２１２は、それぞれ図３のステップＳ１００～Ｓ１０４及びステップＳ１１０～Ｓ１１２に対応してよいため、以下では図６のステップＳ２０６～Ｓ２０８に関して説明する。 Note that steps S200 to S204 and steps S210 to S212 in FIG. 6 may correspond to steps S100 to S104 and steps S110 to S112 in FIG. 3, respectively, so the following description will focus on steps S206 to S208 in FIG. 6.

情報処理装置２は、応答決定処理を実行しながら（Ｓ２０４）、相槌決定処理をさらに実行し得る（Ｓ２０６）。相槌決定処理は、以下の（１）～（３）の処理を含み得る。
（１）第１発話情報（一例では、その時点までに得られた文字起こしのテキストデータ）を含む相槌決定指示をＬＬＭサーバ装置４に送信すること。
（２）ＬＬＭサーバ装置４から相槌情報を取得すること。
（３）相槌情報に対応する音声を出力するための情報を生成すること。
この相槌決定処理は、相槌決定部１０２ｃが、第１応答情報を決定する際に、第１発話情報に対応する相槌に関する相槌情報を決定することの一例である。なお、相槌決定処理は、第１応答情報を決定する際だけに限定されず、ステップＳ２００～ステップＳ２１０のループ毎に実行され得る。 While executing the response determination process (S204), the information processing device 2 may further execute a backchannel determination process (S206). The backchannel determination process may include the following processes (1) to (3).
(1) Transmitting a backchannel decision instruction to the LLM server device 4, including the first speech information (in one example, the transcribed text data obtained up to that point).
(2) Acquiring interjection information from the LLM server device 4.
(3) Generating information for outputting a voice corresponding to the backchannel information.
This backchannel determination process is an example of the backchannel determination unit 102c determining backchannel information related to the first utterance information when determining the first response information. Note that the backchannel determination process is not limited to only when determining the first response information, but may be executed for each loop of steps S200 to S210.

次に、情報処理装置２は、相槌情報を出力する（Ｓ２０８）。すなわち、情報処理装置２は、ステップＳ２００～ステップＳ２１０のループ（対象者が発話を継続している間繰り返される処理）を実行しながら、ステップＳ２１２で応答情報を出力する前に、相槌情報を出力し得る。これにより、対象者は、ＡＩエージェントとのより自然な会話を体感することができる。なお、ステップＳ２０８で相槌情報が出力された後にステップＳ２１２で応答情報が出力されることは、出力部１０６が、相槌情報を出力した後に第２応答情報を出力することの一例である。 Next, the information processing device 2 outputs backchannel information (S208). That is, while executing the loop of steps S200 to S210 (processing repeated while the subject continues speaking), the information processing device 2 may output the backchannel information before outputting the response information in step S212. This allows the subject to experience a more natural conversation with the AI agent. Note that outputting the response information in step S212 after outputting the backchannel information in step S208 is an example of the output unit 106 outputting second response information after outputting the backchannel information.

なお、図６では、ステップＳ２０４と、ステップＳ２０６～Ｓ２０８は便宜上直列に記載されているが、これらの処理は並列に実行されてもよい。すなわち、情報処理装置２は、応答決定処理を実行しながら相槌決定処理及び相槌情報の出力を実行してよい。この構成によれば、応答決定処理が実行されている間にまずは相槌情報が対象者に出力されるため、対象者が体感するレイテンシをさらに抑制し得る。 In FIG. 6, step S204 and steps S206 to S208 are shown in series for convenience, but these processes may be executed in parallel. That is, the information processing device 2 may execute the backchannel determination process and output the backchannel information while executing the response determination process. With this configuration, the backchannel information is first output to the subject while the response determination process is being executed, so that the latency experienced by the subject can be further reduced.

３．２．２テーブル（第１応答情報を出力する場合）
図７は、第２実施形態に係る情報処理装置２の動作を異なる観点から説明するための図である。図４では、時点ｔ_N+Th1+1～時点ｔ_N+Th1+Th2において情報処理装置２は応答決定処理を実行し、その間は情報を出力しない例について説明した。これに対して、図７の例では、情報処理装置２は、時点ｔ_N+1～ｔ_N+Th1において対象者の発声がなかったことに基づいて、発話に区切りが生じていると判定し、応答決定処理及び相槌決定処理を実行する。これにより、情報処理装置２は、第１応答情報及び相槌情報を決定するとともに、相槌情報を出力する。これらの処理は、図６のステップＳ２００ＹＥＳ～Ｓ２０８で説明した処理に対応する。 3.2.2 Table (when outputting first response information)
FIG. 7 is a diagram for explaining the operation of the information processing device 2 according to the second embodiment from a different perspective. In FIG. 4, an example was explained in which the information processing device 2 executes a response determination process from time t N+Th1 ₊₁ to time t _N+Th1+ Th2, and does not output information during that time. In contrast, in the example of FIG. 7, the information processing device 2 determines that a break has occurred in the speech based on the fact that the target person did not speak from time t _N+1 to t _N+Th1 , and executes a response determination process and a backchannel determination process. As a result, the information processing device 2 determines the first response information and backchannel information, and outputs the backchannel information. These processes correspond to the processes explained in steps S200 YES to S208 of FIG. 6.

３．２．３テーブル（第２応答情報を出力する場合）
図８は、第２実施形態に係る情報処理装置２の動作を異なる観点からさらに説明するための図である。図５は、時点ｔ_N+Th1+1～時点ｔ_N+Th1+k及び時点ｔ_M+Th1+1～時点ｔ_M+Th1+Th2のそれぞれにおいて情報処理装置２は応答決定処理を実行し、その間は情報を出力しない例について説明した。 3.2.3 Table (when outputting second response information)
Fig. 8 is a diagram for further explaining the operation of the information processing device 2 according to the second embodiment from a different perspective. Fig. 5 has explained an example in which the information processing device 2 executes a response determination process from time tN+ _Th1+1 to time tN _+Th1+k and from time tM _+Th1+1 to time tM _+Th1+Th2 , and does not output information during those periods.

これに対して、図８の例では、情報処理装置２は、時点ｔ_N+1～ｔ_N+Th1において対象者の発声がなかったことに基づいて、発話に区切りが生じていると判定し、応答決定処理及び相槌決定処理を実行する。情報処理装置２は、さらに、時点ｔ_N+1～ｔ_N+Th1+kにおいて相槌情報を出力する。 8, the information processing device 2 determines that a break has occurred in the speech based on the absence of speech from the target person at time points t _N+1 to t _N+Th1 , and executes a response determination process and a backchannel determination process. The information processing device 2 further outputs backchannel information at time points t _N+1 to t _N+Th1+k .

また、情報処理装置２は、時点ｔ_M+1～ｔ_M+Th1において対象者の発声がなかったことに基づいて、発話に区切りが生じていると判定し、応答決定処理及び相槌決定処理を実行する。情報処理装置２は、さらに、時点ｔ_M+1～ｔ_M+Th1+Th2において相槌情報を出力する。 Furthermore, the information processing device 2 determines that a break has occurred in the speech based on the fact that the target person did not speak from time t _M+1 to t _M+Th1 , and executes a response determination process and a backchannel determination process. The information processing device 2 further outputs backchannel information from time t _M+1 to t _M+Th1+Th2 .

３．３具体例
以下では、対象者による発話が「えーと、今日の夜７時に２名で予約したいんですけど…あ、名前は○○です」であると仮定した場合における、第１応答決定指示、第１応答情報、第２応答決定指示及び第２応答情報の具体例について説明する。なお、第１発話情報は「えーと、今日の夜７時に２名で予約したいんですけど」というテキストデータを含み、第２発話情報は「あ、名前は○○です」というテキストデータを含むものとする。 3.3 Specific Examples In the following, specific examples of the first response determination instruction, the first response information, the second response determination instruction, and the second response information will be described assuming that the utterance by the target person is "Um, I'd like to make a reservation for two people at 7pm tonight... Oh, my name is XX." Note that the first utterance information includes text data of "Um, I'd like to make a reservation for two people at 7pm tonight," and the second utterance information includes text data of "Oh, my name is XX."

一例では、第１応答決定指示は、以下の（１）～（３）のテキストデータを含み得る。
（１）システムプロンプト：「あなたは飲食店の予約受付を行う優秀なＡＩエージェントです。」
（２）会話履歴：「対象者『あ、もしもし？』→ＡＩエージェント『はい、〇〇（店名）でございます。』」
（３）第１発話情報を含む指示：「直近で、対象者から、『えーと、今日の夜７時に２名で予約したいんですけど』という発話がありました。これに対する応答を作成してください。」 In one example, the first response determination instruction may include the following text data (1) to (3).
(1) System prompt: "You are an intelligent AI agent who takes reservations at restaurants."
(2) Conversation history: "Subject: 'Hello?' → AI agent: 'Hello, this is XX (store name)'"
(3) Instructions including first utterance information: "The subject recently said, 'Um, I'd like to make a reservation for two people at 7pm tonight.' Please create a response to this."

ＬＬＭサーバ装置４は、この例の第１応答決定指示に対して、例えば「本日の夜７時に２名様でのご予約ですね。」というテキストデータを含む第１応答情報を生成し、情報処理装置２に対して送信し得る。 In response to the first response determination instruction in this example, the LLM server device 4 may generate first response information including text data such as "This is a reservation for two people at 7 p.m. tonight," and transmit this to the information processing device 2.

一例では、第２応答決定指示は、以下の（１）～（４）のテキストデータを含み得る。
（１）システムプロンプト：「あなたは飲食店の予約受付を行う優秀なＡＩエージェントです。」
（２）会話履歴：「対象者『あ、もしもし？』→ＡＩエージェント『はい、〇〇（店名）でございます。』→対象者『えーと、今日の夜７時に２名で予約したいんですけど』」
（３）第１応答情報：「本日の夜７時に２名様でのご予約ですね。」
（４）第２発話情報を含む指示：「対象者から、追加で『あ、名前は○○です』という発話がありました。第１応答情報を踏まえて、応答を作成してください。」 In one example, the second response determination instruction may include the following text data (1) to (4).
(1) System prompt: "You are an intelligent AI agent who takes reservations at restaurants."
(2) Conversation history: "Subject: 'Hello?' → AI agent: 'Hello, this is XX (store name)' → Subject: 'Um, I'd like to make a reservation for two people at 7pm tonight.'"
(3) First response: "Your reservation is for two people at 7 p.m. tonight."
(4) Instructions including second utterance information: "The subject additionally uttered, 'Oh, my name is ____.' Please create a response based on the first response information."

ＬＬＭサーバ装置４は、この例の第２応答決定指示に対して、例えば「本日の夜７時に２名様でのご予約ですね。お名前は○○様でお間違いないでしょうか？」というテキストデータを含む第２応答情報を生成し、情報処理装置２に対して送信し得る。なお「第１応答情報を踏まえて、応答を作成」することは、第１応答情報に含まれるテキストデータが第２応答情報においても含まれるようにすることであってよく、第１応答情報に含まれるテキストデータの続きに相当するテキストデータが第２応答情報に含まれるようにすることであってよく、第１応答情報を参照しつつこれに拘束されることなく第２応答情報を決定することでもよい。 In response to the second response determination instruction in this example, the LLM server device 4 may generate second response information including text data such as "This is a reservation for two people at 7pm tonight. Is your name Mr./Ms. XX correct?" and transmit it to the information processing device 2. Note that "creating a response based on the first response information" may mean that the text data included in the first response information is also included in the second response information, or that text data corresponding to the continuation of the text data included in the first response information is included in the second response information, or that the second response information is determined by referring to the first response information but not being bound by it.

４ハードウェア構成
図９を参照して、上述してきたシステム１に含まれる装置をコンピュータ７０により実現する場合のハードウェア構成の一例を説明する。なお、それぞれの装置の機能は、複数台の装置に分けて実現することもできる。 9, an example of a hardware configuration in which the devices included in the above-described system 1 are realized by a computer 70 will be described. Note that the functions of each device can also be realized by dividing them into multiple devices.

図９に示すように、コンピュータ７０は、プロセッサ７００と、記憶装置７０２と、入力Ｉ／Ｆ７０４と、データＩ／Ｆ７０６と、通信Ｉ／Ｆ７０８、及び表示装置７１０を含む。 As shown in FIG. 9, the computer 70 includes a processor 700, a storage device 702, an input I/F 704, a data I/F 706, a communication I/F 708, and a display device 710.

プロセッサ７００は、記憶装置７０２に記憶されているプログラムを実行することによりコンピュータ７０における様々な処理を制御する。例えば、情報処理装置２の制御部１０が備える各機能部等は、記憶装置７０２に記憶されたプログラムを、プロセッサ７００が実行することにより実現可能である。 The processor 700 controls various processes in the computer 70 by executing programs stored in the storage device 702. For example, each functional unit of the control unit 10 of the information processing device 2 can be realized by the processor 700 executing a program stored in the storage device 702.

記憶装置７０２は、例えばＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶媒体である。ＲＡＭは、プロセッサ７００によって実行されるプログラムのプログラムコードや、プログラムの実行時に必要となるデータを一時的に記憶する。 The storage device 702 is a storage medium such as a RAM (Random Access Memory). The RAM temporarily stores the program code of the program executed by the processor 700 and data required when the program is executed.

記憶装置７０２は、他にも、例えばハードディスクドライブ（ＨＤＤ）やフラッシュメモリ等の不揮発性の記憶媒体である。記憶装置７０２は、オペレーティングシステムや、上記各構成を実現するための各種プログラムを記憶する。当該各種プログラムを格納した記憶媒体は、コンピュータ読み取り可能な非一時的な記憶媒体（Ｎｏｎ－ｔｒａｎｓｉｔｏｒｙｃｏｍｐｕｔｅｒｒｅａｄａｂｌｅｍｅｄｉｕｍ）であってもよい。この他、記憶装置７０２は、各種情報を登録するテーブルと、当該テーブルを管理するＤＢを記憶することも可能である。このようなプログラムやデータは、必要に応じて記憶装置７０２にロードされることにより、プロセッサ７００から参照される。 The storage device 702 may also be a non-volatile storage medium such as a hard disk drive (HDD) or flash memory. The storage device 702 stores an operating system and various programs for implementing the above-mentioned configurations. The storage medium storing the various programs may be a non-transitory computer readable medium. In addition, the storage device 702 may store a table for registering various information and a DB for managing the table. Such programs and data are loaded into the storage device 702 as necessary and referenced by the processor 700.

入力Ｉ／Ｆ７０４は、対象者からの入力を受け付けるためのデバイスである。入力Ｉ／Ｆ７０４の具体例としては、カメラ、ボタン、マイク、キーボード、マウス、タッチパネル、各種センサ、ウェアラブル・デバイス等が挙げられる。入力Ｉ／Ｆ７０４は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）等のインタフェースを介してコンピュータ７０に接続されてもよい。 The input I/F 704 is a device for receiving input from the subject. Specific examples of the input I/F 704 include a camera, a button, a microphone, a keyboard, a mouse, a touch panel, various sensors, a wearable device, and the like. The input I/F 704 may be connected to the computer 70 via an interface such as a Universal Serial Bus (USB).

データＩ／Ｆ７０６は、コンピュータ７０の外部からデータを入力するためのデバイスである。データＩ／Ｆ７０６の具体例としては、各種記憶媒体に記憶されているデータを読み取るためのドライブ装置等がある。データＩ／Ｆ７０６は、コンピュータ７０の外部に設けられることも考えられる。その場合、データＩ／Ｆ７０６は、例えばＵＳＢ等のインタフェースを介してコンピュータ７０へと接続される。 The data I/F 706 is a device for inputting data from outside the computer 70. A specific example of the data I/F 706 is a drive device for reading data stored in various storage media. The data I/F 706 may be provided outside the computer 70. In that case, the data I/F 706 is connected to the computer 70 via an interface such as a USB.

通信Ｉ／Ｆ７０８は、コンピュータ７０の外部の装置と有線または無線により、通信ネットワーク５を介したデータ通信を行うためのデバイスである。通信Ｉ／Ｆ７０８は、コンピュータ７０の外部に設けられることも考えられる。その場合、通信Ｉ／Ｆ７０８は、例えばＵＳＢ等のインタフェースを介してコンピュータ７０に接続される。 The communication I/F 708 is a device for performing data communication via the communication network 5, either wired or wirelessly, with devices external to the computer 70. The communication I/F 708 may be provided external to the computer 70. In that case, the communication I/F 708 is connected to the computer 70 via an interface such as a USB.

表示装置７１０は、各種情報を表示するためのデバイスである。表示装置７１０の具体例としては、例えば液晶ディスプレイや有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）ディスプレイ、ウェアラブル・デバイスのディスプレイ等が挙げられる。表示装置７１０は、コンピュータ７０の外部に設けられてもよい。その場合、表示装置７１０は、例えばディスプレイケーブル等を介してコンピュータ７０に接続される。また、入力Ｉ／Ｆ７０４としてタッチパネルが採用される場合には、表示装置７１０は、入力Ｉ／Ｆ７０４と一体化して構成することが可能である。 The display device 710 is a device for displaying various types of information. Specific examples of the display device 710 include a liquid crystal display, an organic EL (Electro-Luminescence) display, a display of a wearable device, and the like. The display device 710 may be provided outside the computer 70. In that case, the display device 710 is connected to the computer 70 via, for example, a display cable. In addition, when a touch panel is used as the input I/F 704, the display device 710 can be configured as an integral part of the input I/F 704.

また、上記実施の形態で記載されたシステム１に含まれる装置が備える構成要素は、記憶装置７０２に格納されたプログラムがプロセッサ７００によって実行されることで、定められた処理が他のハードウェアと協働して実現されるものとする。また、言い換えれば、これらの構成要素は、ソフトウェアまたはファームウェアとしても、それと対応するハードウェアとしても想定され、その双方の概念において、「機能」、「手段」、「部」、「処理回路」、「ユニット」、または「モジュール」等とも記載され、またそれぞれに読み替えることができる。 The components of the device included in the system 1 described in the above embodiment are assumed to be implemented in cooperation with other hardware by the processor 700 executing a program stored in the storage device 702. In other words, these components are assumed to be software or firmware, and also corresponding hardware, and in both concepts, they are also described as "functions," "means," "parts," "processing circuits," "units," or "modules," etc., and can be interpreted as each of these terms.

５変形例
以上説明した実施形態は、本開示の理解を容易にするためのものであり、本開示を限定して解釈するためのものではない。実施形態が備え得る構成は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 5. Modifications The above-described embodiments are intended to facilitate understanding of the present disclosure, and are not intended to limit and interpret the present disclosure. The configurations that the embodiments may have are not limited to those exemplified, and may be modified as appropriate. In addition, the configurations shown in different embodiments may be partially substituted or combined with each other.

上記実施形態で「第１」及び「第２」の接頭辞をつけて説明した事項は、当業者の知識に基づいて「第１」～「第Ｎ」（ただし、Ｎは自然数）の関係に拡張して理解され得る。 The matters described in the above embodiment with the prefixes "first" and "second" can be understood as extending the relationship between "first" to "Nth" (where N is a natural number) based on the knowledge of those skilled in the art.

一例では、取得部１００は、対象者の発話に関する発話情報を取得することであって、発話情報は、第１の発話に関する第１発話情報～第Ｎの発話に関する第Ｎ発話情報を含む、発話情報を取得してよい。 In one example, the acquisition unit 100 acquires speech information related to the target person's speech, and the acquired speech information may include first speech information related to a first utterance through Nth speech information related to an Nth utterance.

一例では、自然数ｎ＝２～Ｎに関して、第ｎ応答決定部は、発話に関する所定の条件が満たされる場合に、第ｎ－１応答情報を対象者に出力する前に、第１発話情報～第ｎ－１発話情報及び第１応答情報～第ｎ－１応答情報の少なくとも１つと、第ｎ発話情報とを含む第ｎ応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第ｎ応答情報を決定してよい。 In one example, for natural numbers n = 2 to N, when a predetermined condition related to the utterance is satisfied, the nth response determination unit may determine the nth response information related to another response to at least a portion of the utterance by inputting an nth response determination instruction including at least one of the first utterance information to the nth-1st utterance information and the first response information to the nth-1st response information, and the nth utterance information, into the large-scale language model before outputting the nth-1st response information to the subject.

一例では、出力部１０４は、自然数ｎ＝２～Ｎに関して、第ｎ－１応答情報が決定された後、第ｎ応答情報が決定されるまでの間に対象者による発話が終了したと判定される場合に、第１応答情報～第ｎ－１応答情報の少なくとも１つを出力し、第ｎ応答情報が決定された後に対象者による発話が終了したと判定される場合に、第１応答情報～第ｎ応答情報の少なくとも１つを出力してよい。 In one example, for natural numbers n = 2 to N, the output unit 104 may output at least one of the first response information to the n-1th response information when it is determined that the subject's speech has ended after the n-1th response information has been determined and before the nth response information has been determined, and may output at least one of the first response information to the nth response information when it is determined that the subject's speech has ended after the nth response information has been determined.

上記実施形態において、ＬＬＭはＬＬＭサーバ装置４にホストされるものとして説明したが、これに限定されない。ＬＬＭは情報処理装置２にホストされてよく、端末装置３にホストされてよい。 In the above embodiment, the LLM has been described as being hosted on the LLM server device 4, but this is not limited to the above. The LLM may be hosted on the information processing device 2 or on the terminal device 3.

上記実施形態において、情報処理装置２が端末装置３から音声情報を取得し、当該音声情報に基づいてＡＩエージェントによる通話を効率化する処理を実行するものとして説明したが、これに限定されない。上記実施形態において情報処理装置２が備えるものとして説明した機能の少なくとも一部は端末装置３が備えてよい。 In the above embodiment, the information processing device 2 is described as acquiring voice information from the terminal device 3 and executing a process for making calls by an AI agent more efficient based on the voice information, but this is not limited to the above. At least some of the functions described as being provided by the information processing device 2 in the above embodiment may be provided by the terminal device 3.

上記実施形態において、情報処理装置２は端末装置３と通信するものとして説明したが、これに限定されない。情報処理装置２と端末装置３との間には所定の中間サーバがあってよい。 In the above embodiment, the information processing device 2 is described as communicating with the terminal device 3, but this is not limited to the above. There may be a predetermined intermediate server between the information processing device 2 and the terminal device 3.

上記実施形態では、相槌情報及び第２応答情報が決定される場合において、当該第２応答情報が出力される場合には、相槌情報もさらに出力される例について説明したが、これに限定されない。例えば、相槌決定処理及び第２応答情報の決定に係る応答決定処理が並列に実行される場合において、相槌情報を出力するか否かは、当該応答決定処理が完了するタイミングに基づいて決定され得る。一例では、相槌決定処理が完了する前に応答決定処理が完了した場合には、情報処理装置２は、相槌情報を出力することなく第２応答情報を出力し得る。これに対して、相槌決定処理が完了した後に応答決定処理が完了する場合には、情報処理装置２は、当該応答決定処理の実行中に相槌情報を出力し、その後第２応答情報を出力し得る。このような構成によれば、応答情報が既に決定されているにも関わらず相槌情報が先に出力されるような状況を回避することができ、結果として対象者に早期に応答情報を出力できる場合がある。 In the above embodiment, an example has been described in which, when the backchannel information and the second response information are determined, if the second response information is output, the backchannel information is also output, but this is not limiting. For example, when the backchannel determination process and the response determination process related to the determination of the second response information are executed in parallel, whether or not to output the backchannel information can be determined based on the timing at which the response determination process is completed. In one example, if the response determination process is completed before the backchannel determination process is completed, the information processing device 2 can output the second response information without outputting the backchannel information. In contrast, if the response determination process is completed after the backchannel determination process is completed, the information processing device 2 can output the backchannel information during the execution of the response determination process, and then output the second response information. With this configuration, it is possible to avoid a situation in which the backchannel information is output first even though the response information has already been determined, and as a result, it may be possible to output the response information to the target person early.

６補足
本実施形態における文言は、矛盾が生じない範囲において、以下のように理解され得る。 6 Supplementary Note The wording in this embodiment can be understood as follows to the extent that no contradiction occurs.

本実施形態において、「所定の情報に基づいて所定の処理を実行する」ことは、当該所定の情報の少なくとも一部に基づいて当該所定の処理を実行することと、少なくとも当該所定の情報に基づいて当該所定の処理を実行することと、当該所定の情報に基づいて確率的に当該所定の処理を実行することとのいずれかであってもよい。すなわち、「所定の情報に基づいて所定の処理を実行する」ことは、当該所定の情報のみに基づいて当該所定の処理を実行することに限定されない。 In this embodiment, "executing a predetermined process based on the specified information" may mean any one of executing the predetermined process based on at least a part of the specified information, executing the predetermined process based on at least the specified information, and executing the predetermined process probabilistically based on the specified information. In other words, "executing a predetermined process based on the specified information" is not limited to executing the predetermined process based only on the specified information.

本実施形態において、「所定の処理に基づいて他の処理を実行する」ことは、当該所定の処理が実行された後に当該他の処理を実行することと、当該所定の処理と当該他の処理を連続的に実行することと、当該所定の処理により決定された情報に基づいて当該他の処理を実行することと、当該所定の処理が実行されたことを条件に当該他の処理を実行することと、当該所定の処理という手段によって当該他の処理を実行することとのいずれかであってもよい。なお、「所定の処理によって他の処理を実行する」ことも、「所定の処理に基づいて他の処理を実行する」ことと同様に理解されてよい。 In this embodiment, "executing another process based on a specified process" may mean any of the following: executing the other process after the specified process is executed; executing the specified process and the other process consecutively; executing the other process based on information determined by the specified process; executing the other process on the condition that the specified process has been executed; and executing the other process by means of the specified process. Note that "executing another process by a specified process" may be understood in the same way as "executing another process based on a specified process".

本実施形態において、「所定の情報が他の情報を含む」ことは、当該所定の情報の少なくとも一部が当該他の情報であることと、当該所定の情報に基づいて当該他の情報を取得することができる状態であることとのいずれかであってもよい。 In this embodiment, "specific information includes other information" may mean either that at least a portion of the specific information is the other information, or that the other information can be obtained based on the specific information.

本実施形態において、「所定の処理が他の処理を含む」ことは、当該所定の処理の少なくとも一部が当該他の処理であること（すなわち、当該所定の処理の結果を得る過程において当該他の処理が行われること）と、当該所定の処理の一態様が当該他の処理であることとのいずれかであってもよい。 In this embodiment, "a specified process includes another process" may mean either that at least a part of the specified process is the other process (i.e., that the other process is performed in the process of obtaining the result of the specified process) or that one aspect of the specified process is the other process.

本実施形態において、「所定の対象と他の対象とが対応する」ことは、当該所定の対象と当該他の対象とが１対１の関係にあることと、当該所定の対象に基づいて特定される所定の集合に当該他の対象が含まれることと、当該所定の対象に基づいて当該他の対象を特定し得ることとのいずれかであってもよい。なお、「所定の対象と他の対象とが対応する」ことは、そのことが例えばデータベース上で管理されることに限定されない。また、「所定の対象と他の対象とが関連付けられる」ことも、「所定の対象と他の対象とが対応する」ことと同様に理解されてよい。 In this embodiment, "a specific target corresponds to another target" may mean that the specific target and the other target are in a one-to-one relationship, that the other target is included in a specific set identified based on the specific target, or that the other target can be identified based on the specific target. Note that "a specific target corresponds to another target" is not limited to being managed, for example, on a database. In addition, "a specific target is associated with another target" may be understood in the same way as "a specific target corresponds to another target".

本実施形態において、「情報を取得する」ことは、当該情報を制御部１０において処理可能にすることを含む。「情報を取得する」ことは、例えば、当該情報を他の装置から受信すること、所定の処理によりその情報を得ること、及びその情報を記憶部１２から読み出すことと等であってよい。 In this embodiment, "obtaining information" includes making the information processable in the control unit 10. "Obtaining information" may mean, for example, receiving the information from another device, obtaining the information through a specified process, reading the information from the storage unit 12, etc.

本実施形態において、「情報を生成する」ことは、所定の処理により得られる情報を制御部１０において処理可能にすることと、所定の処理により得られる情報を記憶部１２に記憶することとのいずれかであってもよい。 In this embodiment, "generating information" may mean either making the information obtained by the specified processing process processable in the control unit 10 or storing the information obtained by the specified processing in the storage unit 12.

本実施形態において、「情報を決定する」ことは、一以上の情報の中から少なくとも１つ選択することと、新たにその情報を生成することとのいずれかであってもよい。 In this embodiment, "determining information" may mean either selecting at least one piece of information from one or more pieces of information, or generating new information.

本実施形態において、「情報を出力する」ことは、情報を他の装置に対して送信することと、その情報を音声又は映像により出力することとのいずれかであってもよい。 In this embodiment, "outputting information" may mean either transmitting the information to another device or outputting the information as audio or video.

７構成例
本開示は、以下の技術を含む。 7 Configuration Examples The present disclosure includes the following techniques.

［付記１］
対象者の発話に関する発話情報を取得する取得部１００であって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、取得部１００と、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定する第１応答決定部１０２ａと、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定する第２応答決定部１０２ｂと、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力する出力部１０６と、を備える、情報処理装置２。 [Appendix 1]
an acquisition unit that acquires speech information regarding an utterance of a target person, the speech information including first utterance information regarding a first utterance and second utterance information regarding a second utterance subsequent to the first utterance; a first response determination unit that determines first response information regarding a response to at least a part of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; a second response determination unit that determines second response information regarding another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target person when a predetermined condition regarding the utterance is satisfied; and an output unit that outputs the first response information when the predetermined condition is not satisfied, and outputs at least one of the first response information and the second response information when the predetermined condition is satisfied.

［付記２］
第１応答情報を決定する際に、第１発話情報に対応する相槌に関する相槌情報を決定する相槌決定部１０２ｃ、をさらに備え、出力部１０６は、相槌情報をさらに出力する、
付記１に記載の情報処理装置２。 [Appendix 2]
a backchannel determining unit that determines backchannel information related to a backchannel corresponding to the first utterance information when determining the first response information, and the output unit further outputs the backchannel information.
2. The information processing device 2 according to claim 1.

［付記３］
出力部１０６が、所定の条件が満たされない場合には、相槌情報を出力した後に第１応答情報を出力し、所定の条件が満たされる場合には、相槌情報を出力した後に第１応答情報及び第２応答情報の少なくとも一方を出力する、付記２に記載の情報処理装置２。 [Appendix 3]
An information processing device 2 described in Appendix 2, in which the output unit 106 outputs interjection information and then first response information when a specified condition is not satisfied, and outputs the interjection information and then at least one of the first response information and the second response information when a specified condition is satisfied.

［付記４］
相槌決定部１０２ｃは、第１発話情報を含む相槌決定指示を大規模言語モデルに入力することによって相槌情報を決定する、付記２又は３に記載の情報処理装置２。 [Appendix 4]
The information processing device 2 according to claim 2 or 3, wherein the backchannel determining unit 102c determines the backchannel information by inputting a backchannel determination instruction including the first utterance information into a large-scale language model.

［付記５］
発話に第１の区切りが生じた否かを判定する判定部１０４、をさらに備え、第１発話情報は、発話のうち、第１の区切りまでの部分に関する情報を含み、第１応答決定部１０２ａは、発話に第１の区切りが生じたと判定部１０４によって判定された場合に第１応答決定指示を大規模言語モデルに入力する、付記１に記載の情報処理装置２。 [Appendix 5]
The information processing device 2 described in Appendix 1 further includes a judgment unit 104 that judges whether a first division has occurred in the utterance, the first utterance information includes information regarding the part of the utterance up to the first division, and the first response determination unit 102a inputs a first response determination instruction to the large-scale language model when the judgment unit 104 judges that the first division has occurred in the utterance.

［付記６］
判定部１０４は、発話の第１の区切りの後に第２の区切りが生じたか否かをさらに判定し、第２発話情報は、発話のうち、第２の区切りまでの部分に関する情報を含み、第２応答決定部１０２ｂは、発話に第２の区切りが生じたと判定部１０４によって判定された場合に第２応答決定指示を大規模言語モデルに入力する、付記５に記載の情報処理装置２。 [Appendix 6]
The information processing device 2 described in Appendix 5, wherein the determination unit 104 further determines whether or not a second division has occurred after the first division of the utterance, the second utterance information includes information relating to the part of the utterance up to the second division, and the second response determination unit 102b inputs a second response determination instruction to the large-scale language model when the determination unit 104 determines that a second division has occurred in the utterance.

［付記７］
判定部１０４は、発話の第２の区切りの後に、応答の出力に関する所定の時間が経過したか否かをさらに判定し、出力部１０６は、第２の区切りの後に所定の時間が経過したと判定された場合に第１応答情報及び第２応答情報の少なくとも一方を出力する、付記６に記載の情報処理装置２。 [Appendix 7]
The information processing device 2 described in Appendix 6, wherein the determination unit 104 further determines whether a predetermined time for outputting a response has elapsed after the second division of the utterance, and the output unit 106 outputs at least one of the first response information and the second response information when it is determined that the predetermined time has elapsed after the second division.

［付記８］
第１応答情報を決定する際に、第１発話情報に対応する相槌に関する相槌情報を決定する相槌決定部１０２ｃ、をさらに備え、出力部１０６は、相槌情報をさらに出力し、所定の時間は、相槌情報に基づいて決定される、付記７に記載の情報処理装置２。 [Appendix 8]
An information processing device 2 as described in Appendix 7, further comprising an interjection determination unit 102c that determines interjection information regarding an interjection corresponding to the first utterance information when determining the first response information, wherein the output unit 106 further outputs the interjection information, and the specified time is determined based on the interjection information.

［付記９］
出力部１０６は、所定の条件が満たされない場合には、対象者の端末装置が第１応答情報を音声により出力するように制御し、所定の条件が満たされる場合には、第１応答情報及び第２応答情報の少なくとも一方を音声により出力するように制御する、付記１～８のいずれか１つに記載の情報処理装置２。 [Appendix 9]
The information processing device 2 described in any one of Appendices 1 to 8, wherein the output unit 106 controls the subject's terminal device to output the first response information by voice when a specified condition is not met, and controls the subject's terminal device to output at least one of the first response information and the second response information by voice when a specified condition is met.

［付記１０］
第１応答決定部１０２ａは、第１応答情報を決定する際に、当該第１応答情報を音声により出力するための情報をさらに決定し、第２応答決定部１０２ｂは、第２応答情報を決定する際に、当該第２応答情報を音声により出力するための情報をさらに決定する、付記９に記載の情報処理装置２。 [Appendix 10]
The information processing device 2 described in Appendix 9, wherein the first response determination unit 102a, when determining the first response information, further determines information for outputting the first response information by voice, and the second response determination unit 102b, when determining the second response information, further determines information for outputting the second response information by voice.

［付記１１］
一以上のテキストのそれぞれを音声により出力するための情報を記憶する記憶部１２、をさらに備え、第１応答情報が一以上のテキストの少なくとも１つと整合する場合には、当該第１応答情報を音声により出力するための情報を決定することは、当該整合するテキストを音声により出力するための情報を記憶部から取得することを含み、第１応答情報が一以上のテキストのいずれとも整合しない場合には、当該第１応答情報を音声により出力するための情報を決定することは、所定の音声生成プログラムに基づいて当該第１応答情報を音声により出力するための情報を生成することを含み、第２応答情報が一以上のテキストの少なくとも１つと整合する場合には、当該第２応答情報を音声により出力するための情報を決定することは、当該整合するテキストを音声により出力するための情報を記憶部から取得することを含み、第２応答情報が一以上のテキストのいずれとも整合しない場合には、当該第２応答情報を音声により出力するための情報を決定することは、所定の音声生成プログラムに基づいて当該第２応答情報を音声により出力するための情報を生成することを含む、付記１０に記載の情報処理装置２。 [Appendix 11]
and a storage unit (12) configured to store information for audio output of each of the one or more texts, wherein, when the first response information matches at least one of the one or more texts, determining information for audio output of the first response information includes acquiring information for audio output of the matching text from the storage unit, and when the first response information does not match any of the one or more texts, determining information for audio output of the first response information includes generating information for audio output of the first response information based on a predetermined voice generation program, when the second response information matches at least one of the one or more texts, determining information for audio output of the second response information includes acquiring information for audio output of the matching text from the storage unit, and when the second response information does not match any of the one or more texts, determining information for audio output of the second response information includes generating information for audio output of the second response information based on a predetermined voice generation program.

［付記１２］
情報処理装置２が、対象者の発話に関する発話情報を取得することであって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、発話情報を取得することと、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定することと、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定することと、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力することと、を実行する、情報処理方法。 [Appendix 12]
an information processing method in which an information processing device 2 acquires speech information regarding an utterance of a target person, the speech information including first utterance information regarding a first utterance and second utterance information regarding a second utterance subsequent to the first utterance; determines first response information regarding a response to at least a portion of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; determines second response information regarding another response to at least a portion of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target, when a predetermined condition regarding the utterance is satisfied; outputs the first response information when the predetermined condition is not satisfied, and outputs at least one of the first response information and the second response information when the predetermined condition is satisfied.

［付記１３］
情報処理装置２に、対象者の発話に関する発話情報を取得することであって、発話情報は、第１の発話に関する第１発話情報と、当該第１の発話より後の第２の発話に関する第２発話情報とを含む、発話情報を取得することと、第１発話情報を含む第１応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する応答に関する第１応答情報を決定することと、発話に関する所定の条件が満たされる場合に、第１応答情報を対象者に出力する前に、第１発話情報及び／又は第１応答情報と、第２発話情報とを含む第２応答決定指示を大規模言語モデルに入力することによって、発話の少なくとも一部に対する他の応答に関する第２応答情報を決定することと、所定の条件が満たされない場合には第１応答情報を出力し、所定の条件が満たされる場合には第１応答情報及び第２応答情報の少なくとも一方を出力することと、を実行させる、プログラム。 [Appendix 13]
A program that causes an information processing device (2) to execute the following steps: acquire speech information regarding an utterance of a target person, the speech information including first utterance information regarding a first utterance and second utterance information regarding a second utterance subsequent to the first utterance; determine first response information regarding a response to at least a portion of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model; determine second response information regarding another response to at least a portion of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into the large-scale language model before outputting the first response information to the target, when a predetermined condition regarding the utterance is satisfied; output the first response information when the predetermined condition is not satisfied, and output at least one of the first response information and the second response information when the predetermined condition is satisfied.

１…システム、２…情報処理装置、３…端末装置、４…ＬＬＭサーバ装置、１０…制御部、１２…記憶部、７０…コンピュータ、１００…取得部、１０２…決定部、１０２ａ…第１応答決定部、１０２ｂ…第２応答決定部、１０２ｃ…相槌決定部、１０４…判定部、１０６…出力部、７００…プロセッサ 1...system, 2...information processing device, 3...terminal device, 4...LLM server device, 10...control unit, 12...storage unit, 70...computer, 100...acquisition unit, 102...determination unit, 102a...first response determination unit, 102b...second response determination unit, 102c...backchannel determination unit, 104...determination unit, 106...output unit, 700...processor

Claims

an acquisition unit that acquires speech information related to an utterance of a target person, the speech information including first utterance information related to a first utterance and second utterance information related to a second utterance subsequent to the first utterance;
a first response determination unit that determines first response information related to a response to at least a part of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model;
a second response determination unit that determines second response information related to another response to at least a part of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into a large-scale language model before outputting the first response information to the target person when a predetermined condition related to the utterance is satisfied;
an output unit that outputs the first response information when the predetermined condition is not satisfied, and outputs at least one of the first response information and the second response information when the predetermined condition is satisfied;
An information processing device comprising:

A backchannel determination unit that determines backchannel information related to a backchannel corresponding to the first utterance information when determining the first response information,
The output unit further outputs the backchannel information.
The information processing device according to claim 1 .

When the predetermined condition is not satisfied, the output unit outputs the backchannel information and then outputs the first response information, and when the predetermined condition is satisfied, the output unit outputs the backchannel information and then outputs at least one of the first response information and the second response information.
The information processing device according to claim 2 .

The backchannel determination unit determines the backchannel information by inputting a backchannel determination instruction including the first utterance information into a large-scale language model.
The information processing device according to claim 2 .

a determination unit that determines whether a first division has occurred in the utterance,
the first utterance information includes information on a portion of the utterance up to the first break,
the first response determination unit inputs the first response determination instruction to a large-scale language model when the determination unit determines that the first break has occurred in the utterance.
The information processing device according to claim 1 .

The determination unit further determines whether a second segment of the utterance occurs after the first segment of the utterance;
the second utterance information includes information on a portion of the utterance up to the second break,
the second response determination unit inputs an instruction to determine the second response to a large-scale language model when the determination unit determines that the second break has occurred in the utterance.
The information processing device according to claim 5 .

The determination unit further determines whether or not a predetermined time for outputting a response has elapsed after the second division of the utterance;
the output unit outputs at least one of the first response information and the second response information when the determination unit determines that the predetermined time has elapsed after the second division.
The information processing device according to claim 6.

A backchannel determination unit that determines backchannel information related to a backchannel corresponding to the first utterance information when determining the first response information,
The output unit further outputs the backchannel information,
The predetermined time is determined based on the backchannel information.
The information processing device according to claim 7.

The output unit controls the terminal device of the subject to output the first response information by voice when the predetermined condition is not satisfied, and controls the terminal device of the subject to output at least one of the first response information and the second response information by voice when the predetermined condition is satisfied.
The information processing device according to claim 1 .

The first response determination unit further determines information for outputting the first response information by voice when determining the first response information,
The second response determination unit further determines information for outputting the second response information by voice when determining the second response information.
The information processing device according to claim 9.

a storage unit configured to store information for outputting each of the one or more texts by voice,
When the first response information is consistent with at least one of the one or more texts, determining information for outputting the first response information by voice includes obtaining information for outputting the consistent text by voice from the storage unit;
When the first response information does not match any of the one or more texts, determining information for outputting the first response information by voice includes generating information for outputting the first response information by voice based on a predetermined voice generation program;
When the second response information is consistent with at least one of the one or more texts, determining information for outputting the second response information by voice includes obtaining information for outputting the consistent text by voice from the storage unit;
When the second response information does not match any of the one or more texts, determining information for outputting the second response information by voice includes generating information for outputting the second response information by voice based on the predetermined voice generation program.
The information processing device according to claim 10.

An information processing device,
acquiring speech information relating to an utterance by a target person, the speech information including first utterance information relating to a first utterance and second utterance information relating to a second utterance subsequent to the first utterance;
determining first response information related to a response to at least a portion of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model;
determining second response information related to another response to at least a portion of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into a large-scale language model before outputting the first response information to the target person when a predetermined condition related to the utterance is satisfied;
outputting the first response information when the predetermined condition is not satisfied, and outputting at least one of the first response information and the second response information when the predetermined condition is satisfied;
An information processing method.

In the information processing device,
acquiring speech information relating to an utterance of a target person, the speech information including first utterance information relating to a first utterance and second utterance information relating to a second utterance subsequent to the first utterance;
determining first response information related to a response to at least a portion of the utterance by inputting a first response determination instruction including the first utterance information into a large-scale language model;
determining second response information related to another response to at least a portion of the utterance by inputting a second response determination instruction including the first utterance information and/or the first response information and the second utterance information into a large-scale language model before outputting the first response information to the target person when a predetermined condition related to the utterance is satisfied;
outputting the first response information when the predetermined condition is not satisfied, and outputting at least one of the first response information and the second response information when the predetermined condition is satisfied;
A program to execute.