JP2025058129A

JP2025058129A - system

Info

Publication number: JP2025058129A
Application number: JP2023168026A
Authority: JP
Inventors: 敏彦前島; Toshihiko Maejima
Original assignee: SoftBank Group Corp
Current assignee: SoftBank Group Corp
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2025-04-09

Abstract

To enable a person who is in a state in which it is difficult to output voice to reproduce past voice by combining gaze tracking and voice reproduction.SOLUTION: A system comprises: means for tracking a line of sight of a person in a state in which voice output is difficult; and means for reproducing past voice based on information on the tracked line of sight.SELECTED DRAWING: Figure 1

Description

本開示の技術は、システムに関する。 The technology disclosed herein relates to a system.

特許文献１には、少なくとも一つのプロセッサにより遂行される、ペルソナチャットボット制御方法であって、ユーザ発話を受信するステップと、前記ユーザ発話を、チャットボットのキャラクターに関する説明と関連した指示文を含むプロンプトに追加するステップと前記プロンプトをエンコードするステップと、前記エンコードしたプロンプトを言語モデルに入力して、前記ユーザ発話に応答するチャットボット発話を生成するステップ、を含む、方法が開示されている。 Patent document 1 discloses a persona chatbot control method performed by at least one processor, the method including the steps of receiving a user utterance, adding the user utterance to a prompt including a description of the chatbot character and an associated instruction sentence, encoding the prompt, and inputting the encoded prompt into a language model to generate a chatbot utterance in response to the user utterance.

特開２０２２－１８０２８２号公報JP 2022-180282 A

本発明は、音声出力が困難な状態にある人が、視線追跡と音声再現を組み合わせることで、過去の音声を再現することを可能にすることを解決しようとする。 The present invention aims to enable people who have difficulty outputting voice to reproduce past voices by combining gaze tracking and voice reproduction.

音声出力が困難な状態にある人の視線を追跡する手段と、追跡した視線情報をもとに過去の音声を再現する手段と、を含むシステム。 A system including a means for tracking the gaze of a person who is in a state where it is difficult to output speech, and a means for reproducing past speech based on the tracked gaze information.

第１実施形態に係るデータ処理システムの構成の一例を示す概念図である。1 is a conceptual diagram showing an example of a configuration of a data processing system according to a first embodiment. 第１実施形態に係るデータ処理装置及びスマートデバイスの要部機能の一例を示す概念図である。1 is a conceptual diagram showing an example of main functions of a data processing device and a smart device according to a first embodiment. 第２実施形態に係るデータ処理システムの構成の一例を示す概念図である。FIG. 11 is a conceptual diagram showing an example of a configuration of a data processing system according to a second embodiment. 第２実施形態に係るデータ処理装置及びスマート眼鏡の要部機能の一例を示す概念図である。FIG. 11 is a conceptual diagram showing an example of main functions of a data processing device and smart glasses according to a second embodiment. 第３実施形態に係るデータ処理システムの構成の一例を示す概念図である。FIG. 13 is a conceptual diagram showing an example of the configuration of a data processing system according to a third embodiment. 第３実施形態に係るデータ処理装置及びヘッドセット型端末の要部機能の一例を示す概念図である。FIG. 13 is a conceptual diagram showing an example of main functions of a data processing device and a headset-type terminal according to a third embodiment. 第４実施形態に係るデータ処理システムの構成の一例を示す概念図である。FIG. 13 is a conceptual diagram showing an example of the configuration of a data processing system according to a fourth embodiment. 第４実施形態に係るデータ処理装置及びロボットの要部機能の一例を示す概念図である。FIG. 13 is a conceptual diagram showing an example of main functions of a data processing device and a robot according to a fourth embodiment. 複数の感情がマッピングされる感情マップを示す。1 shows an emotion map onto which multiple emotions are mapped. 複数の感情がマッピングされる感情マップを示す。1 shows an emotion map onto which multiple emotions are mapped.

以下、添付図面に従って本開示の技術に係るシステムの実施形態の一例について説明する。 An example of an embodiment of a system according to the disclosed technology is described below with reference to the attached drawings.

先ず、以下の説明で使用される文言について説明する。 First, let us explain the terminology used in the following explanation.

以下の実施形態において、符号付きのプロセッサ（以下、単に「プロセッサ」と称する）は、１つの演算装置であってもよいし、複数の演算装置の組み合わせであってもよい。また、プロセッサは、１種類の演算装置であってもよいし、複数種類の演算装置の組み合わせであってもよい。演算装置の一例としては、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＧＰＧＰＵ（General-Purpose computing on Graphics Processing Units）、ＡＰＵ（Accelerated Processing Unit）、又はＴＰＵ（Tensor Processing Unit）等が挙げられる。 In the following embodiments, the signed processor (hereinafter simply referred to as "processor") may be one arithmetic device or a combination of multiple arithmetic devices. Furthermore, the processor may be one type of arithmetic device or a combination of multiple types of arithmetic devices. Examples of arithmetic devices include a CPU (Central Processing Unit), a GPU (Graphics Processing Unit), a GPGPU (General-Purpose computing on Graphics Processing Units), an APU (Accelerated Processing Unit), or a TPU (Tensor Processing Unit).

以下の実施形態において、符号付きのＲＡＭ（Random Access Memory）は、一時的に情報が格納されるメモリであり、プロセッサによってワークメモリとして用いられる。 In the following embodiments, a signed random access memory (RAM) is a memory in which information is temporarily stored and is used by the processor as a working memory.

以下の実施形態において、符号付きのストレージは、各種プログラム及び各種パラメータ等を記憶する１つ又は複数の不揮発性の記憶装置である。不揮発性の記憶装置の一例としては、フラッシュメモリ（ＳＳＤ（Solid State Drive））、磁気ディスク（例えば、ハードディスク）、又は磁気テープ等が挙げられる。 In the following embodiments, the coded storage is one or more non-volatile storage devices that store various programs, various parameters, etc. Examples of non-volatile storage devices include flash memory (SSD (Solid State Drive)), magnetic disks (e.g., hard disks), and magnetic tapes.

以下の実施形態において、符号付きの通信Ｉ／Ｆ（Interface）は、通信プロセッサ及びアンテナ等を含むインタフェースである。通信Ｉ／Ｆは、複数のコンピュータ間での通信を司る。通信Ｉ／Ｆに対して適用される通信規格の一例としては、５Ｇ（5th Generation Mobile Communication System）、Ｗｉ－Ｆｉ（登録商標）、又はＢｌｕｅｔｏｏｔｈ（登録商標）等を含む無線通信規格が挙げられる。 In the following embodiments, a communication I/F (Interface) with a code is an interface including a communication processor and an antenna, etc. The communication I/F controls communication between multiple computers. Examples of communication standards applied to the communication I/F include wireless communication standards including 5G (5th Generation Mobile Communication System), Wi-Fi (registered trademark), and Bluetooth (registered trademark).

以下の実施形態において、「Ａ及び／又はＢ」は、「Ａ及びＢのうちの少なくとも１つ」と同義である。つまり、「Ａ及び／又はＢ」は、Ａだけであってもよいし、Ｂだけであってもよいし、Ａ及びＢの組み合わせであってもよい、という意味である。また、本明細書において、３つ以上の事柄を「及び／又は」で結び付けて表現する場合も、「Ａ及び／又はＢ」と同様の考え方が適用される。 In the following embodiments, "A and/or B" is synonymous with "at least one of A and B." In other words, "A and/or B" means that it may be only A, only B, or a combination of A and B. In addition, in this specification, the same concept as "A and/or B" is also applied when three or more things are expressed by connecting them with "and/or."

［第１実施形態］
図１には、第１実施形態に係るデータ処理システム１０の構成の一例が示されている。 [First embodiment]
FIG. 1 shows an example of the configuration of a data processing system 10 according to the first embodiment.

図１に示すように、データ処理システム１０は、データ処理装置１２及びスマートデバイス１４を備えている。データ処理装置１２の一例としては、サーバが挙げられる。 As shown in FIG. 1, the data processing system 10 includes a data processing device 12 and a smart device 14. An example of the data processing device 12 is a server.

データ処理装置１２は、コンピュータ２２、データベース２４、及び通信Ｉ／Ｆ２６を備えている。コンピュータ２２は、本開示の技術に係る「コンピュータ」の一例である。コンピュータ２２は、プロセッサ２８、ＲＡＭ３０、及びストレージ３２を備えている。プロセッサ２８、ＲＡＭ３０、及びストレージ３２は、バス３４に接続されている。また、データベース２４及び通信Ｉ／Ｆ２６も、バス３４に接続されている。通信Ｉ／Ｆ２６は、ネットワーク５４に接続されている。ネットワーク５４の一例としては、ＷＡＮ（Wide Area Network）及び／又はＬＡＮ（Local Area Network）等が挙げられる。 The data processing device 12 includes a computer 22, a database 24, and a communication I/F 26. The computer 22 is an example of a "computer" according to the technology of the present disclosure. The computer 22 includes a processor 28, a RAM 30, and a storage 32. The processor 28, the RAM 30, and the storage 32 are connected to a bus 34. The database 24 and the communication I/F 26 are also connected to the bus 34. The communication I/F 26 is connected to a network 54. Examples of the network 54 include a WAN (Wide Area Network) and/or a LAN (Local Area Network).

スマートデバイス１４は、コンピュータ３６、受付装置３８、出力装置４０、カメラ４２、及び通信Ｉ／Ｆ４４を備えている。コンピュータ３６は、プロセッサ４６、ＲＡＭ４８、及びストレージ５０を備えている。プロセッサ４６、ＲＡＭ４８、及びストレージ５０は、バス５２に接続されている。また、受付装置３８、出力装置４０、及びカメラ４２も、バス５２に接続されている。 The smart device 14 includes a computer 36, a reception device 38, an output device 40, a camera 42, and a communication I/F 44. The computer 36 includes a processor 46, a RAM 48, and a storage 50. The processor 46, the RAM 48, and the storage 50 are connected to a bus 52. The reception device 38, the output device 40, and the camera 42 are also connected to the bus 52.

受付装置３８は、タッチパネル３８Ａ及びマイクロフォン３８Ｂ等を備えており、ユーザ入力を受け付ける。タッチパネル３８Ａは、指示体（例えば、ペン又は指等）の接触を検出することにより、指示体の接触によるユーザ入力を受け付ける。マイクロフォン３８Ｂは、ユーザの音声を検出することにより、音声によるユーザ入力を受け付ける。制御部４６Ａは、タッチパネル３８Ａ及びマイクロフォン３８Ｂによって受け付けたユーザ入力を示すデータをデータ処理装置１２に送信する。データ処理装置１２では、特定処理部２９０（図２参照）が、ユーザ入力を示すデータを取得する。 The reception device 38 includes a touch panel 38A and a microphone 38B, and receives user input. The touch panel 38A detects contact with an indicator (e.g., a pen or a finger) to receive user input by the touch of the indicator. The microphone 38B detects the user's voice to receive user input by voice. The control unit 46A transmits data indicating the user input received by the touch panel 38A and the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 (see FIG. 2) acquires the data indicating the user input.

出力装置４０は、ディスプレイ４０Ａ及びスピーカ４０Ｂ等を備えており、データをユーザが知覚可能な表現形（例えば、音声及び／又はテキスト）で出力することでデータをユーザに対して提示する。ディスプレイ４０Ａは、プロセッサ４６からの指示に従ってテキスト及び画像等の可視情報を表示する。スピーカ４０Ｂは、プロセッサ４６からの指示に従って音声を出力する。カメラ４２は、レンズ、絞り、及びシャッタ等の光学系と、ＣＭＯＳ（Complementary Metal-Oxide-Semiconductor）イメージセンサ又はＣＣＤ（Charge Coupled Device）イメージセンサ等の撮像素子とが搭載された小型デジタルカメラである。 The output device 40 includes a display 40A and a speaker 40B, and presents data to the user by outputting the data in a form of expression that the user can perceive (e.g., voice and/or text). The display 40A displays visible information such as text and images according to instructions from the processor 46. The speaker 40B outputs voice according to instructions from the processor 46. The camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an imaging element such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor.

通信Ｉ／Ｆ４４は、ネットワーク５４に接続されている。通信Ｉ／Ｆ４４及び２６は、ネットワーク５４を介してプロセッサ４６とプロセッサ２８との間の各種情報の授受を司る。 The communication I/F 44 is connected to the network 54. The communication I/Fs 44 and 26 are responsible for transmitting and receiving various types of information between the processor 46 and the processor 28 via the network 54.

図２には、データ処理装置１２及びスマートデバイス１４の要部機能の一例が示されている。 Figure 2 shows an example of the main functions of the data processing device 12 and the smart device 14.

図２に示すように、データ処理装置１２では、プロセッサ２８によって特定処理が行われる。ストレージ３２には、特定処理プログラム５６が格納されている。特定処理プログラム５６は、本開示の技術に係る「プログラム」の一例である。プロセッサ２８は、ストレージ３２から特定処理プログラム５６を読み出し、読み出した特定処理プログラム５６をＲＡＭ３０上で実行する。特定処理は、プロセッサ２８がＲＡＭ３０上で実行する特定処理プログラム５６に従って特定処理部２９０として動作することによって実現される。 As shown in FIG. 2, in the data processing device 12, specific processing is performed by the processor 28. A specific processing program 56 is stored in the storage 32. The specific processing program 56 is an example of a "program" according to the technology of the present disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as a specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

ストレージ３２には、データ生成モデル５８及び感情特定モデル５９が格納されている。データ生成モデル５８及び感情特定モデル５９は、特定処理部２９０によって用いられる。 Storage 32 stores a data generation model 58 and an emotion identification model 59. Data generation model 58 and emotion identification model 59 are used by the identification processing unit 290.

スマートデバイス１４では、プロセッサ４６によって受付出力処理が行われる。ストレージ５０には、受付出力プログラム６０が格納されている。受付出力プログラム６０は、データ処理システム１０によって特定処理プログラム５６と併用される。プロセッサ４６は、ストレージ５０から受付出力プログラム６０を読み出し、読み出した受付出力プログラム６０をＲＡＭ４８上で実行する。受付出力処理は、プロセッサ４６がＲＡＭ４８上で実行する受付出力プログラム６０に従って、制御部４６Ａとして動作することによって実現される。 In the smart device 14, the reception output process is performed by the processor 46. The storage 50 stores a reception output program 60. The reception output program 60 is used in conjunction with a specific processing program 56 by the data processing system 10. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output process is realized by the processor 46 operating as the control unit 46A in accordance with the reception output program 60 executed on the RAM 48.

なお、データ処理装置１２以外の他の装置がデータ生成モデル５８を有してもよい。例えば、サーバ装置（例えば、ＣｈａｔＧＰＴサーバ）がデータ生成モデル５８を有してもよい。この場合、データ処理装置１２は、データ生成モデル５８を有するサーバ装置と通信を行うことで、データ生成モデル５８が用いられた処理結果（予測結果など）を得る。また、データ処理装置１２は、サーバ装置であってもよいし、ユーザが保有する端末装置（例えば、携帯電話、ロボット、家電）であってもよい。次に、データ処理装置１２の特定処理部２９０による特定処理について説明する。 Note that a device other than the data processing device 12 may have the data generation model 58. For example, a server device (e.g., a ChatGPT server) may have the data generation model 58. In this case, the data processing device 12 obtains a processing result (such as a prediction result) using the data generation model 58 by communicating with the server device having the data generation model 58. In addition, the data processing device 12 may be a server device, or may be a terminal device owned by a user (e.g., a mobile phone, a robot, a home appliance). Next, the identification process by the identification processing unit 290 of the data processing device 12 will be described.

（形態例１）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するためのセンサーと、追跡した視線情報をもとに過去の音声を再現するためのニューラルネットワークとを含むシステムである。具体的には、センサーは目の動きを検知し、その情報をニューラルネットワークに入力することで、過去の音声データと組み合わせて音声を生成する。
（形態例２）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するためのカメラと、追跡した視線情報をもとに過去の音声を再現するための音声合成モジュールとを含むシステムである。具体的には、カメラは目の動きを撮影し、その情報を音声合成モジュールに送信することで、過去の音声データと組み合わせて音声を合成する。
（形態例３）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するための赤外線センサーと、追跡した視線情報をもとに過去の音声を再現するための音声処理ユニットとを含むシステムである。具体的には、赤外線センサーは目の動きを検知し、その情報を音声処理ユニットに送信することで、過去の音声データと組み合わせて音声を処理する。 (Example 1)
The embodiment of the present invention is a system including a sensor for tracking the gaze of a person who is in a state where it is difficult to output voice, and a neural network for reproducing past voice based on the tracked gaze information. Specifically, the sensor detects eye movements and inputs the information into the neural network, which combines it with past voice data to generate voice.
(Example 2)
The embodiment of the present invention is a system including a camera for tracking the gaze of a person who is having difficulty outputting voice, and a voice synthesis module for reproducing past voice based on the tracked gaze information. Specifically, the camera captures eye movements and transmits the information to the voice synthesis module, which combines the information with past voice data to synthesize voice.
(Example 3)
The embodiment of the present invention is a system including an infrared sensor for tracking the gaze of a person who is having difficulty outputting voice, and a voice processing unit for reproducing past voice based on the tracked gaze information. Specifically, the infrared sensor detects eye movements and transmits the information to the voice processing unit, which processes the voice in combination with past voice data.

以下に、各形態例の処理の流れについて説明する。 The process flow for each example is explained below.

（形態例１）
ステップ１：音声出力が困難な状態にある人の目の動きを検知するために、センサーが目の位置や動きを計測する。
ステップ２：センサーが検知した目の動き情報をニューラルネットワークに入力し、過去の音声データと組み合わせて音声生成を行う。
ステップ３：ニューラルネットワークによって生成された音声を出力し、音声出力が困難な人が過去の音声を再現することができる。
（形態例２）
ステップ１：音声出力が困難な状態にある人の目の動きを撮影するために、カメラが目の位置や動きを映像として取得する。
ステップ２：カメラが取得した目の動き映像を音声合成モジュールに送信し、過去の音声データと組み合わせて音声合成を行う。
ステップ３：音声合成モジュールによって合成された音声を出力し、音声出力が困難な人が過去の音声を再現することができる。
（形態例３）
ステップ１：音声出力が困難な状態にある人の目の動きを検知するために、赤外線センサーが目の位置や動きを赤外線で計測する。
ステップ２：赤外線センサーが検知した目の動き情報を音声処理ユニットに送信し、過去の音声データと組み合わせて音声処理を行う。
ステップ３：音声処理ユニットによって処理された音声を出力し、音声出力が困難な人が過去の音声を再現することができる。 (Example 1)
Step 1: To detect the eye movements of a person who is having difficulty producing speech output, a sensor measures the position and movement of the eyes.
Step 2: The eye movement information detected by the sensor is input into the neural network and combined with past voice data to generate voice.
Step 3: The voice generated by the neural network is output, allowing people with difficulty in speech output to reproduce past voices.
(Example 2)
Step 1: To capture the eye movements of a person who is having difficulty outputting voice, a camera captures video of the eye position and movements.
Step 2: The eye movement video captured by the camera is sent to the voice synthesis module, where it is combined with past voice data to synthesize voice.
Step 3: The voice synthesis module outputs the synthesized voice, allowing people who have difficulty in voice output to reproduce past voices.
(Example 3)
Step 1: To detect the eye movements of a person who is having difficulty outputting voice, an infrared sensor measures the position and movement of the eyes using infrared rays.
Step 2: The eye movement information detected by the infrared sensor is sent to the voice processing unit, where it is combined with past voice data for voice processing.
Step 3: The voice processed by the voice processing unit is output, so that a person who has difficulty in voice output can reproduce the past voice.

更に、ユーザの感情を推定する感情エンジンを組み合わせてもよい。すなわち、特定処理部２９０は、感情特定モデル５９を用いてユーザの感情を推定し、ユーザの感情を用いた特定処理を行うようにしてもよい。 Furthermore, an emotion engine that estimates the user's emotion may be combined. That is, the identification processing unit 290 may estimate the user's emotion using the emotion identification model 59, and perform identification processing using the user's emotion.

（形態例１）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するためのセンサーと、追跡した視線情報をもとに過去の音声を再現するためのニューラルネットワークと、ユーザの感情を認識する感情エンジンとを含むシステムである。具体的には、センサーが目の動きを検知し、その情報をニューラルネットワークに入力することで、過去の音声データと組み合わせて音声を生成する。さらに、感情エンジンは視線情報と音声データを分析し、感情を推定し、音声再生時に適切な感情表現を生成する。
（形態例２）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するためのカメラと、追跡した視線情報をもとに過去の音声を再現するための音声合成モジュールと、ユーザの感情を認識する感情エンジンとを含むシステムである。具体的には、カメラが目の動きを撮影し、その情報を音声合成モジュールに送信することで、過去の音声データと組み合わせて音声を合成する。さらに、感情エンジンは視線情報と感情推定結果をもとに、音声のリアルタイムな調整を行い、適切な感情表現を反映させる。
（形態例３）
本発明を実施するための形態は、音声出力が困難な状態にある人の視線を追跡するための赤外線センサーと、追跡した視線情報をもとに過去の音声を再現するための音声処理ユニットと、ユーザの感情を認識する感情エンジンとを含むシステムである。具体的には、赤外線センサーが目の位置や動きを赤外線で計測し、その情報を音声処理ユニットに送信することで、過去の音声データと組み合わせて音声を処理する。さらに、感情エンジンは視線情報と感情推定結果をもとに、音声合成パラメータを調整し、感情の変化に応じたリアルタイムな音声処理を行う。 (Example 1)
The embodiment of the present invention is a system including a sensor for tracking the gaze of a person who is in a state where voice output is difficult, a neural network for reproducing past voice based on the tracked gaze information, and an emotion engine for recognizing the user's emotions. Specifically, the sensor detects eye movements and inputs the information into the neural network, which combines the information with past voice data to generate voice. Furthermore, the emotion engine analyzes the gaze information and voice data, estimates emotions, and generates appropriate emotional expressions when playing back voice.
(Example 2)
The embodiment of the present invention is a system including a camera for tracking the gaze of a person who is in a state where voice output is difficult, a voice synthesis module for reproducing past voice based on the tracked gaze information, and an emotion engine for recognizing the user's emotions. Specifically, the camera captures eye movements and transmits the information to the voice synthesis module, which combines the information with past voice data to synthesize voice. Furthermore, the emotion engine adjusts the voice in real time based on the gaze information and emotion estimation results to reflect appropriate emotional expressions.
(Example 3)
The embodiment of the present invention is a system including an infrared sensor for tracking the gaze of a person who is in a state where voice output is difficult, a voice processing unit for reproducing past voice based on the tracked gaze information, and an emotion engine for recognizing the user's emotions. Specifically, the infrared sensor measures the position and movement of the eyes with infrared rays, and transmits the information to the voice processing unit, which combines it with past voice data to process the voice. Furthermore, the emotion engine adjusts voice synthesis parameters based on the gaze information and emotion estimation results, and performs real-time voice processing according to changes in emotions.

（形態例１）
ステップ１：音声出力が困難な状態にある人の目の動きを検知するために、センサーが目の位置や動きを計測する。
ステップ２：センサーが検知した目の動き情報をニューラルネットワークに入力し、過去の音声データと組み合わせて音声生成を行う。
ステップ３：感情エンジンが視線情報と音声データを分析し、感情を推定し、適切な感情表現を生成する。生成された音声と感情表現を組み合わせて、音声出力が困難な人が感情を含んだ過去の音声を再現することができる。
（形態例２）
ステップ１：音声出力が困難な状態にある人の目の動きを撮影するために、カメラが目の位置や動きを映像として取得する。
ステップ２：カメラが取得した目の動き映像を音声合成モジュールに送信し、過去の音声データと組み合わせて音声合成を行う。
ステップ３：感情エンジンが視線情報と音声データを分析し、感情を推定し、適切な感情表現を生成する。生成された音声と感情表現を組み合わせて、音声出力が困難な人が感情を含んだ過去の音声を再現することができる。
（形態例３）
ステップ１：音声出力が困難な状態にある人の目の動きを検知するために、赤外線センサーが目の位置や動きを赤外線で計測する。
ステップ２：赤外線センサーが検知した目の動き情報を音声処理ユニットに送信し、過去の音声データと組み合わせて音声処理を行う。
ステップ３：感情エンジンが視線情報と音声データを分析し、感情を推定し、適切な感情表現を生成する。生成された音声と感情表現を組み合わせて、音声出力が困難な人が感情を含んだ過去の音声を再現することができる。 (Example 1)
Step 1: To detect the eye movements of a person who is having difficulty producing speech output, a sensor measures the position and movement of the eyes.
Step 2: The eye movement information detected by the sensor is input into the neural network and combined with past voice data to generate voice.
Step 3: The emotion engine analyzes the gaze information and voice data, infers emotions, and generates appropriate emotional expressions. By combining the generated voice and emotional expressions, people who have difficulty outputting voice can reproduce past voices containing emotions.
(Example 2)
Step 1: To capture the eye movements of a person who is having difficulty outputting voice, a camera captures video of the eye position and movements.
Step 2: The eye movement video captured by the camera is sent to the voice synthesis module, where it is combined with past voice data to synthesize voice.
Step 3: The emotion engine analyzes the gaze information and voice data, infers emotions, and generates appropriate emotional expressions. By combining the generated voice and emotional expressions, people who have difficulty outputting voice can reproduce past voices containing emotions.
(Example 3)
Step 1: To detect the eye movements of a person who is having difficulty outputting voice, an infrared sensor measures the position and movement of the eyes using infrared rays.
Step 2: The eye movement information detected by the infrared sensor is sent to the voice processing unit, where it is combined with past voice data for voice processing.
Step 3: The emotion engine analyzes the gaze information and voice data, infers emotions, and generates appropriate emotional expressions. By combining the generated voice and emotional expressions, people who have difficulty outputting voice can reproduce past voices containing emotions.

特定処理部２９０は、特定処理の結果をスマートデバイス１４に送信する。スマートデバイス１４では、制御部４６Ａが、出力装置４０に対して特定処理の結果を出力させる。マイクロフォン３８Ｂは、特定処理の結果に対するユーザ入力を示す音声を取得する。制御部４６Ａは、マイクロフォン３８Ｂによって取得されたユーザ入力を示す音声データをデータ処理装置１２に送信する。データ処理装置１２では、特定処理部２９０が音声データを取得する。 The specific processing unit 290 transmits the result of the specific processing to the smart device 14. In the smart device 14, the control unit 46A causes the output device 40 to output the result of the specific processing. The microphone 38B acquires audio indicating a user input for the result of the specific processing. The control unit 46A transmits audio data indicating the user input acquired by the microphone 38B to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

データ生成モデル５８は、いわゆる生成ＡＩ（Artificial Intelligence）である。データ生成モデル５８の一例としては、ＣｈａｔＧＰＴ（インターネット検索＜URL: https://openai.com/blog/chatgpt＞）等の生成ＡＩが挙げられる。データ生成モデル５８は、ニューラルネットワークに対して深層学習を行わせることによって得られる。データ生成モデル５８には、指示を含むプロンプトが入力され、かつ、音声を示す音声データ、テキストを示すテキストデータ、及び画像を示す画像データ等の推論用データが入力される。データ生成モデル５８は、入力された推論用データをプロンプトにより示される指示に従って推論し、推論結果を音声データ及びテキストデータ等のデータ形式で出力する。ここで、推論とは、例えば、分析、分類、予測、及び／又は要約等を指す。特定処理部２９０は、データ生成モデル５８を用いながら、上述した特定処理を行う。 The data generation model 58 is a so-called generative AI (Artificial Intelligence). An example of the data generation model 58 is generative AI such as ChatGPT (Internet search <URL: https://openai.com/blog/chatgpt>). The data generation model 58 is obtained by performing deep learning on a neural network. A prompt including an instruction is input to the data generation model 58, and inference data such as voice data indicating a voice, text data indicating a text, and image data indicating an image is input. The data generation model 58 infers the input inference data according to the instruction indicated by the prompt, and outputs the inference result in a data format such as voice data and text data. Here, inference refers to, for example, analysis, classification, prediction, and/or summarization. The identification processing unit 290 performs the above-mentioned identification processing while using the data generation model 58.

上記実施形態では、データ処理装置１２によって特定処理が行われる形態例を挙げたが、本開示の技術はこれに限定されず、スマートデバイス１４によって特定処理が行われるようにしてもよい。 In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology disclosed herein is not limited to this, and the specific processing may also be performed by the smart device 14.

［第２実施形態］ [Second embodiment]

図３には、第２実施形態に係るデータ処理システム２１０の構成の一例が示されている。 Figure 3 shows an example of the configuration of a data processing system 210 according to the second embodiment.

図３に示すように、データ処理システム２１０は、データ処理装置１２及びスマート眼鏡２１４を備えている。データ処理装置１２の一例としては、サーバが挙げられる。 As shown in FIG. 3, the data processing system 210 includes a data processing device 12 and smart glasses 214. An example of the data processing device 12 is a server.

スマート眼鏡２１４は、コンピュータ３６、マイクロフォン２３８、スピーカ２４０、カメラ４２、及び通信Ｉ／Ｆ４４を備えている。コンピュータ３６は、プロセッサ４６、ＲＡＭ４８、及びストレージ５０を備えている。プロセッサ４６、ＲＡＭ４８、及びストレージ５０は、バス５２に接続されている。また、マイクロフォン２３８、スピーカ２４０、及びカメラ４２も、バス５２に接続されている。 The smart glasses 214 include a computer 36, a microphone 238, a speaker 240, a camera 42, and a communication I/F 44. The computer 36 includes a processor 46, a RAM 48, and a storage 50. The processor 46, the RAM 48, and the storage 50 are connected to a bus 52. The microphone 238, the speaker 240, and the camera 42 are also connected to the bus 52.

マイクロフォン２３８は、ユーザが発する音声を受け付けることで、ユーザから指示等を受け付ける。マイクロフォン２３８は、ユーザが発する音声を捕捉し、捕捉した音声を音声データに変換してプロセッサ４６に出力する。スピーカ２４０は、プロセッサ４６からの指示に従って音声を出力する。 The microphone 238 receives instructions and the like from the user by receiving voice uttered by the user. The microphone 238 captures the voice uttered by the user, converts the captured voice into audio data, and outputs it to the processor 46. The speaker 240 outputs the voice according to instructions from the processor 46.

カメラ４２は、レンズ、絞り、及びシャッタ等の光学系と、ＣＭＯＳ（Complementary Metal-Oxide-Semiconductor）イメージセンサ又はＣＣＤ（Charge Coupled Device）イメージセンサ等の撮像素子とが搭載された小型デジタルカメラであり、ユーザの周囲（例えば、一般的な健常者の視界の広さに相当する画角で規定された撮像範囲）を撮像する。 Camera 42 is a small digital camera equipped with an optical system including a lens, aperture, and shutter, and an imaging element such as a CMOS (Complementary Metal-Oxide-Semiconductor) image sensor or a CCD (Charge Coupled Device) image sensor, and captures the user's surroundings (e.g., an imaging range defined by an angle of view equivalent to the field of vision of a typical able-bodied person).

通信Ｉ／Ｆ４４は、ネットワーク５４に接続されている。通信Ｉ／Ｆ４４及び２６は、ネットワーク５４を介してプロセッサ４６とプロセッサ２８との間の各種情報の授受を司る。通信Ｉ／Ｆ４４及び２６を用いたプロセッサ４６とプロセッサ２８との間の各種情報の授受はセキュアな状態で行われる。 The communication I/F 44 is connected to the network 54. The communication I/Fs 44 and 26 are responsible for the exchange of various information between the processor 46 and the processor 28 via the network 54. The exchange of various information between the processor 46 and the processor 28 using the communication I/Fs 44 and 26 is performed in a secure state.

図４には、データ処理装置１２及びスマート眼鏡２１４の要部機能の一例が示されている。図４に示すように、データ処理装置１２では、プロセッサ２８によって特定処理が行われる。ストレージ３２には、特定処理プログラム５６が格納されている。 Figure 4 shows an example of the main functions of the data processing device 12 and the smart glasses 214. As shown in Figure 4, in the data processing device 12, a specific process is performed by the processor 28. A specific process program 56 is stored in the storage 32.

特定処理プログラム５６は、本開示の技術に係る「プログラム」の一例である。プロセッサ２８は、ストレージ３２から特定処理プログラム５６を読み出し、読み出した特定処理プログラム５６をＲＡＭ３０上で実行する。特定処理は、プロセッサ２８がＲＡＭ３０上で実行する特定処理プログラム５６に従って、特定処理部２９０として動作することによって実現される。 The specific processing program 56 is an example of a "program" according to the technology of the present disclosure. The processor 28 reads the specific processing program 56 from the storage 32 and executes the read specific processing program 56 on the RAM 30. The specific processing is realized by the processor 28 operating as the specific processing unit 290 in accordance with the specific processing program 56 executed on the RAM 30.

スマート眼鏡２１４では、プロセッサ４６によって受付出力処理が行われる。ストレージ５０には、受付出力プログラム６０が格納されている。プロセッサ４６は、ストレージ５０から受付出力プログラム６０を読み出し、読み出した受付出力プログラム６０をＲＡＭ４８上で実行する。受付出力処理は、プロセッサ４６がＲＡＭ４８上で実行する受付出力プログラム６０に従って、制御部４６Ａとして動作することによって実現される。 In the smart glasses 214, the reception output process is performed by the processor 46. A reception output program 60 is stored in the storage 50. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output process is realized by the processor 46 operating as the control unit 46A in accordance with the reception output program 60 executed on the RAM 48.

なお、更に、ユーザの感情を推定する感情エンジンを組み合わせてもよい。すなわち、特定処理部２９０は、感情特定モデル５９を用いてユーザの感情を推定し、ユーザの感情を用いた特定処理を行うようにしてもよい。 Furthermore, an emotion engine that estimates the user's emotion may be combined. That is, the identification processing unit 290 may estimate the user's emotion using the emotion identification model 59, and perform identification processing using the user's emotion.

特定処理部２９０は、特定処理の結果をスマート眼鏡２１４に送信する。スマート眼鏡２１４では、制御部４６Ａが、スピーカ２４０に対して特定処理の結果を出力させる。マイクロフォン２３８は、特定処理の結果に対するユーザ入力を示す音声を取得する。制御部４６Ａは、マイクロフォン２３８によって取得されたユーザ入力を示す音声データをデータ処理装置１２に送信する。データ処理装置１２では、特定処理部２９０が音声データを取得する。 The specific processing unit 290 transmits the result of the specific processing to the smart glasses 214. In the smart glasses 214, the control unit 46A causes the speaker 240 to output the result of the specific processing. The microphone 238 acquires audio indicating a user input for the result of the specific processing. The control unit 46A transmits audio data indicating the user input acquired by the microphone 238 to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

上記実施形態では、データ処理装置１２によって特定処理が行われる形態例を挙げたが、本開示の技術はこれに限定されず、スマート眼鏡２１４によって特定処理が行われるようにしてもよい。 In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology disclosed herein is not limited to this, and the specific processing may also be performed by the smart glasses 214.

［第３実施形態］ [Third embodiment]

図５には、第３実施形態に係るデータ処理システム３１０の構成の一例が示されている。 Figure 5 shows an example of the configuration of a data processing system 310 according to the third embodiment.

図５に示すように、データ処理システム３１０は、データ処理装置１２及びヘッドセット型端末３１４を備えている。データ処理装置１２の一例としては、サーバが挙げられる。 As shown in FIG. 5, the data processing system 310 includes a data processing device 12 and a headset terminal 314. An example of the data processing device 12 is a server.

ヘッドセット型端末３１４は、コンピュータ３６、マイクロフォン２３８、スピーカ２４０、カメラ４２、通信Ｉ／Ｆ４４、及びディスプレイ３４３を備えている。コンピュータ３６は、プロセッサ４６、ＲＡＭ４８、及びストレージ５０を備えている。プロセッサ４６、ＲＡＭ４８、及びストレージ５０は、バス５２に接続されている。また、マイクロフォン２３８、スピーカ２４０、カメラ４２、及びディスプレイ３４３も、バス５２に接続されている。 The headset type terminal 314 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication I/F 44, and a display 343. The computer 36 includes a processor 46, a RAM 48, and a storage 50. The processor 46, the RAM 48, and the storage 50 are connected to a bus 52. The microphone 238, the speaker 240, the camera 42, and the display 343 are also connected to the bus 52.

図６には、データ処理装置１２及びヘッドセット型端末３１４の要部機能の一例が示されている。図６に示すように、データ処理装置１２では、プロセッサ２８によって特定処理が行われる。ストレージ３２には、特定処理プログラム５６が格納されている。 Figure 6 shows an example of the main functions of the data processing device 12 and the headset type terminal 314. As shown in Figure 6, in the data processing device 12, a specific process is performed by the processor 28. A specific process program 56 is stored in the storage 32.

ヘッドセット型端末３１４では、プロセッサ４６によって受付出力処理が行われる。ストレージ５０には、受付出力プログラム６０が格納されている。プロセッサ４６は、ストレージ５０から受付出力プログラム６０を読み出し、読み出した受付出力プログラム６０をＲＡＭ４８上で実行する。受付出力処理は、プロセッサ４６がＲＡＭ４８上で実行する受付出力プログラム６０に従って、制御部４６Ａとして動作することによって実現される。 In the headset terminal 314, the reception output process is performed by the processor 46. A reception output program 60 is stored in the storage 50. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output process is realized by the processor 46 operating as the control unit 46A in accordance with the reception output program 60 executed on the RAM 48.

特定処理部２９０は、特定処理の結果をヘッドセット型端末３１４に送信する。ヘッドセット型端末３１４では、制御部４６Ａが、スピーカ２４０及びディスプレイ３４３に対して特定処理の結果を出力させる。マイクロフォン２３８は、特定処理の結果に対するユーザ入力を示す音声を取得する。制御部４６Ａは、マイクロフォン２３８によって取得されたユーザ入力を示す音声データをデータ処理装置１２に送信する。データ処理装置１２では、特定処理部２９０が音声データを取得する。 The specific processing unit 290 transmits the result of the specific processing to the headset type terminal 314. In the headset type terminal 314, the control unit 46A causes the speaker 240 and the display 343 to output the result of the specific processing. The microphone 238 acquires audio indicating a user input for the result of the specific processing. The control unit 46A transmits audio data indicating the user input acquired by the microphone 238 to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the audio data.

上記実施形態では、データ処理装置１２によって特定処理が行われる形態例を挙げたが、本開示の技術はこれに限定されず、ヘッドセット型端末３１４によって特定処理が行われるようにしてもよい。
［第４実施形態］ In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology of the present disclosure is not limited to this, and the specific processing may be performed by the headset type terminal 314.
[Fourth embodiment]

図７には、第４実施形態に係るデータ処理システム４１０の構成の一例が示されている。 Figure 7 shows an example of the configuration of a data processing system 410 according to the fourth embodiment.

図７に示すように、データ処理システム４１０は、データ処理装置１２及びロボット４１４を備えている。データ処理装置１２の一例としては、サーバが挙げられる。 As shown in FIG. 7, the data processing system 410 includes a data processing device 12 and a robot 414. An example of the data processing device 12 is a server.

ロボット４１４は、コンピュータ３６、マイクロフォン２３８、スピーカ２４０、カメラ４２、通信Ｉ／Ｆ４４、及び制御対象４４３を備えている。コンピュータ３６は、プロセッサ４６、ＲＡＭ４８、及びストレージ５０を備えている。プロセッサ４６、ＲＡＭ４８、及びストレージ５０は、バス５２に接続されている。また、マイクロフォン２３８、スピーカ２４０、カメラ４２、及び制御対象４４３も、バス５２に接続されている。 The robot 414 includes a computer 36, a microphone 238, a speaker 240, a camera 42, a communication I/F 44, and a control target 443. The computer 36 includes a processor 46, a RAM 48, and a storage 50. The processor 46, the RAM 48, and the storage 50 are connected to a bus 52. The microphone 238, the speaker 240, the camera 42, and the control target 443 are also connected to the bus 52.

制御対象４４３は、表示装置、目部のＬＥＤ、並びに、腕、手及び足等を駆動するモータ等を含む。ロボット４１４の姿勢や仕草は、腕、手及び足等のモータを制御することにより制御される。ロボット４１４の感情の一部は、これらのモータを制御することにより表現できる。また、ロボット４１４の目部のＬＥＤの発光状態を制御することによっても、ロボット４１４の表情を表現できる。 The controlled object 443 includes a display device, LEDs in the eyes, and motors that drive the arms, hands, and legs. The posture and gestures of the robot 414 are controlled by controlling the motors of the arms, hands, and legs. Some of the emotions of the robot 414 can be expressed by controlling these motors. In addition, the facial expressions of the robot 414 can also be expressed by controlling the light emission state of the LEDs in the eyes of the robot 414.

図８には、データ処理装置１２及びロボット４１４の要部機能の一例が示されている。図８に示すように、データ処理装置１２では、プロセッサ２８によって特定処理が行われる。ストレージ３２には、特定処理プログラム５６が格納されている。 Figure 8 shows an example of the main functions of the data processing device 12 and the robot 414. As shown in Figure 8, in the data processing device 12, a specific process is performed by the processor 28. A specific process program 56 is stored in the storage 32.

ロボット４１４では、プロセッサ４６によって受付出力処理が行われる。ストレージ５０には、受付出力プログラム６０が格納されている。プロセッサ４６は、ストレージ５０から受付出力プログラム６０を読み出し、読み出した受付出力プログラム６０をＲＡＭ４８上で実行する。受付出力処理は、プロセッサ４６がＲＡＭ４８上で実行する受付出力プログラム６０に従って、制御部４６Ａとして動作することによって実現される。 In the robot 414, the reception output process is performed by the processor 46. A reception output program 60 is stored in the storage 50. The processor 46 reads the reception output program 60 from the storage 50 and executes the read reception output program 60 on the RAM 48. The reception output process is realized by the processor 46 operating as the control unit 46A in accordance with the reception output program 60 executed on the RAM 48.

特定処理部２９０は、特定処理の結果をロボット４１４に送信する。ロボット４１４では、制御部４６Ａが、スピーカ２４０及び制御対象４４３に対して特定処理の結果を出力させる。マイクロフォン２３８は、特定処理の結果に対するユーザ入力を示す音声を取得する。制御部４６Ａは、マイクロフォン２３８によって取得されたユーザ入力を示す音声データをデータ処理装置１２に送信する。データ処理装置１２では、特定処理部２９０が音声データを取得する。 The specific processing unit 290 transmits the result of the specific processing to the robot 414. In the robot 414, the control unit 46A causes the speaker 240 and the control target 443 to output the result of the specific processing. The microphone 238 acquires voice indicating the user input for the result of the specific processing. The control unit 46A transmits voice data indicating the user input acquired by the microphone 238 to the data processing device 12. In the data processing device 12, the specific processing unit 290 acquires the voice data.

上記実施形態では、データ処理装置１２によって特定処理が行われる形態例を挙げたが、本開示の技術はこれに限定されず、ロボット４１４によって特定処理が行われるようにしてもよい。 In the above embodiment, an example was given in which the specific processing is performed by the data processing device 12, but the technology disclosed herein is not limited to this, and the specific processing may also be performed by the robot 414.

なお、感情エンジンとしての感情特定モデル５９は、特定のマッピングに従い、ユーザの感情を決定してよい。具体的には、感情特定モデル５９は、特定のマッピングである感情マップ（図９参照）に従い、ユーザの感情を決定してよい。また、感情特定モデル５９は、同様に、ロボットの感情を決定し、特定処理部２９０は、ロボットの感情を用いた特定処理を行うようにしてもよい。 The emotion identification model 59, which serves as an emotion engine, may determine the emotion of the user according to a specific mapping. Specifically, the emotion identification model 59 may determine the emotion of the user according to an emotion map (see FIG. 9), which is a specific mapping. Similarly, the emotion identification model 59 may determine the emotion of the robot, and the identification processing unit 290 may perform identification processing using the emotion of the robot.

図９は、複数の感情がマッピングされる感情マップ４００を示す図である。感情マップ４００において、感情は、中心から放射状に同心円に配置されている。同心円の中心に近いほど、原始的状態の感情が配置されている。同心円のより外側には、心境から生まれる状態や行動を表す感情が配置されている。感情とは、情動や心的状態も含む概念である。同心円の左側には、概して脳内で起きる反応から生成される感情が配置されている。同心円の右側には概して、状況判断で誘導される感情が配置されている。同心円の上方向及び下方向には、概して脳内で起きる反応から生成され、かつ、状況判断で誘導される感情が配置されている。また、同心円の上側には、「快」の感情が配置され、下側には、「不快」の感情が配置されている。このように、感情マップ４００では、感情が生まれる構造に基づいて複数の感情がマッピングされており、同時に生じやすい感情が、近くにマッピングされている。 9 is a diagram showing an emotion map 400 on which multiple emotions are mapped. In the emotion map 400, emotions are arranged in concentric circles radiating from the center. The closer to the center of the concentric circles, the more primitive emotions are arranged. Emotions that represent states and actions arising from mental states are arranged on the outer sides of the concentric circles. Emotions are a concept that includes emotions and mental states. On the left side of the concentric circles, emotions that are generally generated from reactions that occur in the brain are arranged. On the right side of the concentric circles, emotions that are generally induced by situational judgment are arranged. On the upper and lower sides of the concentric circles, emotions that are generally generated from reactions that occur in the brain and are induced by situational judgment are arranged. In addition, the emotion of "pleasure" is arranged on the upper side of the concentric circles, and the emotion of "discomfort" is arranged on the lower side. In this way, in the emotion map 400, multiple emotions are mapped based on the structure in which emotions are generated, and emotions that tend to occur simultaneously are mapped close to each other.

これらの感情は、感情マップ４００の３時の方向に分布しており、普段は安心と不安のあたりを行き来する。感情マップ４００の右半分では、内部的な感覚よりも状況認識の方が優位に立つため、落ち着いた印象になる。 These emotions are distributed in the three o'clock direction of emotion map 400, and usually fluctuate between relief and anxiety. In the right half of emotion map 400, situational awareness takes precedence over internal sensations, resulting in a sense of calm.

感情マップ４００の内側は心の中、感情マップ４００の外側は行動を表すため、感情マップ４００の外側に行くほど、感情が目に見える（行動に表れる）ようになる。 The inside of emotion map 400 represents what is going on inside the mind, and the outside of emotion map 400 represents behavior, so the further out you go on emotion map 400, the more visible (expressed in behavior) the emotions become.

ここで、人の感情は、姿勢や血糖値のような様々なバランスを基礎としており、それらのバランスが理想から遠ざかると不快、理想に近づくと快という状態を示す。ロボットや自動車やバイク等においても、姿勢やバッテリー残量のような様々なバランスを基礎として、それらのバランスが理想から遠ざかると不快、理想に近づくと快という状態を示すように感情を作ることができる。感情マップは、例えば、光吉博士の感情地図（音声感情認識及び情動の脳生理信号分析システムに関する研究、徳島大学、博士論文：https://ci.nii.ac.jp/naid/500000375379）に基づいて生成されてよい。感情地図の左半分には、感覚が優位にたつ「反応」と呼ばれる領域に属する感情が並ぶ。また、感情地図の右半分には、状況認識が優位にたつ「状況」と呼ばれる領域に属する感情が並ぶ。 Here, human emotions are based on various balances such as posture and blood sugar level, and when these balances are far from the ideal, it indicates an unpleasant state, and when they are close to the ideal, it indicates a pleasant state. Emotions can also be created for robots, cars, motorcycles, etc., based on various balances such as posture and remaining battery power, so that when these balances are far from the ideal, it indicates an unpleasant state, and when they are close to the ideal, it indicates a pleasant state. The emotion map may be generated, for example, based on the emotion map of Dr. Mitsuyoshi (Research on speech emotion recognition and emotion brain physiological signal analysis system, Tokushima University, doctoral dissertation: https://ci.nii.ac.jp/naid/500000375379). The left half of the emotion map is lined with emotions that belong to an area called "reaction" where sensation is dominant. The right half of the emotion map is lined with emotions that belong to an area called "situation" where situation recognition is dominant.

感情マップでは学習を促す感情が２つ定義される。１つは、状況側にあるネガティブな「懺悔」や「反省」の真ん中周辺の感情である。つまり、「もう２度とこんな想いはしたくない」「もう叱られたくない」というネガティブな感情がロボットに生じたときである。もう１つは、反応側にあるポジティブな「欲」のあたりの感情である。つまり、「もっと欲しい」「もっと知りたい」というポジティブな気持ちのときである。 The emotion map defines two emotions that encourage learning. The first is the negative emotion around the middle of "repentance" or "reflection" on the situation side. In other words, this is when the robot experiences negative emotions such as "I never want to feel this way again" or "I don't want to be scolded again." The other is the positive emotion around "desire" on the response side. In other words, this is when the robot has positive feelings such as "I want more" or "I want to know more."

感情特定モデル５９は、ユーザ入力を、予め学習されたニューラルネットワークに入力し、感情マップ４００に示す各感情を示す感情値を取得し、ユーザの感情を決定する。このニューラルネットワークは、ユーザ入力と、感情マップ４００に示す各感情を示す感情値との組み合わせである複数の学習データに基づいて予め学習されたものである。また、このニューラルネットワークは、図１０に示す感情マップ９００のように、近くに配置されている感情同士は、近い値を持つように学習される。図１０では、「安心」、「安穏」、「心強い」という複数の感情が、近い感情値となる例を示している。 The emotion identification model 59 inputs user input to a pre-trained neural network, obtains emotion values indicating each emotion shown in the emotion map 400, and determines the user's emotion. This neural network is pre-trained based on multiple learning data that are combinations of user input and emotion values indicating each emotion shown in the emotion map 400. Furthermore, this neural network is trained so that emotions that are located close to each other have similar values, as in the emotion map 900 shown in Figure 10. Figure 10 shows an example in which multiple emotions, "peace of mind," "calm," and "reassuring," have similar emotion values.

上記実施形態では、１台のコンピュータ２２によって特定処理が行われる形態例を挙げたが、本開示の技術はこれに限定されず、コンピュータ２２を含めた複数のコンピュータによる特定処理に対する分散処理が行われるようにしてもよい。 In the above embodiment, an example was given in which a specific process is performed by one computer 22, but the technology disclosed herein is not limited to this, and distributed processing of the specific process may be performed by multiple computers, including computer 22.

上記実施形態では、ストレージ３２に特定処理プログラム５６が格納されている形態例を挙げて説明したが、本開示の技術はこれに限定されない。例えば、特定処理プログラム５６がＵＳＢ（Universal Serial Bus）メモリなどの可搬型のコンピュータ読み取り可能な非一時的格納媒体に格納されていてもよい。非一時的格納媒体に格納されている特定処理プログラム５６は、データ処理装置１２のコンピュータ２２にインストールされる。プロセッサ２８は、特定処理プログラム５６に従って特定処理を実行する。 In the above embodiment, an example has been described in which the specific processing program 56 is stored in the storage 32, but the technology of the present disclosure is not limited to this. For example, the specific processing program 56 may be stored in a portable, computer-readable, non-transitory storage medium such as a Universal Serial Bus (USB) memory. The specific processing program 56 stored in the non-transitory storage medium is installed in the computer 22 of the data processing device 12. The processor 28 executes the specific processing in accordance with the specific processing program 56.

また、ネットワーク５４を介してデータ処理装置１２に接続されるサーバ等の格納装置に特定処理プログラム５６を格納させておき、データ処理装置１２の要求に応じて特定処理プログラム５６がダウンロードされ、コンピュータ２２にインストールされるようにしてもよい。 The specific processing program 56 may also be stored in a storage device such as a server connected to the data processing device 12 via the network 54, and the specific processing program 56 may be downloaded and installed in the computer 22 in response to a request from the data processing device 12.

なお、ネットワーク５４を介してデータ処理装置１２に接続されるサーバ等の格納装置に特定処理プログラム５６の全てを格納させておいたり、ストレージ３２に特定処理プログラム５６の全てを記憶させたりしておく必要はなく、特定処理プログラム５６の一部を格納させておいてもよい。 It is not necessary to store all of the specific processing program 56 in a storage device such as a server connected to the data processing device 12 via the network 54, or to store all of the specific processing program 56 in the storage 32; only a portion of the specific processing program 56 may be stored.

特定処理を実行するハードウェア資源としては、次に示す各種のプロセッサを用いることができる。プロセッサとしては、例えば、ソフトウェア、すなわち、プログラムを実行することで、特定処理を実行するハードウェア資源として機能する汎用的なプロセッサであるＣＰＵが挙げられる。また、プロセッサとしては、例えば、ＦＰＧＡ（Field-Programmable Gate Array）、ＰＬＤ（Programmable Logic Device）、又はＡＳＩＣ（Application Specific Integrated Circuit）などの特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路が挙げられる。何れのプロセッサにもメモリが内蔵又は接続されており、何れのプロセッサもメモリを使用することで特定処理を実行する。 The various processors listed below can be used as hardware resources for executing specific processes. Examples of processors include a CPU, which is a general-purpose processor that functions as a hardware resource for executing specific processes by executing software, i.e., a program. Examples of processors include dedicated electrical circuits, such as FPGAs (Field-Programmable Gate Arrays), PLDs (Programmable Logic Devices), or ASICs (Application Specific Integrated Circuits), which are processors with a circuit configuration designed specifically to execute specific processes. All of these processors have built-in or connected memory, and all of these processors execute specific processes by using the memory.

特定処理を実行するハードウェア資源は、これらの各種のプロセッサのうちの１つで構成されてもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡの組み合わせ、又はＣＰＵとＦＰＧＡとの組み合わせ）で構成されてもよい。また、特定処理を実行するハードウェア資源は１つのプロセッサであってもよい。 The hardware resource that executes the specific process may be composed of one of these various processors, or may be composed of a combination of two or more processors of the same or different types (e.g., a combination of multiple FPGAs, or a combination of a CPU and an FPGA). The hardware resource that executes the specific process may also be a single processor.

１つのプロセッサで構成する例としては、第１に、１つ以上のＣＰＵとソフトウェアの組み合わせで１つのプロセッサを構成し、このプロセッサが、特定処理を実行するハードウェア資源として機能する形態がある。第２に、ＳｏＣ（System-on-a-chip）などに代表されるように、特定処理を実行する複数のハードウェア資源を含むシステム全体の機能を１つのＩＣチップで実現するプロセッサを使用する形態がある。このように、特定処理は、ハードウェア資源として、上記各種のプロセッサの１つ以上を用いて実現される。 As an example of a configuration using a single processor, first, there is a configuration in which one processor is configured by combining one or more CPUs with software, and this processor functions as a hardware resource that executes a specific process. Secondly, there is a configuration in which a processor is used that realizes the functions of the entire system, including multiple hardware resources that execute a specific process, on a single IC chip, as typified by SoC (System-on-a-chip). In this way, a specific process is realized using one or more of the various processors mentioned above as hardware resources.

更に、これらの各種のプロセッサのハードウェア的な構造としては、より具体的には、半導体素子などの回路素子を組み合わせた電気回路を用いることができる。また、上記の特定処理はあくまでも一例である。従って、主旨を逸脱しない範囲内において不要なステップを削除したり、新たなステップを追加したり、処理順序を入れ替えたりしてもよいことは言うまでもない。 More specifically, the hardware structure of these various processors can be an electric circuit that combines circuit elements such as semiconductor elements. The specific processing described above is merely an example. It goes without saying that unnecessary steps can be deleted, new steps can be added, and the processing order can be changed without departing from the spirit of the invention.

以上に示した記載内容及び図示内容は、本開示の技術に係る部分についての詳細な説明であり、本開示の技術の一例に過ぎない。例えば、上記の構成、機能、作用、及び効果に関する説明は、本開示の技術に係る部分の構成、機能、作用、及び効果の一例に関する説明である。よって、本開示の技術の主旨を逸脱しない範囲内において、以上に示した記載内容及び図示内容に対して、不要な部分を削除したり、新たな要素を追加したり、置き換えたりしてもよいことは言うまでもない。また、錯綜を回避し、本開示の技術に係る部分の理解を容易にするために、以上に示した記載内容及び図示内容では、本開示の技術の実施を可能にする上で特に説明を要しない技術常識等に関する説明は省略されている。 The above description and illustrations are a detailed explanation of the parts related to the technology of the present disclosure, and are merely an example of the technology of the present disclosure. For example, the above explanation of the configuration, functions, actions, and effects is an explanation of an example of the configuration, functions, actions, and effects of the parts related to the technology of the present disclosure. Therefore, it goes without saying that unnecessary parts may be deleted, new elements may be added, or replacements may be made to the above description and illustrations, within the scope of the gist of the technology of the present disclosure. Also, in order to avoid confusion and to facilitate understanding of the parts related to the technology of the present disclosure, the above description and illustrations omit explanations of technical common knowledge that do not require particular explanation to enable the implementation of the technology of the present disclosure.

本明細書に記載された全ての文献、特許出願及び技術規格は、個々の文献、特許出願及び技術規格が参照により取り込まれることが具体的かつ個々に記された場合と同程度に、本明細書中に参照により取り込まれる。 All publications, patent applications, and technical standards described in this specification are incorporated by reference into this specification to the same extent as if each individual publication, patent application, and technical standard was specifically and individually indicated to be incorporated by reference.

以上の実施形態に関し、更に以下を開示する。 The following is further disclosed regarding the above embodiment.

（付記１）
音声出力が困難な状態にある人の視線を追跡する手段と、追跡した視線情報をもとに過去の音声を再現する手段と、を含むシステム。
（付記２）
付記１に記載のシステムにおいて、視線追跡手段は、目の動きを検知するセンサーを使用することを特徴とする。
（付記３）
付記１に記載のシステムにおいて、音声再現手段は、過去の音声データを学習し、視線情報と組み合わせて音声を生成するニューラルネットワークを使用することを特徴とする。 (Appendix 1)
A system including a means for tracking the gaze of a person who is in a state where it is difficult to output voice, and a means for reproducing past voice based on the tracked gaze information.
(Appendix 2)
In the system described in Appendix 1, the gaze tracking means is characterized by using a sensor that detects eye movement.
(Appendix 3)
In the system described in Supplementary Note 1, the voice reproduction means is characterized by using a neural network that learns past voice data and combines it with gaze information to generate voice.

（付記４）
音声出力が困難な状態にある人の視線を追跡する手段と、追跡した視線情報をもとに過去の音声を再現する手段と、ユーザの感情を認識する感情エンジンとを含むシステム。
（付記５）
付記４に記載のシステムにおいて、感情エンジンは、音声再現時に適切な感情表現を生成するために、視線情報と音声データを分析し、感情を推定することを特徴とする。
（付記６）
付記４に記載のシステムにおいて、感情エンジンは、感情の変化に応じて音声のリアルタイムな調整を行うために、視線情報と感情推定結果をもとに音声合成パラメータを調整することを特徴とする。 (Appendix 4)
A system including a means for tracking the gaze of a person who is in a state where voice output is difficult, a means for reproducing past voice based on the tracked gaze information, and an emotion engine for recognizing the user's emotions.
(Appendix 5)
In the system described in Supplementary Note 4, the emotion engine is characterized by analyzing gaze information and voice data and estimating emotions in order to generate appropriate emotional expressions when reproducing voice.
(Appendix 6)
In the system described in Supplementary Note 4, the emotion engine is characterized in that it adjusts voice synthesis parameters based on gaze information and emotion estimation results in order to adjust the voice in real time in response to changes in emotion.

１０、２１０、３１０、４１０データ処理システム
１２データ処理装置
１４スマートデバイス
２１４スマート眼鏡
３１４ヘッドセット型端末
４１４ロボット 10, 210, 310, 410 Data processing system 12 Data processing device 14 Smart device 214 Smart glasses 314 Headset type terminal 414 Robot

Claims

A system including a means for tracking the gaze of a person who is in a state where it is difficult to output speech, and a means for reproducing past speech based on the tracked gaze information.

In the system described in claim 1, the gaze tracking means is characterized by using a sensor that detects eye movement.

In the system described in claim 1, the voice reproduction means uses a neural network that learns past voice data and combines it with gaze information to generate voice.