JP2023117068A

JP2023117068A - Speech recognition device, speech recognition method, speech recognition program, speech recognition system

Info

Publication number: JP2023117068A
Application number: JP2022019554A
Authority: JP
Inventors: 悠斗後藤; Yuto Goto
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2023-08-23

Abstract

【課題】映像データに含まれる画像から特定した音源となる発話者の発話内容を表示させる音声認識装置、音声認識方法、音声認識プログラム及び音声認識システムを提供する。【解決手段】情報処理装置と、撮像装置と、表示装置とが、ネットワーク等を介して接続されている音声認識システムにおいて、情報処理装置である情報処理端末２００Ａの音声認識処理部２３０は、映像データに含まれる画像データが示す画像から検出された人物の顔画像に基づいて注目話者を決定する注目話者決定部２４７と、映像データに含まれる音声データのうち、注目話者の音声データから変換されたテキストデータを表示装置に表示させる発話内容認識結果出力部２３３と、を有する。【選択図】図５The present invention provides a speech recognition device, a speech recognition method, a speech recognition program, and a speech recognition system that display the utterances of a speaker who is a sound source identified from an image included in video data. In a voice recognition system in which an information processing device, an imaging device, and a display device are connected via a network or the like, a voice recognition processing unit 230 of an information processing terminal 200A, which is the information processing device, A notable speaker determination unit 247 that determines a notable speaker based on a face image of a person detected from an image indicated by image data included in the data; and audio data of a notable speaker from among audio data included in the video data. The utterance content recognition result output unit 233 displays the text data converted from the utterance content recognition result output unit 233 on a display device. [Selection diagram] Figure 5

Description

本発明は、音声認識装置、音声認識方法、音声認識プログラム、音声認識システムに関する。 The present invention relates to a speech recognition device, a speech recognition method, a speech recognition program, and a speech recognition system.

近年では、画像から音源となる発話者を特定し、特定した発話者が発している音声を文字画像に変換して表示部に表示させる技術が知られている。具体的には、例えば、画像において特定された人物が口を動かしている場合に、この人物の音声を文字に変換して表示させるシステムが知られている。 In recent years, a technique is known in which a speaker who is a sound source is specified from an image, the voice uttered by the specified speaker is converted into a character image, and the character image is displayed on a display unit. Specifically, for example, when a person specified in an image is moving his mouth, a system is known that converts the voice of the person into text and displays the text.

上述した従来の技術では、画像の中に、口を動かしている人物が複数存在する場合等には、注目すべき発話者を選択することができない。このため、従来の技術では、特定の人物に注目した場合に、注目した人物の発話内容が適切に表示されない可能性がある。 With the above-described conventional technology, when there are multiple persons moving their mouths in the image, it is not possible to select a speaker of interest. For this reason, when focusing on a specific person, there is a possibility that the utterance content of the focused person may not be displayed appropriately in the conventional technology.

開示の技術は、上記事情に鑑みたものであり、特定の発話者の発話内容を表示させることを目的とする。 The technology disclosed has been made in view of the above circumstances, and aims to display the utterance content of a specific speaker.

開示の技術は、映像データに含まれる画像データが示す画像から検出された人物の顔画像に基づき、注目話者を決定する注目話者決定部と、前記映像データに含まれる音声データのうち、前記注目話者の音声データから変換されたテキストデータを表示装置に表示させる発話内容認識結果出力部と、を有する音声認識装置である。 The disclosed technology includes a target speaker determination unit that determines a target speaker based on a face image of a person detected from an image represented by image data included in video data, and voice data included in the video data, and a speech recognition result output unit for displaying text data converted from the speech data of the speaker of interest on a display device.

特定の発話者の発話内容を表示させることができる。 It is possible to display the utterance contents of a specific speaker.

第一の実施形態の音声認識システムの一例を示す図である。1 is a diagram showing an example of a speech recognition system according to a first embodiment; FIG. 第一の実施形態の音声認識の概要について説明する第一の図である。FIG. 2 is a first diagram illustrating an overview of speech recognition according to the first embodiment; FIG. 音声認識システムをスマートグラスに適用した場合について説明する図である。FIG. 10 is a diagram illustrating a case where the speech recognition system is applied to smart glasses; 第一の実施形態の音声認識の概要について説明する第二の図である。FIG. 4 is a second diagram for explaining an overview of speech recognition according to the first embodiment; スマートグラスの機能について説明する図である。It is a figure explaining the function of smart glasses. 第一の実施形態のスマートグラスの動作を説明する第一のフローチャートである。4 is a first flow chart explaining the operation of the smart glasses of the first embodiment; 第一の実施形態のスマートグラスの動作を説明する第二のフローチャートである。4 is a second flowchart for explaining the operation of the smart glasses of the first embodiment; 第一の実施形態における注目話者の決定について説明する図である。FIG. 4 is a diagram explaining determination of a speaker of interest in the first embodiment; 第一の実施形態のスマートグラスの動作を説明する第三のフローチャートである。9 is a third flowchart for explaining the operation of the smart glasses of the first embodiment; FIG. スマートグラスの動作の事例を説明する第一の図である。FIG. 4 is a first diagram for explaining an example of operation of smart glasses; スマートグラスの動作の事例を説明する第二の図である。FIG. 11 is a second diagram for explaining an example of operation of smart glasses; スマートグラスの動作の事例を説明する第三の図である。FIG. 11 is a third diagram illustrating an example of operation of smart glasses; スマートグラスの動作の事例を説明する第四の図である。FIG. 14 is a fourth diagram illustrating an example of operation of smart glasses; スマートグラスの動作の事例を説明する第五の図である。FIG. 11 is a fifth diagram for explaining an example of operation of smart glasses; スマートグラスの動作の事例を説明する第六の図である。FIG. 11 is a sixth diagram explaining an example of operation of smart glasses; スマートグラスの動作の事例を説明する第七の図である。FIG. 11B is a seventh diagram illustrating an example of operation of smart glasses. 第一の実施形態の発話内容の認識について説明する図である。It is a figure explaining recognition of the speech content of 1st embodiment. 第二の実施形態の翻訳システムの一例を示す図である。It is a figure which shows an example of the translation system of 2nd embodiment. 第三の実施形態の発話内容記録システムの一例を示す図である。It is a figure which shows an example of the utterance content recording system of 3rd embodiment.

（第一の実施形態）
以下に図面を参照して、第一の実施形態について説明する。図１は、第一の実施形態の音声認識システムの一例を示す図である。 (First embodiment)
A first embodiment will be described below with reference to the drawings. FIG. 1 is a diagram showing an example of the speech recognition system of the first embodiment.

本実施形態の音声認識システム１００は、情報処理装置２００と、撮像装置３００と、表示装置４００とを含み、情報処理装置２００と撮像装置３００と表示装置４００とは、ネットワーク等を介して接続されている。 The speech recognition system 100 of this embodiment includes an information processing device 200, an imaging device 300, and a display device 400. The information processing device 200, the imaging device 300, and the display device 400 are connected via a network or the like. ing.

本実施形態の音声認識システム１００において、情報処理装置２００は、音声認識処理部２３０を有する。 In the speech recognition system 100 of this embodiment, the information processing device 200 has a speech recognition processing section 230 .

本実施形態の撮像装置３００は、映像データを取得し、情報処理装置２００へ送信する。映像データは、画像データ（動画データ）と音声データとを含む。 The imaging device 300 of this embodiment acquires video data and transmits it to the information processing device 200 . Video data includes image data (moving image data) and audio data.

本実施形態では、映像データは、音声認識システム１００の利用者が、撮像装置３００を用いて撮影し、情報処理装置２００に送信したものであってよい。したがって、映像データに含まれる画像データには、利用者自身が注目する発話者が含まれる。 In this embodiment, the video data may be captured by the user of the speech recognition system 100 using the imaging device 300 and transmitted to the information processing device 200 . Therefore, the image data included in the video data includes the speaker that the user pays attention to.

本実施形態の情報処理装置２００は、音声認識処理部２３０により、撮像装置３００から取得した映像データに含まれる画像データに基づき、音声認識システム１００の利用者が注目する発話者を特定する。そして、情報処理装置２００は、音声認識処理部２３０により、利用者が注目した発話者の音声データのみを、テキストデータに変換して表示装置４００に表示させる。なお、表示装置４００は、例えば、情報処理装置２００の有するディスプレイ等であってもよい。 The information processing apparatus 200 of the present embodiment uses the speech recognition processing unit 230 to identify a speaker that the user of the speech recognition system 100 pays attention to based on the image data included in the video data acquired from the imaging device 300 . Then, the information processing device 200 converts only the voice data of the utterer of interest to the user into text data by the voice recognition processing unit 230, and causes the display device 400 to display the text data. Note that the display device 400 may be, for example, a display included in the information processing device 200 .

このように、本実施形態では、画像データから、利用者が注目した発話者を特定し、特定された発話者の音声データのみをテキストデータに変換して出力する。したがって、本実施形態によれば、利用者が注目した特定の発話者の発話内容を表示させることができる。 As described above, in the present embodiment, the user identifies the speaker of interest from the image data, and converts only the voice data of the identified speaker into text data and outputs the text data. Therefore, according to the present embodiment, it is possible to display the utterance content of a specific utterer focused on by the user.

以下に、図２を参照して、本実施形態の情報処理装置２００による音声認識の概要について説明する。図２は、第一の実施形態の音声認識の概要について説明する第一の図である。 An outline of speech recognition by the information processing apparatus 200 of the present embodiment will be described below with reference to FIG. FIG. 2 is a first diagram explaining an outline of speech recognition according to the first embodiment.

図２に示す画像２１は、撮像装置３００により取得された画像データが示す画像の一例である。 An image 21 shown in FIG. 2 is an example of an image represented by image data acquired by the imaging device 300 .

画像２１には、人物Ａの画像と、人物Ｂの画像とが含まれる。本実施形態の情報処理装置２００は、画像２１に含まれる人物の画像のうち、顔画像の位置が画像２１の中心に近い位置にある人物の画像を、注目すべき人物に特定する。 The image 21 includes an image of person A and an image of person B. FIG. The information processing apparatus 200 of the present embodiment identifies an image of a person whose face image is positioned close to the center of the image 21 among the images of the person included in the image 21 as a person of interest.

図２の例では、人物Ａの顔画像は、人物Ｂの顔画像よりも、画像２１の中心に近い。したがって、図２では、人物Ａが注目すべき発話者に特定される。 In the example of FIG. 2, the facial image of person A is closer to the center of image 21 than the facial image of person B is. Therefore, in FIG. 2, person A is identified as the speaker of interest.

なお、画像の中心とは、画像が示す矩形の対角線が交わる位置であってよい。また、以下の説明では、注目すべき発話者を、注目話者と表現する場合がある。注目すべき発話者とは、言い換えれば、音声認識システム１００の利用者が注目している特定の発話者である。 Note that the center of the image may be the position where the diagonal lines of the rectangle indicated by the image intersect. Also, in the following description, a speaker to be noted may be referred to as a speaker of interest. A speaker of interest is, in other words, a specific speaker that the user of the speech recognition system 100 is paying attention to.

本実施形態では、人物Ａが注目話者に特定されると、人物Ａの口唇部分の動きを示す動画像と、撮像装置３００が取得した音声データとかを用いて、注目話者である人物Ａの音声データをテキストデータ２３に変換して、表示させる。したがって、本実施形態によれば、注目話者の発話内容を高い精度でテキストデータに変換することができる。 In the present embodiment, when person A is identified as the speaker of interest, the person A, who is the speaker of interest, is identified using a moving image showing the movement of the lips of person A and audio data acquired by the imaging device 300 . is converted into text data 23 and displayed. Therefore, according to this embodiment, it is possible to convert the utterance content of the speaker of interest into text data with high accuracy.

なお、このとき、本実施形態では、画像２１と、テキストデータ２３とが重畳されて表示されてもよい。 At this time, in the present embodiment, the image 21 and the text data 23 may be superimposed and displayed.

次に、図３を参照して、本実施形態の音声認識システム１００を、スマートグラスに適用した場合について説明する。 Next, a case where the speech recognition system 100 of the present embodiment is applied to smart glasses will be described with reference to FIG.

図３は、音声認識システムをスマートグラスに適用した場合について説明する図である。図３では、音声認識システム１００をスマートグラス１００Ａとして説明する。 FIG. 3 is a diagram illustrating a case where the speech recognition system is applied to smart glasses. In FIG. 3, the speech recognition system 100 is described as smart glasses 100A.

図３のスマートグラスのハードウェア構成の一例を示す図である。本実施形態のスマートグラス１００Ａは、眼鏡型表示装置３００Ａと、情報処理端末２００Ａと、ケーブル１５０とを含む眼鏡型ウェアラブル端末である。なお、図３の例では、眼鏡型表示装置３００Ａと、情報処理端末２００Ａとがケーブル１５０によって接続されるものとしたが、これに限定されない。眼鏡型表示装置３００Ａと、情報処理端末２００Ａとは、無線通信を行ってもよい。 4 is a diagram illustrating an example of a hardware configuration of the smart glasses of FIG. 3; FIG. The smart glasses 100A of this embodiment are glasses-type wearable terminals including an glasses-type display device 300A, an information processing terminal 200A, and a cable 150 . In the example of FIG. 3, the glasses-type display device 300A and the information processing terminal 200A are connected by the cable 150, but the present invention is not limited to this. The glasses-type display device 300A and the information processing terminal 200A may perform wireless communication.

眼鏡型表示装置３００Ａは、カメラ（撮像装置）１１０、マイク（集音装置）１２０、ディスプレイ（表示装置）１３０、操作部材１４０と、を含む。つまり、眼鏡型表示装置３００Ａは、撮像装置と表示装置とを含む。 The glasses-type display device 300A includes a camera (imaging device) 110 , a microphone (sound collecting device) 120 , a display (display device) 130 and an operation member 140 . That is, the glasses-type display device 300A includes an imaging device and a display device.

カメラ１１０は、スマートグラス１００Ａを装着した装着者の視線方向の画像データを取得する。マイク１２０は、スマートグラス１００Ａの周辺の音声データを取得する。ディスプレイ１３０は、情報処理端末２００Ａから出力されるテキストデータが表示される。なお、本実施形態のディスプレイ１３０は、光学シースルー型のディスプレイであってよい。操作部材１４０は、物理的なボタン等であってよく、眼鏡型表示装置３００Ａに対する各種の操作が行われる。 The camera 110 acquires image data of the line-of-sight direction of the wearer wearing the smart glasses 100A. The microphone 120 acquires audio data around the smart glasses 100A. Display 130 displays text data output from information processing terminal 200A. The display 130 of this embodiment may be an optical see-through display. The operation member 140 may be a physical button or the like, and various operations are performed on the glasses-type display device 300A.

また、本実施形態では、カメラ１１０とマイク１２０とが別々に設けられるものとしたが、これに限定されない。本実施形態のマイク１２０は、カメラ１１０に内蔵されていてもよい。この場合、カメラ１１０が、画像データと音声データとを含む映像データを取得することになる。 Also, in the present embodiment, the camera 110 and the microphone 120 are provided separately, but the present invention is not limited to this. The microphone 120 of this embodiment may be built into the camera 110 . In this case, the camera 110 acquires video data including image data and audio data.

ケーブル１５０は、カメラ１１０が取得した画像データと、マイク１２０が取得した音声データと、を情報処理端末２００Ａに送信する。また、ケーブル１５０は、情報処理端末２００Ａから眼鏡型表示装置３００Ａに対して各種の情報を送信する。 Cable 150 transmits image data acquired by camera 110 and audio data acquired by microphone 120 to information processing terminal 200A. Also, the cable 150 transmits various types of information from the information processing terminal 200A to the glasses-type display device 300A.

情報処理端末２００Ａは、情報入出力インターフェイス（Ｉ／Ｆ）２０１、メモリ２０２、操作装置２０３、ストレージ２０４、電源２０５、ＣＰＵ（Central Processing Unit）２０６、ネットワークインターフェイス（Ｉ／Ｆ）２０７を含む。 Information processing terminal 200A includes information input/output interface (I/F) 201 , memory 202 , operation device 203 , storage 204 , power supply 205 , CPU (Central Processing Unit) 206 and network interface (I/F) 207 .

情報入出力インターフェイス（Ｉ／Ｆ）２０１は、情報処理端末２００Ａと眼鏡型表示装置３００Ａとの間で各種データの送受信を行うためのインターフェイスである。メモリ２０２は、音声データや画像データ（動画データ）等の一時的な情報が格納される。操作装置２０３は、スマートグラス１００Ａの装着者によるアプリケーションの実行、電源のオン／オフ等の各種の操作が行われる。操作装置２０３は、例えば、タッチパネル等により実現されてよい。 An information input/output interface (I/F) 201 is an interface for transmitting and receiving various data between the information processing terminal 200A and the glasses-type display device 300A. A memory 202 stores temporary information such as audio data and image data (moving image data). The operation device 203 is used by the wearer of the smart glasses 100A to perform various operations such as execution of applications and power on/off. The operation device 203 may be realized by, for example, a touch panel or the like.

ストレージ２０４は、後述する各種のモデル等が格納される。電源２０５は、スマートグラス１００Ａの有する各装置に電力を供給する。ＣＰＵ２０６は、各種の処理を実行し、スマートグラス１００Ａ全体の動作を制御する。 The storage 204 stores various models and the like, which will be described later. The power supply 205 supplies power to each device of the smart glasses 100A. The CPU 206 executes various processes and controls the operation of the smart glasses 100A as a whole.

情報処理端末２００Ａは、ＣＰＵ２０６がストレージ２０４等に格納されたプログラムを読み出して実行することで、音声認識処理部２３０の機能を実現する。 200 A of information processing terminals implement|achieve the function of the speech-recognition process part 230 because CPU206 reads and runs the program stored in the storage 204 grade|etc.,.

ネットワークインターフェイス２０７は、通信ネットワークにアクセスするためのインターフェイスである。 A network interface 207 is an interface for accessing a communication network.

なお、図３に示すスマートグラス１００Ａは、眼鏡型表示装置３００Ａと情報処理端末２００Ａとを含むものとしたが、これに限定されない。スマートグラス１００Ａにおいて、眼鏡型表示装置３００Ａが、情報処理端末２００Ａの全ての構成を有していてもよい。 Note that the smart glasses 100A shown in FIG. 3 include the glasses-type display device 300A and the information processing terminal 200A, but are not limited to this. In the smart glasses 100A, the glasses-type display device 300A may have all the configurations of the information processing terminal 200A.

次に、図４を参照して、スマートグラス１００Ａによる音声認識の概要について説明する。図４は、第一の実施形態の音声認識の概要について説明する第二の図である。 Next, an outline of speech recognition by the smart glasses 100A will be described with reference to FIG. FIG. 4 is a second diagram illustrating an overview of speech recognition according to the first embodiment.

本実施形態のスマートグラス１００Ａにおいて、情報処理端末２００Ａは、カメラ１１０から取得した画像データに基づき、注目話者を特定する。そして、情報処理端末２００Ａは、マイク１２０から取得した注目話者の音声データから変換したテキストデータをディスプレイ１３０に表示させる。 In the smart glasses 100A of this embodiment, the information processing terminal 200A identifies the speaker of interest based on the image data acquired from the camera 110 . Then, the information processing terminal 200A causes the display 130 to display the text data converted from the speech data of the speaker of interest acquired from the microphone 120 .

図４の例では、スマートグラス１００Ａの装着者Ｐは、人物Ａに注目している。また、図４の例では、装着者Ｐの視線方向に人物Ａと人物Ｂが存在する。この場合、スマートグラス１００Ａのカメラ１１０が撮像する画像は、装着者Ｐが注目する人物Ａの画像が中心部分に位置する画像となる。 In the example of FIG. 4, the wearer P of the smart glasses 100A is looking at the person A. In the example of FIG. In addition, in the example of FIG. 4, the person A and the person B are present in the line-of-sight direction of the wearer P. As shown in FIG. In this case, the image captured by the camera 110 of the smart glasses 100A is an image in which the image of the person A that the wearer P pays attention to is positioned at the center.

したがって、スマートグラス１００Ａでは、人物Ａを注目話者に特定し、注目話者の音声データのみをテキストデータに変換して、ディスプレイ１３０にテキストデータのみを表示させる。 Therefore, in the smart glasses 100A, the person A is identified as the speaker of interest, only the voice data of the speaker of interest is converted into text data, and only the text data is displayed on the display 130 .

本実施形態では、このように、音声認識システム１００をスマートグラス１００Ａに適用することで、スマートグラス１００Ａの装着者Ｐが注目する人物の方向を向くだけで、装着者Ｐが注目する人物が注目話者に特定される。 In the present embodiment, by applying the voice recognition system 100 to the smart glasses 100A in this way, the wearer P of the smart glasses 100A simply faces the person of interest, and the person of interest of the wearer P becomes the person of interest. identified by the speaker.

また、本実施形態では、スマートグラス１００Ａのディスプレイ１３０を光学シースルー型としている。このため、本実施形態では、装着者Ｐの視界を妨げずに、テキストデータ２３を装着者Ｐに視認させることができる。 Further, in this embodiment, the display 130 of the smart glasses 100A is of an optical see-through type. Therefore, in the present embodiment, the text data 23 can be visually recognized by the wearer P without obstructing the wearer's P view.

なお、ディスプレイ１３０は、光学シースルー型でなくてもよく、カメラ１１０が取得した画像データが示す画像と、テキストデータ２３とが重畳されて表示されてもよい。 Note that the display 130 may not be of the optical see-through type, and may display an image indicated by the image data acquired by the camera 110 and the text data 23 in a superimposed manner.

また、スマートグラス１００Ａは、例えば、網膜走査型の眼鏡型投影装置であってよい。この場合には、ディスプレイ１３０が不要であり、装着者Ｐの網膜に、光学系により直接テキストデータ２３を投影させればよい。 Also, the smart glasses 100A may be, for example, a retinal scanning spectacles-type projection device. In this case, the display 130 is unnecessary, and the text data 23 may be directly projected onto the retina of the wearer P by the optical system.

次に、図５を参照して、本実施形態のスマートグラス１００Ａの機能について説明する。図５は、スマートグラスの機能について説明する図である。具体的には、図５は、スマートグラス１００Ａの有する情報処理端末２００Ａの機能を示す。 Next, with reference to FIG. 5, functions of the smart glasses 100A of the present embodiment will be described. FIG. 5 is a diagram illustrating functions of smart glasses. Specifically, FIG. 5 shows functions of an information processing terminal 200A included in the smart glasses 100A.

本実施形態の情報処理端末２００Ａは、音声認識処理部２３０を有する。音声認識処理部２３０は、映像入力部２３１、音声入力部２３２、注目話者特定部２４０、口唇特徴量取得部２５０、音響特徴量取得部２６０、人物識別部２７０、マルチモーダル認識部２８０（第一の発話認識部）、音声認識部２９０（第二の発話認識部）、発話内容認識結果出力部２３３を含む。 200 A of information processing terminals of this embodiment have the speech-recognition process part 230. FIG. The speech recognition processing unit 230 includes a video input unit 231, a speech input unit 232, a speaker-of-interest identification unit 240, a lip feature amount acquisition unit 250, an acoustic feature amount acquisition unit 260, a person identification unit 270, and a multimodal recognition unit 280 (first first speech recognition unit), speech recognition unit 290 (second speech recognition unit), and speech content recognition result output unit 233 .

映像入力部２３１は、カメラ１１０が撮像した画像データ（動画データ）を取得する。音声入力部２３２は、マイク１２０により集音された音声データを取得する。このとき、音声入力部２３２は、音声データを、所定の条件でサンプリングしたモノラルの非圧縮データとして取得してもよい。 The video input unit 231 acquires image data (moving image data) captured by the camera 110 . The voice input unit 232 acquires voice data collected by the microphone 120 . At this time, the audio input unit 232 may acquire the audio data as monaural uncompressed data sampled under a predetermined condition.

発話内容認識結果出力部２３３は、マルチモーダル認識部２８０による発話内容の認識結果であるテキストデータや、音声認識部２９０による発話内容の認識結果であるテキストデータを、ディスプレイ１３０に表示させる。 The utterance content recognition result output unit 233 causes the display 130 to display the text data that is the recognition result of the utterance content by the multimodal recognition unit 280 and the text data that is the recognition result of the utterance content by the speech recognition unit 290 .

注目話者特定部２４０は、映像入力部２３１が取得した動画データから、注目話者を特定する。注目話者特定部２４０は、画像変換部２４１、顔領域認識部２４２、顔領域検出モデル２４３、顔位置判定部２４４、口唇領域抽出部２４５、顔特徴点推定モデル２４６、注目話者決定部２４７を有する。 The attention speaker identification unit 240 identifies the attention speaker from the video data acquired by the video input unit 231 . The attention speaker identification unit 240 includes an image conversion unit 241, a face area recognition unit 242, a face area detection model 243, a face position determination unit 244, a lip area extraction unit 245, a facial feature point estimation model 246, and an attention speaker determination unit 247. have

画像変換部２４１は、動画データを時系列のフレーム画像に変換する。なお、画像変換部２４１は、処理の高速化のため、ＲＧＢの画像データをグレースケールの画像データに変換してもよいし、画素数を変換してもよい。 The image conversion unit 241 converts moving image data into time-series frame images. Note that the image conversion unit 241 may convert RGB image data into grayscale image data or may convert the number of pixels in order to speed up processing.

顔領域認識部２４２は、顔領域検出モデル２４３を用いて、取得した時系列のフレーム画像において、顔画像を含む領域（顔領域）を認識する。顔領域検出モデル２４３は、画像から顔画像を検出するモデルであり、予め大量のデータを使用してニューラルネットワークを学習させたモデルである。なお、ここで検出された顔画像は、注目話者の候補となる人物の顔画像である。 The face region recognition unit 242 uses the face region detection model 243 to recognize regions (face regions) containing face images in the acquired time-series frame images. The face area detection model 243 is a model for detecting a face image from an image, and is a model in which a neural network is trained in advance using a large amount of data. The face image detected here is the face image of a person who is a candidate for the speaker of interest.

顔位置判定部２４４は、カメラ１１０が取得した画像データが示す画像における顔領域の位置を判定し、顔領域の位置を示す情報を取得する。 The face position determination unit 244 determines the position of the face area in the image indicated by the image data acquired by the camera 110, and acquires information indicating the position of the face area.

口唇領域抽出部２４５は、顔特徴点推定モデル２４６を用いて、顔画像のうち、口唇部分の画像を含む口唇領域を検出し、顔領域内の顔画像から、口唇領域内の画像を抽出する。 A lip region extraction unit 245 detects a lip region including an image of the lip portion in the face image using the facial feature point estimation model 246, and extracts an image within the lip region from the face image within the face region. .

顔特徴点推定モデル２４６は、顔画像から、目や鼻、口唇の輪郭の座標を取得し、口唇周辺の座標を検出するモデルである。 The face feature point estimation model 246 is a model that acquires the coordinates of the contours of the eyes, nose, and lips from the face image and detects the coordinates around the lips.

なお、本実施形態では、口唇領域抽出部２４５は、口唇領域の画像を抽出するものとしたが、これに限定されない。口唇領域抽出部２４５は、例えば、顔画像において、口唇部分の画像が、人物の手などによって隠されていた場合には、目や鼻等の顔のパーツと対応した領域の画像を抽出してもよい。目や鼻等の顔のパーツと対応した領域は、顔特徴点推定モデル２４６によって検出されてよい。 In this embodiment, the lip region extraction unit 245 extracts the image of the lip region, but is not limited to this. For example, in the face image, if the image of the lip portion is hidden by a person's hand or the like, the lip region extraction unit 245 extracts the image of the region corresponding to the facial parts such as the eyes and nose. good too. Areas corresponding to facial parts such as eyes and nose may be detected by facial feature point estimation model 246 .

注目話者決定部２４７は、画像における人物の顔領域の位置、及び、口唇領域内の画像（動画）が示す口の動きに基づき、注目話者を決定する。注目話者特定部２４０の処理の詳細は後述する。 The attention speaker determination unit 247 determines the attention speaker based on the position of the person's face area in the image and the movement of the mouth indicated by the image (moving image) in the lip area. The details of the processing of the attention speaker identification unit 240 will be described later.

本実施形態の口唇特徴量取得部２５０は、カメラ１１０が取得した画像における、口唇領域内の画像から、口の動きを示す口唇特徴量を取得する。 The lip feature quantity acquisition unit 250 of the present embodiment acquires a lip feature quantity indicating the movement of the mouth from the image within the lip region in the image acquired by the camera 110 .

口唇特徴量取得部２５０は、口唇画素数変換部２５１、口唇特徴量算出部２５２、口唇特徴量算出モデル２５３を有する。 The lip feature quantity acquisition unit 250 has a lip pixel number conversion unit 251 , a lip feature quantity calculation unit 252 , and a lip feature quantity calculation model 253 .

口唇画素数変換部２５１は、抽出された口唇領域内の画像を、所定の大きさの画像に変換する。言い換えれば、口唇画素数変換部２５１は、カメラ１１０と、撮影された人物との距離によって大きさが異なる口唇領域内の画像を、一律の大きさの画像となるように、拡大、または縮小する。 A lip pixel number conversion unit 251 converts the image in the extracted lip region into an image of a predetermined size. In other words, the lip pixel number conversion unit 251 enlarges or reduces the image in the lip region, which varies in size depending on the distance between the camera 110 and the person being photographed, so that the image has a uniform size. .

口唇特徴量算出部２５２は、口唇特徴量算出モデル２５３を用いて、口唇特徴量を算出する。具体的には、口唇特徴量算出部２５２は、大きさが変更された時系列の口唇領域内の画像を示す動画データを、口唇特徴量算出モデル２５３に入力し、発話内容の認識を行う際に効果的な口唇特徴量を算出する。唇特徴量とは、口唇領域内の動画データを口唇特徴量算出モデル２５３に入力して、口唇特徴量算出モデル２５３から出力される多次元のベクトルである。 The lip feature amount calculation unit 252 uses the lip feature amount calculation model 253 to calculate the lip feature amount. Specifically, the lip feature amount calculation unit 252 inputs the moving image data showing the time-series images in the lip area whose size has been changed to the lip feature amount calculation model 253, and recognizes the utterance content. To calculate the effective lip feature amount. A lip feature amount is a multidimensional vector that is output from the lip feature amount calculation model 253 by inputting moving image data in the lip region to the lip feature amount calculation model 253 .

音響特徴量取得部２６０は、音声入力部２３２が取得した音声データから、人物による発話が行われている区間である発話区間を検出し、発話区間の音声データの音響特徴量を取得する。 The acoustic feature amount acquisition unit 260 detects an utterance period, which is a period in which a person speaks, from the audio data acquired by the audio input unit 232, and acquires an acoustic feature amount of the audio data in the utterance period.

音響特徴量取得部２６０は、音声発話区間検出部２６１、音声発話区間検出モデル２６２、音響特徴量算出部２６３を有する。 The acoustic feature quantity acquisition unit 260 has a voice utterance segment detection unit 261 , a voice utterance segment detection model 262 and an acoustic feature quantity calculation unit 263 .

音声発話区間検出部２６１は、音声発話区間検出モデル２６２を用いて、入力された音声データから、発話区間を検出する。 The voice utterance period detection unit 261 uses the voice utterance period detection model 262 to detect the utterance period from the input voice data.

音響特徴量算出部２６３は、発話区間として検出された区間の音声波形から、音響特徴量を算出する。音響特徴量は、例えば、メル周波数ケプストラム係数（ＭＦＣＣ）や、対数メルフィルタバンク特徴量（ＦＢＡＮＫ)や対数メルフィルタ等であってよい。 The acoustic feature amount calculation unit 263 calculates an acoustic feature amount from the speech waveform of the section detected as the utterance section. The acoustic features may be, for example, Mel frequency cepstrum coefficients (MFCC), logarithmic Mel filter bank features (FBANK), logarithmic Mel filters, and the like.

人物識別部２７０は、音響特徴量から、発話した人物を識別するための情報を取得する。人物識別部２７０は、話者埋め込み情報算出部２７１、話者埋め込み情報算出モデル２７２、画面外話者推定部２７３を有する。 The person identification unit 270 acquires information for identifying the person who has spoken from the acoustic feature amount. The person identification unit 270 has a speaker embedded information calculation unit 271 , a speaker embedded information calculation model 272 and an off-screen speaker estimation unit 273 .

話者埋め込み情報算出部２７１は、話者埋め込み情報算出モデル２７２を用いて、発話者の声質をあらわす話者埋め込み情報（エンべディング）を算出する。話者埋め込み情報とは、発話者を特定するための情報であり、例えば、i-vectorやd-vector、x-vector等の方式によって抽出された一定次元数の特徴量であってよい。 The speaker-embedded information calculation unit 271 uses the speaker-embedded information calculation model 272 to calculate speaker-embedded information (embedding) representing the voice quality of the speaker. The speaker embedding information is information for specifying the speaker, and may be, for example, a feature quantity with a certain number of dimensions extracted by a method such as i-vector, d-vector, or x-vector.

画面外話者推定部２７３は、スマートグラス１００Ａの装着者Ｐの顔の向きが変化し、注目話者の画像がカメラ１１０が取得した画像に含まれなくなった場合に、話者埋め込み情報を用いて発話者を推定する。画面外話者推定部２７３の処理の詳細は後述する。 The out-of-screen speaker estimation unit 273 uses the speaker embedded information when the orientation of the face of the wearer P of the smart glasses 100A changes and the image of the speaker of interest is no longer included in the image acquired by the camera 110. to estimate the speaker. The details of the processing of the off-screen speaker estimation unit 273 will be described later.

マルチモーダル認識部２８０は、口唇特徴量と音響特徴量とを用いて、注目話者の発話内容を認識する。マルチモーダル認識部２８０は、特徴量統合部２８１、マルチモーダル発話内容認識部２８２、マルチモーダル発話内容認識モデル２８３を有する。 The multimodal recognition unit 280 recognizes the utterance content of the target speaker using the lip feature amount and the acoustic feature amount. The multimodal recognition unit 280 has a feature amount integration unit 281 , a multimodal speech content recognition unit 282 , and a multimodal speech content recognition model 283 .

特徴量統合部２８１は、音響特徴量取得部２６０により取得された音響特徴量と、口唇特徴量取得部２５０により取得された口唇特徴量とを統合し、マルチモーダル特徴量とする。マルチモーダル特徴量とは、複数種類の特徴量を含む特徴量である。より具体的には、マルチモーダル特徴量とは、音響特徴量と口唇特徴量とを含む。 The feature amount integration unit 281 integrates the acoustic feature amount acquired by the acoustic feature amount acquisition unit 260 and the lip feature amount acquired by the lip feature amount acquisition unit 250 to obtain a multimodal feature amount. A multimodal feature amount is a feature amount including a plurality of types of feature amounts. More specifically, the multimodal features include acoustic features and lip features.

マルチモーダル発話内容認識部２８２は、マルチモーダル発話内容認識モデル２８３を用いて、発話内容を認識する。より具体的には、本実施形態のマルチモーダル発話内容認識部２８２は、音声データから抽出された音響特徴量と、動画データから抽出された口唇特徴量とを用いて発話内容の認識を行う。 A multimodal speech content recognition unit 282 recognizes speech content using a multimodal speech content recognition model 283 . More specifically, the multimodal utterance content recognition unit 282 of the present embodiment recognizes utterance content using acoustic features extracted from voice data and lip features extracted from video data.

音声認識部２９０は、音声入力部２３２が取得した音声データから、音響特徴量取得部２６０が取得した音響特徴量に基づき、発話内容を認識する。音声認識部２９０は、音声発話内容認識部２９１、音声発話内容認識モデル２９２を有する。 The speech recognition unit 290 recognizes the speech content from the speech data acquired by the speech input unit 232 based on the acoustic feature quantity acquired by the acoustic feature quantity acquisition unit 260 . The voice recognition unit 290 has a voice utterance content recognition unit 291 and a voice utterance content recognition model 292 .

音声発話内容認識部２９１は、注目話者とされた人物の口唇特徴量が取得されなかった場合に、音響特徴量を用いた発話内容の認識を行う。具体的には、音声発話内容認識部２９１は、音声発話内容認識モデル２９２を用いて、音声データに基づく発話内容の認識を行い、認識結果を発話内容認識結果出力部２３３に渡す。 The voice utterance content recognition unit 291 recognizes the utterance content using the acoustic feature amount when the lip feature amount of the person set as the speaker of interest is not acquired. Specifically, the voice utterance content recognition unit 291 uses the voice utterance content recognition model 292 to recognize the utterance content based on the voice data, and passes the recognition result to the utterance content recognition result output unit 233 .

なお、本実施形態では、口唇領域抽出部２４５により抽出された画像が、口以外の顔のパーツの画像と対応した領域である場合には、口唇特徴量が算出されなかったものとしてもよい。 In this embodiment, if the image extracted by the lip area extraction unit 245 is an area corresponding to an image of a facial part other than the mouth, the lip feature amount may not be calculated.

なお、本実施形態において、顔領域検出モデル２４３、顔特徴点推定モデル２４６、音声発話区間検出モデル２６２、話者埋め込み情報算出モデル２７２は、公知技術を用いたモデルであってよい。 In the present embodiment, the face region detection model 243, face feature point estimation model 246, voice utterance section detection model 262, and speaker embedded information calculation model 272 may be models using known techniques.

次に、図６を参照して、本実施形態のスマートグラス１００Ａの動作について説明する。図６は、第一の実施形態のスマートグラスの動作を説明する第一のフローチャートである。 Next, operation of the smart glasses 100A of the present embodiment will be described with reference to FIG. FIG. 6 is a first flow chart explaining the operation of the smart glasses of the first embodiment.

図６の処理は、例えば、スマートグラス１００Ａの装着者Ｐにより、注目話者の発話内容の認識処理の開始を指示する操作が行われた場合に、実行される。 The process of FIG. 6 is executed, for example, when the wearer P of the smart glasses 100A performs an operation of instructing the start of the process of recognizing the utterance content of the speaker of interest.

本実施形態のスマートグラス１００Ａにおいて、情報処理端末２００Ａは、映像入力部２３１と音声入力部２３２とにより、画像データ（動画データ）と音声データとを取得する（ステップＳ６０１）。 In the smart glasses 100A of the present embodiment, the information processing terminal 200A acquires image data (moving image data) and audio data using the video input unit 231 and the audio input unit 232 (step S601).

続いて、情報処理端末２００Ａは、音声発話区間検出部２６１により、発話区間を検出する処理を行う（ステップＳ６０２）。 Subsequently, the information processing terminal 200A performs processing for detecting an utterance period using the voice utterance period detection unit 261 (step S602).

ステップＳ６０２において、発話区間が検出されない場合、情報処理端末２００Ａは、ステップＳ６０１へ戻る。 In step S602, when the speech period is not detected, the information processing terminal 200A returns to step S601.

ステップＳ６０２において、発話区間が検出されると、情報処理端末２００Ａは、ステップＳ６０５からステップＳ６０７までの処理を、顔画像が検出された人数分、繰り返す（ステップＳ６０４）。 When the speech period is detected in step S602, the information processing terminal 200A repeats the processes from step S605 to step S607 for the number of persons whose face images are detected (step S604).

情報処理端末２００Ａは、顔領域認識部２４２により、映像入力部２３１が取得した画像データが示す画像において、顔画像が含まれる顔領域を検出する（ステップＳ６０５）。 200 A of information processing terminals detect the face area containing a face image in the image which the image data which the video input part 231 acquired by the face area recognition part 242 (step S605).

続いて、情報処理端末２００Ａは、口唇領域抽出部２４５により、顔領域の中から、口唇領域を検出する（ステップＳ６０６）。なお、口唇領域抽出部２４５は、顔領域において、口唇領域が検出されなかった場合には、口以外の顔のパーツ（目や鼻等）の画像と対応した領域を検出すればよい。つまり、口唇領域抽出部２４５は、顔領域から、顔の一部の画像と対応した領域を検出すればよい。 Subsequently, the information processing terminal 200A detects the lip area from the face area using the lip area extraction unit 245 (step S606). If the lip region is not detected in the face region, the lip region extraction unit 245 may detect regions corresponding to images of facial parts (eyes, nose, etc.) other than the mouth. In other words, the lip region extraction unit 245 should detect a region corresponding to a partial image of the face from the face region.

また、本実施形態において、口唇領域を検出することとは、顔領域内の顔画像から、口唇領域内の口唇画像を抽出することと同義であってよい。 Further, in the present embodiment, detecting the lip region may be synonymous with extracting the lip image within the lip region from the face image within the face region.

続いて、情報処理端末２００Ａは、注目話者決定部２４７により、注目話者を選定する（ステップＳ６０７）。ステップＳ６０７の処理の詳細は後述する。 Subsequently, the information processing terminal 200A selects the attention speaker by the attention speaker determination unit 247 (step S607). Details of the processing in step S607 will be described later.

情報処理端末２００Ａは、ステップＳ６０５からステップＳ６０７までの処理を人数分繰り返す（ステップＳ６０８）。本実施形態では、この処理を繰り返すことで、注目話者が決定される。 200 A of information processing terminals repeat the process from step S605 to step S607 for the number of people (step S608). In this embodiment, the speaker of interest is determined by repeating this process.

続いて、情報処理端末２００Ａは、音響特徴量取得部２６０により、注目話者に特定された人物の音声データから、音響特徴量を算出する（ステップＳ６０９）。 Next, the information processing terminal 200A uses the acoustic feature acquisition unit 260 to calculate acoustic features from the voice data of the person identified as the speaker of interest (step S609).

続いて、情報処理端末２００Ａは、話者埋め込み情報算出部２７１により、注目話者の話者埋め込み情報を算出する（ステップＳ６１０）。なお、話者埋め込み情報算出部２７１は、注目話者が決定された後は、注目話者の話者埋め込み情報を保持していてもよい。また、話者埋め込み情報算出部２７１は、注目話者が注目話者でなくなったときに、保持していた話者埋め込み情報を消去してもよい。 Subsequently, the information processing terminal 200A uses the speaker-embedded information calculation unit 271 to calculate the speaker-embedded information of the speaker of interest (step S610). Note that the speaker-embedded information calculation unit 271 may retain the speaker-embedded information of the speaker of interest after the speaker of interest is determined. Further, the speaker-embedded-information calculation unit 271 may delete the held speaker-embedded information when the speaker of interest is no longer the speaker of interest.

続いて、情報処理端末２００Ａは、口唇特徴量取得部２５０により、口唇領域が検出されているか否かを判定する（ステップＳ６１１）。言い換えれば、情報処理端末２００Ａは、口唇領域抽出部２４５により抽出された画像が、口唇領域内の画像であるか否かを判定する。 Subsequently, the information processing terminal 200A determines whether or not the lip region is detected by the lip feature amount acquisition unit 250 (step S611). In other words, information processing terminal 200A determines whether the image extracted by lip area extraction section 245 is an image within the lip area.

ステップＳ６１１において、口唇領域が検出されていない場合、情報処理端末２００Ａは、音声認識部２９０により、音声データによる発話内容の認識を行い（ステップＳ６１２）、後述するステップＳ６１５へ進む。 In step S611, if the lip region is not detected, the information processing terminal 200A uses the speech recognition unit 290 to recognize the utterance content based on the speech data (step S612), and proceeds to step S615, which will be described later.

ステップＳ６１１において、口唇領域が検出された場合、情報処理端末２００Ａは、口唇特徴量取得部２５０により、口唇領域内の画像から口唇特徴量を算出する（ステップＳ６１３）。 When the lip area is detected in step S611, the information processing terminal 200A uses the lip feature amount acquisition unit 250 to calculate the lip feature amount from the image in the lip area (step S613).

続いて、情報処理端末２００Ａは、マルチモーダル認識部２８０により、ステップＳ６０９で算出した音響特徴量と、ステップＳ６１３で算出した口唇特徴量とを用いて、発話内容の認識を行う（ステップＳ６１４）。 Subsequently, the information processing terminal 200A uses the multimodal recognition unit 280 to recognize the speech content using the acoustic feature amount calculated in step S609 and the lip feature amount calculated in step S613 (step S614).

続いて、情報処理端末２００Ａは、発話内容認識結果出力部２３３により、認識結果のテキストデータを出力し（ステップＳ６１５）、処理を終了する。言い換えれば、発話内容認識結果出力部２３３は、認識結果のテキストデータをディスプレイ１３０に表示させて、処理を終了する。 Subsequently, the information processing terminal 200A outputs the text data of the recognition result by the speech content recognition result output unit 233 (step S615), and ends the process. In other words, the utterance content recognition result output unit 233 causes the display 130 to display the text data of the recognition result, and ends the process.

このように、本実施形態では、注目話者とされた人物の音声データのみを、発話内容の認識を行う音声データとする。 As described above, in the present embodiment, only the speech data of the person who is the target speaker is used as the speech data for recognizing the utterance content.

次に、図７を参照して、本実施形態の注目話者決定部２４７の処理について説明する。は、図７は、第一の実施形態のスマートグラスの動作を説明する第二のフローチャートである。図７では、図６のステップＳ６０７の処理の詳細を示している。 Next, with reference to FIG. 7, processing of the attention speaker determination unit 247 of this embodiment will be described. FIG. 7 is a second flowchart for explaining the operation of the smart glasses of the first embodiment. FIG. 7 shows details of the processing in step S607 of FIG.

本実施形態の情報処理端末２００Ａにおいて、注目話者決定部２４７は、ステップＳ６０５において、複数の顔領域が検出されたか否かを判定する（ステップＳ７０１）。 In the information processing terminal 200A of the present embodiment, the attention speaker determination unit 247 determines whether or not a plurality of face areas are detected in step S605 (step S701).

ステップＳ７０１において、複数の顔領域が検出されない場合、つまり、検出された顔領域が１つであった場合、注目話者決定部２４７は、後述するステップＳ７０４へ進む。 If a plurality of face areas are not detected in step S701, that is, if only one face area is detected, the speaker-of-interest determination unit 247 proceeds to step S704, which will be described later.

ステップＳ７０１において、複数の領域が検出された場合、注目話者決定部２４７は、１の顔画像における口唇領域の中心のｘ座標と、映像入力部２３１が取得した画像データが示す画像の中心点のｘ座標との間の距離を算出する（ステップＳ７０２）。 In step S701, when a plurality of regions are detected, the speaker-of-interest determination unit 247 determines the x-coordinate of the center of the lip region in one face image and the center point of the image indicated by the image data acquired by the image input unit 231. is calculated (step S702).

なお、口唇領域抽出部２４５により、口唇領域の代わりに、顔の一部の画像と対応する領域が抽出されている場合は、この領域の中心点のｘ座標を、口唇領域の中心のｘ座標の代わりに用いれば良い。 If the lip region extraction unit 245 extracts a region corresponding to a partial image of the face instead of the lip region, the x coordinate of the center point of this region is replaced by the x coordinate of the center of the lip region. should be used instead of

続いて、注目話者決定部２４７は、算出した距離が、複数の顔領域について算出した距離のうち、最小であるか否かを判定する（ステップＳ７０３）。言い換えれば、注目話者決定部２４７は、算出した距離が前回算出した距離よりも小さいか否かを判定している。つまり、ここでは、カメラ１１０が撮像した画像の中心に最も近い人物を検出している。 Subsequently, the attention speaker determining unit 247 determines whether or not the calculated distance is the smallest among the distances calculated for the plurality of face regions (step S703). In other words, the attention speaker determination unit 247 determines whether or not the calculated distance is smaller than the previously calculated distance. That is, here, the person closest to the center of the image captured by camera 110 is detected.

ステップＳ７０３において、距離が最小でない場合、注目話者決定部２４７は、この顔領域と対応する人物は、注目話者に該当しないものとし（ステップＳ７０５）、処理を終了する。 If the distance is not the minimum in step S703, the attention speaker determination unit 247 determines that the person corresponding to this face area does not correspond to the attention speaker (step S705), and terminates the process.

ステップＳ７０３において、距離が最小であった場合、注目話者決定部２４７は、口唇領域抽出部２４５により抽出された領域内の画像から、口唇が動いているか否かを判定する（ステップＳ７０４）。つまり、ここでは、注目話者決定部２４７は、顔領域と対応する人物が、発話をしているか否かを判定している。 If the distance is the minimum in step S703, the speaker-of-interest determination unit 247 determines whether or not the lips are moving from the image in the region extracted by the lip region extraction unit 245 (step S704). That is, here, the attention speaker determining unit 247 determines whether or not the person corresponding to the face area is speaking.

ステップＳ７０４において、口唇が動いていない場合、注目話者決定部２４７は、ステップＳ７０５へ進む。口唇が動いていない場合とは、発話していないことを示す。 In step S704, if the lips are not moving, the attention speaker determination unit 247 proceeds to step S705. A case where the lips are not moving indicates that the person is not speaking.

ステップＳ７０４において、口唇が動いている場合、注目話者決定部２４７は、この顔領域を注目話者の顔領域に選定し（ステップＳ７０６）、処理を終了する。 In step S704, if the lips are moving, the attention speaker determination unit 247 selects this face area as the attention speaker's face area (step S706), and ends the process.

以下に、図８を参照して、注目話者決定部２４７により注目話者の決定について、さらに説明する。図８は、第一の実施形態における注目話者の決定について説明する図である。 Determination of the attention speaker by the attention speaker determination unit 247 will be further described below with reference to FIG. FIG. 8 is a diagram explaining determination of a speaker of interest in the first embodiment.

図８に示す画像８１は、映像入力部２３１が取得した画像データが示す画像である。また、画像８１における点ｏは、画像８１の中心点であり、中心点の座標は、（ｘ１，ｙ１）である。なお、本実施形態の中心点ｏの座標は、例えば、画像８１の左上の頂点を原点としたときの座標であってよい。 An image 81 shown in FIG. 8 is an image represented by image data acquired by the video input unit 231 . A point o in the image 81 is the center point of the image 81, and the coordinates of the center point are (x1, y1). Note that the coordinates of the center point o in the present embodiment may be coordinates when the upper left vertex of the image 81 is set as the origin, for example.

図８では、図６のステップＳ６０５において、人物Ａの顔領域と、人物Ｂの顔領域とが検出された場合を示している。この場合、情報処理端末２００Ａは、図６のステップＳ６０６において、各顔領域から口唇領域を検出する。図８の例では、人物Ａの顔領域から口唇領域Ｒａが抽出され、人物Ｂの顔領域から口唇領域Ｒｂが抽出されている。 FIG. 8 shows a case where the face area of person A and the face area of person B are detected in step S605 of FIG. In this case, the information processing terminal 200A detects the lip area from each face area in step S606 of FIG. In the example of FIG. 8, a lip area Ra is extracted from the face area of person A, and a lip area Rb is extracted from the face area of person B. FIG.

ここで、注目話者決定部２４７は、例えば、始めに人物Ｂの顔領域を選択し、口唇領域Ｒｂの中心点のｘ座標と、中心点ｏのｘ座標との距離Ｌｂを算出する。このとき、距離Ｌｂは、最小であるため、人物Ｂの口唇が動いている場合には、人物Ｂを注目話者に選定する。 Here, for example, the attention speaker determination unit 247 first selects the face area of the person B, and calculates the distance Lb between the x-coordinate of the center point of the lip area Rb and the x-coordinate of the center point o. At this time, since the distance Lb is the smallest, when the lips of the person B are moving, the person B is selected as the speaker of interest.

次に、注目話者決定部２４７は、人物Ａの顔領域を選択し、口唇領域Ｒａの中心点のｘ座標と、中心点ｏのｘ座標との距離Ｌａを算出する。このとき、距離Ｌａは、距離Ｌｂよりも小さい。したがって、注目話者決定部２４７は、人物Ｂを注目話者から除外し、人物Ａの口唇が動いている場合には、人物Ａを注目話者に決定する。 Next, the attention speaker determination unit 247 selects the face area of the person A, and calculates the distance La between the x-coordinate of the center point of the lip area Ra and the x-coordinate of the center point o. At this time, the distance La is smaller than the distance Lb. Therefore, the attention speaker determination unit 247 excludes person B from the attention speaker, and determines person A as the attention speaker when the lips of person A are moving.

このように、本実施形態では、カメラ１１０が撮像した画像の中心と最も近い位置に顔画像が検出された人物を、注目話者に特定する。カメラ１１０が撮像した画像の中心とは、言い換えれば、スマートグラス１００Ａの装着者の視線方向である。つまり、本実施形態では、スマートグラス１００Ａの装着者の視線方向に最も近い人物を、注目話者に決定する。そして、本実施形態では、注目話者による発話のみをテキストデータに変換する。 As described above, in this embodiment, a person whose face image is detected at a position closest to the center of the image captured by the camera 110 is specified as the speaker of interest. The center of the image captured by the camera 110 is, in other words, the line-of-sight direction of the wearer of the smart glasses 100A. That is, in the present embodiment, the person closest to the line-of-sight direction of the wearer of the smart glasses 100A is determined as the speaker of interest. Then, in this embodiment, only the utterances by the speaker of interest are converted into text data.

したがって、本実施形態によれば、カメラ１１０による撮像された画像に複数の人物が含まれている場合であっても、スマートグラス１００Ａの利用者が注目している人物を特定し、特定された人物の発話内容のみをディスプレイ１３０に表示させることができる。言い換えれば、音声認識システム１００の利用者が注目した特定の発話者の発話内容を、利用者の視界を妨げることなく、適切に表示させることができる。 Therefore, according to the present embodiment, even if the image captured by the camera 110 includes a plurality of people, the user of the smart glasses 100A identifies the person who is paying attention, and the identified person It is possible to display only the utterance content of the person on the display 130 . In other words, it is possible to appropriately display the utterance content of a specific speaker that the user of the speech recognition system 100 pays attention to without obstructing the user's field of vision.

また、本実施形態では、注目話者の発話内容のみをディスプレイ１３０に表示させるため、ディスプレイ１３０に表示される情報の情報量が過剰になることを抑制できる。 Further, in this embodiment, only the utterance content of the speaker of interest is displayed on the display 130, so that the amount of information displayed on the display 130 can be suppressed from becoming excessive.

また、本実施形態では、注目話者の口唇特徴量と音響特徴量との両方を用いて発話内容の認識を行うため、発話内容の認識の精度を向上させることができる。 In addition, in this embodiment, since the utterance content is recognized using both the lip feature amount and the acoustic feature amount of the speaker of interest, accuracy in recognizing the utterance content can be improved.

さらに、本実施形態では、口唇領域が抽出されない場合には、顔の一部の画像と対応する領域を代用するため、顔領域から口唇領域が抽出されない場合であっても、注目話者を特定することができる。 Furthermore, in the present embodiment, when the lip region is not extracted, a region corresponding to a partial image of the face is used as a substitute. can do.

次に、図９を参照して、注目話者が決定された後のスマートグラス１００Ａの動作について説明する。図９は、第一の実施形態のスマートグラスの動作を説明する第三のフローチャートである。図９に示す処理は、図６の処理により、注目話者が決定された後に、定期的に実行される処理である。 Next, the operation of the smart glasses 100A after the target speaker is determined will be described with reference to FIG. FIG. 9 is a third flow chart explaining the operation of the smart glasses of the first embodiment. The process shown in FIG. 9 is a process that is periodically executed after the attention speaker is determined by the process of FIG.

本実施形態のスマートグラス１００Ａにおいて、情報処理端末２００Ａは、音声発話区間検出部２６１により、発話区間を検出する処理を行う（ステップＳ９０１）。続いて、情報処理端末２００Ａは、音響特徴量算出部２６３により、発話区間において取得された音声データから、音響特徴量を算出する（ステップＳ９０２）。 In the smart glasses 100A of the present embodiment, the information processing terminal 200A performs a process of detecting an utterance period using the voice utterance period detection unit 261 (step S901). Subsequently, the information processing terminal 200A uses the acoustic feature quantity calculation unit 263 to calculate the acoustic feature quantity from the voice data acquired in the speech period (step S902).

続いて、情報処理端末２００Ａは、話者埋め込み情報算出部２７１により、発話区間に発話した人物の話者埋め込み情報を算出する（ステップＳ９０３）。 Next, the information processing terminal 200A uses the speaker-embedded information calculation unit 271 to calculate the speaker-embedded information of the person who spoke in the utterance period (step S903).

続いて、情報処理端末２００Ａは、画面外話者推定部２７３により、現在の注目話者の画像が、映像入力部２３１により取得された画像データが示す画像に含まれるか否かを判定する（ステップＳ９０４）。つまり、ここでは、注目話者が、装着者の視線方向に留まっているか否かを判定している。 Next, the information processing terminal 200A uses the off-screen speaker estimation unit 273 to determine whether or not the current image of the speaker of interest is included in the image indicated by the image data acquired by the video input unit 231 ( step S904). That is, here, it is determined whether or not the speaker of interest stays in the line-of-sight direction of the wearer.

なお、このとき、画面外話者推定部２７３は、例えば、映像入力部２３１により取得された画像データが示す画像に対して顔認識処理を行い、注目話者の顔画像が含まれるか否かを判定してもよい。 At this time, the off-screen speaker estimation unit 273, for example, performs face recognition processing on the image indicated by the image data acquired by the video input unit 231, and determines whether or not the face image of the speaker of interest is included. may be determined.

ステップＳ９０４において、注目話者の画像が含まれない場合、情報処理端末２００Ａは、後述するステップＳ９０９へ進む。 If the image of the speaker of interest is not included in step S904, the information processing terminal 200A proceeds to step S909, which will be described later.

注目話者の画像が含まれない場合とは、注目話者が移動したり、装着者が頭の向きを変えることにより、注目話者がスマートグラス１００Ａの装着者の視界から消える、又は、視界の隅へ移動することを示す。 The case where the image of the speaker of interest is not included means that the speaker of interest disappears from the field of view of the wearer of the smart glasses 100A due to the speaker of interest moving or the wearer changing the direction of the head, or indicates to move to the corner of

ステップＳ９０４において、注目話者の画像が含まれる場合、情報処理端末２００Ａは、口唇領域抽出部２４５により、注目話者の口唇領域が検出されたか否かを判定する（ステップＳ９０５）。ステップＳ９０５において、口唇領域が検出されない場合、情報処理端末２００Ａは、後述するステップＳ９１１へ進む。 If the image of the speaker of interest is included in step S904, the information processing terminal 200A determines whether or not the lip area of the speaker of interest has been detected by the lip region extraction unit 245 (step S905). In step S905, if the lip region is not detected, the information processing terminal 200A proceeds to step S911, which will be described later.

ステップＳ９０５において、口唇領域が検出された場合、情報処理端末２００Ａは、口唇特徴量取得部２５０により、口唇領域から抽出された画像から、口唇特徴量を算出し（ステップＳ９０６）、ステップＳ９０７へ進む。 In step S905, when the lip area is detected, the information processing terminal 200A uses the lip feature amount acquisition unit 250 to calculate the lip feature amount from the image extracted from the lip area (step S906), and proceeds to step S907. .

図９のステップＳ９０７とステップＳ９０８の処理は、図６のステップＳ６１４とステップＳ６１５の処理と同様であるから、説明を省略する。 The processing in steps S907 and S908 in FIG. 9 is the same as the processing in steps S614 and S615 in FIG. 6, so description thereof will be omitted.

ステップＳ９０４において、注目話者が画像に含まれない場合、情報処理端末２００Ａは、人物識別部２７０の画面外話者推定部２７３により、注目話者が画像に含まれなくなってから、１０秒未満であるか否かを判定する（ステップＳ９０９）。なお、１０秒は、予め設定される設定時間の一例であり、これに限定されるものではない。 In step S904, if the target speaker is not included in the image, the information processing terminal 200A causes the off-screen speaker estimation unit 273 of the person identification unit 270 to detect that the target speaker is no longer included in the image for less than 10 seconds. (step S909). Note that 10 seconds is an example of a preset time, and is not limited to this.

ステップＳ９０９において、１０秒未満である場合、画面外話者推定部２７３は、ステップＳ９０３で算出した話者埋め込み情報が、注目話者の話者埋め込み情報と一致するか否かを判定する（ステップＳ９１０）。注目話者の話者埋め込み情報とは、図６のステップＳ６１０で算出される話者埋め込み情報である。 If it is less than 10 seconds in step S909, the off-screen speaker estimation unit 273 determines whether or not the speaker-embedded information calculated in step S903 matches the speaker-embedded information of the speaker of interest (step S910). The speaker-embedded information of the speaker of interest is the speaker-embedded information calculated in step S610 of FIG.

ここで、画面外話者推定部２７３は、例えば、２つの話者埋め込み情報のコサイン類似度等を算出し、算出した値が所定の閾値以上である場合に、両者が一致するものとしてもよい。 Here, the off-screen speaker estimation unit 273 may, for example, calculate the cosine similarity of two speaker embedding information, etc., and determine that the two match when the calculated value is equal to or greater than a predetermined threshold. .

ステップＳ９１０において、両者が一致している場合、情報処理端末２００Ａは、音声認識部２９０により、ステップＳ９０２で算出された音響特徴量を用いた音声発話認識を行い（ステップＳ９１１）、ステップＳ９０８へ進む。 If both match in step S910, the information processing terminal 200A causes the speech recognition unit 290 to perform speech utterance recognition using the acoustic feature amount calculated in step S902 (step S911), and proceeds to step S908. .

ステップＳ９１０において、両者が一致していない場合、情報処理端末２００Ａは、後述するステップＳ９１２に進む。 If the two do not match in step S910, the information processing terminal 200A proceeds to step S912, which will be described later.

ステップＳ９０９において、１０秒未満でない場合、つまり、注目話者がスマートグラス１００Ａの装着者の視線方向から外れてから１０秒以上が経過した場合、情報処理端末２００Ａは、注目話者の決定を解除し（ステップＳ９１２）、処理を終了する。 In step S909, if it is not less than 10 seconds, that is, if 10 seconds or more have passed since the speaker of interest left the line-of-sight direction of the wearer of the smart glasses 100A, the information processing terminal 200A cancels the determination of the speaker of interest. (step S912), and the process ends.

言い換えれば、本実施形態では、映像入力部２３１により取得された画像データが示す画像から、注目話者の顔画像が検出されない状態が設定時間以上継続した場合に、注目話者の決定を解除する。 In other words, in this embodiment, when the face image of the target speaker is not detected from the image represented by the image data acquired by the video input unit 231 for a set time or longer, the determination of the target speaker is canceled. .

注目話者の決定を解除することとは、言い換えれば、注目話者が決定された状態から、注目話者が選択されていない初期状態に戻ることを示す。 Canceling the determination of the speaker of interest means, in other words, returning from the state in which the speaker of interest has been determined to the initial state in which the speaker of interest has not been selected.

本実施形態では、このように、注目話者がスマートグラス１００Ａの装着者の視線方向から一時的に外れた場合であっても、音声から注目話者の発話であるか否かを判定し、発話内容の認識結果をディスプレイ１３０に表示させることができる。 In this embodiment, in this way, even if the speaker of interest temporarily deviates from the line-of-sight direction of the wearer of the smart glasses 100A, it is determined from the voice whether or not it is the utterance of the speaker of interest, The recognition result of the utterance content can be displayed on the display 130 .

以下に、図１０乃至図１７を参照し、スマートグラス１００Ａの動作の事例について説明する。 Examples of operations of the smart glasses 100A will be described below with reference to FIGS. 10 to 17 .

図１０は、スマートグラスの動作の事例を説明する第一の図である。図１０では、人物Ａがスマートグラス１００Ａの装着者Ｐの視線方向に位置しており、カメラ１１０が撮像した画像において人物Ａのみの顔領域が検出される状態を示す。 FIG. 10 is a first diagram illustrating an example of operation of smart glasses. FIG. 10 shows a state in which person A is positioned in the line-of-sight direction of person P wearing smart glasses 100A, and only the face area of person A is detected in the image captured by camera 110 .

この場合、スマートグラス１００Ａは、カメラ１１０が撮像した画像から１つの顔領域を検出し、この顔領域と対応する人物Ａを注目話者に特定する。そして、スマートグラス１００Ａは、口唇領域２２を検出し、音響特徴量と口唇特徴量とを用いたマルチモーダル発話認識処理を行い、認識結果のテキストデータ２３をディスプレイ１３０に表示させる。 In this case, the smart glasses 100A detect one face region from the image captured by the camera 110, and identify person A corresponding to this face region as the speaker of interest. The smart glasses 100A detect the lip region 22 , perform multimodal speech recognition processing using the acoustic feature amount and the lip feature amount, and display text data 23 of the recognition result on the display 130 .

図１１は、スマートグラスの動作の事例を説明する第二の図である。図１１では、注目話者とされた人物Ａがスマートグラス１００Ａの装着者Ｐの視線方向から外れてから、所定の設定時間内（例えば１０秒）である状態を示す。 FIG. 11 is a second diagram illustrating an example of operation of smart glasses. FIG. 11 shows a state in which a predetermined set time (for example, 10 seconds) has passed since person A, who is the speaker of interest, has left the line-of-sight direction of person P wearing smart glasses 100A.

この場合、スマートグラス１００Ａは、人物Ａの音声データのみで、人物Ａを注目話者と判定し、音声データから算出した音響特徴量を用いた音声認識処理を行い、認識結果のテキストデータ２３ａをディスプレイ１３０に表示させる。 In this case, the smart glasses 100A determine the person A to be the target speaker based on only the voice data of the person A, perform voice recognition processing using the acoustic feature amount calculated from the voice data, and convert the text data 23a of the recognition result into Display on the display 130 .

図１２は、スマートグラスの動作の事例を説明する第三の図である。図１２では、注目話者とされた人物Ａがスマートグラス１００Ａの装着者Ｐの視線方向から外れてから所定の設定時間以上が経過した状態を示す。 FIG. 12 is a third diagram illustrating an example of operation of smart glasses. FIG. 12 shows a state in which a predetermined set time or more has elapsed after person A, who is the speaker of interest, has left the line-of-sight direction of person P wearing smart glasses 100A.

この場合、スマートグラス１００Ａは、人物Ａに対する注目話者の決定を解除し、注目話者が決定されていない初期状態に戻る。したがって、ディスプレイ１３０には何も表示されない。 In this case, the smart glasses 100A cancel the determination of the target speaker for person A, and return to the initial state in which the target speaker is not determined. Therefore, nothing is displayed on the display 130 .

図１３は、スマートグラスの動作の事例を説明する第四の図である。図１３では、人物Ａがスマートグラス１００Ａの装着者Ｐの視線方向に位置しており、カメラ１１０が撮像した画像において人物Ａのみの顔領域が検出され、且つ、人物Ａの口唇領域が検出されない状態を示す。 FIG. 13 is a fourth diagram illustrating an example of operation of smart glasses. In FIG. 13, person A is positioned in the line-of-sight direction of person P wearing smart glasses 100A, and in the image captured by camera 110, only the face area of person A is detected, and the lip area of person A is not detected. Indicates status.

この場合、スマートグラス１００Ａは、カメラ１１０が撮像した画像から１つの顔領域を検出し、この顔領域と対応する人物Ａを注目話者に特定する。また、スマートグラス１００Ａは、人物Ａの口唇領域が検出されないため、音声データから算出した音響特徴量を用いた音声認識処理を行い、認識結果のテキストデータ２３ａをディスプレイ１３０に表示させる。 In this case, the smart glasses 100A detect one face region from the image captured by the camera 110, and identify person A corresponding to this face region as the speaker of interest. In addition, since the lip region of person A is not detected, the smart glasses 100A perform speech recognition processing using the acoustic feature amount calculated from the speech data, and display the text data 23a of the recognition result on the display 130.

図１４は、スマートグラスの動作の事例を説明する第五の図である。図１４では、カメラ１１０が撮像した画像において人物Ａと人物Ｂの顔領域が検出された状態を示す。 FIG. 14 is a fifth diagram explaining an example of operation of smart glasses. FIG. 14 shows a state in which face areas of person A and person B are detected in an image captured by camera 110 .

この場合、スマートグラス１００Ａは、人物Ａの口唇領域の中心点のｘ座標と、カメラ１１０が撮像した画像の中心点のｘ座標との距離と、人物Ｂの口唇領域の中心点のｘ座標と、カメラ１１０が撮像した画像の中心点のｘ座標との距離とを算出する。 In this case, the smart glasses 100A measure the distance between the x-coordinate of the center point of the lip region of person A and the x-coordinate of the center point of the image captured by camera 110, and the x-coordinate of the center point of the lip region of person B. , and the distance from the x-coordinate of the center point of the image captured by the camera 110 .

次に、スマートグラス１００Ａは、距離が小さい方の人物を注目話者に決定する。図１４では、人物Ａを注目話者に決定する。そして、スマートグラス１００Ａは、人物Ａの口唇領域２２を検出し、音響特徴量と口唇特徴量とを用いたマルチモーダル発話認識処理を行い、認識結果のテキストデータ２３をディスプレイ１３０に表示させる。 Next, the smart glasses 100A determine the person with the shortest distance as the speaker of interest. In FIG. 14, person A is determined as the target speaker. Then, the smart glasses 100A detect the lip region 22 of the person A, perform multimodal speech recognition processing using the acoustic feature amount and the lip feature amount, and display the text data 23 of the recognition result on the display 130.

図１５は、スマートグラスの動作の事例を説明する第六の図である。図１５では、カメラ１１０が撮像した画像において人物Ａと人物Ｂのうち、カメラ１１０が撮像した画像の中心点に近い人物が、装着者Ｐの頭の動き等により、人物Ａから人物Ｂに変わった場合を示す。 FIG. 15 is a sixth diagram illustrating an example of operation of smart glasses. In FIG. 15, in the image captured by the camera 110, between the person A and the person B, the person closer to the center point of the image captured by the camera 110 changes from the person A to the person B due to the movement of the wearer P's head. indicates the case.

この場合、スマートグラス１００Ａは、人物Ｂの口唇領域２２Ｂを検出し、音響特徴量と口唇特徴量とを用いたマルチモーダル発話認識処理を行い、認識結果のテキストデータ２３Ｂをディスプレイ１３０に表示させる。 In this case, the smart glasses 100A detect the lip region 22B of the person B, perform multimodal speech recognition processing using the acoustic feature amount and the lip feature amount, and display text data 23B of the recognition result on the display 130.

図１６は、スマートグラスの動作の事例を説明する第七の図である。図１６では、カメラ１１０が撮像した画像において人物Ａと人物Ｂのうち、カメラ１１０が撮像した画像の中心点に近い人物Ａの口元が隠されている状態を示す。 FIG. 16 is a seventh diagram illustrating an example of operation of smart glasses. FIG. 16 shows a state in which the mouth of person A, who is closer to the center point of the image captured by camera 110 than person A and person B, is hidden in the image captured by camera 110 .

この場合、スマートグラス１００Ａは、人物Ａの口唇領域の中心点のｘ座標の代わりに、人物Ａの顔画像の一部の領域の中心点のｘ座標を求め、このｘ座標と、カメラ１１０が撮像した画像の中心点のｘ座標との距離を算出する。次に、スマートグラス１００Ａは、この距離に基づき、人物Ａを注目話者に決定する。 In this case, the smart glasses 100A obtain the x-coordinate of the central point of the partial area of the facial image of the person A instead of the x-coordinate of the central point of the lip area of the person A, and the x-coordinate and the camera 110 A distance from the x-coordinate of the center point of the captured image is calculated. Next, the smart glasses 100A determine Person A as the speaker of interest based on this distance.

そして、スマートグラス１００Ａは、人物Ａの音声データから算出した音響特徴量を用いた音声認識処理を行い、認識結果のテキストデータ２３ａをディスプレイ１３０に表示させる。 Then, the smart glasses 100A perform speech recognition processing using the acoustic feature amount calculated from the speech data of the person A, and display the text data 23 a of the recognition result on the display 130 .

このように、本実施形態では、スマートグラス１００Ａの装着者の視線方向に複数の人物が存在する場合や、装着者が注目している人物の口元が隠れている場合等であっても、注目話者の発話内容を示すテキストデータをディスプレイ１３０に表示させることができる。 As described above, in the present embodiment, even when a plurality of people are present in the line-of-sight direction of the wearer of the smart glasses 100A, or when the mouth of the person whom the wearer is paying attention to is hidden, It is possible to cause the display 130 to display text data indicating the contents of the speech of the speaker.

次に、図１７を参照して、本実施形態のスマートグラス１００Ａの有する発話内容の認識について説明する。図１７は、第一の実施形態の発話内容の認識について説明する図である。 Next, with reference to FIG. 17, recognition of utterance content of the smart glasses 100A of the present embodiment will be described. 17A and 17B are diagrams for explaining recognition of utterance content according to the first embodiment. FIG.

本実施形態のスマートグラス１００Ａの音声認識処理部２３０は、注目話者として特定された人物の口唇領域内から抽出された動画を口唇特徴量算出部２５２に入力し、口唇特徴量１７１を取得する。また、本実施形態では、注目話者として特定された人物の音声波形を音響特徴量算出部２６３に入力し、音響特徴量１７２を取得する。 The voice recognition processing unit 230 of the smart glasses 100A of the present embodiment inputs the video extracted from the lip region of the person identified as the speaker of interest to the lip feature amount calculation unit 252, and acquires the lip feature amount 171. . In addition, in this embodiment, the speech waveform of the person identified as the speaker of interest is input to the acoustic feature amount calculator 263 to acquire the acoustic feature amount 172 .

そして、音声認識処理部２３０は、特徴量統合部２８１において、口唇特徴量１７１と、音響特徴量１７２とを結合させ、マルチモーダル特徴量１７３を得る。 Then, the speech recognition processing unit 230 combines the lip feature quantity 171 and the acoustic feature quantity 172 in the feature quantity integration unit 281 to obtain a multimodal feature quantity 173 .

次に、音声認識処理部２３０は、マルチモーダル特徴量１７３をマルチモーダル発話内容認識部２８２に入力し、マルチモーダル発話内容認識モデル２８３を用いて、発話内容を示すテキストデータを生成し、テキストデータを発話内容認識結果出力部２３３に対して出力する。 Next, the speech recognition processing unit 230 inputs the multimodal feature quantity 173 to the multimodal speech content recognition unit 282, generates text data indicating the speech content using the multimodal speech content recognition model 283, and generates text data. is output to the speech content recognition result output unit 233 .

また、本実施形態では、口唇が隠れていたり、カメラ１１０が撮像した画像に注目話者の画像が含まれず、口唇領域内の画像を利用できない場合、音響特徴量算出部２６３で抽出した音響特徴量１７２を音声発話内容認識部２９１に入力する。音声発話内容認識部２９１は、音声発話内容認識モデル２９２を用いて、発話内容を示すテキストデータを生成し、テキストデータを発話内容認識結果出力部２３３に対して出力する。 In addition, in the present embodiment, when the lips are hidden or the image captured by the camera 110 does not include the image of the speaker of interest and the image within the lip region cannot be used, the acoustic features extracted by the acoustic feature amount calculation unit 263 Quantity 172 is input to speech utterance content recognition unit 291 . The voice utterance content recognition unit 291 uses the voice utterance content recognition model 292 to generate text data representing the content of the utterance, and outputs the text data to the utterance content recognition result output unit 233 .

このように、本実施形態では、注目話者の口唇領域の検出の可否に応じて、発話内容の認識処理の方式を切り換えるため、音声認識の精度を向上させることができる。 As described above, in this embodiment, the accuracy of speech recognition can be improved because the speech content recognition processing method is switched depending on whether or not the lip area of the speaker of interest can be detected.

また、本実施形態において、口唇特徴量算出モデル２５３と、マルチモーダル発話内容認識モデル２８３と、音声発話内容認識モデル２９２とは、口唇領域の動画データと、音声データと、正解となるテキストデータと、を学習データとして、ニューラルネットワークを学習させた学習済みモデルである。 Further, in this embodiment, the lip feature amount calculation model 253, the multimodal utterance content recognition model 283, and the voice utterance content recognition model 292 are composed of moving image data of the lip region, voice data, and correct text data. is a trained model in which a neural network is trained using , as learning data.

また、本実施形態では、発話区間毎に、音声データを取得して発話内容の認識を行うものとしたが、これに限定されない。本実施形態では、例えば、同時に複数の人物の音声データが取得された場合には、画像データから検出された人物の顔画像に基づき、注目話者の音声データのみを選択するようにしてもよい。 Further, in the present embodiment, voice data is acquired for each utterance period and the content of utterance is recognized, but the present invention is not limited to this. In this embodiment, for example, when the voice data of a plurality of persons are acquired at the same time, only the voice data of the target speaker may be selected based on the face image of the person detected from the image data. .

（第二の実施形態）
以下に、図面を参照して、第二の実施形態について説明する。第二の実施形態は、第一の実施形態のスマートグラス１００Ａを適用した翻訳システムである。 (Second embodiment)
A second embodiment will be described below with reference to the drawings. The second embodiment is a translation system to which the smart glasses 100A of the first embodiment are applied.

図１８は、第二の実施形態の翻訳システムのシステム構成の一例を示す図である。本実施形態の翻訳システム５００は、スマートグラス１００Ａと、自動翻訳装置７００とを含む。スマートグラス１００Ａと自動翻訳装置７００とは、例えば、ネットワーク等を介して接続される。 FIG. 18 is a diagram showing an example of the system configuration of the translation system of the second embodiment. A translation system 500 of this embodiment includes smart glasses 100A and an automatic translation device 700 . The smart glasses 100A and the automatic translation device 700 are connected via a network or the like, for example.

本実施形態の自動翻訳装置７００は、第一の言語のテキストデータと、言語の選択とを受け付けると、第一の言語のテキストデータを、選択された言語（第二の言語）に翻訳し、第二の言語のテキストデータを出力する。 When the automatic translation device 700 of the present embodiment receives the text data of the first language and the selection of the language, it translates the text data of the first language into the selected language (second language), Output text data in the second language.

図１８に示す翻訳システム５００では、スマートグラス１００Ａにおいて、画像データと音声データとに基づき、注目話者の発話内容を認識した結果のテキストデータを、第一の言語のテキストデータとして、自動翻訳装置７００に送信する。このとき、スマートグラス１００Ａは、予め第二の言語の選択を受け付けていてもよい。その場合、スマートグラス１００Ａは、第一の言語のテキストデータと共に第二の言語を示す情報を自動翻訳装置７００へ送信する。 In the translation system 500 shown in FIG. 18, in the smart glasses 100A, based on the image data and the voice data, the text data of the result of recognizing the speech content of the speaker of interest is used as the text data of the first language. 700. At this time, the smart glasses 100A may accept the selection of the second language in advance. In that case, the smart glasses 100A transmit information indicating the second language to the automatic translation device 700 together with the text data of the first language.

自動翻訳装置７００は、第一の言語のテキストデータと第二の言語を示す情報とを受け付けて、第一の言語のテキストデータを第二の言語のテキストデータに変換し、スマートグラス１００Ａに送信する。 The automatic translation device 700 receives text data in the first language and information indicating the second language, converts the text data in the first language into text data in the second language, and transmits the data to the smart glasses 100A. do.

スマートグラス１００Ａは、自動翻訳装置７００から受信した第二の言語のテキストデータをディスプレイ１３０に表示させる。 Smart glasses 100A causes display 130 to display the text data in the second language received from automatic translation device 700 .

本実施形態では、このように、スマートグラス１００Ａと自動翻訳装置７００とを連携させることで、スマートグラス１００Ａの装着者に対し、注目話者が使用する第一の言語とは異なる第二の言語で、注目話者の発話内容を表示させることができる。 In the present embodiment, by linking the smart glasses 100A and the automatic translation device 700 in this way, the wearer of the smart glasses 100A can use a second language different from the first language used by the speaker of interest. , it is possible to display the utterance contents of the speaker of interest.

（第三の実施形態）
以下に、図面を参照して、第三の実施形態について説明する。第三の実施形態は、第一の実施形態のスマートグラス１００Ａを適用した議事録作成システムである。 (Third embodiment)
A third embodiment will be described below with reference to the drawings. The third embodiment is a minutes creation system to which the smart glasses 100A of the first embodiment are applied.

図１９は、第三の実施形態の発話内容記録システムのシステム構成の一例を示す図である。本実施形態の発話内容記録システム６００は、スマートグラス１００Ａと、発話内容記録装置７００Ａとを含む。スマートグラス１００Ａと発話内容記録装置７００Ａとは、例えば、ネットワーク等を介して接続される。 FIG. 19 is a diagram showing an example of the system configuration of the utterance content recording system of the third embodiment. A speech content recording system 600 of the present embodiment includes smart glasses 100A and a speech content recording device 700A. The smart glasses 100A and the utterance content recording device 700A are connected via a network or the like, for example.

本実施形態では、スマートグラス１００Ａは、例えば、教育機関の講義等において、教師の発話内容をテキストデータとして保持する用途等に用いられる。この場合、スマートグラス１００Ａの装着者は、講義を行っている教師Ｔに対して視線方向を向けるだけで、教師の発話内容をテキストデータとして発話内容記録装置７００Ａの有する記憶部に格納することができる。 In the present embodiment, the smart glasses 100A are used, for example, to hold the contents of speeches made by teachers as text data in lectures at educational institutions. In this case, the wearer of the smart glasses 100A can store the content of the teacher's utterance as text data in the storage unit of the utterance content recording device 700A simply by directing the line of sight toward the teacher T who is giving the lecture. can.

なお、本実施形態では、例えば、講堂等のようなスペースに設置された壇上に、複数の人物が存在する場合等に、特定の人物の発話内容をテキストデータとして保存する用途にも利用することができる。 In addition, in this embodiment, for example, when a plurality of people are present on a stage set up in a space such as an auditorium, the utterance contents of a specific person can also be used for saving text data. can be done.

本実施形態では、このように、スマートグラス１００Ａを発話内容記録装置７００Ａと連携させることで、例えば、複数の人物がランダムな順番に発話するような場面であっても、注目話者の発話内容のみをテキストデータとして保存することができる。 In this embodiment, by linking the smart glasses 100A with the utterance content recording device 700A, for example, even in a scene where a plurality of people speak in random order, the utterance content of the speaker of interest can be recorded. can be saved as text data.

なお、スマートグラス１００Ａは、上述した実施形態以外にも適用することができる。例えば、スマートグラス１００Ａは、装着者Ｐの聴覚に障害がある場合等に有用である。 Note that the smart glasses 100A can also be applied to embodiments other than those described above. For example, the smart glasses 100A are useful when the wearer P has a hearing impairment.

上記で説明した実施形態の各機能は、一又は複数の処理回路によって実現することが可能である。ここで、本明細書における「処理回路」とは、電子回路により実装されるプロセッサのようにソフトウェアによって各機能を実行するようプログラミングされたプロセッサや、上記で説明した各機能を実行するよう設計されたASIC（Application Specific Integrated Circuit）、DSP（digital signal processor）、FPGA（field programmable gate array）や従来の回路モジュール等のデバイスを含むものとする。 Each function of the embodiments described above may be implemented by one or more processing circuits. Here, the "processing circuit" in this specification means a processor programmed by software to perform each function, such as a processor implemented by an electronic circuit, or a processor designed to perform each function described above. devices such as ASICs (Application Specific Integrated Circuits), DSPs (digital signal processors), FPGAs (field programmable gate arrays) and conventional circuit modules.

また、実施形態に記載された装置群は、本明細書に開示された実施形態を実施するための複数のコンピューティング環境のうちの１つを示すものにすぎない。 Moreover, the devices described in the embodiments are only representative of one of several computing environments for implementing the embodiments disclosed herein.

ある実施形態では、情報処理装置２００（情報処理端末２００Ａ）は、サーバクラスタといった複数のコンピューティングデバイスを含む。複数のコンピューティングデバイスは、ネットワークや共有メモリなどを含む任意のタイプの通信リンクを介して互いに通信するように構成されており、本明細書に開示された処理を実施する。同様に、情報処理装置２００は、互いに通信するように構成された複数のコンピューティングデバイスを含むことができる。 In one embodiment, the information processing apparatus 200 (information processing terminal 200A) includes multiple computing devices such as a server cluster. Multiple computing devices are configured to communicate with each other over any type of communication link, including a network, shared memory, etc., to perform the processes disclosed herein. Similarly, information processing apparatus 200 may include multiple computing devices configured to communicate with each other.

さらに、情報処理装置２００は、開示された処理ステップを様々な組み合わせで共有するように構成できる。例えば、情報処理装置２００によって実行されるプロセスは、他の情報処理装置によって実行され得る。同様に、情報処理装置２００の機能は、他の情報処理装置によって実行することができる。また、情報処理装置と他の情報処理装置の各要素は、１つの情報処理装置にまとめられていても良いし、複数の装置に分けられていても良い。 Further, the information processing apparatus 200 can be configured to share the disclosed processing steps in various combinations. For example, a process performed by information processing apparatus 200 may be performed by another information processing apparatus. Similarly, the functions of information processing device 200 may be performed by other information processing devices. Also, each element of the information processing device and other information processing devices may be integrated into one information processing device, or may be divided into a plurality of devices.

以上、各実施形態に基づき本発明の説明を行ってきたが、上記実施形態に示した要件に本発明が限定されるものではない。これらの点に関しては、本発明の主旨をそこなわない範囲で変更することができ、その応用形態に応じて適切に定めることができる。 Although the present invention has been described above based on each embodiment, the present invention is not limited to the requirements shown in the above embodiments. These points can be changed without impairing the gist of the present invention, and can be determined appropriately according to the application form.

１００音声認識システム
１００Ａスマートグラス
１１０カメラ
１２０マイク
１３０ディスプレイ
２００情報処理装置
２００Ａ情報処理端末
２３０音声認識処理部
２３１映像入力部
２３２音声入力部
２３３発話内容認識結果出力部
２４０注目話者特定部
２５０口唇特徴量取得部
２６０音響特徴量取得部
２７０人物識別部
２８０マルチモーダル認識部
２９０音声認識部 100 voice recognition system 100A smart glasses 110 camera 120 microphone 130 display 200 information processing device 200A information processing terminal 230 voice recognition processing unit 231 video input unit 232 voice input unit 233 utterance content recognition result output unit 240 attention speaker identification unit 250 lip features quantity acquisition unit 260 acoustic feature quantity acquisition unit 270 person identification unit 280 multimodal recognition unit 290 speech recognition unit

特開２０２１－０２６０５０号公報Japanese Patent Application Laid-Open No. 2021-026050

Claims

a speaker-of-interest determination unit that determines a speaker of interest based on a facial image of a person detected from an image represented by image data included in video data;
a speech recognition result output unit for displaying text data converted from the speech data of the speaker of interest among the speech data contained in the video data on a display device.

a lip region extraction unit for detecting a lip region including an image of the lip portion of the person from the face image of the person identified as the speaker of interest;
2. The apparatus according to claim 1, further comprising a first speech recognition unit that converts the speech data of the speaker of interest into text data by using the image data representing the image within the lip region and the speech data of the speaker of interest. voice recognition device.

When the lip region is not detected from the face image,
3. The speech recognition apparatus according to claim 2, further comprising a second speech recognition section for converting the speech data of the speaker of interest into text data by using the speech data of the speaker of interest.

The display device is a spectacles-type display device, and the spectacles-type display device is provided with an imaging device that captures an image in a line-of-sight direction of the wearer wearing the spectacles-type display device,
4. The speech recognition device according to any one of claims 1 to 3, wherein said video data is video data acquired by said imaging device.

The attention speaker determination unit
When a plurality of face images are detected from an image indicated by image data included in the video data, a center point of a partial region of the face image and a center of the image indicated by the image data included in the video data 5. The speech recognition apparatus according to any one of claims 1 to 4, wherein the face image having the smallest distance from the point is set as the face image of the speaker of interest.

a speaker-embedded information calculation unit for calculating speaker-embedded information for identifying a person who uttered the voice data;
estimating the speaker of the audio data based on the embedded speaker information calculated from the audio data when the face image of the speaker of interest is not detected from the image indicated by the image data included in the video data; 6. The speech recognition device according to any one of claims 1 to 5, further comprising a speaker estimation unit.

The off-screen speaker estimation unit
A period during which the facial image of the speaker of interest is not detected from the image indicated by the image data included in the video data is less than a predetermined set time, and the facial image of the speaker of interest is detected. When the speaker-embedded information calculated from the speech data acquired in a state where the speech data is not synchronized with the speaker-embedded information of the target speaker, the speech data is determined to be the speech data of the target speaker. 7. A speech recognition apparatus according to claim 6.

The off-screen speaker estimation unit
8. The determination of the speaker of interest is canceled when the face image of the speaker of interest is not detected from the image represented by the image data included in the video data for the predetermined set time or longer. A speech recognition device as described.

A speech recognition system comprising an information processing device, an imaging device capable of communicating with the information processing device, and a display device capable of communicating with the information processing device,
The information processing device is
a speaker-of-interest determination unit that determines a speaker of interest based on a face image of a person detected from an image represented by image data included in video data acquired by the imaging device;
an utterance content recognition result output unit that causes the display device to display text data converted from the speech data of the speaker of interest among the speech data included in the video data.

A speech recognition method by a computer, the computer comprising:
determining a speaker of interest based on a facial image of a person detected from an image represented by image data included in video data;
A speech recognition method, wherein text data converted from the speech data of the speaker of interest among the speech data included in the video data is displayed on a display device.

determining a speaker of interest based on a facial image of a person detected from an image represented by image data included in video data;
A speech recognition program for causing a computer to execute a process of displaying text data converted from the speech data of the speaker of interest among the speech data contained in the video data on a display device.

Smart glasses having an information processing terminal, an imaging device connected to the information processing terminal, and a display device connected to the information processing terminal,
The information processing terminal
a speaker-of-interest determination unit that determines a speaker of interest based on a face image of a person detected from an image represented by image data included in video data acquired by the imaging device;
and an utterance content recognition result output unit that causes the display device to display text data converted from the speech data of the speaker of interest among the speech data included in the video data.

Translation including smart glasses having an information processing terminal, an imaging device connected to the information processing terminal, and a display device connected to the information processing terminal, and a translation device capable of communicating with the smart glasses A system, wherein the information processing terminal included in the smart glasses includes:
a speaker-of-interest determination unit that determines a speaker of interest based on a face image of a person detected from an image represented by image data included in video data acquired by the imaging device;
an utterance content recognition result output unit that outputs text data in a first language converted from the speech data of the speaker of interest among the speech data included in the video data to the translation device;
A translation system in which text data in a second language translated from text data in the first language is displayed on the display device in the translation device.

Smart glasses having an information processing terminal, an imaging device connected to the information processing terminal, and a display device connected to the information processing terminal, and an utterance content recording device capable of communicating with the smart glasses. an utterance content recording system comprising: the information processing terminal included in the smart glasses,
a speaker-of-interest determination unit that determines a speaker of interest based on a face image of a person detected from an image represented by image data included in video data acquired by the imaging device;
An utterance content recognition result of causing the display device to display text data converted from the speech data of the speaker of interest among the voice data included in the video data, and outputting the text data to the utterance content recording device. an output unit;
The utterance content recording device is
An utterance content recording system having a storage unit for storing the text data output from the information processing terminal.