WO2025182639A1

WO2025182639A1 - Information processing device, information processing method, and information processing system

Info

Publication number: WO2025182639A1
Application number: PCT/JP2025/005168
Authority: WO
Inventors: 菜琳岡▲崎▼; 光高鳥; 康夫川端; 裕高瀬
Original assignee: Sony Group Corp
Current assignee: Sony Group Corp
Priority date: 2024-02-27
Filing date: 2025-02-17
Publication date: 2025-09-04
Anticipated expiration: 2026-08-27

Abstract

Provided are an information processing device, an information processing method, and an information processing system which make it possible to promote a smoother dialogue. A human sensing unit determines, on the basis of images captured by a plurality of cameras of speakers present in a predetermined space, whether each speaker participates in a conversation. A clarity adjustment unit adjusts the clarity of speech voice of the speaker determined to participate in the conversation by the human sensing unit, the speech voice being included in voice in the predetermined space, which voice is collected by a microphone. The transmission unit transmits the voice in the predetermined space processed by the clarity adjustment unit.

Description

Information processing device, information processing method, and information processing system

　本開示は、情報処理装置、情報処理方法及び情報処理システムに関する。 This disclosure relates to an information processing device, an information processing method, and an information processing system.

　複数の人がいる遠隔エリア同士で遠隔コミュニケーションを行う遠隔コミュニケーションシステムが知られている。このような遠隔コミュニケーションシステムでは、ある遠隔エリアにおける人と、他の遠隔エリアにおける人とが、遠隔コミュニケーションシステムのマイク、スピーカー、カメラ等を使用して遠隔コミュニケーションを行う。 Remote communication systems are known that enable remote communication between multiple people in remote areas. In such remote communication systems, people in one remote area communicate with people in another remote area using the microphone, speaker, camera, etc. of the remote communication system.

特開２０１０－１９１５４４号公報JP 2010-191544 A

　遠隔コミュニケーションシステムは、ある遠隔エリアに配置されたマイクを用いて発話者の音声を収音し、収音した音声を他の遠隔エリアに配置されたスピーカーから再生する。 The remote communication system uses a microphone placed in one remote area to pick up the speaker's voice, and then plays the picked-up voice from a speaker placed in another remote area.

　しかしながら、単に人の音声を抽出して再生する方法では、会話に関係しない人の声まで強調されるため、対話相手との会話に集中することが困難になるという問題がある。さらに、会話を強調するために周囲の音は完全に消してしまうと、相手空間の状況が把握し難い場合や、会話が無い場合の雰囲気が不自然になるといった問題がある。 However, simply extracting and playing back human voices has the problem that even voices unrelated to the conversation are emphasized, making it difficult to concentrate on the conversation with the other person. Furthermore, completely muting surrounding sounds to emphasize the conversation can make it difficult to grasp the situation in the other person's space, and the atmosphere can become unnatural when there is no conversation.

　本開示は、上記実情に鑑みてなされたものであり、対話の円滑化を促進することができる情報処理装置、情報処理方法及び情報処理システムを提供する。 This disclosure has been made in consideration of the above-mentioned circumstances, and provides an information processing device, information processing method, and information processing system that can promote smooth dialogue.

　本開示の情報処理装置は、所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部とを有する。 The information processing device disclosed herein includes a human sensing unit that determines whether each speaker is participating in a conversation based on video captured by a camera of multiple speakers present in a specified space; a clarity adjustment unit that adjusts the clarity of the speech of speakers who are participating in the conversation and that is included in the audio of the specified space picked up by a microphone and that is determined by the human sensing unit to be included; and a transmission unit that transmits the audio of the specified space processed by the clarity adjustment unit.

遠隔コミュニケーション装置のブロック図である。FIG. 1 is a block diagram of a remote communication device. 遠隔コミュニケーション装置が配置された環境の俯瞰図である。1 is a bird's-eye view of an environment in which a remote communication device is placed. 遠隔コミュニケーション装置が配置された環境の正面図である。FIG. 1 is a front view of an environment in which a remote communication device is placed. 人センシング部の処理を説明するための図である。FIG. 10 is a diagram for explaining processing by a human sensing unit. スクリーン座標から正規化座標への座標変換を示す図である。FIG. 10 is a diagram illustrating a coordinate transformation from screen coordinates to normalized coordinates. 個別音分離処理を示す図である。FIG. 10 is a diagram illustrating individual sound separation processing. 複数対向スピーカーバランシング処理を示す図である。FIG. 10 is a diagram illustrating a multiple facing speaker balancing process. 外野発話音声に対する正規分布関数を用いた処理を示す図である。FIG. 10 is a diagram illustrating processing using a normal distribution function for outfield speech. 第１の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 2 is a diagram showing an outline of data flow between the local remote communication devices according to the first embodiment. 送信側処理のフローチャートである。10 is a flowchart of a transmission side process. 人位置判定処理のフローチャートである。10 is a flowchart of a person position determination process. 内野外野判定処理のフローチャートである。10 is a flowchart of an infield/outfield determination process. 個別音分離処理のフローチャートである。10 is a flowchart of an individual sound separation process. 個別音に対する明瞭度調整処理のフローチャートである。10 is a flowchart of a process of adjusting clarity of individual sounds. 複数対向スピーカーバランシング処理のフローチャートである。10 is a flowchart of a multiple opposite speaker balancing process. 背景音調整の処理のフローチャートである。10 is a flowchart of a background sound adjustment process. 内野発話音声調整の処理のフローチャートである。10 is a flowchart of a process for adjusting infield speech sounds. 外野発話音声調整の処理のフローチャートである。10 is a flowchart of a process for adjusting outfield speech sounds. 定位中心スピーカーユニット導出の処理のフローチャートである。10 is a flowchart of a process for deriving a localization center speaker unit. 第２の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 10 is a block diagram of a remote communication device according to a second embodiment. 第２の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 10 is a diagram showing an outline of data flow between the local remote communication devices according to the second embodiment. 第３の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 10 is a block diagram of a remote communication device according to a third embodiment. 第３の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。FIG. 13 is a diagram showing an outline of data flow between the local remote communication devices according to the third embodiment. 第５の実施の形態に係る遠隔コミュニケーション装置のブロック図である。FIG. 13 is a block diagram of a remote communication device according to a fifth embodiment. 第５の実施の形態に係る遠隔コミュニケーション装置による音声信号処理を説明するための図である。13A and 13B are diagrams for explaining audio signal processing by a remote communication device according to a fifth embodiment. 第６の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a sixth embodiment. 第７の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。FIG. 20 is a diagram for explaining audio reproduction by a remote communication device according to a seventh embodiment. 第１の実施の形態～第７の実施の形態に係る情報処理装置である遠隔コミュニケーション装置の演算装置を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 10 is a hardware configuration diagram showing an example of a computer that realizes an arithmetic unit of a remote communication device that is an information processing device according to the first to seventh embodiments.

　以下、添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。説明は以下の順序で行うものとする。 Below, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In this specification and drawings, components having substantially the same functional configuration will be assigned the same reference numerals, and duplicate explanations will be omitted. The explanation will be given in the following order.

１．第１の実施の形態に係る遠隔コミュニケーション装置
　１．１．言葉の定義
　　１．１．１．リモート及びローカル
　　１．１．２．個別音
　　１．１．３．内野及び外野
　　１．１．４．複数対向スピーカー
　１．２．送信ユニット
　　１．２．１．人センシング部
　　１．２．２．音声分離部
　　１．２．３．音響信号処理部
　　１．２．４．出力統合部
　　１．２．５．送信部
　１．３．受信ユニット
　１．３．１．受信部
　１．３．２．音声出力制御部
２．第１の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
３．遠隔コミュニケーション処理
　３．１．送信側処理
　　３．１．１．人位置判定処理
　　３．１．２．内野外野判定処理
　　３．１．３．個別音分離処理
　　３．１．４．明瞭度調整処理
　３．２．複数対向スピーカーバランシング処理（受信側処理）
　　３．２．１．背景音調整
　　３．２．２．内野発話音声調整
　　３．２．３．外野発話音声調整
　　３．２．４．定位中心スピーカーユニット導出
４．効果
５．第２の実施の形態に係る遠隔コミュニケーション装置
　５．１．音響信号処理部
　　５．１．１．背景音分離部
　　５．１．２．明瞭度調整部
　５．２．出力統合部
　５．３．音声出力制御部
６．第２の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
７．効果
８．第３の実施の形態に係る遠隔コミュニケーション装置
　８．１．音響信号処理部
　　８．１．１．音声合成部
９．第３の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ
１０．効果
１１．第４の実施の形態に係る遠隔コミュニケーション装置
　１１．１．人センシング部
　１１．２．音響信号処理部
　１１．３．その他の音声識別
　１１．４．効果
１２．第５の実施の形態に係る遠隔コミュニケーション装置
　１２．１．人センシング部
　１２．２．音響信号処理部
　１２．３．効果
１３．第６の実施の形態に係る遠隔コミュニケーション装置
　１３．１．音声出力制御部
　１３．２．効果
１４．第７の実施の形態に係る遠隔コミュニケーション装置
　１４．１．人センシング部
　１４．２．音響信号処理部
　１４．３．音声出力制御部
　１４．４．効果
１５．ハードウェア構成 1. Remote communication device according to the first embodiment 1.1. Definition of terms 1.1.1. Remote and local 1.1.2. Individual sound 1.1.3. Infield and outfield 1.1.4. Multiple facing speakers 1.2. Transmission unit 1.2.1. Human sensing unit 1.2.2. Audio separation unit 1.2.3. Acoustic signal processing unit 1.2.4. Output integration unit 1.2.5. Transmission unit 1.3. Receiving unit 1.3.1. Receiving unit 1.3.2. Audio output control unit 2. Data flow between remote communication devices according to the first embodiment 3. Remote communication processing 3.1. Transmission-side processing 3.1.1. Human position determination processing 3.1.2. Infield/outfield determination processing 3.1.3. Individual sound separation processing 3.1.4. Clarity adjustment processing 3.2. Multiple opposing speaker balancing processing (receiving side processing)
3.2.1. Background sound adjustment 3.2.2. Infield speech sound adjustment 3.2.3. Outfield speech sound adjustment 3.2.4. Localization center speaker unit derivation 4. Effect 5. Remote communication device according to the second embodiment 5.1. Acoustic signal processing unit 5.1.1. Background sound separation unit 5.1.2. Clarity adjustment unit 5.2. Output integration unit 5.3. Audio output control unit 6. Data flow between remote communication devices according to the second embodiment 7. Effect 8. Remote communication device according to the third embodiment 8.1. Acoustic signal processing unit 8.1.1. Speech synthesis unit 9. Data flow between remote communication devices according to the third embodiment 10. Effect 11. Remote communication device according to the fourth embodiment 11.1. Human sensing unit 11.2. Acoustic signal processing unit 11.3. Other voice recognition 11.4. Effect 12. 12. Remote communication device according to the fifth embodiment 12.1. Human sensing unit 12.2. Acoustic signal processing unit 12.3. Effect 13. Remote communication device according to the sixth embodiment 13.1. Audio output control unit 13.2. Effect 14. Remote communication device according to the seventh embodiment 14.1. Human sensing unit 14.2. Acoustic signal processing unit 14.3. Audio output control unit 14.4. Effect 15. Hardware configuration

＜１．第１の実施の形態に係る遠隔コミュニケーション装置＞
　従来の遠隔コミュニケーションシステムでは、複数人が同時に発話すると、再生されるときに音声が混合することで聞き取りづらかったり、聞きたい人の声が勝手に抑圧されたりして不自然と感じられる場合があった。また、従来の遠隔コミュニケーションシステムでは、発話以外の音はカットされることが多かった。 1. Remote communication device according to the first embodiment
In conventional remote communication systems, when multiple people speak at the same time, the sounds can become mixed together during playback, making it difficult to hear, or the voice of the person you want to hear can be suppressed, creating an unnatural experience.In addition, conventional remote communication systems often cut out sounds other than the speech.

　図１は、遠隔コミュニケーション装置のブロック図である。実施の形態に係る遠隔コミュニケーション装置１は、背景音と発話の音声とを別系統で処理できるようにし、発話の音声をさらに話し相手に応じて２種類に分ける。そして、遠隔コミュニケーション装置１は、分けた音声のそれぞれに対して異なるパラメータで処理を施すことで、遠隔地とのつながり感を保持しながら声の聞こえやすさを調整する。ここで、つながり感とは、同じ空間において対面で対話相手と会話しているように感じられる臨場感にあたる。 Figure 1 is a block diagram of a remote communication device. The remote communication device 1 according to the embodiment is able to process background sounds and spoken voices in separate systems, and further divides spoken voices into two types depending on the person being spoken to. The remote communication device 1 then processes each of the divided voices using different parameters, adjusting the ease with which the voices can be heard while maintaining a sense of connection with the remote location. Here, the sense of connection corresponds to a sense of realism that makes it feel as if you are having a face-to-face conversation with your conversation partner in the same space.

　遠隔コミュニケーション装置１は、送信ユニット１０及び受信ユニット２０を有する。図１に示した遠隔コミュニケーション装置１は、相手方の遠隔コミュニケーション装置１と双方向通信を行う装置であり、相手方の遠隔コミュニケーション装置１も送信ユニット１０及び受信ユニット２０を有する。 The remote communication device 1 has a transmitting unit 10 and a receiving unit 20. The remote communication device 1 shown in FIG. 1 is a device that performs two-way communication with a remote communication device 1 of a partner, and the remote communication device 1 of the partner also has a transmitting unit 10 and a receiving unit 20.

　ただし、双方向送信を行わない場合には、送信側の遠隔コミュニケーション装置１は少なくとも送信ユニット１０を有し、受信側の遠隔コミュニケーション装置１は少なくとも受信ユニット２０を有すればよい。すなわち、実施例に係る遠隔コミュニケーションを実現する情報処理システムは、送信装置である遠隔コミュニケーション装置１と受信装置と遠隔コミュニケーション装置１とを有する。 However, if two-way transmission is not performed, the remote communication device 1 on the transmitting side only needs to have at least a transmission unit 10, and the remote communication device 1 on the receiving side only needs to have at least a receiving unit 20. In other words, the information processing system that realizes remote communication in this embodiment has a remote communication device 1 as a transmitting device, a receiving device, and a remote communication device 1.

　送信側の遠隔コミュニケーション装置１と受信側の遠隔コミュニケーション装置１とは、ネットワーク７を介して接続される。 The sending remote communication device 1 and the receiving remote communication device 1 are connected via a network 7.

　図２Ａは、遠隔コミュニケーション装置が配置された環境の俯瞰図である。また、図２Ｂは、遠隔コミュニケーション装置が配置された環境の正面図である。図２Ａ及び２Ｂにおける点線は信号線の配線を示す。 Figure 2A is an overhead view of the environment in which the remote communication device is placed. Figure 2B is a front view of the environment in which the remote communication device is placed. The dotted lines in Figures 2A and 2B indicate the wiring of signal lines.

　遠隔コミュニケーション装置１は、オーディオインタフェース５に接続される。オーディオインタフェース５は、複数のマイク３及び複数対向スピーカー４に接続される。オーディオインタフェース５は、複数のマイク３から収音した音を遠隔コミュニケーション装置１へ出力する。また、オーディオインタフェース５は、遠隔コミュニケーション装置１から出力された音を複数対向スピーカー４へ出力する。 The remote communication device 1 is connected to an audio interface 5. The audio interface 5 is connected to multiple microphones 3 and multiple opposed speakers 4. The audio interface 5 outputs sounds picked up from the multiple microphones 3 to the remote communication device 1. The audio interface 5 also outputs sounds output from the remote communication device 1 to the multiple opposed speakers 4.

　カメラ２は、配置された特定の空間の映像を撮影して、撮影した映像を遠隔コミュニケーション装置１へ出力する。マイク３は、全方位マイクである。複数のマイク３は、所定の配置で配列されてマイクアレイ３０を形成する。カメラ２とマイクアレイ３０とは、ほぼ同じ位置に並べられることが好ましい。特に、図２Ａにおける人が並ぶ位置からカメラ２及びマイクアレイ３０へ向かう水平方向に一列に並べられることが好ましい。 The camera 2 captures images of the specific space in which it is placed and outputs the captured images to the remote communication device 1. The microphone 3 is an omnidirectional microphone. Multiple microphones 3 are arranged in a predetermined configuration to form a microphone array 30. It is preferable that the camera 2 and microphone array 30 are aligned in approximately the same position. In particular, it is preferable that they are aligned in a horizontal line from the position where people are lined up in Figure 2A toward the camera 2 and microphone array 30.

　また、遠隔コミュニケーション装置１は、ディスプレイ６が接続される。ディスプレイ６は、遠隔コミュニケーション装置１が受信した映像を表示する。図２Ｂに示すように、ディスプレイ６の上下には複数対向スピーカー４のスピーカーアレイ４１及びスピーカーアレイ４２がそれぞれ配置される。 Furthermore, a display 6 is connected to the remote communication device 1. The display 6 displays the video received by the remote communication device 1. As shown in FIG. 2B , a speaker array 41 and a speaker array 42 of multiple opposing speakers 4 are arranged above and below the display 6, respectively.

　以下の説明では、遠隔コミュニケーション装置１を用いて対話を行う人は、ディスプレイ６から等距離にディスプレイ６と平行に一列に並ぶ場合で説明する。ディスプレイ６から人までの距離は、例えば２．０ｍ程度である。以下の説明では、ディスプレイ６の実際に映像を映し出す面を「スクリーン」とぶ。また、撮影された人の頭と足元とを結ぶ方向であり映像がスクリーンに映された場合の映像の上下方向を「縦方向」と呼ぶ。また、撮影された人が正面を向いた場合の左右方向であり映像がスクリーンに映された場合の左右方向を「横方向」と呼ぶ。特に、映像がスクリーンに映された状態での映像に向って左側を「左」と呼び、右側を「右」と呼ぶ。 In the following explanation, it is assumed that people who will be conversing using the remote communication device 1 are lined up in a row parallel to the display 6 at equal distances from the display 6. The distance from the display 6 to the people is, for example, approximately 2.0 m. In the following explanation, the surface of the display 6 that actually displays the image will be referred to as the "screen." Furthermore, the direction connecting the head and feet of the person being photographed, and the up and down direction of the image when projected onto the screen, will be referred to as the "vertical direction." Furthermore, the left and right directions when the person being photographed is facing forward, and the left and right directions when the image is projected onto the screen, will be referred to as the "horizontal direction." In particular, the left side of the image when projected onto the screen will be referred to as the "left," and the right side will be referred to as the "right."

　＜１．１．言葉の定義＞
　次に、実施の形態において使用される言葉の定義について説明する。 <1.1. Definition of terms>
Next, definitions of terms used in the embodiments will be explained.

　＜１．１．１．リモート及びローカル＞
　実施の形態において遠隔コミュニケーション装置１は、物理的に離れた２拠点を接続してそれぞれの空間にいる人同士でコミュニケーションを取り合う遠隔コミュニケーションを実行する。そこで、遠隔コミュニケーションを行う２拠点のうち、送信側の遠隔コミュニケーション装置１が配置された側を「リモート」と呼び、送信側の遠隔コミュニケーション装置１が配置された側を「ローカル」と呼ぶ。すなわち、以下の説明では、リモートの映像及び音声をローカルへ送る場合について説明する。また、ローカルの対話相手となるリモートにおける人を発話者と呼び、発話者がメッセージを送りたい対象である対話相手を受話者と呼ぶ。 1.1.1. Remote and Local
In the embodiment, the remote communication device 1 connects two physically separated locations and performs remote communication between people in each space. Of the two locations where remote communication is performed, the location where the transmitting remote communication device 1 is located is referred to as the "remote" location, and the location where the transmitting remote communication device 1 is located is referred to as the "local" location. In other words, the following description will focus on the case where video and audio from the remote location are sent to the local location. The remote person who is the conversation partner of the local location is referred to as the speaker, and the conversation partner to whom the speaker wants to send a message is referred to as the listener.

　＜１．１．２．個別音＞
　個別音とは、各発話者の個別音声のことである。リモートの多数人とローカルの多数人とが対話する多対多のコミュニケーションでは、一方の拠点（リモート）で複数人が同時に話す状況が想定される。そのような場面で、単純に無指向マイクで収録すると複数人の声が収録される。そこで、収録された複数人の声の中からそれぞれの話者の声を単体で抽出することを「個別音分離」の処理と呼ぶ。 <1.1.2. Individual sounds>
Individual sounds refer to the individual voices of each speaker. In many-to-many communication, where many remote people converse with many local people, it is expected that multiple people will be speaking simultaneously at one of the locations (remote). In such a situation, simply recording with an omnidirectional microphone will result in the recording of multiple voices. Therefore, the process of extracting each speaker's voice individually from the recorded voices of multiple people is called "individual sound separation."

　＜１．１．３．内野及び外野＞
　内野とは、ローカルの話者を基準として、ローカルの話者との間でその時点で会話をしている等のローカルに注目しているリモートの話者の、その時点でのローカルの話者から見た立ち位置を表す。例えば、ローカルの人とリモートの人とで会話が行われている場合、そのリモートの人が、ローカルの人にとっての「内野」として扱われる。内野の人は、その時点で会話に参加している人ともに言える。 1.1.3. Infield and outfield
The infield refers to the position of a remote speaker who is currently paying attention to the local speaker, such as someone who is currently having a conversation with the local speaker, as seen from the local speaker at that time. For example, if a conversation is taking place between a local person and a remote person, the remote person is considered to be the "infield" for the local person. The infield can also be anyone who is currently participating in the conversation.

　外野とは、ローカルの話者を基準として、リモートの人同士の間で会話をしているリモートの人等のローカルに注目しておらずその時点でローカルに注目していない人の、その時点でのローカルの話者から見た立ち位置を表す。リモートの発話者にとっての受話者が同じくリモートにいる、つまりリモート同士の話者の間で会話をしている場合、そのリモートの受話者はローカルの話者にとっての「外野」として扱われる。外野の人は、その時点で会話に参加していない人とも言える。ただし、リモートの発話者は、時間経過でのその人の行動の変化により、内野になったり外野になったりする。 The outfield refers to the position of people who are not paying attention to the local situation at that time, such as remote people having a conversation with other remote people, as seen from the perspective of the local speaker at that time. If the listener for the remote speaker is also remote, in other words, if the conversation is between two remote speakers, the remote listener is treated as the "outfield" for the local speaker. People in the outfield can also be said to be people who are not participating in the conversation at that time. However, remote speakers can become infield or outfield depending on changes in their behavior over time.

　＜１．１．４．複数対向スピーカー＞
　複数対向スピーカー４は、複数のスピーカーユニット４１１からなるスピーカーアレイ４１と複数のスピーカーユニット４１２からなるスピーカーアレイ４２とを対として有するスピーカシステムである。本実施例では、スピーカーアレイ４１とスピーカーアレイ４２とは、ディスプレイ６の画面を挟んで縦方向に並べて配置される。例えば、スピーカーアレイ４１は、スピーカーアレイ４２に対してスクリーンに映された映像における上側に設置される。以下では、スピーカーアレイ４１がスピーカーアレイ４２の上側に設置される場合を例に説明する。 <1.1.4. Multiple opposing speakers>
The multiple opposed speakers 4 are a speaker system having a pair of a speaker array 41 consisting of a plurality of speaker units 411 and a speaker array 42 consisting of a plurality of speaker units 412. In this embodiment, the speaker array 41 and the speaker array 42 are arranged vertically with the screen of the display 6 in between. For example, the speaker array 41 is installed above the speaker array 42 in the image projected on the screen. The following describes an example in which the speaker array 41 is installed above the speaker array 42.

　スピーカーアレイ４１とスピーカーアレイ４２とは、同数のスピーカーユニット４１１及び４１２を有する。スピーカーアレイ４１において、複数のスピーカーユニット４１１が横方向に並ぶ。また、スピーカーアレイ４２において複数のスピーカーユニット４１２も横方向に並ぶ。各スピーカーユニット４１１と各スピーカーユニット４１２とはそれぞれが、縦方向の対応する位置に配置される。以下では、縦方向の対応する位置に配置されたスピーカーユニット４１１とスピーカーユニット４１２の組みを、同じ列のスピーカーユニット４１１とスピーカーユニット４１２と呼ぶ。スピーカーユニット４１１及び４１２からは同じ音声を再生することで、スピーカーアレイ４１とスピーカーアレイ４２との間の中央に仮想的に音源が定位する。 Speaker array 41 and speaker array 42 have the same number of speaker units 411 and 412. In speaker array 41, multiple speaker units 411 are arranged horizontally. Similarly, in speaker array 42, multiple speaker units 412 are arranged horizontally. Each speaker unit 411 and each speaker unit 412 are arranged at corresponding positions in the vertical direction. Below, a pair of speaker units 411 and 412 arranged at corresponding positions in the vertical direction will be referred to as speaker units 411 and 412 in the same row. By playing the same sound from speaker units 411 and 412, a sound source is virtually positioned in the center between speaker array 41 and speaker array 42.

　複数対向スピーカー４は、単にスピーカーを並べたシステムに比べて以下のようなメリットを有する。複数対向スピーカー４は、分離した話者の音声を個別チャンネルで再生することで、１つのチャンネルに複数人の声が混ざって再生されたときよりも個々人の声が聞き取り安くできる。また、複数対向スピーカー４は、ディスプレイ６のスクリーンの縦方向の中央に音源を定位させることで、おおよそ人の顔の位置から声が出ているように聞かせることができる。 Multiple opposed speakers 4 have the following advantages over a system that simply consists of an array of speakers. By playing back the voices of separate speakers on individual channels, multiple opposed speakers 4 make it easier to hear each individual voice than when multiple voices are mixed together on a single channel. Additionally, by localizing the sound source in the vertical center of the display 6 screen, multiple opposed speakers 4 can make the voice sound as if it is coming from approximately the position of a person's face.

　ただし、第１の実施の形態では、スピーカーアレイ４１の出力方向とスピーカーアレイ４２の出力方向と対向させることで点定位させ臨場感を向上させる効果を強めているが、対向させず一般的なアレイスピーカーで面定位させてもよい。 However, in the first embodiment, the output direction of speaker array 41 and the output direction of speaker array 42 are opposed to each other to achieve point localization and enhance the sense of realism, but it is also possible to use a general array speaker to achieve surface localization without opposing them.

＜１．２．送信ユニット＞
　送信ユニット１０は、送信側の遠隔コミュニケーション装置１で用いられる機能である。以下の説明では、リモートからローカルに音を送る場合、すなわちリモートを送信側としローカルを受信側として説明する。送信ユニット１０は、図１に示すように、人センシング部１１、音声分離部１２、音響信号処理部１３、出力統合部１４及び送信部１５を有する。 <1.2. Transmitting unit>
The transmission unit 10 is a function used in the transmitting remote communication device 1. In the following explanation, the case where sound is sent from the remote to the local will be explained, i.e., the remote will be the transmitting side and the local will be the receiving side. As shown in FIG. 1 , the transmission unit 10 has a human sensing unit 11, a voice separation unit 12, an acoustic signal processing unit 13, an output integration unit 14, and a transmission unit 15.

＜１．２．１．人位置検出部＞
　人センシング部１１は、カメラ２が配置されたリモート側の特定の空間のカメラ２により撮影された映像の入力を受ける。そして、人センシング部１１は、以下に説明する、人位置判定処理及び内野外野判定処理を含む人センシング処理を実行する。以下では、カメラ２により撮影されるリモート側の空間を「リモート空間」と呼ぶ。 <1.2.1. Person position detection unit>
The human sensing unit 11 receives an input of an image captured by the camera 2 in a specific space on the remote side where the camera 2 is placed. The human sensing unit 11 then executes human sensing processing including human position determination processing and infield/outfield determination processing, which will be described below. Hereinafter, the space on the remote side captured by the camera 2 will be referred to as the "remote space."

　人センシング部１１は、カメラ２の映像からリモート空間における横方向の原点とする位置を検出する。図３は、人センシング部の処理を説明するための図である。ここでは、図３の紙面に向って上段の画像で示すように、ディスプレイ６に３人の発話者６１～６３が写る映像６０が表示される場合を例に説明する。人センシング部１１は、例えば、リモート空間におけるカメラ２の映像がディスプレイ６のスクリーンに映された場合の画面の横方向の中央６００に原点があるものとする。次に、人センシング部１１は、原点からディスプレイ６の画面の横方向の位置を示す正規化座標を設定する。例えば、人センシング部１１は、ディスプレイ６のスクリーンの横方向の端部から中央６００までの距離を正規化座標における０．５の距離とする。図３では、人センシング部１１は、ｘ座標を正規化座標として、ディスプレイ６のスクリーンの右側端部の正規化座標を０．５とし、左側端部の正規化座標を－０．５とする。すなわち、本実施の形態では、人センシング部１１は、スクリーンの横幅を１として正規化座標を設定する。 The human sensing unit 11 detects the position of the horizontal origin in the remote space from the image of camera 2. Figure 3 is a diagram for explaining the processing of the human sensing unit. Here, as shown in the image in the upper row of Figure 3, an example will be described in which an image 60 showing three speakers 61 to 63 is displayed on display 6. For example, the human sensing unit 11 assumes that the origin is located at the horizontal center 600 of the screen when the image of camera 2 in the remote space is projected onto the screen of display 6. Next, the human sensing unit 11 sets normalized coordinates that indicate the horizontal position from the origin on the screen of display 6. For example, the human sensing unit 11 sets the distance from the horizontal edge of the screen of display 6 to the center 600 as a distance of 0.5 in normalized coordinates. In Figure 3, the human sensing unit 11 uses the x coordinate as a normalized coordinate, and sets the normalized coordinate of the right edge of the screen of display 6 to 0.5 and the normalized coordinate of the left edge to -0.5. That is, in this embodiment, the human sensing unit 11 sets normalized coordinates with the screen width set to 1.

　次に、人センシング部１１は、以下の人位置判定処理を実行する。人センシング部１１は、カメラ２の映像に写っている各発話者の骨格認識を実行する。そして、人センシング部１１は、カメラ２の映像に写っている発話者それぞれの首の正規化座標を取得する。これにより、人センシング部１１は、各発話者の位置を示す正規化座標が取得でき、発話者同士の横方向の位置関係が判定できる。人センシング部１１は、既に商用リリースされているような一般的なツールを用いて骨格認識を実行することができる。そして、人センシング部１１は、正規化座標が取得できた発話者の人数分の要素からなる人別リストを生成する。本実施の形態では、人センシング部１１は、正規化座標の値が小さい順に人別リストに要素を並べる。 Next, the human sensing unit 11 performs the following human position determination process. The human sensing unit 11 performs skeletal recognition of each speaker who appears in the image captured by camera 2. Then, the human sensing unit 11 acquires the normalized coordinates of the neck of each speaker who appears in the image captured by camera 2. As a result, the human sensing unit 11 can acquire normalized coordinates that indicate the position of each speaker, and can determine the lateral positional relationships between speakers. The human sensing unit 11 can perform skeletal recognition using general tools that are already commercially released. Then, the human sensing unit 11 generates a person list consisting of elements equal to the number of speakers whose normalized coordinates have been acquired. In this embodiment, the human sensing unit 11 arranges the elements in the person list in ascending order of normalized coordinate values.

　図３の紙面に向って下段の画像は、人位置判定処理及び内野外野判定処理の詳細を示す。例えば、映像６０を取得した場合、人センシング部１１は、骨格認識を実行して発話者６１の首６０１の正規化座標、発話者６２の首６０２の正規化座標及び発話者６３の首６０３の正規化座標を取得する。そして、人センシング部１１は、取得した正規化座標から、発話者６１、発話者６２、発話者６３の順に映像６０の左から並ぶことを確認する。そして、人センシング部１１は、映像６０に合わせて発話者６１、発話者６２、発話者６３の順に人別リストに要素を設定する。 The images at the bottom of Figure 3 show details of the person position determination process and the infield/outfield determination process. For example, when video 60 is acquired, the person sensing unit 11 performs skeletal recognition to obtain the normalized coordinates of speaker 61's neck 601, the normalized coordinates of speaker 62's neck 602, and the normalized coordinates of speaker 63's neck 603. The person sensing unit 11 then confirms from the acquired normalized coordinates that speakers 61, 62, and 63 are lined up from the left of video 60 in this order. The person sensing unit 11 then sets elements in the person list in the order of speaker 61, speaker 62, and speaker 63, in accordance with video 60.

　ここで、人センシング部１１は、映像から骨格検出を行うが、影像から得られる座標は通常は正規化座標系ではなく、映像のピクセル数を用いたスクリーン座標により表される。そのため、人センシング部１１は、実際には、人位置判定処理において映像から発話者の位置のスクリーン座標を取得した後、そのスクリーン座標から正規化座標への座標変換を行う。図４は、スクリーン座標から正規化座標への座標変換を示す図である。ここでは、スクリーン座標６２１の各軸をｕ軸及びｖ軸とし、正規化座標６２２の各軸をｘ軸及びｙ軸とする。 Here, the human sensing unit 11 performs skeleton detection from the video, but the coordinates obtained from the image are usually not expressed in a normalized coordinate system but in screen coordinates using the number of pixels in the video. Therefore, in practice, the human sensing unit 11 obtains the screen coordinates of the speaker's position from the video in the human position determination process, and then performs coordinate conversion from those screen coordinates to normalized coordinates. Figure 4 is a diagram showing the coordinate conversion from screen coordinates to normalized coordinates. Here, the axes of the screen coordinates 621 are the u-axis and v-axis, and the axes of the normalized coordinates 622 are the x-axis and y-axis.

　人センシング部１１は、ｕ軸及びｖ軸を有するスクリーン座標６２１を用いてスクリーン６２０に映される映像から各発話者の位置を判定し、その後、判定した各発話者の位置をｘ軸及びｙ軸を有する正規化座標６２２に変換する。また、スクリーン６２０のスクリーン幅をＷ_Ｓｃｒとし、スクリーン高さをＨ_Ｓｃｒとすると、それぞれの座標系が取れる範囲は、０≦ｕ≦Ｗ_Ｓｃｒかつ０≦ｖ≦Ｈ_Ｓｃｒ、及び、－０．５≦ｘ≦０．５かつ－０．５≦ｙ≦０．５となる。ここでは、縦方向の正規化座標系を、影像の縦方向の中心を０として、－０．５から０．５の範囲で表すものとする。この場合、人センシング部１１は、次の数式（１）を用いて座標変換を行う。このように座標変換することで、カメラ画角やアスペクト比に左右されることなく、処理を行うことができる。 The human sensing unit 11 determines the position of each speaker from the image projected on the screen 620 using screen coordinates 621 having a u-axis and a v-axis, and then converts the determined position of each speaker into normalized coordinates 622 having an x-axis and a y-axis. Furthermore, if the screen width of the screen 620 is W _Scr and the screen height is H _Scr , the ranges that each coordinate system can take are 0≦u≦W _Scr and 0≦v≦H _Scr , and -0.5≦x≦0.5 and -0.5≦y≦0.5. Here, the vertical normalized coordinate system is expressed in the range from -0.5 to 0.5, with the vertical center of the image set as 0. In this case, the human sensing unit 11 performs coordinate conversion using the following equation (1). By performing coordinate conversion in this manner, processing can be performed regardless of the camera's angle of view or aspect ratio.

　次に、人センシング部１１は、以下の内野外野判定処理を実行する。人センシング部１１は、映像内の各人の頭部方向、すなわち、その発話者の向いている方向を推定する。例えば、人位置判定処理でｐ人の正規化座標が得られた場合、人センシング部１１は、ｐ人数分の正規化座標のそれぞれに位置する発話者の顔を検出して、画像解析等により頭部方向を推定する。より詳しくは、人センシング部１１は、正規化座標の正方向のベクトルの向きを０度として、その０度のベクトルからの角度を頭部方向ベクトルの角度Φを推定することで頭部方向の推定を行う。 Next, the human sensing unit 11 executes the following infield/outfield determination process. The human sensing unit 11 estimates the head direction of each person in the video, i.e., the direction in which the speaker is facing. For example, if the normalized coordinates of p people are obtained in the human position determination process, the human sensing unit 11 detects the faces of the speakers located at each of the normalized coordinates for each of the p people, and estimates the head direction using image analysis, etc. More specifically, the human sensing unit 11 estimates the head direction by assuming that the direction of the vector in the positive direction of the normalized coordinates is 0 degrees, and estimating the angle Φ of the head direction vector from this 0-degree vector.

　ここで、人センシング部１１は、各発話者が内野か外野かを判定するための内野範囲を予め有する。例えば、人センシング部１１は、カメラ２に向いた状態を９０度として、９０度の所定角度前後の範囲を内野範囲として有する。そして、人センシング部１１は、頭部方向が内野範囲に含まれる発話者を内野と判定し、頭部方向が内野範囲に含まれない発話者を外野の人と判定する。次に、人センシング部１１は、人別リストの各要素に内野フラグの項目を設けて、初期値としてＦａｌｓｅを設定する。そして、人センシング部１１は、内野と判定した発話者については、内野フラグをＴｒｕｅに設定する。また、人センシング部１１は、外野と判定した発話者については、内野フラグをＦａｌｕｓｅに設定する。これにより、人センシング部１１は、人別リストにおける各発話者の内野フラグについて、その発話者が内野であればＴｒｅを設定し、外野であればＦａｌｕｓｅを設定し、さらに頭部方向の検出漏れがあった場合にはＦａｌｓｅを設定することができる。 Here, the person sensing unit 11 has an infield range in advance for determining whether each speaker is in the infield or outfield. For example, the person sensing unit 11 defines the state in which the speaker is facing the camera 2 as 90 degrees, and defines a range around a predetermined angle of 90 degrees as the infield range. The person sensing unit 11 then determines that speakers whose head direction is within the infield range are in the infield, and determines that speakers whose head direction is not within the infield range are outfield people. Next, the person sensing unit 11 provides an infield flag item for each element in the person list and sets the initial value to False. The person sensing unit 11 then sets the infield flag to True for speakers determined to be in the infield. The person sensing unit 11 also sets the infield flag to False for speakers determined to be in the outfield. As a result, the human sensing unit 11 can set the infield flag of each speaker in the person list to Tre if the speaker is in the infield, to False if the speaker is in the outfield, and further to False if there is an oversight of head direction detection.

　例えば、図３の下段の画像に示すように、人センシング部１１は、発話者６１～６３について頭部方向６１１～６１３を推定する。さらに、ここでは内野範囲ψを９０度となる正面方向６１０から前後４５度として、人センシング部１１は、内野外野判定処理を行う。この場合、頭部方向６１１は内野範囲に含まれるため、人センシング部１１は、発話者６１を内野と判定する。また、頭部方向６１２及び６１３は内野範囲に含まれないため、人センシング部１１は、発話者６２及び６３を外野の人と判定する。そして、人センシング部１１は、人別リストにおける発話者６１の内野フラグをＴｕｒｅとし、発話者６２及び６３の内野フラグをＦａｌｓｅとする。 For example, as shown in the image at the bottom of Figure 3, the person sensing unit 11 estimates head directions 611-613 for speakers 61-63. Furthermore, here, the infield range ψ is set to 45 degrees forward or backward from the front direction 610, which is 90 degrees, and the person sensing unit 11 performs infield/outfield determination processing. In this case, because head direction 611 is included in the infield range, the person sensing unit 11 determines that speaker 61 is in the infield. Furthermore, because head directions 612 and 613 are not included in the infield range, the person sensing unit 11 determines that speakers 62 and 63 are people in the outfield. Then, the person sensing unit 11 sets the infield flag for speaker 61 in the person list to True, and the infield flags for speakers 62 and 63 to False.

　図１に戻って説明を続ける。その後、人センシング部１１は、作成した人別リストを音響信号処理部１３及び出力統合部１４へ出力する。 Referring back to Figure 1, the explanation continues. The human sensing unit 11 then outputs the created human list to the acoustic signal processing unit 13 and the output integration unit 14.

　ここで、リモート空間が「所定空間」の一例にあたり、発話者が内野であるか外野であるかの判定が「発話者が会話に参加しているか否かの判定」の一例にあたる。すなわち、人センシング部１１は、所定空間に存在する複数の発話者を撮影するカメラ２により撮影された映像を基に、各発話者が会話に参加しているか否かを判定する。また、人センシング部１１は、映像を基に発話者それぞれの位置を判定する。 Here, the remote space is an example of a "predetermined space," and determining whether a speaker is in the infield or outfield is an example of "determining whether a speaker is participating in a conversation." In other words, the human sensing unit 11 determines whether each speaker is participating in a conversation based on the video captured by the camera 2 that captures multiple speakers present in the predetermined space. The human sensing unit 11 also determines the position of each speaker based on the video.

　＜１．２．２．音声分離部＞
　音声分離部１２は、複数のマイク３で収音された音声の入力を受ける。マイクアレイ３０の入力には発話による発話音声以外にも、背景ノイズ、ＢＧＭ（Back　Ground　Music）及び物音等といったその環境で発生する背景音が含まれる。そこで、音声分離部１２は、後の処理で発話音声に処理を施して聞こえ易さを調整することができるように、マイクアレイ３０から入力された音の発話音声と背景音とを分離する音声分離処理を行う。音声分離処理は、発話抽出処理とも言える。例えば、音声分離部１２は、ボーカル抽出や歌声ありの音楽で歌声のみを抽出する技術を用いて音声分離を行う。 <1.2.2. Audio Separation Unit>
The audio separation unit 12 receives audio input collected by the multiple microphones 3. The input to the microphone array 30 includes not only speech sounds generated by speech but also background sounds generated in the environment, such as background noise, background music (BGM), and other sounds. Therefore, the audio separation unit 12 performs audio separation processing to separate the speech sounds input from the microphone array 30 from the background sounds so that the speech sounds can be processed later to adjust their audibility. The audio separation processing can also be considered speech extraction processing. For example, the audio separation unit 12 performs audio separation using vocal extraction or technology to extract only the singing voice from music with singing voices.

　このように、音声分離部１２は、リモート空間にあたる所定空間の音声から背景音を分離する。そして、発話音声と背景音とを分離することで、発話音声及び背景音に対する明瞭度調整等の処理が調整することが容易となる。 In this way, the audio separation unit 12 separates background sound from audio in a specified space that corresponds to the remote space. By separating the speech audio from the background sound, it becomes easier to adjust processes such as clarity adjustment for the speech audio and background sound.

　ここで、本実施例では、マイク３として全方位マイクを用いたが、他にも、マイク３として、各発話者にピンマイクを装着させ、且つ、無指向性マイクを用いて背景音を収音してもよい。その場合、音声分離部１２は、ピンマイクで収音した音声を各発話者の発話音声として、無指向性マイクで収音した音声を背景音としてもよい。 In this embodiment, an omnidirectional microphone is used as the microphone 3, but it is also possible to have each speaker wear a pin microphone as the microphone 3, and to use an omnidirectional microphone to pick up background sound. In this case, the audio separation unit 12 may use the audio picked up by the pin microphone as the speech of each speaker, and the audio picked up by the omnidirectional microphone as the background sound.

　＜１．２．３．音響信号処理部＞
　音響信号処理部１３は、個別音分離部１３１及び明瞭度調整部１３２を有する。 <1.2.3. Acoustic signal processing section>
The sound signal processing unit 13 includes an individual sound separation unit 131 and a clarity adjustment unit 132 .

　個別音分離部１３１は、音声分離部１２による音声分離処理により抽出された発話音声の入力を受ける。また、個別音分離部１３１は、人センシング部１１により作成された人別リストを取得する。そして、個別音分離部１３１は、人別リストを用いて映像に映る各人の発話音声を分離するための以下の個別音分離処理を実行する。 The individual sound separation unit 131 receives input of speech sounds extracted by the speech separation process performed by the speech separation unit 12. The individual sound separation unit 131 also acquires the person list created by the person sensing unit 11. The individual sound separation unit 131 then uses the person list to perform the following individual sound separation process to separate the speech sounds of each person appearing in the video.

　図５は、個別音分離処理を示す図である。ここでは、以下の条件が満たされている場合で説明する。カメラ２とマイクアレイ３０によるマイクアレイとはカメラ２の撮影方向に向かって直列に並べられる。すなわち、カメラ２により撮影される２次元の映像の法線方向にカメラ２とマイクアレイ３０とが並べられ、映像における縦方向及び横方向と奥行き方向とで形成される座標を考えた場合、カメラ２とマイクアレイ３０とは縦方向及び横方向の座標が一致する。また、リモート空間の発話者は、カメラ２により撮影される２次元の映像の縦方向が一致する位置で横方向一列に並ぶ。ここで、リモート空間の各発話者が並ぶ位置の縦方向及び横方向にあたる面を「仮想スクリーン」と呼ぶ。 Figure 5 is a diagram showing the individual sound separation process. Here, we will explain the case where the following conditions are met. The camera 2 and the microphone array made up of the microphone array 30 are arranged in series in the shooting direction of the camera 2. In other words, the camera 2 and the microphone array 30 are arranged in the normal direction of the two-dimensional image shot by the camera 2, and when considering the coordinates formed by the vertical, horizontal and depth directions in the image, the vertical and horizontal coordinates of the camera 2 and the microphone array 30 match. Furthermore, the speakers in the remote space line up horizontally at positions where the vertical direction of the two-dimensional image shot by the camera 2 matches. Here, the surface that corresponds vertically and horizontally to the positions where the speakers in the remote space are lined up is called the "virtual screen."

　図５における－０．５～０．５の範囲で示す正規化座標は、仮想スクリーンの横方向の座標と一致する。また、カメラ２及びマイク３との距離Ｌは既知である。さらに、図４におけるｘ_ｐｈは、仮想スクリーンにおける座標の原点からの物理的距離である。ｘ_ｐｈは、カメラ２の画角や距離Ｌの値から算出可能である。 The normalized coordinates shown in the range of -0.5 to 0.5 in Figure 5 match the horizontal coordinates of the virtual screen. The distance L between the camera 2 and the microphone 3 is known. Furthermore, x _ph in Figure 4 is the physical distance from the origin of the coordinates on the virtual screen. x _ph can be calculated from the angle of view of the camera 2 and the value of the distance L.

　この場合に、個別音分離部１３１は、取得した人別リストに登録された要素数分のスレッドを作成して、各スレッドに属性として各発話者を識別するためのＩＤ（Identifier）を設定する。例えば、個別音分離部１３１は、人別リストの左から並ぶ順に、スレッドＩＤを１，２，３，・・・と割り当てる。さらに、個別音分離部１３１は、各スレッドに人別リストに登録された正規化座標の値もスレッド固有の値として保持させる。そして、個別音分離部１３１は、人位置判定処理で得られた正規化座標を仮想スクリーンの座標として、各発話者の原点からの物理的距離であるｘ_ｐｈを算出する。そして、個別音分離部１３１は、算出した物理的距離ｘ_ｐｈ、仮想クリーンからカメラ２までの距離Ｌを用いて、次の数式（２）により特定の人までの角度θを求めることができる。 In this case, the individual sound separation unit 131 creates threads for the number of elements registered in the acquired person list and assigns an ID (identifier) to each thread as an attribute to identify each speaker. For example, the individual sound separation unit 131 assigns thread IDs of 1, 2, 3, ... in order from left to right in the person list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the person list in each thread as thread-specific values. Then, the individual sound separation unit 131 calculates x _ph , which is the physical distance from the origin of each speaker, using the normalized coordinates obtained in the person position determination process as coordinates on the virtual screen. Then, the individual sound separation unit 131 can calculate the angle θ to a specific person using the calculated physical distance x _ph and the distance L from the virtual screen to camera 2 using the following equation (2).

　例えば、図５の発話者１０１について、個別音分離部１３１は、正規化座標としてｘ_１を取得する。そして、個別音分離部１３１は、正規化座標であるｘ_１から原点から発話者１０１までの物理的距離を算出し、数式（２）に算出した物理的距離及び距離Ｌを用いて、発話者１０１のカメラ２及びマイク３を中心とした原点からの角度θ_１を算出する事ができる。個別音分離部１３１は、各スレッドについて角度θを求める計算を行うことで、各発話者の角度をそれぞれのスレッドで得ることができる。 For example, for the speaker 101 in Fig. 5, the individual sound separation unit 131 acquires _x1 as the normalized coordinate. Then, the individual sound separation unit 131 calculates the physical distance from the origin to the speaker 101 from the normalized coordinate _x1 , and can calculate the angle _θ1 from the origin centered on the camera 2 and microphone 3 of the speaker 101 using the physical distance calculated in equation (2) and the distance L. The individual sound separation unit 131 can obtain the angle of each speaker for each thread by performing a calculation to find the angle θ for each thread.

　次に、個別音分離部１３１は、音声分離部１２による音声分離により抽出された発話音声に対して、ビームフォーミングを用いて発話者毎の発話音声について、個別音分離処理を実行する。音声分離部１２による音声分離により抽出された発話音声は、マイクアレイ３０から得られた音であり、各発話者の方向の感度は確保されている。そこで、個別音分離部１３１は、発話者１０１の場合であれば、発話者１０１に対する角度θ_１をビーム方向とし、かつ、角度θ_１を中心として１５度の範囲１０２のビーム幅とする。そして、個別音分離部１３１は、ビーム方向及びビーム幅を指定して、取得した発話音声のうち指定した角度範囲以外の方向の音を抑圧することで、目的方向の感度を確保して発話者１０１の個別音を取得する。 Next, the individual sound separation unit 131 performs individual sound separation processing on the speech sound extracted by the speech separation by the speech separation unit 12, using beamforming, for the speech sound of each speaker. The speech sound extracted by the speech separation by the speech separation unit 12 is sound obtained from the microphone array 30, and sensitivity in the direction of each speaker is ensured. Therefore, in the case of speaker 101, the individual sound separation unit 131 sets the beam direction to angle _θ1 with respect to speaker 101 and sets the beam width to a range 102 of 15 degrees centered on angle _θ1 . Then, the individual sound separation unit 131 specifies the beam direction and beam width to suppress sounds from directions outside the specified angular range of the acquired speech sound, thereby ensuring sensitivity in the target direction and acquiring the individual sound of speaker 101.

　このように、個別音分離部１３１は、人センシング部１１により判定された発話者それぞれの位置を基に、リモート空間にあたる所定空間の音声から発話者毎の発話音声である個別音を分離する。 In this way, the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the sound in the specified space that corresponds to the remote space, based on the position of each speaker determined by the human sensing unit 11.

　この点、従来は、映像上の人位置と対応付けながらマイク非装着で話者の個別音を取得することが困難であった。一方、本実施の形態に係る個別音分離部１３１は、映像上の人の位置が映像上の順番で登録された人別リストを用いて、その順番を保持しながらビームフォーミングの処理をすることで、映像上の各発話者の位置と分離後の個別音のペアが容易に作成できる。これにより、伝送後にローカルで発話を再生するときに、映像と出音位置を一致させることが容易となる。 In the past, it was difficult to acquire individual sounds from speakers without wearing a microphone while associating them with the position of the person on the video. However, the individual sound separation unit 131 of this embodiment uses a person-specific list in which the positions of people on the video are registered in the order they appear on the video, and performs beamforming processing while maintaining that order, making it easy to pair the position of each speaker on the video with the individual sound after separation. This makes it easy to match the video and sound output positions when playing back the speech locally after transmission.

　図１に戻って説明を続ける。明瞭度調整部１３２は、個別音分離部１３１により生成されたリモート空間に存在する各人の個別音の入力を受ける。また、明瞭度調整部１３２は、人別リストの入力を個別音分離部１３１から受ける。さらに、明瞭度調整部１３２は、音声分離部１２による音声分離により抽出された背景音の入力を受ける。 Returning to Figure 1, the explanation continues. The clarity adjustment unit 132 receives input of the individual sounds of each person present in the remote space generated by the individual sound separation unit 131. The clarity adjustment unit 132 also receives input of a person list from the individual sound separation unit 131. Furthermore, the clarity adjustment unit 132 receives input of background sounds extracted by audio separation by the audio separation unit 12.

　明瞭度調整部１３２は、個別音毎に各発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部１３２は、判定結果に応じて個別音及び背景音に対して明瞭度調整処理を実施する。詳しくは、明瞭度調整部１３２は、内野の人の個別音に対して、より聞こえやすくするエンハンス（Enhance）処理を実施する。また、明瞭度調整部１３２は、外野の人の個別音及び背景音に対して、より聞こえ難くするデグレード（Degrade）処理を実施する。その後、明瞭度調整部１３２は、明瞭度調整処理を施した個別音及び背景音を出力統合部１４へ出力する。 The clarity adjustment unit 132 determines for each individual sound whether each speaker is in the infield or outfield using the infield flag in the person list. Then, the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds and background sounds according to the determination result. In detail, the clarity adjustment unit 132 performs enhancement processing on the individual sounds of people in the infield, making them easier to hear. The clarity adjustment unit 132 also performs degrade processing on the individual sounds and background sounds of people in the outfield, making them harder to hear. The clarity adjustment unit 132 then outputs the individual sounds and background sounds that have been subjected to clarity adjustment processing to the output integration unit 14.

　明瞭度調整部１３２は、明瞭度調整処理には、フォルマント重み付けや、イコライジング・フィルタや、残響フィルタを用いることができる。フォルマント重み付けの場合、明瞭度調整部１３２は、エンハンス処理として第２フォルマンド以上を補強し、デグレード処理として第２フォルマント以上を抑圧する。また、明瞭度調整部１３２は、イコライジング・フィルタを用いた処理として、ローパスフィルタによる背景音のデグレード処理や、残響音の付加によるデグレード処理を実施する。 The clarity adjustment unit 132 can use formant weighting, an equalizing filter, or a reverberation filter for clarity adjustment processing. In the case of formant weighting, the clarity adjustment unit 132 reinforces the second formant and above as an enhancement process, and suppresses the second formant and above as a degradation process. In addition, the clarity adjustment unit 132 performs equalizing filter processing, such as degrading background sounds using a low-pass filter and degrading sounds by adding reverberation.

　このように、明瞭度調整部１３２は、マイク３により収音されたリモート空間にあたる所定空間の音声に含まれる、人センシング部１１により判定された会話に参加している発話者の発話音声について明瞭度の調整を行う。また、明瞭度調整部１３２は、リモート空間にあたる所定空間の音声に含まれる会話に参加していない発話者の発話音声について、マイク３による収音状態よりも明瞭度を低下させる明瞭度低下調整を行う。また、明瞭度調整部１３２は、音声分離部１２により分離された背景音について、他の音声よりも明瞭度を低下させる明瞭度低下調整を行う。 In this way, the clarity adjustment unit 132 adjusts the clarity of the speech of speakers who are participating in the conversation as determined by the human sensing unit 11, which is included in the audio of the specified space corresponding to the remote space picked up by the microphone 3. The clarity adjustment unit 132 also performs a clarity reduction adjustment on the speech of speakers who are not participating in the conversation, which is included in the audio of the specified space corresponding to the remote space, to reduce the clarity compared to the audio pickup state by the microphone 3. The clarity adjustment unit 132 also performs a clarity reduction adjustment on the background sound separated by the audio separation unit 12 to reduce the clarity compared to other audio.

　この点、従来は、背景音の適切な取り扱いや、発話者の会話への参加不参加に応じた明瞭度調整が困難であった。これに対して、明瞭度調整部１３２は、伝送する音が背景音か発話音声か、また発話音声の中でも発話者が内野か外野かで会話への参加不参加を判定することによって、聞き取りやすさを変更することができ、対話の円滑化を促進することができる。 In the past, it was difficult to appropriately handle background sounds and adjust clarity depending on whether a speaker is participating in a conversation. In contrast, the clarity adjustment unit 132 determines whether the transmitted sound is background sound or speech, and even within speech, whether the speaker is in the infield or outfield, thereby making it possible to change the ease of audibility and promote smoother dialogue.

　＜１．２．４．出力統合部＞
　出力統合部１４は、明瞭度調整処理が施された個別音及び背景音の入力を明瞭度調整部１３２はから受ける。また、出力統合部１４は、人別リストの入力を人センシング部１１から受ける。そして、出力統合部１４は、以下に説明する明瞭度調整処理を実行する。 <1.2.4. Output Integration Unit>
The output integration unit 14 receives input of the individual sounds and background sounds that have been subjected to clarity adjustment processing from the clarity adjustment unit 132. The output integration unit 14 also receives input of the person list from the person sensing unit 11. Then, the output integration unit 14 executes the clarity adjustment processing described below.

　背景音や発話音声の音声処理は、それぞれの音声毎に対応させて１チャンネルずつ行われる。すなわち、出力統合部１４は、発話者の人数に背景音の１チャンネルを付加した数のデータの入力を受ける。つまり、背景音が１チャンネルのデータであり、発話者の人数がｐ人だとすると、出力統合部１４は、（１＋Ｐ）チャンネルのデータを取得する。 Audio processing of background sounds and speech sounds is performed one channel at a time, corresponding to each sound. In other words, the output integration unit 14 receives input data equal to the number of speakers plus one channel of background sound. In other words, if the background sound is one channel of data and the number of speakers is p, the output integration unit 14 obtains (1 + P) channels of data.

　そして、出力統合部１４は、ローカルに音声を伝送するために、すべてのチャンネルの音を統合して１つの音データにする。例えば、出力統合部１４は、フレーム毎に各チャンネルの要素を含ませるチャンネルインターリーブを用いて１つの音データを生成することができる。その後、出力統合部１４は、生成した１つの音データを送信部１５へ出力する。また、出力統合部１４は、人別リストの各要素及び背景音のそれぞれと音データに含まれる各チャンネルとを対応付ける情報とともに、人別リストを送信部１５へ出力する。ここで、出力統合部１４は、音データに人別リストのデータを付加して１つのデータとしてもよい。 Then, in order to transmit the audio locally, the output integration unit 14 integrates the sounds of all channels into a single piece of audio data. For example, the output integration unit 14 can generate a single piece of audio data using channel interleaving, which includes elements of each channel for each frame. The output integration unit 14 then outputs the generated single piece of audio data to the transmission unit 15. The output integration unit 14 also outputs the person list to the transmission unit 15, along with information that associates each element of the person list and background sound with each channel included in the audio data. Here, the output integration unit 14 may add the data of the person list to the audio data to create a single piece of data.

　＜１．２．５．送信部＞
　送信部１５は、各発話者の個別音及び背景音を含む１つの音データの入力を出力統合部１４から受ける。また、送信部１５は、人別リストの入力を出力統合部１４から受ける。また、送信部１５は、撮像された映像の入力をカメラ２から受ける。そして、送信部１５は、各発話者の個別音及び背景音を含む１つの音データ、人別リスト、並びに、カメラ２の映像を、ネットワーク７を介してローカル側の遠隔コミュニケーション装置１の受信ユニット２０へ送信する。ネットワーク７は、例えば、インターネット、ＬＡＮ（Ｌｏｃａｌ　Ａｒｅａ　Ｎｅｔｗｏｒｋ）等である。このように、送信部１５は、明瞭度調整部１３２により処理されたリモート空間にあたる所定空間の音声を送信する。 <1.2.5. Transmitter>
The transmitter 15 receives an input of a single piece of sound data including the individual sounds of each speaker and background sound from the output integration unit 14. The transmitter 15 also receives an input of a person list from the output integration unit 14. The transmitter 15 also receives an input of captured video from the camera 2. The transmitter 15 then transmits the single piece of sound data including the individual sounds of each speaker and background sound, the person list, and the video from the camera 2 to the receiving unit 20 of the remote communication device 1 on the local side via the network 7. The network 7 is, for example, the Internet or a LAN (Local Area Network). In this way, the transmitter 15 transmits the audio of a predetermined space corresponding to the remote space processed by the clarity adjustment unit 132.

　＜１．３．受信ユニット＞
　受信ユニット２０は、本実施の形態ではローカル側である受信側の遠隔コミュニケーション装置１で用いられる機能である。受信ユニット２０は、図１に示すように、受信部２１及び音声出力制御部２２を有する。 1.3. Receiving unit
The receiving unit 20 is a function used in the receiving-side remote communication device 1, which is the local side in this embodiment. The receiving unit 20 includes a receiving section 21 and an audio output control section 22, as shown in FIG.

　＜．３．１．受信部＞
　受信部２１は、リモート側の遠隔コミュニケーション装置１の送信部１５から送信されたリモート空間の各発話者の個別音及び背景音を含む１つの音データ、人別リスト、並びに、映像を受信する。受信部２１は、受信した映像をディスプレイ６に出力する。ディスプレイ６は、受信部２１から出力された映像をスクリーンに表示する。また、受信部２１は、音データ及び人別リストを音声出力制御部２２へ出力する。 <3.1. Receiving section>
The receiving unit 21 receives one piece of sound data including individual sounds of each speaker in the remote space and background sounds, a person list, and video transmitted from the transmitting unit 15 of the remote communication device 1 on the remote side. The receiving unit 21 outputs the received video to the display 6. The display 6 displays the video output from the receiving unit 21 on a screen. The receiving unit 21 also outputs the sound data and the person list to the audio output control unit 22.

　このように、受信部２１は、送信部１５により送信されたリモート空間にあたる所定空間の音声を受信する。また、受信部２１は、リモート空間にあたる所定空間の音声とともに映像を受信して、映像を写した際の映像の上下方向の両端部に１つずつスピーカーアレイ４１及び４２が配置されたスクリーンに映像を映す。 In this way, the receiving unit 21 receives the audio from the specified space corresponding to the remote space transmitted by the transmitting unit 15. The receiving unit 21 also receives the video along with the audio from the specified space corresponding to the remote space, and projects the video onto a screen on which speaker arrays 41 and 42 are arranged, one at each end of the video in the vertical direction when the video is displayed.

　＜１．３．２．音声出力制御部＞
　音声出力制御部２２は、以下に説明する複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット４１１及び４１２を指定して再生させる。音声出力制御部２２は、各発話者の個別音及び背景音を含む音データ、並びに、人別リストの入力を受信部２１から受ける。以下では、内野の人の発話音声を「内野発話音声」と呼び、外野の人の発話音声を「外野発話音声」と呼ぶ。 <1.3.2. Audio output control unit>
The audio output control unit 22 performs a multiple opposing speaker balancing process described below to specify the speaker units 411 and 412 that will output the background sound, the infield speech sound, and the outfield speech sound, respectively, and plays them back. The audio output control unit 22 receives sound data including the individual sound and background sound of each speaker, as well as an input of a person list, from the receiving unit 21. Hereinafter, the speech sound of people in the infield will be referred to as "infield speech sound," and the speech sound of people in the outfield will be referred to as "outfield speech sound."

　図６は、複数対向スピーカーバランシング処理を示す図である。図６の紙面に向って上側の画像はスピーカーアレイ４１及び４２における各部の長さを示し、紙面に向って下側のグラフ２００は処理後の背景音、内野発話音声及び外野発話音声のスピーカーアレイ４１及び４２からの出音を示す。グラフ２００は、横軸でスピーカーアレイ４１及び４２に対応する位置を示し、縦軸で音量を示す。グラフ２００の縦軸は、内野発話音声のピークの音量を１として規格化した値を示す。 Figure 6 is a diagram showing the multiple opposing speaker balancing process. The image at the top of Figure 6 shows the length of each part of speaker arrays 41 and 42, and graph 200 at the bottom shows the post-processing output sounds from speaker arrays 41 and 42 for background sound, infield speech sounds, and outfield speech sounds. Graph 200 shows the positions corresponding to speaker arrays 41 and 42 on the horizontal axis, and the volume on the vertical axis. The vertical axis of graph 200 shows a value normalized with the peak volume of infield speech sounds set to 1.

　音声出力制御部２２は、物理的なスピーカーアレイ幅Ｗ_ｓｐｋ及びディスプレイ幅Ｗ_ｄｉｓｐの情報を予め有する。また、音声出力制御部２２は、スピーカーアレイ４１の中のスピーカーユニット４１１の数ｕの情報も予め有する。スピーカーアレイ４２の中のスピーカーユニット４１２の数もｕである。ここで、Ｗ_ｓｐｋは左端のスピーカーユニット４１１と右端のスピーカーユニット４１１と間の距離を表す。これは、Ｗ_ｓｐｋは、スピーカーユニット４１２についても同様である。また、スピーカーユニット４１１同士及びスピーカーユニット４１２同士の間のスピーカー間距離Ｗ_ｕはすべて等しく、ディスプレイ６とスピーカーアレイ４１とスピーカーアレイ４２とは左右中央揃えされている。 The audio output control unit 22 has information on the physical speaker array width W _spk and display width W _disp in advance. The audio output control unit 22 also has information on the number u of speaker units 411 in the speaker array 41 in advance. The number of speaker units 412 in the speaker array 42 is also u. Here, W _spk represents the distance between the leftmost speaker unit 411 and the rightmost speaker unit 411. The same applies to W _spk for the speaker units 412. Furthermore, the inter-speaker distances W _u between the speaker units 411 and between the speaker units 412 are all equal, and the display 6, speaker array 41, and speaker array 42 are aligned horizontally and centrally.

　背景音、内野発話音声及び外野発話音声に対する出力調整処理に共通する前段の処理として、音声出力制御部２２は、既知のＷ_ｓｐｋ及びｕを用いて、Ｗ_ｕ＝Ｗ_ｓｐｋ／ｕとしてスピーカーユニット４１１間のスピーカー間距離Ｗ_ｕを求める。ここで、Ｗ_ｕは物理的な実際の距離であるため、音声出力制御部２２は、映像上の発話者の位置と音声とを一致させるため、処理上の距離をＷ_ｕ＿ｎ＝Ｗ_ｕ／Ｗ_ｄｉｓｐとして正規化座標の大きさに揃える。また、音声出力制御部２２は、スピーカーアレイ幅Ｗ_ｓｐｋも同様にＷ_{ｓｐｋ＿ｎ}として正規化座標に揃える。また、音声出力制御部２２は、スピーカーユニット４１１及び４１２の正規化座標ｘｕを算出する。 As a preliminary process common to the output adjustment processes for background sound, infield speech sound, and outfield speech sound, the audio output control unit 22 calculates the inter-speaker distance _Wu between the speaker units 411 as _Wu = _Wspk /u using known _Wspk and u. Here, since _Wu is the actual physical distance, the audio output control unit 22 aligns the processed distance to the normalized coordinate size as _{Wu_n} = _Wu / _Wdisp in order to match the position of the speaker on the video with the audio. The audio output control unit 22 also aligns the speaker array width _Wspk to the normalized coordinate size as _{Wspk_n} . The audio output control unit 22 also calculates the normalized coordinate xu of the speaker units 411 and 412.

　ここで、振幅比をｘ、音圧レベルをｙ、ラウドネス比をｚとすると、ｙ＝２０ｌｏｇ_１０ｘ及びｚ＝２^ｙ／１０という関係式が成り立つ。これをｘについて解くと、ｘ＝ｚ^{１／２ｌｏｇ} _１０ ^２という関係が導かれる。これはすなわち、音量をｚ倍したい場合には、振幅をｚ^１．６６倍すればよいことを表す。 Here, if the amplitude ratio is x, the sound pressure level is y, and the loudness ratio is z, then the following relationship holds: y = 20 log ₁₀ x and z = 2 ^y/10 . Solving this for x yields the relationship x = z ^{1/2 log} ₁₀ ^2. This means that if you want to increase the volume by z times, you just need to multiply the amplitude by z ^1.66 .

　背景音に対する複数対向スピーカーバランシング処理について説明する。本実施の形態に係る遠隔コミュニケーション装置１では、空間のつながり感を維持しつつ、リモートにいる対話の相手とのコミュニケーションを円滑に行うため、背景音を完全に除去するのではなく敢えてローカルに伝送する。ただし、背景音の音量を大きくしすぎると内野発話が聞き取り難くなる。そこで、音声出力制御部２２は、デグレード処理で明瞭度が下げられた背景音に対して、再生のタイミングでさらに以下の音量調整を加えて背景音の定位感をぼかす。 The following describes the multiple opposing speaker balancing process for background sound. In the remote communication device 1 according to this embodiment, in order to facilitate communication with a remote conversation partner while maintaining a sense of spatial connection, background sound is transmitted locally rather than being completely removed. However, if the volume of the background sound is made too high, it becomes difficult to hear infield speech. Therefore, the audio output control unit 22 further applies the following volume adjustments to the background sound, whose clarity has been reduced by the degradation process, at the timing of playback to blur the sense of positioning of the background sound.

　具体的には、音声出力制御部２２は、音データから背景音を取得する。次に、音声出力制御部２２は、背景音のラウドネスが１／ｕとなるようにして、全てのスピーカーユニット４１１及び４１２から出力させる。すなわち、音声出力制御部２２は、背景音の振幅を（１／ｕ）^１．６６倍になるように処理して、スピーカーアレイ４１の全てのスピーカーユニット４１１及びスピーカーアレイ４２の全てのスピーカーユニット４１２から出音させる。 Specifically, the audio output control unit 22 acquires background sound from the sound data. Next, the audio output control unit 22 adjusts the loudness of the background sound to 1/u and outputs it from all speaker units 411 and 412. That is, the audio output control unit 22 processes the amplitude of the background sound to be ^1.66 times (1/u), and outputs the background sound from all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42.

　以上の処理を加えることで、背景音は、グラフ２００の曲線２０１に示すように全てのスピーカーユニット４１１及び４１２から同じ抑えられた音量で出音される。ここでは、背景音とは、自然環境音、カフェのＢＧＭ及びオフィスの喧騒等の音源位置が明確でない音を対象とする。音声出力制御部２２は、背景音の定位感をぼかすことで、ある１つのスピーカーユニット４１１及び４１２から再生するよりもリモート空間の様子を臨場感高く再現できる。 By performing the above processing, background sound is output at the same reduced volume from all speaker units 411 and 412, as shown by curve 201 in graph 200. Here, background sound refers to sounds whose source location is unclear, such as natural environmental sounds, background music in a cafe, and office noise. By blurring the sense of positioning of the background sound, the audio output control unit 22 can reproduce the remote space with a greater sense of realism than if it were played from a single speaker unit 411 or 412.

　次に、内野発話音声に対する複数対向スピーカーバランシング処理について説明する。内野発話音声はローカルの人に向けた発話なので、なるべく聞こえやすくすることが好ましい。そこで、音声出力制御部２２は、リモートにおいてエンハンス処理が施され内野の個別音に対して以下のような処理を実施する。 Next, we will explain the multiple facing speaker balancing process for infield speech sounds. Since infield speech sounds are intended for local people, it is preferable to make them as easy to hear as possible. Therefore, the audio output control unit 22 performs the following process on individual infield sounds that have been remotely enhanced.

　音声出力制御部２２は、人別リストを用いて音データから内野発話音声を取得する。次に、音声出力制御部２２は、取得したその内野発話音声の定位感はぼかさず、音も大きさも調整せずそのままスピーカーアレイ４１及び４２に再生させる。ここで、音声出力制御部２２は、その内野発話音声の発話者の正規化座標に最も近いスピーカーユニット４１１及び４１２を特定する。すなわち、発話者の正規化座標をｘ_ｐとした場合、音声出力制御部２２は、以下の数式（３）で求まる正規化座標ｘ_ｕに存在するスピーカーユニット４１１及び４２１を特定する。 The audio output control unit 22 acquires infield speech from the sound data using the person list. Next, the audio output control unit 22 plays the acquired infield speech directly from the speaker arrays 41 and 42 without blurring the sense of localization of the speech and without adjusting the volume or volume. Here, the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the normalized coordinates of the speaker of the infield speech. In other words, if the normalized coordinates of the speaker are x _p , the audio output control unit 22 identifies the speaker units 411 and 421 that exist at the normalized coordinate x _u calculated by the following formula (3).

　そして、音声出力制御部２２は、特定したスピーカーユニット４１１及び４１２からその内野発話音声を出力させる。この場合、音声出力制御部２２は、その内野発話音声を他のスピーカーユニット４１１及び４１２からは出力させない。以上の処理を加えることで、内野発話音声は、グラフ２００の曲線２０２に示すように１つの特定のスピーカーユニット４１１及び４１２から他の音よりも大きな音量で出音される。 Then, the audio output control unit 22 outputs the infield speech sound from the identified speaker units 411 and 412. In this case, the audio output control unit 22 does not output the infield speech sound from the other speaker units 411 and 412. By performing the above processing, the infield speech sound is output from one specific speaker unit 411 and 412 at a louder volume than other sounds, as shown by curve 202 in graph 200.

　次に、外野発話音声に対する複数対向スピーカーバランシング処理について説明する。外野発話は、ローカルの人が直接会話している相手ではないので明瞭に聞こえなくてもよいが、背景音と同様に空間のつながり感の観点でも完全に除去することは望ましくない。また、明瞭度は下げつつも耳を傾ければ聞き取れるようにすることで、ローカルの人が外野の人に話しかけ、コミュニケーションが発展し内野に変化するというチャンスもある。そこで、相対的に内野発話を聞き取りやすく、外野発話を聞き取り難くするため、リモートにおいて外野の個別音に対してデグレード処理を施している。そのうえで、音声出力制御部２２は、取得したその外野発話音声の定位感をぼかす処理を行う。 Next, we will explain the multiple facing speaker balancing process for outfield speech sounds. Outfield speech does not need to be clearly audible because the local person is not directly speaking with them, but as with background sound, it is undesirable to completely remove it from the perspective of spatial connection. Also, by lowering the clarity but still being audible if you listen carefully, there is an opportunity for a local person to talk to someone in the outfield, and the communication develops and shifts to the infield. Therefore, to make infield speech relatively easier to hear and outfield speech relatively harder to hear, degrading processing is performed on individual outfield sounds remotely. Then, the audio output control unit 22 performs processing to blur the sense of positioning of the acquired outfield speech sounds.

　具体的には、外野発話音声は音源位置が明確であるため、音声出力制御部２２は、正規分布関数を用いる処理により外野発話音声の定位感をぼかす。ここで、正規分布関数は、μを平均値とし、σを標準偏差として次の数式（４）で表される。 Specifically, because the sound source position of outfield speech is clear, the audio output control unit 22 blurs the sense of localization of outfield speech by processing using a normal distribution function. Here, the normal distribution function is expressed by the following equation (4), where μ is the mean value and σ is the standard deviation.

　音声出力制御部２２は、正規分布関数のピークの部分を外野発話音声の発話者の位置と合致させるため、数式（３）を満たす発話者の正規化座標に最も近いスピーカーユニット４１１及び４１２の正規化座標ｘ_ｕの位置をμの値に設定する。また、音声出力制御部２２は、σをＷ_ｕとする。図７は、外野発話音声に対する正規分布関数を用いた処理を示す図である。音声出力制御部２２は、ピークを外野発話音声の発話者の位置としたグラフ２１１で表される関数ｆ（ｘ）を作成する。ここで、ｆ（ｘ）は、次の数式（５）で表される。 In order to match the peak of the normal distribution function with the position of the speaker of the outfield speech, the audio output control unit 22 sets the value of μ to the position of the normalized coordinate x _u of the speaker units 411 and 412 that is closest to the normalized coordinate of the speaker that satisfies Equation (3). Furthermore, the audio output control unit 22 sets σ to W _u . FIG. 7 is a diagram showing processing using a normal distribution function for outfield speech. The audio output control unit 22 creates a function f(x) represented by a graph 211 whose peak is the position of the speaker of the outfield speech. Here, f(x) is expressed by the following Equation (5).

　正規化分布関数のｘ座標の範囲は通常は－∞から∞までであるが、音声出力制御部２２は、スピーカーアレイ４１及び４２を再生する範囲として、例えば、信頼区間９５％までの区間にあたるスピーカーユニット４１１及び４１２で再生するといった制約を掛ける。これにより、音声出力制御部２２は、μ－２σからμ＋２σまでのレンジ＝２の範囲で出音させることができ、出音する範囲を有限にする。この場合、音声出力制御部２２は、１人の外野発話音声を５つずつのスピーカーユニット４１１及び４１２に再生させる。このように再生させるスピーカーユニット４１１及び４１２の範囲を狭めることで、音声出力制御部２２は、処理負荷を軽減することができる。 The x-coordinate range of the normalized distribution function is normally from -∞ to ∞, but the audio output control unit 22 imposes a restriction on the range of playback by the speaker arrays 41 and 42, for example, by playing back using speaker units 411 and 412 within the 95% confidence interval. This allows the audio output control unit 22 to output sound within a range of μ-2σ to μ+2σ (range = 2), fining the range of sound output. In this case, the audio output control unit 22 plays back one person's outfield speech from five speaker units 411 and 412 each. By narrowing the range of speaker units 411 and 412 to be played back in this way, the audio output control unit 22 can reduce the processing load.

　また、σの値によっては最大値が１より大きくなってしまうことが考えられる。その場合、ｆ（ｘ）を音量調整に用いると内野発話音声よりも外野発話音声が大きくなる可能性がある。そこで、音声出力制御部２２は、スケーリング因子ｓｆを用いて、図７のグラフ２１２で示すように、ｇ（ｘ）＝ｓｆ／ｆ（μ）×ｆ（ｘ）として音量を全体にわたって小さくすることで、内野発話音声よりも外野発話音声が大きく再生することを防止する。ｓｆは例えば、０．８とすることができる。 Furthermore, depending on the value of σ, it is possible that the maximum value will be greater than 1. In that case, using f(x) to adjust the volume may result in outfield speech sounds being louder than infield speech sounds. Therefore, the audio output control unit 22 uses a scaling factor sf to reduce the overall volume by g(x) = sf/f(μ) × f(x), as shown in graph 212 in Figure 7, thereby preventing outfield speech sounds from being played louder than infield speech sounds. sf can be set to 0.8, for example.

　そして、音声出力制御部２２は、正規化座標に合わせて、外野発話音声の音の振幅を｛ｇ（ｘ）｝^１．６６倍になるように処理して、外野発話音声の発話者の位置を中心として指定した数のスピーカーユニット４１１及び４１２で外野発話音声を再生させる。ここで、ピークの値を規定するスケーリング因子ｓｆの値（０≦ｓｆ≦１）やスピーカー再生範囲は、設定ファイル等を用いて利用者が変更できる。以上の処理を加えることで、外野発話音声は、図５のグラフ２００の曲線２０３に示すように発話者の位置を中心としての所定の範囲のスピーカーユニット４１１及び４１２から内野発話音声よりも抑えた音量で出音される。 The audio output control unit 22 then processes the amplitude of the outfield speech sound to be ^1.66 times {g(x)} according to the normalized coordinates, and reproduces the outfield speech sound from a specified number of speaker units 411 and 412 centered around the speaker position of the outfield speech sound. Here, the value of the scaling factor sf (0≦sf≦1) that defines the peak value and the speaker playback range can be changed by the user using a configuration file, etc. By performing the above processing, the outfield speech sound is output at a lower volume than the infield speech sound from the speaker units 411 and 412 within a specified range centered around the speaker position, as shown by curve 203 in graph 200 of FIG. 5.

　このように、音声出力制御部２２は、複数のスピーカーユニット４１１が一列に並ぶスピーカーアレイ４１と複数のスピーカーユニット４１２が一列に並ぶスピーカーアレイ４２とを２つ有する複数対向スピーカー４において、以下の処理を実行する。音声出力制御部２２は、発話者それぞれの位置を基に、受信部２１により受信されたリモート空間にあたる所定空間の音声のうち各発話者の発話音声を再生させるスピーカーユニット４１１及び４１２を選択する。そして、音声出力制御部２２は、選択した前記スピーカーユニット４１１及び４１２に発話者それぞれの発話音声を再生させる。また、音声出力制御部２２は、スクリーンに映された発話者毎に、スクリーン上の位置の近傍で発話音声が再生されるように、発話者の位置を基に再生させるスピーカーユニット４１１及び４１２を選択する。 In this way, the audio output control unit 22 performs the following processing on the multiple opposed speakers 4, which have two arrays: a speaker array 41 in which multiple speaker units 411 are arranged in a row, and a speaker array 42 in which multiple speaker units 412 are arranged in a row. Based on the position of each speaker, the audio output control unit 22 selects speaker units 411 and 412 to play back the speech of each speaker from the audio received by the receiving unit 21 in a specified space corresponding to the remote space. The audio output control unit 22 then causes the selected speaker units 411 and 412 to play back the speech of each speaker. Furthermore, for each speaker shown on the screen, the audio output control unit 22 selects speaker units 411 and 412 to play back based on the position of the speaker so that the speech is played back near the position on the screen.

　この点、従来は発話者が画面の向こうの相手と会話しているか否か、すなわち会話に参加しているか否かにより多チャンネルスピーカーの再生方法を制御することは困難であった。これに対して、音声出力制御部２２は、ローカルにとっての対話相手である内野の発話者の声が相対的に聞き取り易くすると同時に、背景音や外野発話を残すことでリモートの空間とのつながり感を保持する。また、音声出力制御部２２は、背景音や外野発話を複数のスピーカーユニット４１１及び４１２から再生することで音の定位感をぼかしてより聞き取り難くし、かつ、内野発話音声を相対的に聞き取り易くすることでリモートの相手との会話により集中できるようにする。 In the past, it was difficult to control the playback method of multi-channel speakers depending on whether the speaker was speaking with the other party on the other side of the screen, i.e., whether they were participating in the conversation. In contrast, the audio output control unit 22 makes the voice of the infield speaker, who is the local conversation partner, relatively easier to hear, while maintaining a sense of connection with the remote space by leaving background sounds and outfield speech. In addition, the audio output control unit 22 plays background sounds and outfield speech from multiple speaker units 411 and 412, blurring the sense of sound localization and making them more difficult to hear, while making infield speech relatively easier to hear, allowing you to better concentrate on the conversation with the remote party.

　＜２．第１の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ＞
　図８は、第１の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置１の送信ユニット１０からローカル側の遠隔コミュニケーション装置１の受信ユニット２０へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。 2. Data flow between remote communication devices according to the first embodiment
8 is a diagram showing an outline of the data flow between the local remote communication devices according to the first embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

　送信側処理２２１が、リモート側に存在する遠隔コミュニケーション装置１が実行する処理である。また、受信側処理２２２が、ローカル側に存在する遠隔コミュニケーション装置１が実行する処理である。 The sending-side process 221 is a process executed by the remote communication device 1 located on the remote side. The receiving-side process 222 is a process executed by the remote communication device 1 located on the local side.

　マイク３は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ１）。ここでは、例えば、マイクアレイ３０のデータとしてｓチャンネルのデータが存在する。また、カメラ２は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ２）。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S1). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing images of the remote space to the remote communication device 1 (step S2).

　カメラ２から入力された映像は、人センシング部１１に送られる。人センシング部１１は、カメラ２により撮影された映像を用いて、リモート空間に存在する各人の位置の判定及び内野外野の判定を行う人センシング処理を実行する（ステップＳ３）。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform human sensing processing to determine the position of each person in the remote space and to determine whether the field or outfield is located (step S3).

　詳しくは、人センシング部１１は、スクリーンに映される映像の横方向に正規化座標を設定する。そして、人センシング部１１は、映像内の発話者に対して骨格認識を行い、各発話者の位置の正規化座標を取得する。その後、人センシング部１１は、発話者の並びに対応させた要素を有する人別リストを生成して、各人の位置の正規化座標を登録することで、人位置判定処理を実行する（ステップＳ３１）。ここでは、人センシング部１１は、映像内に存在するｐ人の発話者を抽出する。すなわち、人別リストにはｐ個の要素が登録される。 In more detail, the human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Then, the human sensing unit 11 performs skeletal recognition on speakers in the image and obtains normalized coordinates of the position of each speaker. After that, the human sensing unit 11 generates a person list having elements corresponding to the order of speakers and registers the normalized coordinates of each person's position, thereby executing the human position determination process (step S31). Here, the human sensing unit 11 extracts p speakers present in the image. In other words, p elements are registered in the person list.

　また、人センシング部１１は、映像内の各発話者の頭部方向を推定する。そして、人センシング部１１は、推定した頭部方向が予め決められた内野範囲に含まれるか否かにより、各発話者が内野か外野かを判定し、判定結果を基に内野及び外野を示す内野フラグを人別リストに登録することで内野外野判定処理を実行する（ステップＳ３２）。 The human sensing unit 11 also estimates the head direction of each speaker in the video. Then, depending on whether the estimated head direction is within a predetermined infield range, the human sensing unit 11 determines whether each speaker is in the infield or outfield, and executes the infield/outfield determination process by registering an infield flag indicating the infield or outfield in the person list based on the determination result (step S32).

　マイク３から入力された音声は、音声分離部１２へ送られる。音声分離部１２は、マイク３から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う（ステップＳ４）。音声分離部１２は、マイクアレイ３０から入力されたｓチャンネルのデータに対して、背景音として１チャンネルのデータを生成し、発話音声としてｓチャンネルのデータを生成する。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate speech and background sounds contained in the audio input from microphone 3 (step S4). For the s-channel data input from microphone array 30, audio separation unit 12 generates one channel of data as background sound and s-channel data as speech.

　音声分離部１２により抽出された１チャンネルのデータである背景音は、音響信号処理部１３の明瞭度調整部１３２へ送られる。明瞭度調整部１３２は、明瞭度調整処理として背景音に対してより聞こえ難くするデグレード処理を実施する（ステップＳ５）。 The background sound, which is one-channel data extracted by the audio separation unit 12, is sent to the clarity adjustment unit 132 of the audio signal processing unit 13. The clarity adjustment unit 132 performs a clarity adjustment process, called a degradation process, to make the background sound less audible (step S5).

　また、音声分離部１２により抽出されたｓチャンネルのデータである発話音声は、音響信号処理部１３の個別音分離部１３１へ送られる。また、人別リストが、個別音分離部１３１へ送られる。発話音声に対しては、音響信号処理部１３により人別リストを用いた発話音声信号処理が実施される（ステップＳ６）。 Furthermore, the speech sound, which is the s-channel data extracted by the speech separation unit 12, is sent to the individual sound separation unit 131 of the acoustic signal processing unit 13. The person list is also sent to the individual sound separation unit 131. The acoustic signal processing unit 13 performs speech sound signal processing on the speech sound using the person list (step S6).

　詳しくは、個別音分離部１３１は、取得した人別リストに登録された要素数であるｐ個分のスレッドを作成する。さらに、個別音分離部１３１は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。そして、個別音分離部１３１は、各発話者の正規化座標を仮想スクリーンの座標として、物理的距離を算出する。そして、個別音分離部１３１は、算出した物理的距離、仮想クリーンからカメラ２までの距離を用いて各発話者までの角度θを求める。次に、個別音分離部１３１は、音声分離部１２による音声分離により抽出された発話音声に対して、各発話者までの角度に応じたビームフォーミングを用いて個別音分離処理を行う（ステップＳ６１）。これにより、個別音分離部１３１は、発話者毎に１チャンネルのデータである個別音を生成する。すなわち、ｐ個のスレッド１つ１つで、１チャンネルの個別音のデータが生成される。 In more detail, the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. The individual sound separation unit 131 then calculates the physical distance using the normalized coordinates of each speaker as coordinates on the virtual screen. The individual sound separation unit 131 then determines the angle θ to each speaker using the calculated physical distance and the distance from the virtual screen to camera 2. Next, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker (step S61). As a result, the individual sound separation unit 131 generates individual sounds, which are one-channel data for each speaker. In other words, one-channel individual sound data is generated for each of the p threads.

　次に、明瞭度調整部１３２は、スレッド毎に個別音に対応する発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部１３２は、明瞭度調整処理として、内野の発話者の個別音に対してより聞こえやすくするエンハンス処理を実施し、かつ、外野の発話者の個別音に対してより聞こえ難くするデグレード処理を実施する（ステップＳ６２）。 Next, the clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. Then, as a clarity adjustment process, the clarity adjustment unit 132 performs an enhancement process to make the individual sounds of infield speakers easier to hear, and a degradation process to make the individual sounds of outfield speakers harder to hear (step S62).

　明瞭度調整処理が施された背景音は、出力統合部１４へ入力される。また、発話音声信号処理が施された個別音も、出力統合部１４へ入力される。これにより、出力統合部１４は、（１＋Ｐ）チャンネルのデータを取得する。そして、出力統合部１４は、すべてのチャンネルの音を統合して１つの音データを生成する出力統合処理を実行する（ステップＳ７）。 The background sound that has undergone clarity adjustment processing is input to the output integrating unit 14. The individual sounds that have undergone speech signal processing are also input to the output integrating unit 14. As a result, the output integrating unit 14 obtains data for (1+P) channels. The output integrating unit 14 then executes output integration processing to integrate the sounds from all channels to generate a single sound data set (step S7).

　出力統合部１４で生成された音データ及び人別リストは、送信部１５によりネットワーク７を介してローカル側の遠隔コミュニケーション装置１へ送られる（ステップＳ８）。人別リストの送信により、人位置判定処理により判定された各発話者の正規化座標及び内野外野判定処理により判定された各発話者が内野であるか外野であるかの情報が、ローカル側の遠隔コミュニケーション装置１へ送られる。 The sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 via the network 7 to the local-side remote communication device 1 (step S8). By sending the person list, the normalized coordinates of each speaker determined by the person position determination process and information on whether each speaker is in the infield or outfield determined by the infield/outfield determination process are sent to the local-side remote communication device 1.

　ローカル側の遠隔コミュニケーション装置１は、音データ及び人別リストの送信を受ける。音データ及び人別リストは、受信部２１を介して音声出力制御部２２へ送られる。音声出力制御部２２は、人別リストを用いて音データに対して複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット４１１及び４１２を指定して再生させる（ステップＳ９）。詳細には、音声出力制御部２２は、背景音のラウドネスをスピーカーユニット４１１及び４１２の組みの数で除算した大きさにして全てのスピーカーユニット４１１及び４１２に再生させる。また、音声出力制御部２２は、内野発話音声の発話者の位置に最も近いスピーカーユニット４１１及び４１２に内野発話音声をそのまま再生させる。また、音声出力制御部２２は、スケーリング因子を用いた正規分布関数を使用して外野発話音声に処理を施し、かつ、使用するスピーカーユニット４１１及び４１２の範囲を限定してスピーカーユニット４１１及び４１２に外野発話音声を再生させる。 The local remote communication device 1 receives the sound data and the person list. The sound data and person list are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 uses the person list to perform a multiple facing speaker balancing process on the sound data, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound for playback (step S9). In detail, the audio output control unit 22 adjusts the loudness of the background sound by dividing it by the number of pairs of speaker units 411 and 412, and plays the adjusted sound on all speaker units 411 and 412. The audio output control unit 22 also plays the infield speech sound directly on the speaker units 411 and 412 closest to the position of the speaker of the infield speech sound. In addition, the audio output control unit 22 processes the outfield speech using a normal distribution function with a scaling factor, and limits the range of speaker units 411 and 412 to be used, causing the speaker units 411 and 412 to reproduce the outfield speech.

　スピーカーアレイ４１及び４２は、音声出力制御部２２から指定されたスピーカーユニット４１１及び４１２を用いて、背景音、内野発話音声及び外野発話音声を出音する（ステップＳ１０）。 The speaker arrays 41 and 42 output background sounds, infield speech sounds, and outfield speech sounds using the speaker units 411 and 412 specified by the audio output control unit 22 (step S10).

　＜３．遠隔コミュニケーション処理＞
　次に、遠隔コミュニケーション処理の流れについて説明する。ここでは、送信側の遠隔コミュニケーション装置１による送信側処理と、受信側の遠隔コミュニケーション装置１による複数対向スピーカーバランシング処理に分けて説明する。 <3. Remote communication processing>
Next, the flow of the remote communication process will be described, which will be divided into a transmitting process by the transmitting remote communication device 1 and a multiple opposing speaker balancing process by the receiving remote communication device 1.

　＜３．１．送信側処理＞
　図９は、送信側処理のフローチャートである。図９を参照して、送信側処理の全体的な流れを説明する。 <3.1. Transmission side processing>
9 is a flowchart of the transmission side process. The overall flow of the transmission side process will be described with reference to FIG.

　人センシング部１１は、カメラ２により撮影されたリモート空間の映像を取得する。また、音声分離部１２は、マイク３により収音されたリモート空間の音声を取得する（ステップＳ１１）。 The human sensing unit 11 acquires video of the remote space captured by the camera 2. The audio separation unit 12 also acquires audio of the remote space picked up by the microphone 3 (step S11).

　人センシング部１１は、カメラ２により撮影された映像を用いて、各発話者の位置の正規化座標を判定し、かつ、人数分の要素を有し各発話者の正規化座標が登録された人別リストを作成する人判定処理を実行する（ステップＳ１２）。 The person sensing unit 11 performs a person determination process using the video captured by the camera 2 to determine the normalized coordinates of each speaker's position and to create a person list with elements for the number of people and in which the normalized coordinates of each speaker are registered (step S12).

　次に、人センシング部１１は、映像及び人別リストを用いて、各発話者が内野であるか外野であるかを判定する内野外野判定処理を実行する（ステップＳ１３）。 Next, the human sensing unit 11 performs an infield/outfield determination process using the video and the person list to determine whether each speaker is in the infield or outfield (step S13).

　音声分離部１２は、マイク３から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う（ステップＳ１４）。 The audio separation unit 12 performs audio separation processing to separate the speech and background sounds contained in the audio input from the microphone 3 (step S14).

　明瞭度調整部１３２は、音声分離部１２により抽出された背景音の入力を受ける。そして、明瞭度調整部１３２は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する（ステップＳ１５）。 The clarity adjustment unit 132 receives the background sound extracted by the audio separation unit 12. The clarity adjustment unit 132 then performs a degrading process to make the background sound more difficult to hear as a clarity adjustment process (step S15).

　個別音分離部１３１は、音声分離部１２により抽出された発話音声の入力を受ける。また、個別音分離部１３１は、人別リストを人センシング部１１から取得する。次に、個別音分離部１３１は、取得した人別リストに登録された要素数分のスレッドを生成する（ステップＳ１６）。個別音分離部１３１は、各スレッドに１から連番で人別リストに登録された要素数分のＩＤを割り当てる。ここでは、人別リストに登録された要素数がｐである場合で詰め移する。 The individual sound separation unit 131 receives input of speech extracted by the audio separation unit 12. The individual sound separation unit 131 also acquires a person-specific list from the person sensing unit 11. Next, the individual sound separation unit 131 generates threads equal to the number of elements registered in the acquired person-specific list (step S16). The individual sound separation unit 131 assigns IDs to each thread, sequentially numbered from 1, equal to the number of elements registered in the person-specific list. Here, the process is performed when the number of elements registered in the person-specific list is p.

　次に、個別音分離部１３１は、発話者毎の個別音に発話音声を分離する個別音分離処理を実行する（ステップＳ１７）。 Next, the individual sound separation unit 131 performs individual sound separation processing to separate the speech into individual sounds for each speaker (step S17).

　明瞭度調整部１３２は、個別音の入力を個別音分離部１３１から受ける。次に、明瞭度調整部１３２は、ｉを初期化して０に設定する（ステップＳ１８）。 The clarity adjustment unit 132 receives input of individual sounds from the individual sound separation unit 131. Next, the clarity adjustment unit 132 initializes i to 0 (step S18).

　次に、明瞭度調整部１３２は、ＩＤがｉであるスレッドの個別音について明瞭度調整処理を実行する（ステップＳ１９）。 Next, the clarity adjustment unit 132 performs clarity adjustment processing on the individual sounds in the thread whose ID is i (step S19).

　次に、明瞭度調整部１３２は、ｉがｐ未満か否かを判定する（ステップＳ２０）。ｉがｐ未満の場合（ステップＳ２０：肯定）、明瞭度調整部１３２は、ｉを１つインクリメントする（ステップＳ２１）。その後、明瞭度調整部１３２は、ステップＳ１９へ戻る。 Next, the clarity adjustment unit 132 determines whether i is less than p (step S20). If i is less than p (step S20: Yes), the clarity adjustment unit 132 increments i by 1 (step S21). Then, the clarity adjustment unit 132 returns to step S19.

　これに対して、ｉがｐ以上の場合（ステップＳ２０：否定）、明瞭度調整部１３２は、明瞭度調整処理を施した背景音及び個別音を出力統合部１４へ出力する。また、出力統合部１４は、映像の入力をカメラ２から受ける。そして、出力統合部１４は、背景音声及び個別音を統合して１つの音データを生成する出力統合処理を実行する（ステップＳ２２）。 On the other hand, if i is greater than or equal to p (step S20: No), the clarity adjustment unit 132 outputs the background sound and individual sounds that have been subjected to clarity adjustment processing to the output integrator 14. The output integrator 14 also receives video input from the camera 2. The output integrator 14 then executes output integration processing to integrate the background sound and individual sounds to generate a single piece of sound data (step S22).

　その後、送信部１５は、出力統合部１４により生成された音データ及びカメラ２により撮影された映像をネットワーク７を介してローカル側の遠隔コミュニケーション装置１へ送信する（ステップ２３）。 Then, the transmission unit 15 transmits the sound data generated by the output integration unit 14 and the video captured by the camera 2 to the local remote communication device 1 via the network 7 (step 23).

　＜３．１．１．人位置判定処理＞
　図１０は、人位置判定処理のフローチャートである。図１０に示した処理は、図９のステップＳ１２で実行される処理の一例にあたる。次に、図１０を参照して、人位置判定処理の流れを説明する。 <3.1.1. Person position determination process>
Fig. 10 is a flowchart of the person position determination process. The process shown in Fig. 10 is an example of the process executed in step S12 in Fig. 9. Next, the flow of the person position determination process will be described with reference to Fig. 10.

　人センシング部１１は、スクリーンに映される映像の横方向に正規化座標を設定する。次に、人センシング部１１は、映像内の発話者に対して骨格認識を行って各発話者の首の位置を特定し、首のスクリーン座標を取得する（ステップＳ１０１）。 The human sensing unit 11 sets normalized coordinates in the horizontal direction of the image projected on the screen. Next, the human sensing unit 11 performs skeletal recognition on the speakers in the image to identify the position of each speaker's neck and obtain the screen coordinates of the neck (step S101).

　次に、人センシング部１１は、スクリーン座標を正規化座標に変換する（ステップＳ１０２）。 Next, the human sensing unit 11 converts the screen coordinates into normalized coordinates (step S102).

　その後、人センシング部１１は、人別リストに各発話者のスクリーン座標及び正規化座標を登録する（ステップＳ１０３）。ここで、人センシング部１１は、スクリーン座標及び正規化座標それぞれの映像をディスプレイ６のスクリーンに映した場合の横方向の座標を格納すればよい。 Then, the human sensing unit 11 registers the screen coordinates and normalized coordinates of each speaker in the person list (step S103). Here, the human sensing unit 11 stores the horizontal coordinates when the screen coordinates and normalized coordinates of each image are projected onto the screen of the display 6.

　＜３．１．２．内野外野判定処理＞
　図１１は、内野外野判定処理のフローチャートである。図１１に示した処理は、図９のステップＳ１３で実行される処理の一例にあたる。次に、図１１を参照して、内野外野判定処理の流れを説明する。 <3.1.2. Infield/Outfield Determination Processing>
Fig. 11 is a flowchart of the infield/outfield determination process. The process shown in Fig. 11 is an example of the process executed in step S13 in Fig. 9. Next, the flow of the infield/outfield determination process will be described with reference to Fig. 11.

　人センシング部１１は、人別リストに内野フラグの項目を設けて、それらを初期化してリモート空間に存在する人数分の項目全てにＦａｌｓｅを設定する（ステップＳ１１１）。 The human sensing unit 11 creates an infield flag item in the person list, initializes it, and sets all items equal to the number of people present in the remote space to False (step S111).

　次に、人センシング部１１は、　ｉを初期化して０とする（ステップＳ１１２）。 Next, the human sensing unit 11 initializes i to 0 (step S112).

　次に、人センシング部１１は、映像の左側から先頭を０番目としてｉ番目の発話者の頭部方向の検出を実行する（ステップＳ１１３）。 Next, the human sensing unit 11 detects the head direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S113).

　人センシング部１１は、ｉ番目の発話者の頭部方向が検出可能か否かを判定する（ステップＳ１１４）。頭部方向が検出できない場合（ステップＳ１１４：否定）、人センシング部１１は、ステップＳ１１８へ進む。 The human sensing unit 11 determines whether the head direction of the i-th speaker can be detected (step S114). If the head direction cannot be detected (step S114: No), the human sensing unit 11 proceeds to step S118.

　これに対して、頭部方向が検出可能な場合（ステップＳ１１４：肯定）、人センシング部１１は、頭部方向が内野範囲内か否かを判定する（ステップＳ１１５）。 On the other hand, if the head direction can be detected (step S114: Yes), the human sensing unit 11 determines whether the head direction is within the infield (step S115).

　頭部方向が内野範囲内の場合（ステップＳ１１５：肯定）、人センシング部１１は、人別リストのｉ番目の発話者に対応する要素の内野フラグにＴｒｕｅを格納する（ステップＳ１１６）。 If the head direction is within the infield range (step S115: Yes), the person sensing unit 11 stores True in the infield flag of the element corresponding to the i-th speaker in the person list (step S116).

　これに対して、頭部方向が内野範囲以外の場合（ステップＳ１１５：否定）、人センシング部１１は、人別リストのｉ番目の発話者に対応する要素の内野フラグにＦａｌｓｅを格納する（ステップＳ１１７）。 On the other hand, if the head direction is outside the infield range (step S115: No), the person sensing unit 11 stores False in the infield flag of the element corresponding to the i-th speaker in the person list (step S117).

　その後、人センシング部１１は、ｉがｐ未満か否かを判定する（ステップＳ１１８）。ｉがｐ未満の場合（ステップＳ１１８：肯定）、人センシング部１１は、ｉを１つインクリメントする（ステップＳ１１９）。その後、人センシング部１１は、ステップＳ１１３へ戻る。 Then, the human sensing unit 11 determines whether i is less than p (step S118). If i is less than p (step S118: Yes), the human sensing unit 11 increments i by 1 (step S119). Then, the human sensing unit 11 returns to step S113.

　これに対して、ｉがｐ以上の場合（ステップＳ１１８：否定）、人センシング部１１は、内野外野判定処理を終了する。 On the other hand, if i is greater than or equal to p (step S118: No), the human sensing unit 11 ends the infield/outfield determination process.

　＜３．１．３．個別音分離処理＞
　図１２は、個別音分離処理のフローチャートである。図１２に示した処理は、図９のステップＳ１７で実行される処理の一例にあたる。次に、図１２を参照して、個別音分離処理の流れを説明する。 <3.1.3. Individual Sound Separation Processing>
Fig. 12 is a flowchart of the individual sound separation process. The process shown in Fig. 12 is an example of the process executed in step S17 in Fig. 9. Next, the flow of the individual sound separation process will be described with reference to Fig. 12.

　個別音分離部１３１は、人別リストに登録された人数分のスレッドを作成する。次に、個別音分離部１３１は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。そして、個別音分離部１３１は、ｉを初期化して０に設定する（ステップＳ１２１）。 The individual sound separation unit 131 creates threads for the number of people registered in the personal list. Next, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Then, the individual sound separation unit 131 initializes i to 0 (step S121).

　次に、個別音分離部１３１は、映像の左側から先頭を０番目としてｉ番目の発話者のビーム方向を計算する（ステップＳ１２２）。 Next, the individual sound separation unit 131 calculates the beam direction of the i-th speaker, counting the first speaker from the left side of the video as 0 (step S122).

　次に、個別音分離部１３１は、ビーム方向に対するビームフォーミング処理でｉ番目の発話者の個別音を取得する（ステップＳ１２３）。 Next, the individual sound separation unit 131 acquires the individual sound of the i-th speaker by beamforming processing in the beam direction (step S123).

　その後、個別音分離部１３１は、ｉがｐ未満か否かを判定する（ステップＳ１２４）。ｉがｐ未満の場合（ステップＳ１２４：肯定）、個別音分離部１３１は、ｉを１つインクリメントする（ステップＳ１２５）。その後、個別音分離部１３１は、ステップＳ１２２へ戻る。 Then, the individual sound separation unit 131 determines whether i is less than p (step S124). If i is less than p (step S124: Yes), the individual sound separation unit 131 increments i by 1 (step S125). Then, the individual sound separation unit 131 returns to step S122.

　これに対して、ｉがｐ以上の場合（ステップＳ１２４：否定）、個別音分離部１３１は、個別音分離処理を終了する。 On the other hand, if i is greater than or equal to p (step S124: No), the individual sound separation unit 131 terminates the individual sound separation process.

　＜３．１．４．明瞭度調整処理＞
　図１３は、個別音に対する明瞭度調整処理のフローチャートである。図１３に示した処理は、図９のステップＳ１９で実行される処理の一例にあたる。次に、図１３を参照して、明瞭度調整処理の流れを説明する。 3.1.4. Clarity Adjustment Processing
Fig. 13 is a flowchart of the clarity adjustment process for individual sounds. The process shown in Fig. 13 is an example of the process executed in step S19 in Fig. 9. Next, the flow of the clarity adjustment process will be described with reference to Fig. 13.

　明瞭度調整部１３２は、ｉを初期化して０に設定する（ステップＳ１３１）。 The clarity adjustment unit 132 initializes i to 0 (step S131).

　次に、明瞭度調整部１３２は、映像の左側から先頭を０番目としてｉ番目の発話者の人別リストの内野フラグがＴｒｕｅか否かを判定する（ステップＳ１３２）。 Next, the clarity adjustment unit 132 determines whether the infield flag in the person list for the ith speaker, with the first speaker from the left side of the video being number 0, is True (step S132).

　内野フラグがＴｒｕｅの場合（ステップＳ１３２：肯定）、明瞭度調整部１３２は、ｉ番目の発話者の個別音にエンハンス処理を実施する（ステップＳ１３３）。 If the infield flag is True (step S132: Yes), the clarity adjustment unit 132 performs enhancement processing on the individual sounds of the i-th speaker (step S133).

　内野フラグがＦａｌｓｅの場合（ステップＳ１３２：否定）、明瞭度調整部１３２は、ｉ番目の発話者の個別音にデグレード処理を実施する（ステップＳ１３４）。 If the infield flag is False (step S132: No), the clarity adjustment unit 132 performs degrading processing on the individual sound of the i-th speaker (step S134).

　その後、明瞭度調整部１３２は、ｉがｐ未満か否かを判定する（ステップＳ１３５）。ｉがｐ未満の場合（ステップＳ１３５：肯定）、明瞭度調整部１３２は、ｉを１つインクリメントする（ステップＳ１３６）。その後、明瞭度調整部１３２は、ステップＳ１３２へ戻る。 Then, the clarity adjustment unit 132 determines whether i is less than p (step S135). If i is less than p (step S135: Yes), the clarity adjustment unit 132 increments i by 1 (step S136). Then, the clarity adjustment unit 132 returns to step S132.

　これに対して、ｉがｐ以上の場合（ステップＳ１３５：否定）、明瞭度調整部１３２は、個別音分離処理を終了する。 On the other hand, if i is greater than or equal to p (step S135: No), the clarity adjustment unit 132 terminates the individual sound separation process.

　＜３．２．複数対向スピーカーバランシング処理（受信側処理）＞
　図１４は、複数対向スピーカーバランシング処理のフローチャートである。図１４を参照して、複数対向スピーカーバランシング処理の全体的な流れを説明する。ここでは、音声出力制御部２２は、スピーカーユニット４１１及び４１２の組合せ毎のリストを使用して再生を制御する場合で説明する。 <3.2. Multiple opposing speaker balancing process (receiving side process)>
14 is a flowchart of the multiple opposed speaker balancing process. The overall flow of the multiple opposed speaker balancing process will be described with reference to FIG. 14. Here, the case where the audio output control unit 22 controls playback using a list for each combination of speaker units 411 and 412 will be described.

　音声出力制御部２２は、音データ及び人別リストの入力を受信部２１から受ける。音声出力制御部２２は、スピーカーユニット４１１同士の間の距離を算出する（ステップＳ３２１）。例えば、双方の端部のスピーカーユニット４１１間の距離をＷｓｐｋであり、スピーカーユニット４１１及び４１２のいずれもｕ個存在する場合、音声出力制御部２２は、Ｗｕ＝Ｗｓｐｋ／ｕとして、スピーカーユニット４１１同士の間のスピーカー間距離Ｗｕを算出できる。 The audio output control unit 22 receives input of the sound data and the person list from the receiving unit 21. The audio output control unit 22 calculates the distance between the speaker units 411 (step S321). For example, if the distance between the speaker units 411 at both ends is Wspk and there are u speaker units 411 and 412, the audio output control unit 22 can calculate the inter-speaker distance Wu between the speaker units 411 as Wu = Wspk/u.

　次に、音声出力制御部２２は、各スピーカーユニット４１１及び４１２の組の正規化座標を算出して、スピーカーユニット４１１及び４１２の組の数の要素を有するスピーカーユニットリストに、昇順に正規化座標を格納する（ステップＳ３２２）。 Next, the audio output control unit 22 calculates the normalized coordinates for each pair of speaker units 411 and 412, and stores the normalized coordinates in ascending order in a speaker unit list having the same number of elements as the number of pairs of speaker units 411 and 412 (step S322).

　次に、音声出力制御部２２は、スピーカーユニットリストの要素毎に出力音格納用の領域を設けて、それぞれの領域を背景音及び個別音のチャンネル数で初期化する（ステップＳ３３）。例えば、背景音が１チャンネルのデータであり、個別音がｐ個ある場合、出力音格納用の領域には、（１＋Ｐ）個のチャンネルの音を格納する（１＋Ｐ）個の領域が設定される。 Next, the audio output control unit 22 provides an area for storing output sounds for each element in the speaker unit list, and initializes each area with the number of channels for the background sounds and individual sounds (step S33). For example, if the background sound is one-channel data and there are p individual sounds, (1 + P) areas for storing sounds of (1 + P) channels are set in the area for storing output sounds.

　次に、音声出力制御部２２は、ｉを初期化して０に設定する（ステップＳ３４）。 Next, the audio output control unit 22 initializes i to 0 (step S34).

　次に、音声出力制御部２２は、音データにおいて順番に並べられた（１＋Ｐ）個のチャンネルの先頭を０番目としてｉ番目のチャンネルの音声が発話音声の個別音であるか否かを判定する（ステップＳ３５）。 Next, the audio output control unit 22 determines whether the audio of the i-th channel, with the first of the (1+P) channels arranged in order in the audio data being numbered 0, is an individual sound of the speech (step S35).

　ｉ番目のチャンネルの音声が背景音の場合（ステップＳ３５：否定）、音声出力制御部２２は、背景音調整を実施する（ステップＳ３６）。その後、音声出力制御部２２は、ステップＳ４０へ進む。 If the audio on the i-th channel is background sound (step S35: No), the audio output control unit 22 performs background sound adjustment (step S36). The audio output control unit 22 then proceeds to step S40.

　これに対して、ｉ番目のチャンネルの音声が発話音声の個別音である場合（ステップＳ３５：肯定）、音声出力制御部２２は、その個別音の発話者が内野であるか否かを人別リストを用いて判定する（ステップＳ３７）。 On the other hand, if the audio on the i-th channel is an individual sound of speech (step S35: Yes), the audio output control unit 22 determines whether the speaker of that individual sound is Uchino using the person list (step S37).

　発話者が外野の場合（ステップＳ３７：否定）、音声出力制御部２２は、外野発話音声調整を実施する（ステップＳ３８）。その後、音声出力制御部２２は、ステップＳ４０へ進む。 If the speaker is in the outfield (step S37: No), the audio output control unit 22 performs outfield speech sound adjustment (step S38). The audio output control unit 22 then proceeds to step S40.

　発話者が内野の場合（ステップＳ３７：肯定）、音声出力制御部２２は、内野発話音声調整を実施する（ステップＳ３９）。 If the speaker is an infield player (step S37: Yes), the audio output control unit 22 performs infield speech sound adjustment (step S39).

　その後、音声出力制御部２２は、スピーカーユニットリストの出力音格納用の領域にｉ番目のチャンネルの音声を格納する（ステップＳ４０）。 Then, the audio output control unit 22 stores the audio of the i-th channel in the area for storing output audio in the speaker unit list (step S40).

　次に、音声出力制御部２２は、ｉがｐ未満か否かを判定する（ステップＳ４１）。ｉがｐ未満の場合（ステップＳ４１：肯定）、音声出力制御部２２は、ｉを１つインクリメントする（ステップＳ４２）。その後、音声出力制御部２２は、ステップＳ３５へ戻る。 Next, the audio output control unit 22 determines whether i is less than p (step S41). If i is less than p (step S41: Yes), the audio output control unit 22 increments i by 1 (step S42). Then, the audio output control unit 22 returns to step S35.

　これに対して、ｉがｐ以上の場合（ステップＳ４１：否定）、音声出力制御部２２は、スピーカーユニットリストの出力音格納用の領域に格納された音声をミキシングする。そして、音声出力制御部２２は、ミキシングした音をスピーカーユニットリストに対応するスピーカーユニット４１１及び４１２から出音させる（ステップＳ４３）。 On the other hand, if i is greater than or equal to p (step S41: No), the audio output control unit 22 mixes the audio stored in the area for storing output audio in the speaker unit list. Then, the audio output control unit 22 outputs the mixed sound from the speaker units 411 and 412 corresponding to the speaker unit list (step S43).

　＜３．２．１．背景音調整＞
　図１５は、背景音調整の処理のフローチャートである。図１５に示した処理は、図１４のステップＳ３６で実行される処理の一例にあたる。次に、図１５を参照して、背景音調整の処理の流れを説明する。 <3.2.1. Background sound adjustment>
Fig. 15 is a flowchart of the background sound adjustment process. The process shown in Fig. 15 is an example of the process executed in step S36 in Fig. 14. Next, the flow of the background sound adjustment process will be described with reference to Fig. 15.

　音声出力制御部２２は、背景音の振幅を（１／ｕ）倍に調整する（ステップＳ２０１）。ここで、ｕは、スピーカーユニット４１１及び４１２の組の数である。 The audio output control unit 22 adjusts the amplitude of the background sound by a factor of (1/u) (step S201), where u is the number of pairs of speaker units 411 and 412.

　そして、音声出力制御部２２は、全てのスピーカーユニット４１１及び４１２の組のチャンネルに波形を代入する（ステップＳ２０２）。 Then, the audio output control unit 22 assigns waveforms to the channels of all pairs of speaker units 411 and 412 (step S202).

　＜３．２．２．内野発話音声調整＞
　図１６は、内野発話音声調整の処理のフローチャートである。図１６に示した処理は、図１４のステップＳ３９で実行される処理の一例にあたる。次に、図１６を参照して、内野発話音声調整の処理の流れを説明する。 <3.2.2. Infield speech sound adjustment>
Fig. 16 is a flowchart of the infield speech sound adjustment process. The process shown in Fig. 16 is an example of the process executed in step S39 in Fig. 14. Next, the flow of the infield speech sound adjustment process will be described with reference to Fig. 16.

　音声出力制御部２２は、定位中心スピーカーユニット導出を実行して、発話者に最も近いスピーカーユニット４１１及び４１２の組を特定する（ステップＳ２１１）。この定位中心スピーカーユニット導出については、後で詳細に説明する。 The audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 that are closest to the speaker (step S211). This localization center speaker unit derivation will be explained in detail later.

　次に、音声出力制御部２２は、スピーカーユニット４１１及び４１２の組のチャンネルのうち発話者の正規化座標と最も近くにある組のチャンネルに波形を代入する（ステップＳ２１２）。 Next, the audio output control unit 22 assigns the waveform to the pair of channels of speaker units 411 and 412 that is closest to the normalized coordinates of the speaker (step S212).

　＜３．２．３．外野発話音声調整＞
　図１７は、外野発話音声調整の処理のフローチャートである。図１７に示した処理は、図１４のステップＳ３８で実行される処理の一例にあたる。次に、図１７を参照して、外野発話音声調整の処理の流れを説明する。 3.2.3. Outfield speech adjustment
Fig. 17 is a flowchart of the outfield speech sound adjustment process. The process shown in Fig. 17 is an example of the process executed in step S38 of Fig. 14. Next, the flow of the outfield speech sound adjustment process will be described with reference to Fig. 17.

　音声出力制御部２２は、定位中心スピーカーユニット導出を実行して、発話者に最も近いスピーカーユニット４１１及び４１２の組を特定する（ステップＳ２２１）。この定位中心スピーカーユニット導出は内野発話音声調整における処理と同じ処理であり、後で詳細に説明する。ここでは、スピーカーユニット４１１及び４１２の組みの左から順に先頭を０として連番の番号でそれぞれの位置を表した場合の、発話者に最も近い組の位置を「最近接スピーカーユニット位置」とよぶ。ここでは、スピーカーユニット４１１及び４１２の組の個数がｕ個であり、最近接スピーカーユニット位置は、０～ｕのいずれかである。 The audio output control unit 22 performs localization center speaker unit derivation to identify the pair of speaker units 411 and 412 closest to the speaker (step S221). This localization center speaker unit derivation is the same process as the process used in infield speech sound adjustment, and will be described in detail later. Here, when the positions of the pair of speaker units 411 and 412 are represented by consecutive numbers starting from the left with 0 as the first, the position of the pair closest to the speaker is called the "closest speaker unit position." Here, the number of pairs of speaker units 411 and 412 is u, and the closest speaker unit position is any one of 0 to u.

　次に、音声出力制御部２２は、ｋを初期化して０に設定する（ステップＳ２２２）。ｋは、再生範囲を決定する処理の繰り返しを制御するためのパラメータである。 Next, the audio output control unit 22 initializes k to 0 (step S222). k is a parameter used to control the repetition of the process of determining the playback range.

　次に、音声出力制御部２２は、ｋ＝０か否かを判定する（ステップＳ２２３）。 Next, the audio output control unit 22 determines whether k = 0 (step S223).

　ｋ＝０の場合（ステップＳ２２３：肯定）、音声出力制御部２２は、ｘ_ｓｐｋを最近接スピーカーユニット位置にｋを加算した位置とする（ステップＳ２２４）。その後、音声出力制御部２２は、ステップＳ２２９へ進む。 If k=0 (step S223: Yes), the audio output control unit 22 sets x _spk to the position obtained by adding k to the position of the closest speaker unit (step S224). After that, the audio output control unit 22 proceeds to step S229.

　ｋが０以外の場合（ステップＳ２２３：否定）、音声出力制御部２２は、最近接スピーカーユニット位置にｋを加算した値がｕ以下かを判定する（ステップＳ２２５）。すなわち、音声出力制御部２２は、最近接スピーカーユニット位置にｋを加算した位置がスピーカーアレイ４１及び４２の右端からはみ出さないか否かを判定する。 If k is other than 0 (step S223: No), the audio output control unit 22 determines whether the value obtained by adding k to the nearest speaker unit position is less than or equal to u (step S225). In other words, the audio output control unit 22 determines whether the position obtained by adding k to the nearest speaker unit position does not extend beyond the right end of the speaker arrays 41 and 42.

　最近接スピーカーユニット位置にｋを加算した値がｕより大きい場合（ステップＳ２２５：否定）、音声出力制御部２２は、ステップＳ２２７へ進む。これに対して、最近接スピーカーユニット位置にｋを加算した値がｕ以下の場合（ステップＳ２２５：肯定）、音声出力制御部２２は、一方のｘ_ｓｐｋを最近接スピーカーユニットにｋを加算した位置とする（ステップＳ２２６）。その後、音声出力制御部２２は、ステップＳ２２７へ進む。 If the value obtained by adding k to the position of the nearest speaker unit is greater than u (step S225: No), the audio output control unit 22 proceeds to step S227. On the other hand, if the value obtained by adding k to the position of the nearest speaker unit is equal to or less than u (step S225: Yes), the audio output control unit 22 sets one x _spk to the position obtained by adding k to the nearest speaker unit (step S226). Thereafter, the audio output control unit 22 proceeds to step S227.

　次に、音声出力制御部２２は、最近接スピーカーユニット位置からｋを減算した値が０以上かを判定する（ステップＳ２２７）。すなわち、音声出力制御部２２は、最近接スピーカーユニット位置からｋを減算した位置がスピーカーアレイ４１及び４２の左端からはみ出さないか否かを判定する。 Next, the audio output control unit 22 determines whether the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227). In other words, the audio output control unit 22 determines whether the position obtained by subtracting k from the nearest speaker unit position does not extend beyond the left end of the speaker arrays 41 and 42.

　最近接スピーカーユニット位置からｋを減算した値が０未満の場合（ステップＳ２２７：否定）、音声出力制御部２２は、ステップＳ２２９へ進む。これに対して、最近接スピーカーユニット位置からｋを減算した値が０以上の場合（ステップＳ２２７：肯定）、音声出力制御部２２は、他方のｘ_ｓｐｋを最近接スピーカーユニットからｋを減算した位置とする（ステップＳ２２８）。その後、音声出力制御部２２は、ステップＳ２２９へ進む。 If the value obtained by subtracting k from the nearest speaker unit position is less than 0 (step S227: No), the audio output control unit 22 proceeds to step S229. On the other hand, if the value obtained by subtracting k from the nearest speaker unit position is 0 or greater (step S227: Yes), the audio output control unit 22 sets the other x _spk to the position obtained by subtracting k from the nearest speaker unit (step S228). Thereafter, the audio output control unit 22 proceeds to step S229.

　その後、音声出力制御部２２は、ｋの位置の音声の振幅を｛ｇ（ｘ_ｓｐｋ）｝^１．６６倍して調整を実施する（ステップＳ２２９）。ここで、一方のｘ_ｓｐｋ及び他方のｘ_ｓｐｋの双方が存在する場合は、音声出力制御部２２は、両方のｋの位置についての音声の振幅を調整する。 Thereafter, the audio output control unit 22 adjusts the amplitude of the audio at position k by multiplying it by {g(x _spk )} ^1.66 (step S229). Here, if both one x _spk and the other x _spk exist, the audio output control unit 22 adjusts the amplitude of the audio at both positions k.

　その後、音声出力制御部２２は、ｋが予め決められたレンジ幅を未満か否かを判定する（ステップＳ２３０）。 Then, the audio output control unit 22 determines whether k is less than a predetermined range width (step S230).

　ｋがレンジ幅以下の場合（ステップＳ２３０：肯定）、音声出力制御部２２は、ｋを１つインクリメントする（ステップＳ２３１）。その後、音声出力制御部２２は、ステップＳ２３３へ戻る。 If k is equal to or less than the range width (step S230: Yes), the audio output control unit 22 increments k by 1 (step S231). Then, the audio output control unit 22 returns to step S233.

　これに対して、ｋがレンジ幅より大きい場合（ステップＳ２３０：否定）、音声出力制御部２２は、ｕ個のチャンネルのうち、最近接スピーカーユニット位置を中心に±レンジ幅のチャンネルに波形を代入する（ステップＳ２３２）。 On the other hand, if k is greater than the range width (step S230: No), the audio output control unit 22 assigns the waveform to one of the u channels with a ±range width centered on the position of the closest speaker unit (step S232).

　＜３．２．４．定位中心スピーカーユニット導出＞
　図１８は、定位中心スピーカーユニット導出の処理のフローチャートである。図１８に示した処理は、図１６のステップＳ２１１及び図１７のステップＳ２２１で実行される処理の一例にあたる。次に、図１８を参照して、定位中心スピーカーユニット導出の処理の流れを説明する。 3.2.4. Derivation of the center speaker unit position
Fig. 18 is a flowchart of the process of deriving the localization center speaker unit. The process shown in Fig. 18 corresponds to an example of the process executed in step S211 of Fig. 16 and step S221 of Fig. 17. Next, the flow of the process of deriving the localization center speaker unit will be described with reference to Fig. 18.

　音声出力制御部２２は、スピーカーユニット４１１までの最短距離の初期値をスピーカーユニット４１１の間の距離と設定する（ステップＳ２４１）。 The audio output control unit 22 sets the initial value of the shortest distance to the speaker unit 411 as the distance between the speaker units 411 (step S241).

　次に、音声出力制御部２２は、ｊを初期化して０に設定する（ステップＳ２４２）。ここで、ｊは、スピーカーユニット４１１及び４１２の組み毎の発話者までの距離を最近接スピーカーユニット距離とするかの判定する処理の繰り返しを制御するためのパラメータである。 Next, the audio output control unit 22 initializes j to 0 (step S242). Here, j is a parameter used to control the repetition of the process of determining whether the distance to the speaker for each pair of speaker units 411 and 412 is the closest speaker unit distance.

　次に、音声出力制御部２２は、発話者の位置からスピーカーユニット４１１の組に左から先頭を０番として連番を振った場合のｊ番目のスピーカーユニット４１１の位置を減算する（ステップＳ２４３）。 Next, the audio output control unit 22 subtracts the position of the jth speaker unit 411 when the speaker unit 411 pairs are numbered consecutively from the left, with the first being number 0, from the speaker's position (step S243).

　次に、音声出力制御部２２は、減算結果が最短距離未満か否かを判定する（ステップＳ２４４）。減算結果が最短距離以上の場合（ステップＳ２４４：否定）、音声出力制御部２２は、ステップＳ２４６へ進む。 Next, the audio output control unit 22 determines whether the subtraction result is less than the shortest distance (step S244). If the subtraction result is equal to or greater than the shortest distance (step S244: No), the audio output control unit 22 proceeds to step S246.

　減算結果が最短距離未満の場合（ステップＳ２４４：肯定）、音声出力制御部２２は、最短距離を減算結果として更新する。さらに、音声出力制御部２２は、ｊを最近接スピーカーユニット位置とする（ステップＳ２４５）。 If the subtraction result is less than the shortest distance (step S244: Yes), the audio output control unit 22 updates the shortest distance as the subtraction result. Furthermore, the audio output control unit 22 sets j to the nearest speaker unit position (step S245).

　次に、音声出力制御部２２は、ｊがスピーカーユニット４１１の個数であるｕ未満か否かを判定する（ステップＳ２４６）。 Next, the audio output control unit 22 determines whether j is less than u, which is the number of speaker units 411 (step S246).

　ｊがｕ未満の場合（ステップＳ２４６：肯定）、音声出力制御部２２は、ｊを１つインクリメントする（ステップＳ２４７）。その後、音声出力制御部２２は、ステップＳ２４３へ戻る。 If j is less than u (step S246: Yes), the audio output control unit 22 increments j by 1 (step S247). Then, the audio output control unit 22 returns to step S243.

　これに対して、ｊがｕ以上の場合（ステップＳ２４６：否定）、音声出力制御部２２は、定位中心スピーカーユニット導出を終了する。 On the other hand, if j is greater than or equal to u (step S246: No), the audio output control unit 22 terminates the derivation of the localization center speaker unit.

　＜４．効果＞
　以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置１は、映像から発話者の位置を特定し、特定した各発話者の頭部方向を推定して各発話者が内野であるか外野であるかを判定する。また、遠隔コミュニケーション装置１は、収集した音声から、背景音を分離し、さらに特定した位置を基に発話者の個別音を取得する。そして、遠隔コミュニケーション装置１は、背景音及び外野発話音声を小さくし、かつ、一定の範囲のスピーカーユニット４１１及び４１２の組みに出音させる。また、遠隔コミュニケーション装置１は、内野発話音声を発話者の位置に近いスピーカーユニット４１１及び４１２から出音させる。 <4. Effects>
As described above, the remote communication device 1 according to this embodiment identifies the positions of speakers from the video, estimates the head direction of each identified speaker, and determines whether each speaker is in the infield or outfield. The remote communication device 1 also separates background sound from the collected audio and acquires the individual sounds of the speakers based on the identified positions. The remote communication device 1 then reduces the volume of the background sound and outfield speech sounds and outputs them from a pair of speaker units 411 and 412 within a certain range. The remote communication device 1 also outputs the infield speech sounds from the speaker units 411 and 412 closest to the speaker's position.

　このように、発話音声を内野発話音声と外野発話音声に分類することで、内野発話音声と外野発話音声とに対してそれぞれ別の音響処理ができる。また、背景音や外野発話は、音の定位感をぼかしてより聞き取りづらくすることで、内野発話が相対的に聞き取りやすくなる。これらにより、存在感や空気感は残しつつ、対話相手の声が聞こえ易くなり、会話に集中できる。また、頭部方向推定による会話参加の判定により、コミュニケーション参加話者の自動判定ができる。また、背景音、内野発話音声及び外野発話音声毎の処理を行うことで、参加度に基づく信号処理が可能となり、さらに、背景音、内野発話音声及び外野発話音声に合わせた多チャンネルスピーカーの制御を行うことができる。したがって、対話の円滑化を促進することができる。 In this way, by classifying speech sounds into infield speech sounds and outfield speech sounds, different acoustic processing can be performed for each. Furthermore, by blurring the sense of sound localization for background sounds and outfield speech sounds, making them harder to hear, the infield speech sounds become relatively easier to hear. This makes it easier to hear the voice of the person you're talking to while still retaining a sense of presence and atmosphere, allowing you to concentrate on the conversation. Furthermore, determining conversation participation through head direction estimation enables automatic determination of speakers participating in communication. Furthermore, by processing background sounds, infield speech sounds, and outfield speech sounds separately, signal processing based on participation level becomes possible, and multi-channel speaker control can be performed in accordance with background sounds, infield speech sounds, and outfield speech sounds. Therefore, smoother dialogue can be promoted.

　＜５．第２の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第２の実施の形態に係る遠隔コミュニケーション装置１について説明する。第２の実施の形態に係る遠隔コミュニケーション装置１は、背景音のうち突発的な背景音の定位感をぼかさずに出音する。 5. Remote communication device according to the second embodiment
Next, a remote communication device 1 according to a second embodiment will be described. The remote communication device 1 according to the second embodiment outputs background sounds without blurring the sense of localization of sudden background sounds.

　図１９は、第２の実施の形態に係る遠隔コミュニケーション装置のブロック図である。なお、図１９では図１と同一の各部には同一符号を付し、以下では、図１と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。 Figure 19 is a block diagram of a remote communication device according to the second embodiment. Note that in Figure 19, the same components as in Figure 1 are designated by the same reference numerals, and the following explanation will focus on the components that differ from Figure 1, and may omit explanations of the components that are the same as in Figure 1.

　＜５．１．音響信号処理部＞
　本実施の形態に係る音響信号処理部１３は、個別音分離部１３１及び明瞭度調整部１３２に加えて、背景音分離部１３３を有する。 <5.1. Acoustic signal processing section>
The sound signal processing unit 13 according to this embodiment includes a background sound separation unit 133 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .

　＜５．１．１．背景音分離部＞
　背景音分離部１３３は、音声分離部１２により分離された背景音の入力を受ける。そして、背景音分離部１３３は、背景音のピークを検出するピーク検出処理を実行する。次に、背景音分離部１３３は、突発音が発生した方向を推定する突発音のＤｏＡ（Direction　Of　Arrival）推定を実施して背景音の中から突発音を抽出する突発音抽出処理を実行する。ここで、背景音分離部１３３は、音声を可視化する技術やマイクアレイ３０から得られる音声信号の相互相関関数を用いる方法等を利用して突発音のＤｏＡ推定を実行する。 <5.1.1. Background sound separation section>
The background sound separation unit 133 receives input of the background sound separated by the audio separation unit 12. The background sound separation unit 133 then performs peak detection processing to detect peaks of the background sound. Next, the background sound separation unit 133 performs DoA (Direction Of Arrival) estimation of the sudden sound to estimate the direction from which the sudden sound occurred, and performs sudden sound extraction processing to extract the sudden sound from the background sound. Here, the background sound separation unit 133 performs DoA estimation of the sudden sound by using, for example, technology for visualizing audio or a method using a cross-correlation function of audio signals obtained from the microphone array 30.

　そして、背景音分離部１３３は、推定した突発音の位置の情報を出力統合部１４へ出力する。また、背景音分離部１３３は、抽出した突発音を出力統合部１４へ出力する。 The background sound separation unit 133 then outputs information about the estimated position of the sudden sound to the output integration unit 14. The background sound separation unit 133 also outputs the extracted sudden sound to the output integration unit 14.

　また、背景音分離部１３３は、背景音のうち突発音以外の音声を定常音として抽出する背景音抽出処理を実行する。そして、背景音分離部１３３は、抽出した定常音を明瞭度調整部１３２へ出力する。このように、背景音分離部１３３は、背景音を突発音と定常音とに分離する。 The background sound separation unit 133 also performs background sound extraction processing to extract sounds other than sudden sounds from the background sound as steady sounds. The background sound separation unit 133 then outputs the extracted steady sounds to the clarity adjustment unit 132. In this way, the background sound separation unit 133 separates the background sound into sudden sounds and steady sounds.

　＜５．１．２．明瞭度調整部＞
　明瞭度調整部１３２は、背景音のうちの定常音の入力を背景音分離部１３３から受ける。そして、明瞭度調整部１３２は、背景音のうちの定常音に対してより聞こえ難くするデグレード処理を実施する。その後、明瞭度調整部１３２は、明瞭度調整を施した背景音のうちの定常音を出力統合部１４へ出力する。 <5.1.2. Clarity adjustment section>
The clarity adjustment unit 132 receives an input of a stationary sound from the background sound separation unit 133. The clarity adjustment unit 132 then performs a degradation process on the stationary sound from the background sound to make it harder to hear. The clarity adjustment unit 132 then outputs the stationary sound from the background sound that has been subjected to clarity adjustment to the output integration unit 14.

　＜５．２．出力統合部＞
　出力統合部１４は、背景音のうちの突発音及び突発音の位置の情報の入力を背景音分離部１３３から受ける。また、出力統合部１４は、背景音のうちの定常音の入力を明瞭度調整部１３２から受ける。この場合、出力統合部１４は、リモート空間に存在する人数分のチャンネル数の個別音に加えて、背景音のうちの突発音と定常音とをそれぞれ１チャンネルのデータとして取得する。すなわち、リモート空間に存在する人数をｐとした場合、出力統合部１４は、（２＋ｐ）チャンネルのデータを取得する。 <5.2. Output Integration Unit>
The output integrating unit 14 receives input of information about sudden sounds and the positions of the sudden sounds in the background sound from the background sound separation unit 133. The output integrating unit 14 also receives input of steady sounds in the background sound from the clarity adjustment unit 132. In this case, the output integrating unit 14 acquires individual sounds of the same number of channels as the number of people present in the remote space, as well as one channel of data for each of the sudden sounds and steady sounds in the background sound. In other words, if the number of people present in the remote space is p, the output integrating unit 14 acquires (2+p) channels of data.

　そして、出力統合部１４は、各個別音、背景音のうちの突発音及び背景音のうちの定常音を統合して１つの音声データを生成する。そして、出力統合部１４は、各個別音に対応する人別リストにおける発話者の情報を付加する。また、出力統合部１４は、突発音と突発音の位置情報とを対応付ける。そして、出力統合部１４は、音データ、人別リスト及び突発音の位置情報を送信部１５へ出力してローカル側の遠隔コミュニケーション装置１へ送信させる。 The output integration unit 14 then integrates each individual sound, the sudden sound from the background sound, and the steady sound from the background sound to generate one piece of audio data. The output integration unit 14 then adds speaker information in the person list corresponding to each individual sound. The output integration unit 14 also associates the sudden sound with positional information of the sudden sound. The output integration unit 14 then outputs the sound data, the person list, and the positional information of the sudden sound to the transmission unit 15, causing it to be transmitted to the local-side remote communication device 1.

　＜５．３．音声出力制御部＞
　ローカル側の遠隔コミュニケーション装置１における音声出力制御部２２は、各個別音、背景音のうちの突発音及び背景音のうちの定常音を含む音声データの入力を受ける。また、音声出力制御部２２は、人別リスト及び突発音の位置情報の入力を受ける。 <5.3. Audio output control unit>
The audio output control unit 22 in the local remote communication device 1 receives input of audio data including each individual sound, a sudden sound from the background sound, and a steady sound from the background sound. The audio output control unit 22 also receives input of the person list and position information of the sudden sound.

　音声出力制御部２２は、音データから背景音のうちの突発音及び背景音のうちの定常音をそれぞれ取得する。次に、音声出力制御部２２は、背景音のうちの定常音については、振幅を（１／ｕ）^１．６６倍してラウドネスが１／ｕになるように処理する。そして、音声出力制御部２２は、スピーカーアレイ４１の全てのスピーカーユニット４１１及びスピーカーアレイ４２の全てのスピーカーユニット４１２からラウドネスを低減した定常音を出音させる。 The audio output control unit 22 acquires a sudden sound from the background sound and a steady sound from the background sound from the sound data. Next, the audio output control unit 22 processes the steady sound from the background sound by multiplying the amplitude by (1/u) ^1.66 so that the loudness becomes 1/u. Then, the audio output control unit 22 causes all speaker units 411 of the speaker array 41 and all speaker units 412 of the speaker array 42 to output the steady sounds with reduced loudness.

　これに対して、背景音のうちの突発音については、音声出力制御部２２は、突発音の位置に最も近いスピーカーユニット４１１及び４１２を特定する。そして、音声出力制御部２２は、特定したスピーカーユニット４１１及び４１２に背景音のうちの突発音をそのまま再生させる。 In contrast, for a sudden sound from the background sound, the audio output control unit 22 identifies the speaker units 411 and 412 that are closest to the position of the sudden sound. Then, the audio output control unit 22 causes the identified speaker units 411 and 412 to reproduce the sudden sound from the background sound as is.

　ここで、本実施例では、音声出力制御部２２は、第１の実施の形態における内野発話音声と同じ処理を突発音に施して出音させたが、この他にも、第１の実施の形態における外野発話音声と同じ処理を突発音に施して出音させてもよい。 In this example, the audio output control unit 22 applies the same processing to the sudden sound as to the infield speech sound in the first embodiment, and outputs the sound; however, it may also apply the same processing to the sudden sound as to the outfield speech sound in the first embodiment, and output the sound.

　＜６．第２の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ＞
　図２０は、第２の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置１の送信ユニット１０からローカル側の遠隔コミュニケーション装置１の受信ユニット２０へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。 6. Data flow between remote communication devices according to the second embodiment
20 is a diagram showing an outline of the data flow between the local remote communication devices according to the second embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

　マイク３は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ３０１）。ここでは、例えば、マイクアレイ３０のデータとしてｓチャンネルのデータが存在する。また、カメラ２は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ３０２）。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S301). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S302).

　カメラ２から入力された映像は、人センシング部１１に送られる。人センシング部１１は、カメラ２により撮影された映像を用いて、人位置判定処理（ステップＳ３３１）及び内野外野判定処理（ステップＳ３３２）を含むセンシング処理を実行する（ステップＳ３０３）。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S303), including human position determination processing (step S331) and infield/outfield determination processing (step S332).

　マイク３から入力された音声は、音声分離部１２へ送られる。音声分離部１２は、マイク３から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う（ステップＳ３０４）。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S304).

　音声分離部１２により抽出された背景音は、音響信号処理部１３の背景音分離部１３３へ送られる。背景音分離部１３３は、背景音のピークを検出するピーク検出処理を実行する（ステップＳ３０５）。 The background sound extracted by the audio separation unit 12 is sent to the background sound separation unit 133 of the audio signal processing unit 13. The background sound separation unit 133 executes peak detection processing to detect peaks in the background sound (step S305).

　次に、背景音分離部１３３は、突発音のＤｏＡ推定を実施して背景音の中から突発音を抽出する突発音抽出処理を実行する（ステップＳ３０６）。背景音のうちの突発音及び突発音の位置情報は、出力統合部１４へ送られる。 Next, the background sound separation unit 133 executes a sudden sound extraction process to estimate the DoA of the sudden sound and extract the sudden sound from the background sound (step S306). The sudden sound in the background sound and its position information are sent to the output integration unit 14.

　また、背景音分離部１３３は、背景音の中から定常音を抽出する定常音抽出処理を実行する（ステップＳ３０７）。背景音のうちの定常音は、明瞭度調整部１３２へ送られる。 The background sound separation unit 133 also executes a steady sound extraction process to extract steady sounds from the background sounds (step S307). The steady sounds from the background sounds are sent to the clarity adjustment unit 132.

　明瞭度調整部１３２は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する（ステップＳ３０８）。 The clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S308).

　また、音声分離部１２により抽出されたｓチャンネルのデータである個別音及び人別リストが、音響信号処理部１３へ送られる。音響信号処理部１３は、個別音分離部１３１による個別音分離処理（ステップＳ３９１）及び明瞭度調整部１３２による明瞭度調整処理（ステップＳ３９２）を含む発話音声信号処理を実施する（ステップＳ３０９）。 Furthermore, the individual sounds and person list, which are s-channel data extracted by the audio separation unit 12, are sent to the audio signal processing unit 13. The audio signal processing unit 13 performs speech signal processing (step S309), including individual sound separation processing by the individual sound separation unit 131 (step S391) and clarity adjustment processing by the clarity adjustment unit 132 (step S392).

　１チャンネルのデータである定常音、１チャンネルのデータである突発音及び突発音の位置情報は、出力統合部１４へ入力される。また、発話音声信号処理が施されたｐチャンネルのデータである個別音も、出力統合部１４へ入力される。これにより、出力統合部１４は、（２＋Ｐ）チャンネルのデータを取得する。そして、出力統合部１４は、すべてのチャンネルの音を統合して１つの音データを生成する出力統合処理を実行する（ステップＳ３１０）。 The steady sound, which is one channel of data, the sudden sound, and the position information of the sudden sound, which is one channel of data, are input to the output integration unit 14. In addition, the individual sounds, which are p channel of data that have been subjected to speech signal processing, are also input to the output integration unit 14. As a result, the output integration unit 14 obtains (2+P) channel data. The output integration unit 14 then executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S310).

　出力統合部１４で生成された音データ、人別リスト及び突発音の位置情報は、送信部１５によりネットワーク７を介してローカル側の遠隔コミュニケーション装置１へ送られる（ステップＳ３１１）。 The sound data, person list, and sudden sound location information generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S311).

　ローカル側の遠隔コミュニケーション装置１は、音データ、人別リスト及び突発音の位置情報の送信を受ける。音データ、人別リスト及び突発音の位置情報は、受信部２１を介して音声出力制御部２２へ送られる。音声出力制御部２２は、人別リスト及び突発音の位置情報を用いて音データに対して複数対向スピーカーバランシング処理を実施する。そして、音声出力制御部２２は、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット４１１及び４１２を指定して再生させる（ステップＳ３１２）。この際、音声出力制御部２２は、背景音のうちの定常音を、ラウドネスをスピーカーユニット４１１及び４１２の組みの数で除算した大きさにして全てのスピーカーユニット４１１及び４１２に再生させる。また、音声出力制御部２２は、突発音の位置に最も近いスピーカーユニット４１１及び４１２に定常音のうちの突発音をそのまま再生させる。 The local remote communication device 1 receives the sound data, the person list, and the positional information of the sudden sound. The sound data, the person list, and the positional information of the sudden sound are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list and the positional information of the sudden sound. The audio output control unit 22 then specifies the speaker units 411 and 412 to output for each of the background sound, infield speech sound, and outfield speech sound, and plays them (step S312). At this time, the audio output control unit 22 plays the steady sound of the background sound in all speaker units 411 and 412, adjusting the loudness by dividing it by the number of pairs of speaker units 411 and 412. The audio output control unit 22 also plays the sudden sound of the steady sound as is in the speaker units 411 and 412 closest to the position of the sudden sound.

　スピーカーアレイ４１及び４２は、音声出力制御部２２から指定されたスピーカーユニット４１１及び４１２を用いて、背景音のうちの突発音、背景音のうちの定常音、内野発話音声及び外野発話音声を出音する（ステップＳ３１３）。 The speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S313).

　＜７．効果＞
　以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置１は、背景音を突発音と定常音の２種類に分けて、背景音は位置情報を伝送する。そして、遠隔コミュニケーション装置１は、背景音のうちの突発音の定位感はぼかさずに出音させ、背景音のうちの定常音について定位感をぼかして出音させる。 <7. Effects>
As described above, the remote communication device 1 according to the present embodiment divides background sounds into two types, sudden sounds and steady sounds, and transmits position information for the background sounds. The remote communication device 1 outputs the sudden sounds among the background sounds without blurring their localization, and outputs the steady sounds among the background sounds with blurring their localization.

　これにより、背景音は発生した位置付近から定常音よりも大きな音で再生することができ、突発的な背景音について定位感が無くなり違和感が発生することを抑制できる。対話を自然な状態に近づけることができ、対話の円滑化を促進することができる。 This allows background sounds to be played louder than steady sounds from near the location where they were generated, preventing sudden background sounds from losing their sense of position and causing discomfort. This makes conversations more natural and promotes smoother conversations.

　＜８．第３の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第３の実施の形態に係る遠隔コミュニケーション装置１について説明する。第３の実施の形態に係る遠隔コミュニケーション装置１は、マイク３により収音された発話音声を機械的な音声に変更して聞き取り易くする。 8. Remote communication device according to the third embodiment
Next, a remote communication device 1 according to a third embodiment will be described. The remote communication device 1 according to the third embodiment converts the speech voice picked up by the microphone 3 into a mechanical voice to make it easier to hear.

　図２１は、第３の実施の形態に係る遠隔コミュニケーション装置のブロック図である。なお、図２１では図１と同一の各部には同一符号を付し、以下では、図１と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。 FIG. 21 is a block diagram of a remote communication device according to the third embodiment. Note that in FIG. 21, the same components as those in FIG. 1 are designated by the same reference numerals, and the following description will focus on the components that differ from FIG. 1, and may omit a description of the components that are the same as those in FIG. 1.

　＜８．１．音響信号処理部＞
　本実施の形態に係る音響信号処理部１３は、個別音分離部１３１及び明瞭度調整部１３２に加えて、音声合成部１３４を有する。 8.1. Acoustic signal processing section
The acoustic signal processing unit 13 according to this embodiment includes a voice synthesis unit 134 in addition to an individual sound separation unit 131 and a clarity adjustment unit 132 .

　＜８．１．１．音声合成部＞
　音声合成部１３４は、個別音分離部１３１により分離された各発話者の個別音の入力を受ける。そして、音声合成部１３４は、音声認識をして一度発話内容を文字に起こす音声認識処理を実行する。次に、音声合成部１３４は、文字にした発話内容を音声合成して個別音を再生成する音声合成処理を実行する。その後、音声合成部１３４は、再生成した各発話者の個別音を明瞭度調整部１３２へ出力する。 <8.1.1. Speech synthesis unit>
The speech synthesis unit 134 receives input of the individual sounds of each speaker separated by the individual sound separation unit 131. The speech synthesis unit 134 then executes a speech recognition process to recognize speech and transcribe the speech into text. Next, the speech synthesis unit 134 executes a speech synthesis process to synthesize the textual speech into text and regenerate the individual sounds. The speech synthesis unit 134 then outputs the regenerated individual sounds of each speaker to the clarity adjustment unit 132.

　このように、音声合成部１３４は、個別音毎に、音声認識を実行して個別音の発話内容を示す文字を生成し、発話内容を示す文字を基に音声合成を実行して、発話者の発話音声を再生成する。この場合、明瞭度調整部１３２は、音声合成部１３４により再生成された個別音毎に、発話者が会話に参加しているか否かを基に調整を行う。 In this way, the speech synthesis unit 134 performs speech recognition for each individual sound to generate characters indicating the spoken content of the individual sound, and performs speech synthesis based on the characters indicating the spoken content to regenerate the speaker's speech. In this case, the clarity adjustment unit 132 makes adjustments for each individual sound regenerated by the speech synthesis unit 134 based on whether or not the speaker is participating in the conversation.

　＜９．第３の実施の形態に係る遠隔コミュニケーション装置間のデータの流れ＞
　図２２は、第３の実施の形態に係る自分側の遠隔コミュニケーション装置間のデータの流れの概要を示す図である。ここでも、リモート側の遠隔コミュニケーション装置１の送信ユニット１０からローカル側の遠隔コミュニケーション装置１の受信ユニット２０へとデータが送信される場合で説明する。ここでは、音声の送信について説明する。 9. Data Flow Between Remote Communication Devices According to the Third Embodiment
22 is a diagram showing an outline of the data flow between the local remote communication devices according to the third embodiment. Here, too, the case where data is transmitted from the transmitting unit 10 of the remote remote communication device 1 to the receiving unit 20 of the local remote communication device 1 will be described. Here, the transmission of audio will be described.

　マイク３は、リモート空間で収音した音声をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ４０１）。ここでは、例えば、マイクアレイ３０のデータとしてｓチャンネルのデータが存在する。また、カメラ２は、リモート空間の撮影を行って生成した映像をリモート側の遠隔コミュニケーション装置１へ入力する（ステップＳ４０２）。 The microphone 3 inputs the audio picked up in the remote space to the remote communication device 1 (step S401). Here, for example, data from the microphone array 30 is s channel data. The camera 2 also inputs the video generated by capturing the remote space to the remote communication device 1 (step S402).

　カメラ２から入力された映像は、人センシング部１１に送られる。人センシング部１１は、カメラ２により撮影された映像を用いて、人位置判定処理（ステップＳ４３１）及び内野外野判定処理（ステップＳ４３２）を含むセンシング処理を実行する（ステップＳ４０３）。 The video input from camera 2 is sent to the human sensing unit 11. The human sensing unit 11 uses the video captured by camera 2 to perform sensing processing (step S403), including human position determination processing (step S431) and infield/outfield determination processing (step S432).

　マイク３から入力された音声は、音声分離部１２へ送られる。音声分離部１２は、マイク３から入力された音声に含まれる発話音声と背景音とを分離する音声分離処理を行う（ステップＳ４０４）。 The audio input from microphone 3 is sent to audio separation unit 12. Audio separation unit 12 performs audio separation processing to separate the speech audio and background sounds contained in the audio input from microphone 3 (step S404).

　音声分離部１２により抽出された背景音は、音響信号処理部１３の明瞭度調整部１３２へ送られる。明瞭度調整部１３２は、背景音に対してより聞こえ難くするデグレード処理を明瞭度調整処理として実施する（ステップＳ４０５）。 The background sound extracted by the audio separation unit 12 is sent to the clarity adjustment unit 132 of the audio signal processing unit 13. The clarity adjustment unit 132 performs a degrading process to make the background sound less audible as clarity adjustment processing (step S405).

　また、音声分離部１２により抽出されたｓチャンネルのデータである個別音及び人別リストが、音響信号処理部１３へ送られる。発話音声に対しては、音響信号処理部１３により人別リストを用いた発話音声信号処理が実施される（ステップＳ４０６）。 Furthermore, the individual sounds and person list, which are s-channel data extracted by the voice separation unit 12, are sent to the acoustic signal processing unit 13. The acoustic signal processing unit 13 performs speech signal processing on the speech using the person list (step S406).

　詳しくは、個別音分離部１３１は、取得した人別リストに登録された要素数であるｐ個分のスレッドを作成する。さらに、個別音分離部１３１は、人別リストに登録された正規化座標の値を、スレッド毎のスレッド固有の値として保持させる。次に、個別音分離部１３１は、各発話者までの角度θを求める。そして、個別音分離部１３１は、音声分離部１２による音声分離により抽出された発話音声に対して、各発話者までの角度に応じたビームフォーミングを用いて個別音分離処理を行い各発話者の個別音をスレッド毎に生成する（ステップＳ４６１）。 In detail, the individual sound separation unit 131 creates p threads, which is the number of elements registered in the acquired personal list. Furthermore, the individual sound separation unit 131 stores the normalized coordinate values registered in the personal list as thread-specific values for each thread. Next, the individual sound separation unit 131 calculates the angle θ to each speaker. Then, the individual sound separation unit 131 performs individual sound separation processing on the speech sounds extracted by speech separation by the speech separation unit 12 using beamforming according to the angle to each speaker, and generates individual sounds for each speaker for each thread (step S461).

　個別音分離部１３１により生成された個別音は、音声合成部１３４へ送られる。音声合成部１３４は、個別音について音声認識をして一度発話内容を文字に起こす音声認識処理を実行する（ステップＳ４６２）。 The individual sounds generated by the individual sound separation unit 131 are sent to the speech synthesis unit 134. The speech synthesis unit 134 performs speech recognition processing to recognize the individual sounds and transcribe the speech content into text (step S462).

　次に、音声合成部１３４は、文字にした発話内容を音声合成して個別音を再生成する音声合成処理を実行する（ステップＳ４６３）。音声合成部１３４により再生成された各発話者の個別音は、明瞭度調整部１３２へ送られる。 Next, the speech synthesis unit 134 executes a speech synthesis process to synthesize the text of the speech and regenerate the individual sounds (step S463). The individual sounds of each speaker regenerated by the speech synthesis unit 134 are sent to the clarity adjustment unit 132.

　明瞭度調整部１３２は、スレッド毎に個別音に対応する発話者が内野であるか外野であるかを人別リストの内野フラグを用いて判定する。そして、明瞭度調整部１３２は、内野の発話者の個別音に対してより聞こえやすくするエンハンス処理を実施し、外野の発話者の個別音に対してより聞こえにくくするデグレード処理を実施する明瞭度調整処理を実施する（ステップＳ４６４）。 The clarity adjustment unit 132 determines whether the speaker corresponding to the individual sounds for each thread is infield or outfield using the infield flag in the person list. The clarity adjustment unit 132 then performs clarity adjustment processing, which performs enhancement processing to make the individual sounds of infield speakers easier to hear, and degrade processing to make the individual sounds of outfield speakers harder to hear (step S464).

　１チャンネルのデータである背景音は、出力統合部１４へ入力される。また、発話音声信号処理が施されたｐチャンネルのデータである個別音も、出力統合部１４へ入力される。これにより、出力統合部１４は、（１＋Ｐ）チャンネルのデータを取得する。そして、出力統合部１４は、すべてのチャンネルの音を統合して１つの音データを生成する出力統合処理を実行する（ステップＳ４０７）。 The background sound, which is one channel of data, is input to the output integrating unit 14. The individual sounds, which are p channel data that have been subjected to speech signal processing, are also input to the output integrating unit 14. As a result, the output integrating unit 14 obtains (1+P) channel data. The output integrating unit 14 then executes output integration processing, which integrates the sounds of all channels to generate a single sound data (step S407).

　出力統合部１４で生成された音データ及び人別リストは、送信部１５によりネットワーク７を介してローカル側の遠隔コミュニケーション装置１へ送られる（ステップＳ４０８）。 The sound data and person list generated by the output integration unit 14 are sent by the transmission unit 15 to the local remote communication device 1 via the network 7 (step S408).

　ローカル側の遠隔コミュニケーション装置１は、音データ及び人別リストの送信を受ける。音データ及び人別リストは、受信部２１を介して音声出力制御部２２へ送られる。音声出力制御部２２は、人別リストを用いて音データに対して複数対向スピーカーバランシング処理を実施して、背景音、内野発話音声及び外野発話音声毎に出力するスピーカーユニット４１１及び４１２を指定して再生させる（ステップＳ４０９）。 The local remote communication device 1 receives the sound data and the person list. The sound data and the person list are sent to the audio output control unit 22 via the receiving unit 21. The audio output control unit 22 performs a multiple facing speaker balancing process on the sound data using the person list, and specifies the speaker units 411 and 412 to output the background sound, infield speech sound, and outfield speech sound, respectively, and plays them (step S409).

　スピーカーアレイ４１及び４２は、音声出力制御部２２から指定されたスピーカーユニット４１１及び４１２を用いて、背景音のうちの突発音、背景音のうちの定常音、内野発話音声及び外野発話音声を出音する（ステップＳ４１０）。 The speaker arrays 41 and 42 use the speaker units 411 and 412 specified by the audio output control unit 22 to output sudden sounds from the background sound, steady sounds from the background sound, infield speech sounds, and outfield speech sounds (step S410).

　＜１０．効果＞
　以上に説明したように、本実施の形態に係る遠隔コミュニケーション装置１は、音声認識をして一度発話内容を文字に起こし、それを音声合成したものを再生させる発話音声のデータとする。 <10. Effects>
As described above, the remote communication device 1 according to this embodiment performs speech recognition to transcribe the spoken content into text, and then synthesizes the text into speech data to be played back.

　マイク３で収録した発話音声をそのまま再生すると、人による声量や活舌の違いにより聞き取りづらくなることがある。これに対して、本実施の形態に係る遠隔コミュニケーション装置１は、発話の声量や活舌のばらつきを抑えて全体的に発話音声を聞き取り易くすることができる、したがって、コミュニケーションの円滑化を図ることができる。 If the speech recorded by the microphone 3 is played back as is, it may be difficult to hear due to differences in voice volume and articulation between people. In contrast, the remote communication device 1 of this embodiment can reduce variations in voice volume and articulation, making the speech easier to hear overall, thereby facilitating smoother communication.

　＜１１．第４の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第４の実施の形態に係る遠隔コミュニケーション装置１について説明する。第４の実施の形態に係る遠隔コミュニケーション装置１は、内野外野判定等の音声識別を頭部方向推定以外の方法でも実行する。本実施例に係る遠隔コミュニケーション装置１も、図１のブロック図で表される。以下では、図１と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。 11. Remote communication device according to the fourth embodiment
Next, a remote communication device 1 according to a fourth embodiment will be described. The remote communication device 1 according to the fourth embodiment performs voice recognition, such as infield/outfield determination, using a method other than head direction estimation. The remote communication device 1 according to this embodiment is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.

　＜１１．１．人センシング部＞
　人センシング部１１は、映像から各発話者の距離を検出するデプスセンシングを行う。そして、人センシング部１１は、カメラ２から遠くの位置にいる人がリモートの人と会話することはないという仮定の下、カメラ２から遠くの位置にいる人を外野の人と判定する。このように、人センシング部１１は、映像を基に発話者の距離を判定し、距離を基に発話者が会話に参加しているか否かを判定する。 <11.1. Human sensing unit>
The human sensing unit 11 performs depth sensing to detect the distance of each speaker from the video. Then, under the assumption that people far from the camera 2 will not converse with remote people, the human sensing unit 11 determines that people far from the camera 2 are people in the outfield. In this way, the human sensing unit 11 determines the distance of the speaker based on the video, and determines whether the speaker is participating in the conversation based on the distance.

　このように、カメラ２からの距離を用いて内野外野を判定することで、たまたま後ろを通りかかった人が画面の方を見ながら話しているときに、その人をローカルの人との会話の参加者として判定されることを回避できる。これにより、コミュニケーションの円滑化を図ることができる。 In this way, by determining infield and outfield using the distance from camera 2, it is possible to avoid someone passing behind you who happens to be looking at the screen while talking, being identified as a participant in a conversation with a local person. This facilitates smooth communication.

　＜１１．２．音響信号処理部＞
　音響信号処理部１３は、個別音分離部１３１により生成された各個別音に対して音声認識を実行して、発話内容を文字に起こす。そして、音響信号処理部１３は、文字にした発話内容のコンテキストに基づいて、その発話がローカル側の人に対する発話であるか否かを判定する。音響信号処理部１３は、例えば、感情、共感度及び傾聴度をＡＩ分析する等によりこの判定を行うことができる。 11.2. Acoustic signal processing section
The acoustic signal processing unit 13 performs speech recognition on each individual sound generated by the individual sound separation unit 131 and transcribes the speech content into text. Then, the acoustic signal processing unit 13 determines whether the speech is directed to a person on the local side based on the context of the transcribed speech content. The acoustic signal processing unit 13 can make this determination, for example, by performing AI analysis of emotions, empathy level, and listening level.

　そして、音響信号処理部１３は、その発話がローカル側の人に対する発話であると判定した場合、その発話者を内野の人とする。そして、音響信号処理部１３は、人別リストにおける内野の人と判定した発話者の内野フラグをＴｒｕｅに設定して人別リストを更新する。その後、音響信号処理部１３は、更新した人別リストを明瞭度調整部１３２へ出力する。 If the acoustic signal processing unit 13 determines that the speech is directed at a person on the local side, it determines that the speaker is an infield person. The acoustic signal processing unit 13 then sets the infield flag of the speaker determined to be an infield person in the person list to True, and updates the person list. The acoustic signal processing unit 13 then outputs the updated person list to the clarity adjustment unit 132.

　ここで、例えばスライドを見ながらプレゼンテーションをしている人の声や、メモを取りながら話を聞いている人の相槌も、リモートに対して発せられるメッセージであり、内野として処理されることが好ましい。そこで、コンテキストに基づいて内野外野を判定することで、頭部方向推定では外野として判定されてしまうが実際には会話に参加している発話者の発話が、内野の人の発話としてリモートに伝送して再生される。これにより、リモートの方向に頭が向いていないがリモートに話しかけているような状態でも、内野発話として聞き取りやすくなるように処理される。したがって、コミュニケーションの円滑化を図ることができる。 Here, for example, the voice of a person giving a presentation while looking at slides, or the interjections of a person listening while taking notes, are also messages sent to the remote, and are preferably processed as infield. Therefore, by determining infield/outfield based on the context, the speech of a speaker who would be determined to be infield in head direction estimation but is actually participating in the conversation is transmitted to the remote and played back as the speech of an infield person. This makes it possible to process speech that is easier to hear as infield speech, even in situations where the speaker is not facing the remote but is speaking to it. This can facilitate smooth communication.

　＜１１．３．その他の音声識別＞
　他にも、ローカル側の人センシング部１１は、ローカル側の人がディスプレイ６のスクリーン上の特定の位置を指さした場合に、モーションキャプチャ又は視線推定等を用いてその特定の位置を抽出する。そして、ローカル側の人センシング部１１は、送信部１５を介してリモート側の遠隔コミュニケーション装置１へ指さされた特定の位置の情報を送信する。 11.3. Other Voice Recognition
Additionally, when a person on the local side points to a specific position on the screen of the display 6, the local-side human sensing unit 11 extracts the specific position using motion capture, gaze estimation, or the like. Then, the local-side human sensing unit 11 transmits information about the specific position pointed to via the transmission unit 15 to the remote-side remote communication device 1.

　この場合、例えば、個別音分離部１３１が、ビームフォーミング等を用いてその特定の位置の音声を個別音として抽出する。明瞭度調整部１３２により、特定の位置の個別音に対して内野発話に対する処理と同じ処理等の適当な処理を加える。その後、出力統合部１４は、他の個別音とともに１つのデータにまとめて、特定の位置の情報とともにローカル側の遠隔コミュニケーション装置１へ送信部１５を介して送信する。 In this case, for example, the individual sound separation unit 131 extracts the sound at that specific position as an individual sound using beamforming or the like. The clarity adjustment unit 132 applies appropriate processing to the individual sound at the specific position, such as the same processing as that applied to infield speech. The output integration unit 14 then combines this with the other individual sounds into a single data item, and transmits it, along with information about the specific position, to the local remote communication device 1 via the transmission unit 15.

　ローカル側の遠隔コミュニケーション装置１の音声出力制御部２２は、音データから特定の位置の個別音を取り出して、特定の位置の情報を用いて内野発話に対する処理と同じ処理等の適当な複数対向スピーカーバランシング処理を行う。そして、音声出力制御部２２は、処理を施した特定の位置の個別音をスピーカーユニット４１１及び４１２に再生させる。 The audio output control unit 22 of the local remote communication device 1 extracts individual sounds at specific positions from the sound data and performs appropriate multiple opposing speaker balancing processing, such as the same processing as for infield speech, using the information about the specific positions. The audio output control unit 22 then causes the speaker units 411 and 412 to reproduce the processed individual sounds at the specific positions.

　＜１１．４．効果＞
　このように、指さされた特定の位置の音を背景音と比較して強調して再生することで、発話音声以外にも、ローカルの人が聞き取りたい音を聞き易くすることができ、コミュニケーションの円滑化を図ることができる。 <11.4. Effects>
In this way, by emphasizing and reproducing the sound at the specific location pointed at compared to the background sounds, it is possible to make it easier for local people to hear sounds that they want to hear, in addition to spoken voices, thereby facilitating smooth communication.

　＜１２．第５の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第５の実施の形態に係る遠隔コミュニケーション装置１について説明する。第１の実施の形態では、カメラ２とマイク３とが仮想スクリーンの法線方向に並べられており、人センシング部１１が、カメラ２で取得した映像の各発話者の位置を示す正規化座標を基に、各発話者の方向の角度を求めた。これに対して、第５の実施の形態に係る遠隔コミュニケーション装置１は、マイク３により得られた音声により音源方向を特定する。本実施例に係る遠隔コミュニケーション装置１も、図１のブロック図で表される。以下では、図１と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。 12. Remote communication device according to the fifth embodiment
Next, a remote communication device 1 according to a fifth embodiment will be described. In the first embodiment, the camera 2 and microphone 3 are aligned in the normal direction of the virtual screen, and the human sensing unit 11 determines the angle of the direction of each speaker based on normalized coordinates indicating the position of each speaker in the image captured by the camera 2. In contrast, the remote communication device 1 according to the fifth embodiment identifies the direction of a sound source using the sound captured by the microphone 3. The remote communication device 1 according to this example is also represented by the block diagram of FIG. 1. The following description will focus on the parts that are different from FIG. 1, and descriptions of parts that are the same as those in FIG. 1 may be omitted.

　図２３は、第５の実施の形態に係る遠隔コミュニケーション装置のブロック図である。また、図２４は、第５の実施の形態に係る遠隔コミュニケーション装置による音声信号処理を説明するための図である。 FIG. 23 is a block diagram of a remote communication device according to the fifth embodiment. Also, FIG. 24 is a diagram for explaining audio signal processing by the remote communication device according to the fifth embodiment.

　＜１２．１．音源方向推定部＞
　本実施例に係る遠隔コミュニケーション装置１は、図２３に示すように音源方向推定部１６を有する。音源方向推定部１６は、マイクアレイ３０により各方向に順番にビームを向けて得られる音声を取得する。ビームを向ける方向は、マイク３に予め設定しておいても良いし、音源方向推定部１６がマイク３に順番に指定してもよい。 <12.1. Sound source direction estimation section>
The remote communication device 1 according to this embodiment has a sound source direction estimating unit 16 as shown in Fig. 23. The sound source direction estimating unit 16 acquires sounds obtained by directing a beam in each direction in turn by the microphone array 30. The direction to direct the beam may be preset in the microphone 3, or the sound source direction estimating unit 16 may specify the direction to direct the beam in turn to the microphone 3.

　ここでは、ビームフォーミングでは発話および突発背景音のみを検出し、定常背景音は収音しないものとする。また、映像上に映っている発話者及びその範囲で発生した突発背景音の他に、音は発生しない、すなわち画角外からは音が鳴らないものとする。 Here, beamforming is assumed to detect only speech and sudden background sounds, and not pick up steady background sounds. Furthermore, it is assumed that no sounds are generated other than the speaker appearing on the screen and any sudden background sounds occurring within that range, i.e., no sound is heard from outside the field of view.

　次に、図２４に示すように、音源方向推定部１６は、各方向にビームを向けて得られるそれぞれの音声のうち、出力が閾値より大きい音声を選択して、選択した音声を取得したビームの方向に音源があるとして音源方向の推定を行う（ステップＳ５０１）。図２４のステップＳ５０１は、多数の音源方向が推定されたことを示している。そして、音源方向推定部１６は、音源方向を発話者が存在する位置と判定する。このように、音源方向推定部１６は、リモート空間にあたる所定空間の音声を基に発話者の位置を判定する。 Next, as shown in FIG. 24, the sound source direction estimation unit 16 selects, from the sounds obtained by directing the beam in each direction, a sound whose output is greater than a threshold, and estimates the sound source direction by assuming that the sound source is located in the direction of the beam that obtained the selected sound (step S501). Step S501 in FIG. 24 shows that a large number of sound source directions have been estimated. The sound source direction estimation unit 16 then determines the sound source direction as the position where the speaker is located. In this way, the sound source direction estimation unit 16 determines the position of the speaker based on the sound in a specified space that corresponds to the remote space.

　その後、音源方向推定部１６は、判定した発話者が存在する位置の正規化座標を人センシング部１１へ通知して、発話者が存在する位置の正規化座標を基に人別リストを作成する。この場合、人センシング部１１は、通知された正規化座標を用いて映像に映っている人の中から発話者を特定し、特定した各発話者の内野外野判定を実行して、人別リストに登録する。 The sound source direction estimation unit 16 then notifies the person sensing unit 11 of the normalized coordinates of the position where the determined speaker is located, and creates a person list based on the normalized coordinates of the position where the speaker is located. In this case, the person sensing unit 11 uses the notified normalized coordinates to identify speakers from among the people shown in the video, performs an infield/outfield determination for each identified speaker, and registers them in the person list.

　＜１２．２．音響信号処理部＞
　音響信号処理部１３は、音源方向推定部１６により音声から推定された各発話者の位置の正規化座標が登録された人別リストを人センシング部１１から取得する。そして、取得した人別リストを用いて発話音声信号処理を実行する（ステップＳ５０２）。 <12.2. Acoustic signal processing section>
The acoustic signal processing unit 13 acquires, from the person sensing unit 11, a person list in which normalized coordinates of the position of each speaker estimated from the voice by the sound source direction estimation unit 16 are registered. Then, the acoustic signal processing unit 13 executes speech signal processing using the acquired person list (step S502).

　詳しくは、個別音分離部１３１が、人別リストに登録された音声から推定された各発話者の人数のスレッドを生成する。そして、個別音分離部１３１は、スレッド毎に各発話者の位置の正規化座標を用いて個別音分離処理を実行して各発話者の個別音を生成する（ステップＳ５２１）。このように、個別音分離部１３１は、音源方向推定部１６により判定された発話者の位置を基に、リモート空間にあたる所定空間の音声から発話者毎の発話音声である個別音を分離する。 In more detail, the individual sound separation unit 131 generates threads for each speaker estimated from the audio registered in the person list. Then, the individual sound separation unit 131 performs individual sound separation processing using the normalized coordinates of the position of each speaker for each thread to generate individual sounds for each speaker (step S521). In this way, the individual sound separation unit 131 separates individual sounds, which are the speech sounds of each speaker, from the audio in a specified space corresponding to the remote space, based on the position of the speaker determined by the sound source direction estimation unit 16.

　個別音分離部１３１は、マイクアレイ３０を用いて検出した音源方向についても、ディスプレイ６に表示された状態での映像の左側の角度から順に個別音分離を行うことで、個別音と人別リストに登録された各発話者の位置を示す正規化座標とを対応させる。これにより、ローカル側に伝送後も画像と音声とを一致させることができる。 The individual sound separation unit 131 also separates the individual sounds from the sound source direction detected using the microphone array 30, starting from the angle on the left side of the image displayed on the display 6, thereby matching the individual sounds with the normalized coordinates indicating the position of each speaker registered in the person list. This allows the image and sound to match even after transmission to the local side.

　明瞭度調整部１３２は、スレッド毎に、人別リストに登録された各発話者の内野フラグを用いて、音声から推定された発話者の個別音の明瞭度調整処理を実行する（ステップＳ５２２）。このように、明瞭度調整部１３２は、個別音分離部１３１により分離された個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う。 For each thread, the clarity adjustment unit 132 performs clarity adjustment processing for the individual sounds of the speaker estimated from the audio using the infield flag of each speaker registered in the person list (step S522). In this way, the clarity adjustment unit 132 adjusts the clarity for each individual sound separated by the individual sound separation unit 131 based on whether the speaker is participating in the conversation.

　＜１２．３．効果＞
　これにより、カメラ２とマイク３とを仮想スクリーンの法線方向に並べて設置することが現実的に難しい場合であっても、カメラ２とマイク３との位置関係に制約なく個別音を生成することができる。 <12.3. Effects>
This allows individual sounds to be generated without any restrictions on the positional relationship between the camera 2 and the microphone 3, even if it is practically difficult to install the camera 2 and the microphone 3 side by side in the normal direction of the virtual screen.

　＜１３．第６の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第６の実施の形態に係る遠隔コミュニケーション装置１について説明する。本実施の形態に係る遠隔コミュニケーション装置１は、スピーカーアレイ４１及び４２の幅がディスプレイ６のスクリーンの幅よりも短い場合に、ディスプレイ６の幅に合わせて音声の再生位置を再マッピングする。本実施例に係る遠隔コミュニケーション装置１も、図１のブロック図で表される。以下では、図１と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。 13. Remote communication device according to the sixth embodiment
Next, a remote communication device 1 according to a sixth embodiment will be described. When the width of the speaker arrays 41 and 42 is shorter than the width of the screen of the display 6, the remote communication device 1 according to this embodiment remaps the audio playback position to match the width of the display 6. The remote communication device 1 according to this example is also represented by the block diagram of Figure 1. The following description will focus on the parts that are different from Figure 1, and descriptions of parts that are the same as those in Figure 1 may be omitted.

＜１３．１．音声出力制御部＞
　図２５は、第６の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。本実施の形態では、図２５に示すように、スピーカーアレイ４１及び４２のスピーカーアレイ幅Ｗ_ｓｐｋがディスプレイ６のスクリーンの幅であるディスプレイ幅Ｗ_ｄｉｓｐよりも短い。 <13.1. Audio output control unit>
25 is a diagram illustrating audio playback by a remote communication device according to a sixth embodiment. In this embodiment, as shown in FIG. 25, the speaker array width W _spk of the speaker arrays 41 and 42 is shorter than the display width W _disp, which is the width of the screen of the display 6.

　ここで、第１の実施の形態では、音声出力制御部２２は、正規化座標系で画像と音声とを一致させるために、スピーカーアレイ幅Ｗ_ｓｐｋ及びスピーカー間距離Ｗ_ｕをディスプレイ幅Ｗ_ｄｉｓｐで除して処理上の位置を決定した。しかし、この方法の複数対向スピーカーバランシング処理では、スピーカーアレイ幅Ｗ_ｓｐｋがディスプレイ幅Ｗ_ｄｉｓｐよりも短い場合、ディスプレイ６のスクリーンの端部に映る発話者の音声の再生が困難となる。 In the first embodiment, the audio output control unit 22 determines the processing position by dividing the speaker array width W _spk and the speaker distance W _u by the display width W _disp in order to match the image and audio in the normalized coordinate system. However, with this method of multiple opposed speaker balancing processing, if the speaker array width W _spk is shorter than the display width W _disp , it becomes difficult to reproduce the audio of a speaker who appears on the edge of the screen of the display 6.

　そこで、本実施の形態に係る音声出力制御部２２は、スピーカーアレイ幅Ｗ_ｓｐｋ及びスピーカー間距離Ｗ_ｕをスピーカーアレイ幅Ｗ_ｓｐｋで除して処理上の位置を決定する。この場合、スピーカーアレイ４１及び４２の幅は正規化座標における長さが１となり、スピーカーユニット４１１及び４１２の組みの間の幅は、スピーカー間距離Ｗ_ｕをスピーカーアレイ幅Ｗ_ｓｐｋで除した値となる。すなわち、音声出力制御部２２は、処理上のスピーカーアレイ幅が正規化座標の－０．５～０．５に設定する。そしえ、音声出力制御部２２は、この処理上の距離を用いて複数対向スピーカーバランシング処理を実行して、各音声を出力するスピーカーユニット４１１及び４１２の組みを決定する。 Therefore, the audio output control unit 22 according to the present embodiment determines the position in processing by dividing the speaker array width W _spk and the speaker distance W _u by the speaker array width W _spk . In this case, the width of the speaker arrays 41 and 42 has a length of 1 in normalized coordinates, and the width between the pair of speaker units 411 and 412 is the value obtained by dividing the speaker distance W _u by the speaker array width W _spk . In other words, the audio output control unit 22 sets the speaker array width in processing to -0.5 to 0.5 in normalized coordinates. Then, the audio output control unit 22 executes a multiple facing speaker balancing process using this distance in processing to determine the pair of speaker units 411 and 412 that will output each sound.

　このように、音声出力制御部２２は、スピーカーアレイ４１及び４２における発話音声の再生位置が、発話者それぞれのスクリーン上の位置の距離関係を維持するように、発話者の位置を基に再生させるスピーカーユニット４１１及び４１２を選択する。 In this way, the audio output control unit 22 selects the speaker units 411 and 412 to play back based on the speaker's position so that the playback positions of the spoken voices in the speaker arrays 41 and 42 maintain the distance relationship between the speaker's position on the screen.

　＜１３．２．効果＞
　このように、本実施の形態に係る遠隔コミュニケーション装置１は、スピーカーアレイ４１及び４２の幅に合わせて音声の再生位置を再マッピングする。これにより、スクリーンに映されている発話者の声を全て再生することができる。したがって、スピーカーアレイ４１及び４２の幅がディスプレイ６のスクリーンよりも極端に短い場合であっても、画面の端で話している発話者の声が再生されなくなることを回避でき、映像上の全ての話者をスピーカーアレイ４１及び４２から再生することができる。 <13.2. Effects>
In this way, the remote communication device 1 according to the present embodiment remaps the audio playback position to match the width of the speaker arrays 41 and 42. This makes it possible to play back the voices of all speakers appearing on the screen. Therefore, even if the width of the speaker arrays 41 and 42 is significantly shorter than the screen of the display 6, it is possible to prevent the voices of speakers speaking at the edges of the screen from being cut off, and all speakers appearing on the screen can be played back from the speaker arrays 41 and 42.

　＜１４．第７の実施の形態に係る遠隔コミュニケーション装置＞
　次に、第７の実施の形態に係る遠隔コミュニケーション装置１について説明する。複数対向スピーカー４は、上下ともモノラルで同じ音声を再生することで映像の縦方向の中央に仮想的に音源を定位させる。ただし、子供等の低身長の発話者や高身長の発話者の発話音声を再生する場合には不自然になる可能性がある。 14. Remote communication device according to the seventh embodiment
Next, a remote communication device 1 according to a seventh embodiment will be described. The multiple opposing speakers 4 virtually localize the sound source in the vertical center of the image by reproducing the same monaural sound from both the top and bottom. However, this may sound unnatural when reproducing the speech of a short speaker such as a child or a tall speaker.

　そこで、本実施の形態に係る遠隔コミュニケーション装置１は、音源の高さに合わせて縦方向の音源位置を変更して音声を再生する。本実施例に係る遠隔コミュニケーション装置１は、図２３のブロック図で表される。以下では、図２３と異なる部分を中心に説明して、図１と同一の各部については説明を省略する場合がある。ただし、本実施の形態では、音源方向推定部１６は、各発話者の正規化座標方向の位置の推定は行わなくてもよい。図２６は、第７の実施の形態に係る遠隔コミュニケーション装置による音声再生を説明するための図である。 Therefore, the remote communication device 1 according to this embodiment plays back audio by changing the vertical position of the sound source according to the height of the sound source. The remote communication device 1 according to this embodiment is represented by the block diagram in Figure 23. The following explanation will focus on the parts that differ from Figure 23, and explanations of parts that are the same as those in Figure 1 may be omitted. However, in this embodiment, the sound source direction estimation unit 16 does not need to estimate the position of each speaker in the normalized coordinate direction. Figure 26 is a diagram for explaining audio playback by a remote communication device according to the seventh embodiment.

　＜１４．１．人センシング部＞
　人センシング部１１は、映像に映った各発話者の口の縦方向の位置を取得する。例えば、人センシング部１１は、映像解析により口の位置を推定することができる。そして、人センシング部１１は、人別リストに各発話者についての縦方向の図２６における正規化座標Ｙを格納する。この場合、人センシング部１１は、図２６に示すように、スクリーンの縦方向の中央を原点とし、且つ、縦方向の座標を－０．５～０．５として縦方向の正規化座標Ｙを設定する。 <14.1. Human sensing unit>
The human sensing unit 11 acquires the vertical position of the mouth of each speaker shown in the video. For example, the human sensing unit 11 can estimate the position of the mouth through video analysis. Then, the human sensing unit 11 stores the normalized vertical coordinate Y in FIG. 26 for each speaker in the person list. In this case, as shown in FIG. 26, the human sensing unit 11 sets the vertical normalized coordinate Y with the center of the screen as the origin and the vertical coordinate between -0.5 and 0.5.

　＜１４．２．個別音分離部＞
　また、背景音のうちの突発音等の発話以外の音は必ずしも映像上の中央に映されている位置が音源とは限らない。これについては、第２の実施の形態のように突発音を特定する場合に、背景音分離部１３３は、突発音の縦方向の位置を特定する。この場合、例えば、背景音分離部１３３は、個別音分離部１３１及びオーディオインタフェース５を介してマイクアレイ３０に繋がる信号線を用いて信号を送受信する。背景音分離部１３３は、マイクアレイ３０に縦方向にビームを走査させて、縦方向の音声の強さから突発音の音源の縦方向の位置を推定する事ができる。そして、背景音分離部１３３は、背景音における突発音の位置を正規化座標Ｙで表して、その情報を出力統合部１４及び送信部１５を介してリモート側の遠隔コミュニケーション装置１へ送信する。 <14.2. Individual sound separation section>
Furthermore, the source of background sounds other than speech, such as sudden sounds, is not necessarily located at the center of the image. Regarding this, when identifying a sudden sound as in the second embodiment, the background sound separation unit 133 identifies the vertical position of the sudden sound. In this case, for example, the background sound separation unit 133 transmits and receives signals using a signal line connected to the microphone array 30 via the individual sound separation unit 131 and the audio interface 5. The background sound separation unit 133 can estimate the vertical position of the sound source of the sudden sound from the vertical sound intensity by causing the microphone array 30 to scan a beam. The background sound separation unit 133 then represents the position of the sudden sound in the background sound using normalized coordinates Y and transmits this information to the remote communication device 1 via the output integration unit 14 and the transmission unit 15.

　また、ここでは、背景音分離部１３３がマイクアレイ３０に操作させる構成で説明したが、他にも、第２の実施の形態の遠隔コミュニケーション装置１に、図２３に示した音源方向推定部１６を搭載させてもよい。その場合、音源方向推定部１６がマイクアレイ３０を用いて推定した音源の縦方向の位置は人センシング部１１に送られて、人別リストに登録される。 Furthermore, while the configuration has been described here in which the background sound separation unit 133 operates the microphone array 30, the remote communication device 1 of the second embodiment may also be equipped with the sound source direction estimation unit 16 shown in FIG. 23. In this case, the vertical position of the sound source estimated by the sound source direction estimation unit 16 using the microphone array 30 is sent to the person sensing unit 11 and registered in the person list.

　＜１４．３．音声出力制御部＞
　音声出力制御部２２は、例えば、スピーカーアレイ４１をＬチャンネルとし、かつ、スピーカーアレイ４２をＲチャンネルとして音声信号をステレオ化する。そして、音声出力制御部２２は、人別リストに登録された縦方向の正規化座標ＹにしたがってＬチャンネル及びＲチャンネルを用いてステレオ化した音声信号をパニングして、指定された縦方向の位置で発話音声が再生されるように音源位置を変更する。 <14.3. Audio output control unit>
The audio output control unit 22 stereoizes the audio signal, for example, by using the speaker array 41 as the L channel and the speaker array 42 as the R channel. Then, the audio output control unit 22 pans the stereoized audio signal using the L channel and the R channel in accordance with the vertical normalized coordinate Y registered in the person list, and changes the sound source position so that the speech sound is reproduced at a specified vertical position.

　例えば、音声出力制御部２２は、図２６における発話者Ｐ１の発話音声は、縦方向の中央よりも下の位置を音源として再生させる。また、音声出力制御部２２は、発話者Ｐ１の発話音声は、縦方向の中央よりも上の位置を音源として再生させる。 For example, the audio output control unit 22 plays back the speech of speaker P1 in FIG. 26 using a sound source located below the center in the vertical direction. Furthermore, the audio output control unit 22 plays back the speech of speaker P1 using a sound source located above the center in the vertical direction.

　また、第２の実施の形態のように背景音における突発音を定常音とは別に再生させる場合に、音声出力制御部２２は、以下の処理を行ってもよい。すなわち、音声出力制御部２２は、突発音の縦方向の位置についても、個別音分離部１３１から通知された突発音の縦方向の位置の正規化座標Ｙに合わせて、ステレオ化した音声信号をパニングする。これにより、音声出力制御部２２は、指定された縦方向の位置で突発音が再生されるように音源位置を変更する。このように、音声出力制御部２２は、リモート空間にあたる所定空間の音声をステレオ化して、スピーカーアレイ４１とスピーカーアレイ４２との間の音源の位置を調整する。 Furthermore, when a sudden sound in the background sound is reproduced separately from the steady sound as in the second embodiment, the audio output control unit 22 may perform the following process. That is, the audio output control unit 22 also pans the stereo audio signal for the vertical position of the sudden sound, according to the normalized coordinate Y of the vertical position of the sudden sound notified by the individual sound separation unit 131. As a result, the audio output control unit 22 changes the sound source position so that the sudden sound is reproduced at the specified vertical position. In this way, the audio output control unit 22 stereophonizes the sound in a predetermined space that corresponds to the remote space, and adjusts the position of the sound source between the speaker array 41 and the speaker array 42.

　＜１４．４．効果＞
　モノラル再生ではスクリーンの縦方向の中央に仮想的な音源が定位される。これに対して、本実施の形態に係る遠隔コミュニケーション装置１は、発話者の身長に差がある場合や鳴らす音が縦方向の中央から離れている場合に、縦方向についても適切な位置に音源位置を動かすことができる。したがって、より対話を自然な状態に近づけることができ、コミュニケーションの円滑化を図ることができる。 <14.4. Effects>
In monaural playback, a virtual sound source is positioned in the vertical center of the screen. In contrast, the remote communication device 1 according to this embodiment can move the sound source position vertically to an appropriate position when there is a difference in the height of speakers or when the sound being emitted is far from the vertical center. This allows for a more natural conversation and facilitates smooth communication.

　以上、本開示の実施の形態について説明したが、本開示の技術的範囲は、上述の実施の形態そのままに限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。また、異なる実施の形態及び変形例にわたる構成要素を適宜組み合わせてもよい。 Although the embodiments of the present disclosure have been described above, the technical scope of the present disclosure is not limited to the above-described embodiments, and various modifications are possible without departing from the gist of the present disclosure. Furthermore, components from different embodiments and modifications may be combined as appropriate.

　＜１５．ハードウェア構成＞
　図２７は、第１の実施の形態～第７の実施の形態に係る情報処理装置である遠隔コミュニケーション装置の演算装置を実現するコンピュータの一例を示すハードウェア構成図である。 15. Hardware Configuration
FIG. 27 is a hardware configuration diagram showing an example of a computer that realizes the arithmetic unit of the remote communication device that is the information processing device according to the first to seventh embodiments.

　コンピュータ１０００は、ＣＰＵ１１００、ＲＡＭ１２００、ＲＯＭ（Read　Only　Memory）１３００、ＨＤＤ（Hard　Disk　Drive）１４００、通信インターフェイス１５００、及び入出力インターフェイス１６００を有する。コンピュータ１０００の各部は、バス１０５０によって接続される。 Computer 1000 has a CPU 1100, RAM 1200, ROM (Read Only Memory) 1300, HDD (Hard Disk Drive) 1400, communication interface 1500, and input/output interface 1600. Each part of computer 1000 is connected by bus 1050.

　ＣＰＵ１１００は、ＲＯＭ１３００又はＨＤＤ１４００に格納されたプログラムに基づいて動作し、各部の制御を行う。例えば、ＣＰＵ１１００は、ＲＯＭ１３００又はＨＤＤ１４００に格納されたプログラムをＲＡＭ１２００に展開し、各種プログラムに対応した処理を実行する。 The CPU 1100 operates based on programs stored in the ROM 1300 or HDD 1400 and controls each component. For example, the CPU 1100 loads programs stored in the ROM 1300 or HDD 1400 into the RAM 1200 and executes processing corresponding to the various programs.

　ＲＯＭ１３００は、コンピュータ１０００の起動時にＣＰＵ１１００によって実行されるＢＩＯＳ（Basic　Input　Output　System）等のブートプログラムや、コンピュータ１０００のハードウェアに依存するプログラム等を格納する。 ROM 1300 stores boot programs such as the BIOS (Basic Input Output System) executed by the CPU 1100 when the computer 1000 starts up, as well as programs that depend on the computer 1000's hardware.

　ＨＤＤ１４００は、ＣＰＵ１１００によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を非一時的に記録する、コンピュータが読み取り可能な記録媒体である。具体的には、ＨＤＤ１４００は、プログラムデータ１４５０の一例である本開示に係るアプリケーションプログラムを記録する記録媒体である。 HDD 1400 is a computer-readable recording medium that non-temporarily records programs executed by CPU 1100 and data used by such programs. Specifically, HDD 1400 is a recording medium that records application programs related to the present disclosure, which are an example of program data 1450.

　通信インターフェイス１５００は、コンピュータ１０００が外部ネットワーク１５５０（例えばインターネット）と接続するためのインターフェイスである。例えば、ＣＰＵ１１００は、通信インターフェイス１５００を介して、他の機器からデータを受信したり、ＣＰＵ１１００が生成したデータを他の機器へ送信したりする。 The communication interface 1500 is an interface that allows the computer 1000 to connect to an external network 1550 (e.g., the Internet). For example, the CPU 1100 receives data from other devices and transmits data generated by the CPU 1100 to other devices via the communication interface 1500.

　入出力インターフェイス１６００は、入出力デバイス１６５０とコンピュータ１０００とを接続するためのインターフェイスである。例えば、ＣＰＵ１１００は、入出力インターフェイス１６００を介して、キーボードやマウス等の入力デバイスからデータを受信する。また、ＣＰＵ１１００は、入出力インターフェイス１６００を介して、ディスプレイ６やオーディオインタフェース５やプリンタ等の出力デバイスにデータを送信する。また、入出力インターフェイス１６００は、所定の記録媒体（メディア）に記録されたプログラム等を読み取るメディアインターフェイスとして機能してもよい。メディアとは、例えばＤＶＤ（Digital　Versatile　Disc）、ＰＤ（Phase　change　rewritable　Disk）等の光学記録媒体、ＭＯ（Magneto－Optical　disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、又は半導体メモリ等である。 The input/output interface 1600 is an interface for connecting the input/output device 1650 and the computer 1000. For example, the CPU 1100 receives data from input devices such as a keyboard or mouse via the input/output interface 1600. The CPU 1100 also transmits data to output devices such as the display 6, audio interface 5, or printer via the input/output interface 1600. The input/output interface 1600 may also function as a media interface that reads programs and the like recorded on a specified recording medium. Examples of media include optical recording media such as DVDs (Digital Versatile Discs) and PDs (Phase Change Rewritable Disks), magneto-optical recording media such as MOs (Magneto-Optical Disks), tape media, magnetic recording media, or semiconductor memory.

　なお、ＣＰＵ１１００は、プログラムデータ１４５０をＨＤＤ１４００から読み取って実行するが、他の例として、外部ネットワーク１５５０を介して、他の装置からこれらのプログラムを取得してもよい。 Note that the CPU 1100 reads and executes the program data 1450 from the HDD 1400, but as another example, it may also obtain these programs from another device via the external network 1550.

　以上、添付図面を参照しながら本開示の好適な実施の形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例又は修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 The above describes in detail preferred embodiments of the present disclosure with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to such examples. It is clear that a person with ordinary skill in the technical field of the present disclosure would be able to conceive of various modified or revised examples within the scope of the technical ideas set forth in the claims, and it is understood that these also naturally fall within the technical scope of the present disclosure.

　また、本明細書に記載された効果は、あくまで説明的又は例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、又は上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Furthermore, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology disclosed herein may achieve other effects in addition to or in place of the above-mentioned effects that would be apparent to those skilled in the art from the description herein.

　なお、本技術は以下のような構成を取ることもできる。 This technology can also be configured as follows:

（１）
　所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
　マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
　前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部と
　を有する情報処理装置。
（２）
　前記人センシング部は、前記映像を基に発話者それぞれの位置を判定し、
　前記人センシング部により判定された発話者それぞれの前記位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離する個別音分離部とさらに有し、
　前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、各発話者が会話に参加しているか否かを基に明瞭度の調整を行う
　前記（１）に記載の情報処理装置。
（３）
　前記送信部により送信された前記所定空間の音声を受信する受信部と、
　複数のスピーカーユニットが一列に並ぶスピーカーアレイを２つ有するスピーカーにおいて、発話者それぞれの前記位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部と
　をさらに有する前記（２）に記載の情報処理装置。
（４）
　前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に１つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
　前記音声出力制御部は、前記スクリーンに映された発話者毎に、前記スクリーン上の位置の近傍で発話音声が再生されるように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
　前記（３）に記載の情報処理装置。
（５）
　前記受信部は、前記所定空間の音声とともに前記映像を受信して、前記映像を写した際の前記映像の上下方向の両端部に１つずつ前記スピーカーアレイが配置されたスクリーンに前記映像を映し、
　前記音声出力制御部は、前記スピーカーアレイにおける発話音声の再生位置が、発話者それぞれの前記スクリーン上の位置の距離関係を維持するように、前記発話者の位置を基に再生させる前記スピーカーユニットを選択する
　前記（３）に記載の情報処理装置。
（６）
　前記明瞭度調整部は、前記所定空間の音声に含まれる会話に参加していない発話者の発話音声について、前記マイクによる収音状態よりも明瞭度を下げる明瞭度低下調整を行う前記（１）～（５）のいずれか一つに記載の情報処理装置。
（７）
　前記所定空間の音声から背景音を分離する音声分離部をさらに有し、
　前記明瞭度調整部は、前記音声分離部により分離された前記背景音について、他の音声よりも明瞭度を下げる明瞭度低下調整を行う
　前記（１）～（６）のいずれか一つに記載の情報処理装置。
（８）
　前記背景音を突発音と定常音とに分離する背景音分離部を備え、
　前記明瞭度調整部は、前記背景音のうち前記定常音について、前記明瞭度低下調整を行う
　前記（７）に記載の情報処理装置。
（９）
　前記個別音毎に、音声認識を実行して前記個別音の発話内容を示す文字を生成し、前記発話内容を示す文字を基に音声合成を実行して、発話者の発話音声を再生成する音声合成部をさらに有し、
　前記明瞭度調整部は、前記音声合成部により再生成された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
　前記（２）～（５）のいずれか一つに記載の情報処理装置。
（１０）
　前記人センシング部は、前記映像を基に発話者の距離を判定し、前記距離を基に発話者が会話に参加しているか否かを判定する前記（１）～（９）のいずれか一つに記載の情報処理装置。
（１１）
　前記所定空間の音声を基に発話者の位置を判定する音源方向推定部をさらに有し、
　個別音分離部は、前記音源方向推定部により判定された発話者の位置を基に、前記所定空間の音声から発話者毎の発話音声である個別音を分離し、
　前記明瞭度調整部は、前記個別音分離部により分離された前記個別音毎に、発話者が会話に参加しているか否かを基に明瞭度の調整を行う
　前記（１）～（１０）のいずれか一つに記載の情報処理装置。
（１２）
　前記音声出力制御部は、前記所定空間の音声をステレオ化して、前記スピーカーアレイ間の音源の位置を調整する前記（３）に記載の情報処理装置。
（１３）
　情報処理装置が、
　所定空間に存在する複数の発話者を撮影するカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシングステップと、
　前記所定空間の音声を収音するマイクにより収音された前記所定空間の音声に含まれる、前記人センシングステップにおいて判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整ステップと、
　前記明瞭度調整ステップにおいて処理された前記所定空間の音声を送信する送信ステップと
　を実行する情報処理方法。
（１４）
　所定空間に存在する複数の発話者を撮影するカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定し
　前記所定空間の音声を収音するマイクにより収音された前記所定空間の音声に含まれる、会話に参加している発話者の発話音声について、明瞭度の調整を行い、
　会話に参加している発話者の発話音声を調整した前記所定空間の音声を送信する
　処理をコンピュータに実行させる情報処理プログラム。
（１５）
　送信装置は、
　所定空間に存在する複数の発話者のカメラにより撮影された映像を基に、各発話者が会話に参加しているか否かを判定する人センシング部と、
　マイクにより収音された前記所定空間の音声に含まれる前記人センシング部により判定された会話に参加している発話者の発話音声について、明瞭度の調整を行う明瞭度調整部と、
　前記明瞭度調整部により処理された前記所定空間の音声を送信する送信部とを有し、
　受信装置は、
　前記送信部により送信された前記所定空間の音声を受信する受信部と、
　複数のスピーカーユニットが一列に並ぶスピーカーアレイを２つ有するスピーカーにおいて、発話者それぞれの前記位置を基に、前記受信部により受信された前記所定空間の音声のうち各発話者の発話音声を再生させる前記スピーカーユニットを選択して、選択した前記スピーカーユニットに発話者それぞれの発話音声を再生させる音声出力制御部とを有する
　情報処理システム。 (1)
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space that has been processed by the clarity adjusting unit.
(2)
the human sensing unit determines the position of each speaker based on the video;
an individual sound separation unit that separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of each speaker determined by the human sensing unit;
The information processing device according to (1), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not each speaker is participating in a conversation.
(3)
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
The information processing device described in (2) further includes an audio output control unit that selects, based on the position of each speaker, a speaker unit that will play the speech of each speaker from the audio in the specified space received by the receiving unit, in a speaker having two speaker arrays in which multiple speaker units are arranged in a row, and causes the selected speaker unit to play the speech of each speaker.
(4)
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device described in (3), wherein the audio output control unit selects the speaker unit to be played based on the position of each speaker displayed on the screen so that the spoken audio is played near the position on the screen.
(5)
the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device described in (3), wherein the audio output control unit selects the speaker unit to be played back based on the position of the speaker so that the playback position of the spoken audio in the speaker array maintains the distance relationship between the positions of each speaker on the screen.
(6)
The information processing device described in any one of (1) to (5), wherein the clarity adjustment unit performs a clarity reduction adjustment to reduce the clarity of the speech of a speaker who is not participating in the conversation and is included in the audio of the specified space compared to the state of sound collection by the microphone.
(7)
further comprising an audio separation unit that separates background sounds from the audio in the predetermined space;
The information processing device according to any one of (1) to (6), wherein the clarity adjustment unit performs clarity reduction adjustment to reduce clarity of the background sound separated by the sound separation unit compared to other sounds.
(8)
a background sound separation unit that separates the background sound into a sudden sound and a steady sound,
The information processing device according to (7), wherein the clarity adjustment unit performs the clarity reduction adjustment on the stationary sound among the background sounds.
(9)
a speech synthesis unit that performs speech recognition for each of the individual sounds to generate characters indicating the speech content of the individual sounds, and performs speech synthesis based on the characters indicating the speech content to reproduce the speech of the speaker;
The information processing device according to any one of (2) to (5), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds regenerated by the speech synthesis unit based on whether or not a speaker is participating in a conversation.
(10)
The information processing device according to any one of (1) to (9), wherein the human sensing unit determines the distance of the speaker based on the video and determines whether the speaker is participating in the conversation based on the distance.
(11)
a sound source direction estimation unit that determines the position of a speaker based on the sound in the predetermined space;
an individual sound separation unit separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of the speaker determined by the sound source direction estimation unit;
The information processing device according to any one of (1) to (10), wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not a speaker is participating in a conversation.
(12)
The information processing device according to (3), wherein the audio output control unit stereophonically converts audio in the predetermined space and adjusts the position of a sound source among the speaker arrays.
(13)
The information processing device
a human sensing step of determining whether each speaker is participating in a conversation based on an image captured by a camera capturing multiple speakers present in a predetermined space;
an articulation adjustment step of adjusting the articulation of speech of a speaker participating in the conversation determined in the person sensing step, the speech being included in the audio of the predetermined space collected by a microphone that collects audio of the predetermined space;
a transmitting step of transmitting the sound in the predetermined space processed in the clarity adjusting step.
(14)
Based on the video captured by a camera capturing multiple speakers in a predetermined space, it is determined whether each speaker is participating in the conversation, and the clarity of the speech of the speakers participating in the conversation, which is included in the audio of the predetermined space collected by a microphone that collects the audio of the predetermined space, is adjusted.
An information processing program that causes a computer to execute a process of adjusting the speech of speakers participating in the conversation and transmitting the speech in the predetermined space.
(15)
The transmitting device
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space processed by the clarity adjustment unit,
The receiving device
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
An information processing system comprising: a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row; and an audio output control unit that selects a speaker unit to play back the speech of each speaker from the audio in the specified space received by the receiving unit based on the position of each speaker, and causes the selected speaker unit to play back the speech of each speaker.

　１　遠隔コミュニケーション装置
　２　カメラ
　３　マイク
　４　複数対向スピーカー
　５　オーディオインタフェース
　６　ディスプレイ
　７　ネットワーク
　１０　送信ユニット
　１１　人センシング部
　１２　音声分離部
　１３　音響信号処理部
　１４　出力統合部
　１５　送信部
　１６　音源方向推定部
　２０　受信ユニット
　２１　受信部
　２２　音声出力制御部
　３０　マイクアレイ
　４１，４２　スピーカーアレイ
　１３１　個別音分離部
　１３２　明瞭度調整部
　１３３　背景音分離部
　１３４　音声合成部
　４１１，４１２　スピーカーユニット REFERENCE SIGNS LIST 1 Remote communication device 2 Camera 3 Microphone 4 Multiple opposing speakers 5 Audio interface 6 Display 7 Network 10 Transmission unit 11 Human sensing unit 12 Voice separation unit 13 Acoustic signal processing unit 14 Output integration unit 15 Transmission unit 16 Sound source direction estimation unit 20 Receiving unit 21 Receiving unit 22 Voice output control unit 30 Microphone array 41, 42 Speaker array 131 Individual sound separation unit 132 Clarity adjustment unit 133 Background sound separation unit 134 Voice synthesis unit 411, 412 Speaker unit

Claims

a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in the conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space that has been processed by the clarity adjusting unit.

the human sensing unit determines the position of each speaker based on the video;
an individual sound separation unit that separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of each speaker determined by the human sensing unit;
The information processing device according to claim 1 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not each speaker is participating in a conversation.

a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
3. The information processing device according to claim 2, further comprising: an audio output control unit that selects, based on the position of each speaker, a speaker unit that will play back the speech of each speaker from the audio in the specified space received by the receiving unit, in a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row, and causes the selected speaker unit to play back the speech of each speaker.

the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device according to claim 3 , wherein the audio output control unit selects the speaker unit to be played back based on the position of each speaker displayed on the screen so that the spoken audio is played back near the position on the screen.

the receiving unit receives the video together with the audio from the predetermined space, and projects the video on a screen on which the speaker arrays are arranged, one at each end of the video in a vertical direction when the video is projected;
The information processing device according to claim 3 , wherein the audio output control unit selects the speaker unit to be played back based on the position of the speaker so that the playback position of the speech voice in the speaker array maintains a distance relationship between the positions of the speakers on the screen.

The information processing device according to claim 1, wherein the clarity adjustment unit performs clarity reduction adjustment to reduce the clarity of speech of speakers not participating in the conversation included in the audio from the specified space compared to the state of sound pickup by the microphone.

further comprising an audio separation unit that separates background sounds from the audio in the predetermined space;
The information processing device according to claim 1 , wherein the clarity adjustment unit performs clarity reduction adjustment on the background sound separated by the sound separation unit to reduce clarity more than other sounds.

a background sound separation unit that separates the background sound into a sudden sound and a steady sound,
The information processing device according to claim 7 , wherein the clarity adjustment unit performs the clarity reduction adjustment on the stationary sound of the background sound.

a speech synthesis unit that performs speech recognition for each of the individual sounds to generate characters indicating the speech content of the individual sounds, and performs speech synthesis based on the characters indicating the speech content to reproduce the speech of the speaker;
The information processing device according to claim 2 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds regenerated by the speech synthesis unit based on whether or not a speaker is participating in a conversation.

The information processing device described in claim 1, wherein the human sensing unit determines the distance of the speaker based on the video and determines whether the speaker is participating in the conversation based on the distance.

a sound source direction estimation unit that determines the position of a speaker based on the sound in the predetermined space;
an individual sound separation unit separates individual sounds, which are speech sounds of each speaker, from the sound in the predetermined space based on the position of the speaker determined by the sound source direction estimation unit;
The information processing device according to claim 1 , wherein the clarity adjustment unit adjusts clarity for each of the individual sounds separated by the individual sound separation unit based on whether or not a speaker is participating in a conversation.

The information processing device described in claim 3, wherein the audio output control unit converts the audio from the specified space into stereo and adjusts the position of the sound source between the speaker arrays.

The information processing device
a human sensing step of determining whether each speaker is participating in a conversation based on an image captured by a camera capturing multiple speakers present in a predetermined space;
an articulation adjustment step of adjusting the articulation of speech of a speaker participating in the conversation determined in the person sensing step, the speech being included in the audio of the predetermined space collected by a microphone that collects audio of the predetermined space;
a transmitting step of transmitting the sound in the predetermined space processed in the clarity adjusting step.

The transmitting device
a human sensing unit that determines whether each speaker is participating in a conversation based on images of multiple speakers present in a predetermined space captured by a camera;
an articulation adjustment unit that adjusts the articulation of speech of speakers participating in a conversation determined by the human sensing unit and included in the speech in the predetermined space collected by a microphone;
a transmitting unit that transmits the sound in the predetermined space processed by the clarity adjustment unit,
The receiving device
a receiving unit that receives the sound in the predetermined space transmitted by the transmitting unit;
An information processing system comprising: a speaker having two speaker arrays in which a plurality of speaker units are arranged in a row; and an audio output control unit that selects a speaker unit to play back the speech of each speaker from the audio in the specified space received by the receiving unit based on the position of each speaker, and causes the selected speaker unit to play back the speech of each speaker.