JP2016010039A

JP2016010039A - Remote conference system, video processing method, video controller, conference terminal, and program thereof

Info

Publication number: JP2016010039A
Application number: JP2014130379A
Authority: JP
Inventors: 亮人相場; Akihito Aiba
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2014-06-25
Filing date: 2014-06-25
Publication date: 2016-01-18

Abstract

PROBLEM TO BE SOLVED: To allow for setting of the right of speaking or selection of an emphasis object, by natural movement of a conference participant, in a remote conference system.SOLUTION: Video and audio 10-1, 10-2, ..., 10-N transmitted from respective terminals 1-1, 1-2, ..., 1-N are received by a video and audio synthesis control server 2. Composite video and audio 20-1, 20-2, ..., 20-N transmitted from the video and audio synthesis control server 2 are received by a first terminal 1-1, a second terminal 1-2, ..., an N-th terminal 1-N. When a conference participant of the first terminal 1-1, i.e., a chairman terminal, performs movement for designating a desired position of a composite video displayed, the first terminal 1-1 transmits a movement detection result 11 representing a designated position to the video and audio synthesis control server 2. The video and audio synthesis control server 2 generates the composite video and audio 20-1, 20-2, ..., 20-N emphasizing the video and audio displayed at the designated position.

Description

本発明は、遠隔会議システム、その映像処理方法、映像制御装置、会議端末、及びプログラムに関する。 The present invention relates to a remote conference system, a video processing method thereof, a video control device, a conference terminal, and a program.

近年、情報通信技術の発達に伴い、インターネットなどのネットワークを介して、互いに遠隔地にいる者が会議に参加できる遠隔会議システムが開発されている。遠隔会議は、例えばテレビ会議およびウェブ会議などのように、会議を行う際に、参加者が実際に一箇所に集まって直接話し合う形態の会議ではなく、参加者が互いに離れた地点（拠点）にいながらにして、互いに接続された会議端末を使用して動画と音声を通信して話し合いを進める形態の会議である。 2. Description of the Related Art In recent years, with the development of information communication technology, a remote conference system has been developed that allows people in remote locations to participate in a conference via a network such as the Internet. Remote conferences are not conferences where participants actually gather in one place and talk directly, such as video conferences and web conferences, but at locations (bases) where participants are separated from each other. In the meantime, this is a conference in which a conference terminal connected to each other is used to communicate a moving image and a voice to advance a discussion.

このような遠隔会議システムを利用した遠隔会議では、２箇所の参加者が会議を行うだけでなく、３箇所以上の多数の参加者が遠隔会議に参加する場合も多い。多数の参加者が遠隔会議を行う場合、会議の進行に伴って発言者が発言する際に、一度に複数の参加者が発言したのでは混乱してしまう。 In a remote conference using such a remote conference system, not only two participants hold a conference, but many participants at three or more locations often participate in the remote conference. When a large number of participants conduct a remote conference, it is confusing if a plurality of participants speak at a time when the speaker speaks as the conference progresses.

このような問題に対処した遠隔会議システムとして、特許文献１に記載された遠隔会議システムがある。この遠隔会議システムでは、複数の会議端末のうちの１つを司会者端末とし、司会者が会議の進行を管理できるよう、会議端末からの発言意志の通知に応じて、その会議端末に発言権を設定することで、限られた会議端末のみが映像及び音声を通信できるようにする。 As a remote conference system that addresses such a problem, there is a remote conference system described in Patent Document 1. In this teleconferencing system, one of a plurality of conference terminals is used as a moderator terminal, and the right to speak is given to the conference terminal in response to a notification of the intention to speak from the conference terminal so that the moderator can manage the progress of the conference. By setting, only a limited conference terminal can communicate video and audio.

また、特定の参加者を他の参加者に注目させ、会議を円滑に進行させることを可能にした遠隔会議システムも知られている（特許文献２）。この遠隔会議システムでは、任意の会議端末から他の会議端末に対し、参加者自身の映像及び音声等の会議情報をより強調して出力させるための要求信号を送信し、各会議端末は他の会議端末から強調要求信号を受信したとき、強調要求信号で指定された参加者の会議情報を強調して再生する。 There is also known a remote conference system that allows a specific participant to focus on other participants and allows the conference to proceed smoothly (Patent Document 2). In this remote conference system, a request signal is transmitted from an arbitrary conference terminal to another conference terminal to emphasize and output conference information such as video and audio of the participants themselves. When the emphasis request signal is received from the conference terminal, the conference information of the participant specified by the emphasis request signal is emphasized and reproduced.

しかし、上記従来の遠隔会議システムにおける発言権の設定方法や強調対象の選択方法には、会議端末のボタンやキーボードの使用、メニュー画面での操作など、煩わしさ、不自然さが伴うという問題がある。 However, the above-mentioned conventional teleconferencing system has a problem in that it is bothersome and unnatural, such as the use of buttons and keyboards on conference terminals and operations on menu screens, in the method for setting the right to speak and the method for selecting highlight targets. is there.

本発明は、このような問題を解決するためになされたものであり、その目的は、遠隔会議システムにおいて、会議参加者の自然な動作により、発言権の設定や強調対象の選択を可能にすることである。 The present invention has been made to solve such a problem, and an object of the present invention is to enable setting of a right to speak and selection of an emphasis target by a natural operation of a conference participant in a remote conference system. That is.

本発明は、映像を含む情報を複数の拠点に設置された会議端末間で通信する遠隔会議システムであって、所定の一つの会議端末で表示されている映像の所望の位置を、該会議端末を使用する会議参加者が所定の第１の動作で指示したとき、その指示された位置を検出する位置検出手段と、前記位置検出手段により検出された位置と、前記所定の一つの会議端末で表示されている映像における他の複数の会議端末の拠点の映像の位置を表す情報とから、前記所定の一つの会議端末を使用する会議参加者により指示された位置に表示されている拠点を検出する拠点検出手段と、前記拠点検出手段により検出された拠点の映像を強調する映像強調手段と、を有する、遠隔会議システムである。 The present invention relates to a remote conference system for communicating information including video between conference terminals installed at a plurality of bases, and a desired position of a video displayed on a predetermined conference terminal When a conference participant who uses the command is instructed by a predetermined first operation, the position detecting means for detecting the instructed position, the position detected by the position detecting means, and the predetermined one conference terminal The base displayed at the position indicated by the conference participant who uses the predetermined one conference terminal is detected from the information indicating the position of the video of the bases of the other conference terminals in the displayed video. A teleconferencing system comprising: a site detecting unit that performs the above-described operation and a video enhancement unit that enhances the video of the site detected by the site detecting unit.

本発明によれば、遠隔会議システムにおいて、会議参加者の自然な動作により、発言権の設定や強調対象の選択が可能になる。 According to the present invention, in a teleconference system, it is possible to set a floor and select an emphasis target by natural actions of conference participants.

本発明の第１の実施形態に係る遠隔会議システムの全体構成について説明するための図である。It is a figure for demonstrating the whole structure of the remote conference system which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る遠隔会議システムにおける端末のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the terminal in the remote conference system which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る遠隔会議システムにおける映像音声合成・制御サーバのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware constitutions of the video / audio synthesis / control server in the remote conference system according to the first embodiment of the present invention. 本発明の第１の実施形態に係る遠隔会議システムにおける司会者端末の機能ブロック図である。It is a functional block diagram of the moderator terminal in the remote conference system which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る遠隔会議システムにおける映像音声合成・制御サーバの機能ブロック図である。It is a functional block diagram of the video / audio synthesis / control server in the remote conference system according to the first embodiment of the present invention. 図４における動作検出手段の出力について説明するための図である。It is a figure for demonstrating the output of the operation | movement detection means in FIG. 図６における指定座標の検出方法について説明するための図である。It is a figure for demonstrating the detection method of the designated coordinate in FIG. 図５における映像配置情報及び合成映像について説明するための図である。It is a figure for demonstrating the image | video arrangement | positioning information and synthetic | combination image | video in FIG. 本発明の第１の実施形態に係る遠隔会議システムにおける強調処理の解除動作について説明するための図である。It is a figure for demonstrating cancellation | release operation | movement of the emphasis process in the remote conference system which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係る遠隔会議システムの動作を示すフローチャートである。It is a flowchart which shows operation | movement of the remote conference system which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る遠隔会議システムの全体構成について説明するための図である。It is a figure for demonstrating the whole structure of the remote conference system which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末の機能ブロック図である。It is a functional block diagram of the moderator terminal in the remote conference system which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末以外の端末の機能ブロック図である。It is a functional block diagram of terminals other than the moderator terminal in the remote conference system according to the second embodiment of the present invention. 本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末の送信動作及び司会者端末以外の端末の受信動作示すフローチャートである。It is a flowchart which shows transmission operation | movement of the moderator terminal in the remote conference system which concerns on the 2nd Embodiment of this invention, and reception operation | movement of terminals other than moderator terminal. 本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末以外の端末の送信動作及び司会者端末の受信動作示すフローチャートである。It is a flowchart which shows the transmission operation | movement of terminals other than the moderator terminal in the remote conference system which concerns on the 2nd Embodiment of this invention, and the reception operation of the moderator terminal.

以下、本発明の実施形態について図面を参照して説明する。
［第１の実施形態］
〈遠隔会議システムの全体構成〉
図１は、本発明の第１の実施形態に係る遠隔会議システムの全体構成について説明するための図である。 Embodiments of the present invention will be described below with reference to the drawings.
[First Embodiment]
<Overall configuration of remote conference system>
FIG. 1 is a diagram for explaining the overall configuration of the remote conference system according to the first embodiment of the present invention.

この遠隔会議システムは、複数の拠点としてのＮ箇所（Ｎは３以上の整数）の拠点に設置されたＮ台の会議端末である第１の端末１-1、第２の端末１-2，・・・，第Ｎの端末１-Nと、映像音声合成・制御サーバ２とをネットワーク３に接続することにより構成されており、端末間で映像及び音声を通信することで遠隔会議を支援することができる。 This teleconference system includes a first terminal 1-1, a second terminal 1-2, and N conference terminals installed at N locations (N is an integer of 3 or more) as a plurality of locations. ..., configured by connecting the N-th terminal 1-N and the video / audio synthesis / control server 2 to the network 3, and supporting the remote conference by communicating video and audio between the terminals. be able to.

各端末は映像及び音声（以下、映像音声）を映像音声合成・制御サーバ２へ送信する。
映像音声合成・制御サーバ２は受信した映像音声を制御、合成した後、合成した映像音声を各端末へと送信する。各端末は映像音声合成・制御サーバ２からの合成映像音声を受信する。 Each terminal transmits video and audio (hereinafter referred to as video and audio) to the video and audio synthesis / control server 2.
The video / audio synthesis / control server 2 controls and synthesizes the received video / audio, and then transmits the synthesized video / audio to each terminal. Each terminal receives the synthesized video / audio from the video / audio synthesis / control server 2.

即ち、第１の端末１-1から送信された映像音声１０-1、第２の端末１-2から送信された映像音声１０-2，・・・，第Ｎの端末１-Nから送信された映像音声１０-Nは、それぞれネットワーク３を通り、映像音声合成・制御サーバ２で受信される。そして、映像音声合成・制御サーバ２から送信された合成映像音声２０-1、２０-2，・・・，２０-Nは、ネットワーク３を通り、それぞれ第１の端末１-1、第２の端末１-2，・・・，第Ｎの端末１-Nで受信される。 That is, the video and audio 10-1 transmitted from the first terminal 1-1, the video and audio 10-2 transmitted from the second terminal 1-2,..., And the Nth terminal 1-N are transmitted. The video / audio 10-N passes through the network 3 and is received by the video / audio synthesis / control server 2. Then, the synthesized video / audio 20-1, 20-2,..., 20-N transmitted from the video / audio synthesis / control server 2 pass through the network 3, and the first terminal 1-1 and the second, respectively. Received by the terminal 1-2, ..., the Nth terminal 1-N.

ここで、２０-1、２０-2，・・・，２０-Nを全て同じものにしてもよいが、本実施形態では送信先の端末によって異なる。即ち、例えば、第１の端末１-1へ送信する合成映像音声２０-1には、第１の端末１-1からの映像音声１０-1は含めず、それ以外から端末からの映像音声のみを合成する。 Here, 20-1, 20-2,..., 20-N may all be the same, but in the present embodiment, they differ depending on the destination terminal. That is, for example, the synthesized video / audio 20-1 transmitted to the first terminal 1-1 does not include the video / audio 10-1 from the first terminal 1-1, but only video / audio from the other terminals. Is synthesized.

第１の端末１-1は司会者端末である。司会者端末とは、会議参加者の一人である司会者が使用する端末である。司会者端末の決定方法には、例えば、会議の開始前、あるいは開始後にどの端末がそうかをメニュー操作などで登録する方法、事前に司会者役の人物の顔や声を登録して、会議中に認識できた端末を設定する方法などがある。 The first terminal 1-1 is a moderator terminal. The moderator terminal is a terminal used by a moderator who is one of the conference participants. The method of determining the moderator terminal is, for example, a method of registering which terminal is so by the menu operation before or after the start of the conference, or by registering the face and voice of the person acting as the host in advance. There is a method to set a terminal that can be recognized.

司会者端末は、司会者が会議を円滑に進めるための発言権の決定や映像音声の強調などを行うための所定の動作を検出し、動作検出結果１１を映像音声合成・制御サーバ２へと送信する。後に詳述するように、動作検出結果は、司会者が、第１の端末１−１で受信し表示している合成映像上のどこを指し示す動作を行ったかを表す情報を含んでいる。映像音声合成・制御サーバ２は、動作検出結果１１をもとに、映像、音声の制御を行う。なお、以下の説明では、第１の端末１-1、第２の端末１-2，・・・，第Ｎの端末１-Nを区別しない場合は端末１とする。 The moderator terminal detects a predetermined operation for the speaker to determine the right to speak to facilitate the conference and to emphasize video and audio, and the operation detection result 11 is sent to the video and audio synthesis / control server 2. Send. As will be described in detail later, the motion detection result includes information indicating where the moderator has performed the motion indicating the synthesized video received and displayed by the first terminal 1-1. The video / audio synthesis / control server 2 controls video and audio based on the motion detection result 11. In the following description, the first terminal 1-1, the second terminal 1-2,...

〈端末のハードウェア構成〉
図２は、本発明の第１の実施形態に係る遠隔会議システムにおける端末のハードウェア構成を示すブロック図である。 <Hardware configuration of terminal>
FIG. 2 is a block diagram showing a hardware configuration of a terminal in the remote conference system according to the first embodiment of the present invention.

端末１は、コンピュータ及びその周辺装置などから構成されており、ＣＰＵ５１１、メモリ（ＲＯＭ、ＲＡＭ）５１２、記憶媒体装着部５１３、ネットワーク装置５１４、モニタ制御部５１５、入力装置５１６、ＨＤＤ（ハードディスク装置）５１７、カメラ５２１、スピーカ５２２及びマイク５２３がバスラインで接続された構造を有する。 The terminal 1 includes a computer and its peripheral devices, and the like, and includes a CPU 511, a memory (ROM, RAM) 512, a storage medium mounting unit 513, a network device 514, a monitor control unit 515, an input device 516, and an HDD (hard disk device). 517, a camera 521, a speaker 522, and a microphone 523 are connected by a bus line.

ＣＰＵ５１１は、ＨＤＤ５１７からプログラム５２０を読み出して実行し、端末１を全体的に制御する。メモリ５１２のＲＡＭは、ＤＲＡＭなどの揮発性メモリであり、ＣＰＵがプログラムやＯＳを実行する際の作業エリアとなる。メモリ５１２のＲＯＭは、ブートプログラムなどを記憶する不揮発性メモリである。 The CPU 511 reads out and executes the program 520 from the HDD 517 and controls the terminal 1 as a whole. The RAM of the memory 512 is a volatile memory such as a DRAM, and serves as a work area when the CPU executes a program or an OS. The ROM of the memory 512 is a non-volatile memory that stores a boot program and the like.

記憶媒体装着部５１３は、各種の記憶媒体５１９を脱着可能に接続するインタフェースであり、記憶媒体５１９からデータを読み出し、また、記憶媒体５１９にデータを書き込む際に利用される。プログラム５２０は、記憶媒体５１９に記憶された状態で配布され、記憶媒体５１９から読み出されＨＤＤ５１７にインストールされる。なお、記憶媒体５１９は、ＵＳＢメモリ、ＳＤカード、メモリースティック（登録商標）、マルチメディアカード、ＣＤ−ＲＯＭ（Ｒ／Ｗ）、ＤＶＤ−ＲＯＭ（ＲＡＭ、Ｒ／Ｗ）等である。 The storage medium mounting unit 513 is an interface that connects various storage media 519 in a detachable manner, and is used when reading data from the storage medium 519 and writing data into the storage medium 519. The program 520 is distributed while being stored in the storage medium 519, read from the storage medium 519, and installed in the HDD 517. The storage medium 519 is a USB memory, SD card, Memory Stick (registered trademark), multimedia card, CD-ROM (R / W), DVD-ROM (RAM, R / W), or the like.

ネットワーク装置５１４は、端末１をネットワーク３に接続するためのインタフェース（例えばイーサネット（登録商標）カード）である。プログラム５２０は、ネットワーク装置５１４が不図示のサーバからダウンロードすることでＨＤＤ５１７にインストールされてよい。 The network device 514 is an interface (for example, an Ethernet (registered trademark) card) for connecting the terminal 1 to the network 3. The program 520 may be installed in the HDD 517 by the network device 514 being downloaded from a server (not shown).

ＨＤＤ５１７には、遠隔会議システムの端末側のプログラム５２０、ＯＳ、及び後述する各種のデータが記憶されている。プログラム５２０は、端末１に後述する機能ブロック（図４）を構成し、後述するフローチャート（図１０）を実行させる。 The HDD 517 stores a terminal-side program 520 of the remote conference system, an OS, and various data described below. The program 520 configures a functional block (FIG. 4) to be described later in the terminal 1, and causes a flowchart (FIG. 10) to be described later to be executed.

入力装置５１６は、マウスやキーボード、タッチパネルなどを用いて、会議の参加者が端末１を操作するための装置である。モニタ制御部５１５は、ＯＳやプログラム５２０が指定する解像度や色数で、ディスプレイ５１８に表示する画面を描画する。ディスプレイ５１８は、モニタ制御部５１５の制御によりＧＵＩ（Graphical User Interface）画面を表示するユーザ−インタフェースとなる。ディスプレイ５１８はＬＣＤやプロジェクタからなり、ＬＣＤにタッチパネルを一体に搭載していてもよい。 The input device 516 is a device for a conference participant to operate the terminal 1 using a mouse, a keyboard, a touch panel, or the like. The monitor control unit 515 draws a screen to be displayed on the display 518 with the resolution and the number of colors specified by the OS and the program 520. The display 518 is a user interface that displays a GUI (Graphical User Interface) screen under the control of the monitor control unit 515. The display 518 includes an LCD or a projector, and a touch panel may be integrally mounted on the LCD.

カメラ５２１は端末１が設置されている拠点を撮影することで、拠点の映像を取得する装置である。この拠点の映像は、端末１及びその周辺の映像、並びに端末１を使用する会議参加者の映像からなる。マイク５２３は端末１を使用する会議参加者が発声した音声を集音する装置である。カメラ５２１が撮影した映像とマイク５２３が集音した音声は、他の端末へリアルタイムに送信される。この映像は他の端末のディスプレイ５１８に表示され、音声はスピーカ５２２から出力される。 The camera 521 is a device that acquires a video of a base by photographing the base where the terminal 1 is installed. The video of this base consists of the video of the terminal 1 and its surroundings, and the video of conference participants who use the terminal 1. The microphone 523 is a device that collects sound uttered by a conference participant who uses the terminal 1. Video captured by the camera 521 and audio collected by the microphone 523 are transmitted to other terminals in real time. This video is displayed on the display 518 of another terminal, and the audio is output from the speaker 522.

〈映像音声合成・制御サーバ〉
図３は、本発明の第１の実施形態に係る遠隔会議システムにおける映像音声合成・制御サーバのハードウェア構成を示すブロック図である。 <Video / audio synthesis / control server>
FIG. 3 is a block diagram showing a hardware configuration of the video / audio synthesis / control server in the remote conference system according to the first embodiment of the present invention.

本発明に係る映像制御装置としての映像音声合成・制御サーバ２は、コンピュータ及びその周辺装置などから構成されており、ＣＰＵ６１１、メモリ（ＲＯＭ、ＲＡＭ）６１２、記憶媒体装着部６１３、ネットワーク装置６１４、入力装置６１６、及びＨＤＤ６１７がバスラインで接続された構造を有する。 The video / audio synthesis / control server 2 as a video control apparatus according to the present invention is composed of a computer and peripheral devices thereof, and includes a CPU 611, a memory (ROM, RAM) 612, a storage medium mounting unit 613, a network device 614, The input device 616 and the HDD 617 are connected via a bus line.

ＣＰＵ６１１は、ＨＤＤ６１７からプログラム６２０を読み出し実行し、映像音声合成・制御サーバ２を全体的に制御する。メモリ６１２のＲＡＭは、ＤＲＡＭなどの揮発性メモリであり、ＣＰＵがプログラムやＯＳを実行する際の作業エリアとなる。記憶媒体装着部６１３は、各種の記憶媒体６１９を脱着可能に接続するインタフェースであり、記憶媒体６１９からデータを読み出し、また、記憶媒体６１９にデータを書き込む際に利用される。プログラム６２０は、記憶媒体６１９に記憶された状態で配布され、記憶媒体６１９から読み出されＨＤＤ６１７にインストールされる。 The CPU 611 reads out and executes the program 620 from the HDD 617 and controls the video / audio synthesis / control server 2 as a whole. The RAM of the memory 612 is a volatile memory such as a DRAM, and serves as a work area when the CPU executes a program or OS. The storage medium mounting unit 613 is an interface that connects various storage media 619 in a detachable manner, and is used when reading data from the storage medium 619 and writing data into the storage medium 619. The program 620 is distributed in a state stored in the storage medium 619, read from the storage medium 619, and installed in the HDD 617.

ネットワーク装置６１４は、映像音声合成・制御サーバ２をネットワーク３に接続するためのインタフェースである。入力装置６１６は、マウスやキーボード、タッチパネルなどを用いて、映像音声合成・制御サーバ２を操作するための装置である。 The network device 614 is an interface for connecting the video / audio synthesis / control server 2 to the network 3. The input device 616 is a device for operating the video / audio synthesis / control server 2 using a mouse, a keyboard, a touch panel, or the like.

〈司会者端末の機能ブロック図〉
図４は、本発明の第１の実施形態に係る遠隔会議システムにおける司会者端末の機能ブロック図である。 <Functional block diagram of moderator terminal>
FIG. 4 is a functional block diagram of the moderator terminal in the remote conference system according to the first embodiment of the present invention.

司会者端末である第１の端末１-1は、映像取得手段１０１、映像送信手段１０２、音声取得手段１０３、音声送信手段１０４、センサ情報取得手段１０５、動作検出手段１０６、検出結果送信手段１０７、映像受信手段１０８、映像出力手段１０９、音声受信手段１１０、及び音声出力手段１１１を備えている。 The first terminal 1-1, which is a moderator terminal, includes a video acquisition unit 101, a video transmission unit 102, an audio acquisition unit 103, an audio transmission unit 104, a sensor information acquisition unit 105, an operation detection unit 106, and a detection result transmission unit 107. Video receiving means 108, video output means 109, audio receiving means 110, and audio output means 111.

映像取得手段１０１は、カメラ５２１が撮影した司会者の映像を取得する。映像送信手段１０２は、映像取得手段１０１により取得された映像をネットワーク装置５１４により、ネットワーク３を介して映像音声合成・制御サーバ２へ送信する。音声取得手段１０３は、マイク５２３が集音した司会者の音声を取得する。音声送信手段１０４は、音声取得手段１０３により取得された音声をネットワーク装置５１４により、ネットワーク３を介して映像音声合成・制御サーバ２へ送信する。映像取得手段１０１により取得された映像と、音声取得手段１０３により取得された音声が映像音声１０-1を構成する。 The video acquisition unit 101 acquires the video of the presenter taken by the camera 521. The video transmission unit 102 transmits the video acquired by the video acquisition unit 101 to the video / audio synthesis / control server 2 via the network 3 by the network device 514. The voice acquisition unit 103 acquires the voice of the presenter collected by the microphone 523. The audio transmission unit 104 transmits the audio acquired by the audio acquisition unit 103 to the video / audio synthesis / control server 2 via the network 3 by the network device 514. The video acquired by the video acquisition unit 101 and the audio acquired by the audio acquisition unit 103 constitute the video / audio 10-1.

映像受信手段１０８は、映像音声合成・制御サーバ２から送信された映像を受信し、映像出力手段１０９は、映像受信手段１０８により受信された映像をディスプレイ５１８で表示する。音声受信手段１１０は、映像音声合成・制御サーバ２から送信された音声を受信し、音声出力手段１１１は、音声受信手段１１０により受信された音声をスピーカ５２２で再生（出力）する。映像受信手段１０８により受信された映像と、音声受信手段１１０により受信された音声が合成映像音声２０-1を構成する。 The video receiving unit 108 receives the video transmitted from the video / audio synthesis / control server 2, and the video output unit 109 displays the video received by the video receiving unit 108 on the display 518. The audio receiving unit 110 receives the audio transmitted from the video / audio synthesis / control server 2, and the audio output unit 111 reproduces (outputs) the audio received by the audio receiving unit 110 with the speaker 522. The video received by the video receiving unit 108 and the audio received by the audio receiving unit 110 constitute the synthesized video audio 20-1.

センサ情報取得手段１０５は、第１の端末１-1のディスプレイ５１８の前面などに配置された距離センサなどから、司会者の身体要素、例えば手までの距離情報を取得する。動作検出手段１０６は、センサ情報取得手段１０５により取得されたセンサ情報と、映像取得手段１０１により取得された映像とから、司会者が、ディスプレイ５１８に表示されている映像上の所望の位置を指し示す（指示する）所定の動作を行ったか否かを検出し、動作検出結果を出力する。この動作検出結果には、指し示した位置の座標も含める。つまり、動作検出手段１０６は本発明に係る位置検出手段として機能する。動作検出手段１０６の構成の詳細及び動作検出結果の詳細については後述する。 The sensor information acquisition means 105 acquires the distance information from the distance sensor etc. which are arrange | positioned on the front surface of the display 518 of the 1st terminal 1-1, etc. to the body element of a moderator, for example, a hand. In the motion detection unit 106, the presenter indicates a desired position on the video displayed on the display 518 from the sensor information acquired by the sensor information acquisition unit 105 and the video acquired by the video acquisition unit 101. It detects whether or not a predetermined operation has been performed (instructed) and outputs an operation detection result. This motion detection result also includes the coordinates of the indicated position. That is, the motion detection unit 106 functions as a position detection unit according to the present invention. Details of the configuration of the motion detector 106 and details of the motion detection result will be described later.

なお、司会者端末以外の端末は、センサ情報取得手段１０５、動作検出手段１０６、及び検出結果送信手段１０７を備える必要はないので、これらの手段を司会者端末から除いた構成の端末でよい。また、これらの手段を備えている端末であっても、司会者役ではない場合、これらの手段は何も行わなくてよい。 Since terminals other than the moderator terminal do not need to include the sensor information acquisition unit 105, the operation detection unit 106, and the detection result transmission unit 107, the terminal may be configured by removing these units from the moderator terminal. Further, even if the terminal is equipped with these means, if the terminal is not a moderator, these means need not be performed.

〈映像音声合成・制御サーバ〉
図５は、本発明の第１の実施形態に係る遠隔会議システムにおける映像音声合成・制御サーバの機能ブロック図である。 <Video / audio synthesis / control server>
FIG. 5 is a functional block diagram of the video / audio synthesis / control server in the remote conference system according to the first embodiment of the present invention.

映像音声合成・制御サーバ２は、映像受信手段２０１、映像合成手段２０２、映像送信手段２０３、音声受信手段２０４、音声合成手段２０５、音声送信手段２０６、検出結果受信手段２０７、強調対象選択手段２０８、映像制御手段２０９、及び音声制御手段２１０を備えている。 The video / audio synthesis / control server 2 includes a video reception unit 201, a video synthesis unit 202, a video transmission unit 203, a voice reception unit 204, a voice synthesis unit 205, a voice transmission unit 206, a detection result reception unit 207, and an enhancement target selection unit 208. Video control means 209 and audio control means 210.

映像受信手段２０１は、第１の端末１-1，第２の端末１-2，・・・，第Ｎの端末１-Nから送信された映像１，２，・・・，Ｎを受信する。映像合成手段２０２は、映像受信手段２０１により受信された映像１，２，・・・，Ｎ（合成前映像）を合成して合成映像１、２，・・・，Ｎを生成する。映像合成手段２０２の動作の詳細及び合成映像の内容の詳細については後述する。映像送信手段２０３は、合成映像１，２，・・・，Ｎを第１の端末１-1，第２の端末１-2，・・・，第Ｎの端末１-Nへ送信する。 The video receiving means 201 receives the video 1, 2,..., N transmitted from the first terminal 1-1, the second terminal 1-2,. . The video synthesizing unit 202 synthesizes the videos 1, 2,..., N (pre-combination video) received by the video receiving unit 201 to generate synthesized videos 1, 2,. Details of the operation of the video composition means 202 and details of the content of the composite video will be described later. The video transmission means 203 transmits the composite video 1, 2,..., N to the first terminal 1-1, the second terminal 1-2,.

音声受信手段２０４は、第１の端末１-1，第２の端末１-2，・・・，第Ｎの端末１-Nから送信された音声１，２，・・・，Ｎを受信する。音声合成手段２０５は、音声受信手段２０４により受信された音声１，２，・・・，Ｎ（合成前音声）を合成して合成音声１，２，・・・，Ｎを生成する。音声合成手段２０５の動作の詳細及び合成音声の内容の詳細については後述する。音声送信手段２０６は、合成音声１，２，・・・，Ｎを第１の端末１-１，第２の端末２-2，・・・，第Ｎの端末１-Nへ送信する。 The voice receiving unit 204 receives the voices 1, 2,..., N transmitted from the first terminal 1-1, the second terminal 1-2,. . The voice synthesis unit 205 synthesizes the voices 1, 2,..., N (pre-synthesis voices) received by the voice reception unit 204 to generate synthesized voices 1, 2,. Details of the operation of the speech synthesizer 205 and details of the contents of the synthesized speech will be described later. The voice transmission means 206 transmits the synthesized voices 1, 2,..., N to the first terminal 1-1, the second terminal 2-2,.

ここで、映像１，２，・・・，Ｎと音声１，２，・・・，Ｎが図１における映像音声１０-１，１０-2，・・・，１０-Nを構成し、合成映像１，２，・・・，Ｎと合成音声１，２，・・・，Ｎが図１における合成映像音声２０-１，２０-2，・・・，２０-Nを構成する。 Here, the images 1, 2,..., N and the sounds 1, 2,..., N constitute the images and sounds 10-1, 10-2,. .., N and synthesized voices 1, 2,..., N constitute the synthesized video voices 20-1, 20-2,.

検出結果受信手段２０７は、第１の端末１-1から送信された動作検出結果を受信する。強調対象選択手段２０８は、受信した動作検出結果と、映像合成手段２０２から供給される映像配置情報とから、映像１，２，・・・，Ｎのうち、強調対象となる端末を選択し、選択結果としての強調対象情報を映像制御手段２０９及び音声制御手段２１０へ送出する。つまり、強調対象選択手段２０８は、本発明に係る拠点検出手段として機能する。映像制御手段２０９、音声制御手段２１０は、入力された強調対象情報に基づいて、それぞれ映像合成手段２０２、音声合成手段２０５の動作を制御する。映像配置情報及び強調対象情報の詳細については後述する。 The detection result receiving unit 207 receives the operation detection result transmitted from the first terminal 1-1. Emphasis target selection means 208 selects a terminal to be emphasized from videos 1, 2,..., N from the received motion detection result and video layout information supplied from video synthesis means 202, Information to be emphasized as a selection result is sent to the video control unit 209 and the audio control unit 210. That is, the emphasis target selection unit 208 functions as a site detection unit according to the present invention. The video control unit 209 and the voice control unit 210 control the operations of the video synthesis unit 202 and the voice synthesis unit 205, respectively, based on the input emphasis target information. Details of the video arrangement information and the emphasis target information will be described later.

〈動作検出手段の構成の詳細及び動作検出結果の詳細〉
図６は、図４における動作検出手段１０６の出力について説明するための図であり、図７は、図６における指定座標の検出方法について説明するための図である。ここで、図７Ａ及び図７Ｂは、それぞれ司会者の手と司会者端末（第１の端末１-1）のディスプレイ５１８の画面との関係の平面図及び側面図である。 <Details of configuration of motion detection means and details of motion detection result>
6 is a diagram for explaining the output of the motion detection means 106 in FIG. 4, and FIG. 7 is a diagram for explaining the method for detecting the designated coordinates in FIG. Here, FIG. 7A and FIG. 7B are a plan view and a side view of the relationship between the hands of the presenter and the screen of the display 518 of the presenter terminal (first terminal 1-1), respectively.

図６に示すように、動作検出手段１０６の出力、即ち動作検出結果は、所定の動作が行われたか否かを表す情報と、指定座標とを対応付けたデータからなる。ここで、「所定の動作」とは、会議の司会者が第１の端末１-1のディスプレイ５１８の画面５１８ａに表示されている合成映像のある位置を指し示す動作であり、「指定座標」はその画面上の位置を表す二次元座標（ｘ，ｙ）である。なお、二次元座標の原点は例えば画面５１８ａの左下端である。 As shown in FIG. 6, the output of the motion detection means 106, that is, the motion detection result is composed of data in which information indicating whether or not a predetermined motion has been performed and designated coordinates are associated with each other. Here, the “predetermined operation” is an operation in which the conference presenter points to a certain position of the composite image displayed on the screen 518a of the display 518 of the first terminal 1-1. Two-dimensional coordinates (x, y) representing the position on the screen. Note that the origin of the two-dimensional coordinates is, for example, the lower left corner of the screen 518a.

この動作検出結果は一定時間毎に出力される。図の場合、１回目と３回目に所定の動作が行われていないことを示す“FALSE”が出力され、２回目と４回目に所定の動作が行われたことを示す“TRUE”が出力されている。そして、２回目と４回目には、「指定座標」として、それぞれ“ｘ＝１００，ｙ＝１５０”、“ｘ＝３００，ｙ＝１２０”が出力されている。 This motion detection result is output at regular intervals. In the case of the figure, “FALSE” indicating that the predetermined operation is not performed at the first time and the third time is output, and “TRUE” indicating that the predetermined operation is performed at the second time and the fourth time are output. ing. In the second and fourth times, “x = 100, y = 150” and “x = 300, y = 120” are output as “designated coordinates”, respectively.

ここで、「指定座標」は、司会者の手の向きとディスプレイ５１８の画面との位置関係から算出することができる。即ち、司会者の手４０とディスプレイ５１８の画面５１８ａとが図７Ａ及び図７Ｂに示す関係にあるとき、画面５１８ａ上の点（ｘ，ｙ）は下記の式〔１〕、〔２〕で求められる。
ｘ＝Ｚ_ｈｔａｎθ_ｈ＋ｘ_ｈ …式〔１〕
ｙ＝Ｚ_ｈｔａｎφ_ｈ＋ｙ_ｈ …式〔２〕 Here, the “designated coordinates” can be calculated from the positional relationship between the orientation of the presenter's hand and the screen of the display 518. That is, when the moderator's hand 40 and the screen 518a of the display 518 are in the relationship shown in FIGS. 7A and 7B, the point (x, y) on the screen 518a is obtained by the following equations [1] and [2]. It is done.
x = Z _h tan θ _h + x _h (1)
y = Z _h tanφ _h + y _h (2)

ここで、手４０の三次元座標（ｘ_ｈ，ｙ_ｈ，ｚ_ｈ）及び向き（θ_ｈ，φ_ｈ）は、例えば映像取得手段１０１で取得された映像や距離センサの情報から得られる。 Here, the three-dimensional coordinates (x _h , y _h , z _h ) and orientation (θ _h , φ _h ) of the hand 40 are obtained from, for example, the video acquired by the video acquisition unit 101 and the information of the distance sensor.

第１の端末１-1が設置されている拠点に司会者以外の参加者がいる場合、動作検出手段１０６は、司会者とその他の参加者を区別して、司会者のみを検出対象とする必要がある。区別する方法には、予め司会者の位置を決めておく方法（例：カメラ５２１の中心に最も近い位置にいる人物を司会者と決める）、会議開始前に司会者の位置を指定する方法、顔認識で司会者を判別する方法などがある。 When there is a participant other than the moderator at the site where the first terminal 1-1 is installed, the motion detection means 106 needs to distinguish only the moderator from the moderator and detect only the moderator. There is. As a method of distinguishing, a method of predetermining the position of the presenter (e.g., determining a person closest to the center of the camera 521 as the presenter), a method of specifying the position of the presenter before the start of the conference, There is a method of identifying a moderator by face recognition.

〈映像合成手段の動作の詳細及び合成映像の内容の詳細〉
図８は、図５における映像配置情報及び合成映像について説明するための図である。ここで、図８Ａは映像配置情報を示しており、図８Ｂは、図８Ａに示されている映像配置情報に対応する合成映像の画面上の配置を示す図である。これらの図において、端末番号１，２，３は、図１における任意の３つの端末の番号を表す。 <Details of the operation of the video composition means and details of the content of the composite video>
FIG. 8 is a diagram for explaining the video arrangement information and the composite video in FIG. Here, FIG. 8A shows the video layout information, and FIG. 8B is a diagram showing the layout on the screen of the composite video corresponding to the video layout information shown in FIG. 8A. In these figures, terminal numbers 1, 2, and 3 represent the numbers of arbitrary three terminals in FIG.

映像配置情報は、映像合成手段２０２が各端末の映像をどのように合成するかを表す情報（レイアウト情報）であり、図８Ａに示すように、端末番号と、配置位置（左上端の座標、右下端の座標）を表す情報と、重ねる順番を表す情報とからなる。そして、図８Ａに示す映像配置情報を基にディスプレイ５１８の画面５１８ａに表示される合成画像は、図８Ｂに示すようになる。 The video arrangement information is information (layout information) indicating how the video composition unit 202 synthesizes the video of each terminal, and as shown in FIG. 8A, the terminal number and the arrangement position (the upper left coordinates, Information indicating the coordinates of the lower right corner) and information indicating the order of overlapping. The composite image displayed on the screen 518a of the display 518 based on the video arrangement information shown in FIG. 8A is as shown in FIG. 8B.

強調対象選択手段２０８は、動作検出結果と、司会者端末１-1へ送信した合成映像の映像配置情報を照らし合わせることで、指定座標ではどの端末の映像が表示されていたかを判断することが出来るので、その判断の結果を基に強調対象とする端末を選択する。即ち、例えば、図８Ａに示す映像配置情報に対して、動作検出結果の指定座標が（１００，１００）であった場合、表示領域内にその座標を含んでいる第２の端末１-2を強調対象として選択する。なお、動作検出結果が所定の動作を検出していない場合、強調対象選択手段２０８は強調対象を選択しないので、「強調対象情報」として強調対象が選択されていないことを表す情報が送信される。 The emphasis target selecting unit 208 can determine which terminal's video is displayed at the designated coordinates by comparing the motion detection result with the video layout information of the composite video transmitted to the moderator terminal 1-1. Since it is possible, the terminal to be emphasized is selected based on the result of the determination. That is, for example, when the designated coordinates of the motion detection result are (100, 100) for the video layout information shown in FIG. 8A, the second terminal 1-2 including the coordinates in the display area is displayed. Select as highlight target. Note that when the motion detection result does not detect a predetermined motion, the highlight target selection unit 208 does not select the highlight target, and therefore information indicating that the highlight target is not selected is transmitted as “highlight target information”. .

映像制御手段２０９、音声制御手段２１０は、強調対象として選択された端末の映像音声を強調するための映像制御信号、音声を強調するための音声制御信号を、それぞれ映像合成手段２０２、音声合成手段２０５へ送る。つまり、映像制御手段２０９、音声制御手段２１０は、それぞれ本発明に係る映像強調手段、音声強調手段として機能する。映像を強調する方法には、例えば表示領域を拡大するなどがある。音声を強調する方法には、例えば強調対象端末以外の端末からの音声をミュートするなどがある。この音声ミュートを行った場合、強調対象端末を使用する会議参加者に発言権を設定したことになる。強調対象が選択されていない場合、映像制御信号、音声制御信号はそのままの値とする。 The video control unit 209 and the audio control unit 210 are a video control signal for emphasizing video and audio of a terminal selected as an emphasis target, and an audio control signal for emphasizing audio, respectively. Send to 205. That is, the video control unit 209 and the audio control unit 210 function as a video enhancement unit and a voice enhancement unit according to the present invention, respectively. As a method for enhancing the video, for example, a display area is enlarged. As a method for enhancing the voice, for example, the voice from a terminal other than the terminal to be emphasized is muted. When this audio mute is performed, the right to speak is set for the conference participant who uses the emphasis target terminal. When the enhancement target is not selected, the video control signal and the audio control signal are set as they are.

選択される強調対象を端末単位ではなく、個人単位にすることも可能である。その場合、映像合成手段２０２から得られる映像配置情報に、映像中の人物の位置情報を含める必要がある。人物の位置情報を得る方法には、例えば予め位置を固定しておく方法や、顔認識、人物認識などを利用する方法などがある。強調対象選択手段２０８は、動作検出結果の指定座標に最も近い位置に表示されている人物を、強調対象として選択する。個人単位で映像音声を強調する方法には、例えば、対象人物の表示領域を切り出して拡大表示する方法や、マイクアレイ処理で指向性を制御する方法などがある。 It is also possible to select an emphasis target to be selected in units of individuals rather than in units of terminals. In that case, it is necessary to include the position information of the person in the video in the video layout information obtained from the video synthesizing means 202. For example, there are a method for obtaining position information of a person, a method for fixing the position in advance, a method for using face recognition, person recognition, and the like. The emphasis target selecting means 208 selects the person displayed at the position closest to the designated coordinates of the motion detection result as the emphasis target. Examples of methods for emphasizing video and audio in individual units include a method of cutting out and displaying a target person's display area, and a method of controlling directivity by microphone array processing.

〈強調処理の解除〉
図９は、本発明の第１の実施形態に係る遠隔会議システムにおける強調処理の解除動作について説明するための図である。 <Release emphasis processing>
FIG. 9 is a diagram for explaining the cancellation processing of the emphasis process in the remote conference system according to the first embodiment of the present invention.

強調処理を解除したい場合、新たに解除用の動作を定義して利用してもよい。例えば、解除したいときは手を叩く、などである。その場合、動作検出結果は例えば図９のようになる。図６に示す動作検出結果に対して動作番号が追加されており、映像制御手段２０９、音声制御手段２１０は、指定された強調対象と動作番号に対応する制御を行う。 If it is desired to cancel the emphasis process, a new canceling operation may be defined and used. For example, when you want to release, tap your hand. In this case, the operation detection result is as shown in FIG. 9, for example. An operation number is added to the operation detection result shown in FIG. 6, and the video control unit 209 and the audio control unit 210 perform control corresponding to the specified emphasis target and the operation number.

この図の場合、２回目の動作検出結果は指定座標“ｘ＝１００，ｙ＝１５０”の端末（図８Ｂの場合、端末番号２の端末）に対する強調対象の指定動作を表すものである。つまり、司会者が第１の端末１-1のディスプレイ５１８に表示されている合成映像の端末番号２の端末を指し示す動作を行った結果が検出されたことを表している。また、４回目の動作検出結果は強調処理の解除動作を表すものである。つまり、司会者が例えば手を叩く動作を行った結果が検出されたことを表している。 In the case of this figure, the second motion detection result represents the designation operation to be emphasized for the terminal having the designated coordinates “x = 100, y = 150” (in the case of FIG. 8B, the terminal having the terminal number 2). That is, it represents that a result of the moderator performing an operation indicating the terminal of terminal number 2 of the composite video displayed on the display 518 of the first terminal 1-1 is detected. The fourth motion detection result represents the enhancement processing canceling operation. That is, it represents that the result of the moderator performing an action of clapping his hand, for example, has been detected.

〈遠隔会議システムの動作〉
図１０は、本発明の第１の実施形態に係る遠隔会議システムの動作を示すフローチャートである。 <Operation of remote conference system>
FIG. 10 is a flowchart showing the operation of the remote conference system according to the first embodiment of the present invention.

まず司会者端末（第１の端末１-1）において、映像取得手段１０１、音声取得手段１０３、センサ情報取得手段１０５は、それぞれ映像、音声、センサ情報を取得する（ステップＳ１）。次に動作検出手段１０６は、ステップＳ１で取得した映像及びセンサ情報から、所定動作を検出したか否かを判定する（ステップＳ２）。 First, in the moderator terminal (first terminal 1-1), the video acquisition unit 101, the audio acquisition unit 103, and the sensor information acquisition unit 105 acquire video, audio, and sensor information, respectively (step S1). Next, the motion detection means 106 determines whether or not a predetermined motion has been detected from the video and sensor information acquired in step S1 (step S2).

動作検出手段１０６は、所定動作を検出した場合（ステップＳ２：Yes）、ステップＳ１で取得した映像及びセンサ情報から、所定の動作の指し示す座標（指定座標）を計算し、動作検出結果に追加する（ステップＳ３）。 When detecting a predetermined motion (step S2: Yes), the motion detection unit 106 calculates coordinates (designated coordinates) indicated by the predetermined motion from the video and sensor information acquired in step S1, and adds them to the motion detection result. (Step S3).

次に検出結果送信手段１０７は、ステップＳ１で取得した映像及び音声、ステップＳ２、Ｓ３で求めた動作検出結果を映像音声合成・制御サーバ２へ送信する（ステップＳ４）。 Next, the detection result transmitting unit 107 transmits the video and audio acquired in step S1 and the motion detection result obtained in steps S2 and S3 to the video / audio synthesis / control server 2 (step S4).

動作検出手段１０６が所定動作を検出しなかった場合は（ステップＳ２：No）、ステップＳ１で取得した映像及び音声、及び所定動作が行われなかったことを表す情報（図６における“FALSE”）を含む動作検出結果を映像音声合成・制御サーバ２へ送信する（ステップＳ４）。 When the motion detection unit 106 does not detect the predetermined motion (step S2: No), the video and audio acquired in step S1 and information indicating that the predetermined motion has not been performed (“FALSE” in FIG. 6). The motion detection result including is transmitted to the video / audio synthesis / control server 2 (step S4).

映像音声合成・制御サーバ２は、第１の端末１-1からの映像、音声、及び動作検出結果を受信する（ステップＳ５）。次に強調対象選択手段２０８は、その動作検出結果は所定動作を検出しているか否か、即ち動作検出結果が所定動作を検出していることを表す情報（図６における“TRUE”）を含むか否かを判定する（ステップＳ６）。 The video / audio synthesis / control server 2 receives the video, audio, and operation detection result from the first terminal 1-1 (step S5). Next, the highlight target selecting unit 208 includes information ("TRUE" in FIG. 6) indicating whether or not the motion detection result is detecting a predetermined motion, that is, the motion detection result is detecting a predetermined motion. Whether or not (step S6).

判定の結果、所定動作を検出していた場合は（ステップＳ６：Yes）、その動作検出結果と、映像合成手段２０２からの映像配置情報とから、強調対象とする端末を選択する（ステップＳ７）。このとき、選択の結果を表す強調対象情報が映像制御手段２０９及び音声制御手段２１０に供給される。 As a result of the determination, if a predetermined operation is detected (step S6: Yes), a terminal to be emphasized is selected from the operation detection result and the video arrangement information from the video synthesizing means 202 (step S7). . At this time, the emphasis target information indicating the selection result is supplied to the video control unit 209 and the audio control unit 210.

次に映像制御手段２０９、音声制御手段２１０は、ステップＳ６で選択された端末の映像及び音声を強調するように、それぞれの合成時のパラメータを調整する（ステップＳ８）。 Next, the video control unit 209 and the audio control unit 210 adjust the parameters at the time of synthesis so as to emphasize the video and audio of the terminal selected in step S6 (step S8).

映像合成手段２０２、音声合成手段２０５は、映像制御手段２０９、音声制御手段２１０により調整されたパラメータに基づいて、各端末へ送信するための映像の合成、音声の合成を行う（ステップＳ９）。 The video synthesizing unit 202 and the audio synthesizing unit 205 synthesize video and audio for transmission to each terminal based on the parameters adjusted by the video control unit 209 and the audio control unit 210 (step S9).

強調対象選択手段２０８が、所定動作を検出していなかった場合（ステップＳ６：NO）、映像制御手段２０９、音声制御手段２１０は制御信号のパラメータはそのままの値とし、映像合成手段２０２、音声合成手段２０５は、そのままの値のパラメータを用いて、各端末へ送信するための映像の合成、音声の合成を行う（ステップＳ９）。 When the enhancement target selection unit 208 has not detected a predetermined operation (step S6: NO), the video control unit 209 and the audio control unit 210 leave the control signal parameters as they are, and the video synthesis unit 202 and the audio synthesis unit The means 205 synthesizes video and audio for transmission to each terminal using the parameter of the value as it is (step S9).

映像送信手段２０３、音声送信手段２０６は、ステップＳ９で合成された映像、音声を各端末へ送信する（ステップＳ１０）。各端末では、映像受信手段１０８、音声受信手段１１０が、それぞれ映像音声合成・制御サーバ２からの映像、音声を受信する（ステップＳ１１）。そして、映像出力手段１０９が映像を表示し、音声出力手段１１１が音声を再生する（ステップＳ１２）。 The video transmission unit 203 and the audio transmission unit 206 transmit the video and audio synthesized in step S9 to each terminal (step S10). In each terminal, the video receiving unit 108 and the audio receiving unit 110 receive the video and audio from the video / audio synthesis / control server 2 respectively (step S11). Then, the video output unit 109 displays the video, and the audio output unit 111 reproduces the audio (step S12).

司会者端末以外の端末（第２の端末１-2，・・・，第Ｎの端末１-N）の動作は下記のとおりである。「映像、音声を取得する」（ステップＳ１における「映像、音声、センサ情報」を「映像、音声」に変更）→「映像、音声をサーバへ送信する」（ステップＳ４における「映像、音声、動作検出結果」を「映像、音声」に変更）→「映像、音声をサーバから受信する」（ステップＳ１１と同じ）→「映像、音声情報を表示再生する」（ステップＳ１２と同じ）。 The operations of terminals other than the moderator terminal (second terminal 1-2,..., Nth terminal 1-N) are as follows. “Obtain video and audio” (change “video, audio and sensor information” in step S1 to “video and audio”) → “send video and audio to server” (“video, audio and operation in step S4” “Detection result” is changed to “video, audio”) → “receive video and audio from server” (same as step S11) → “display and reproduce video and audio information” (same as step S12).

このように、本発明の第１の実施形態に係る遠隔会議システムには下記（１）〜（５）の特徴がある。
（１）司会者の自然な動作により、発言権の設定、映像の強調、音声の強調が可能である。
（２）映像配置情報を各端末からの映像をどう合成しているかを表すレイアウト情報とすることで、拠点単位で強調対象を選択することができる。
（３）映像配置情報を各端末からの映像に参加者がどう位置しているかを表す人物位置情報を含ませることにより、個人単位で強調対象を選択することができる。
（４）司会者の自然な動作により、強調処理を解除することもできる。
（５）司会者による会議の進行の管理が容易になる。 As described above, the remote conference system according to the first embodiment of the present invention has the following features (1) to (5).
(1) The right to speak, video enhancement, and audio enhancement can be achieved by the natural operation of the presenter.
(2) Emphasis targets can be selected in units of bases by using video layout information as layout information indicating how video from each terminal is combined.
(3) The emphasis target can be selected on an individual basis by including video position information including person position information indicating how the participant is positioned in the video from each terminal.
(4) The emphasis process can be canceled by the natural operation of the presenter.
(5) The management of the progress of the conference by the presenter becomes easy.

［第２の実施形態］
〈遠隔会議システムの全体構成〉
図１１は、本発明の第２の実施形態に係る遠隔会議システムの全体構成について説明するための図である。この図において、図１（第１の実施形態に係る遠隔会議システムの全体構成）と同じ部分には図１と同じ参照符号を付すとともに、特に必要な場合以外はその部分の説明を省略する。 [Second Embodiment]
<Overall configuration of remote conference system>
FIG. 11 is a diagram for explaining the overall configuration of the remote conference system according to the second embodiment of the present invention. In this figure, the same parts as those in FIG. 1 (the entire configuration of the remote conference system according to the first embodiment) are denoted by the same reference numerals as those in FIG. 1, and the explanation of those parts is omitted unless particularly necessary.

この遠隔会議システムは、Ｎ台の端末である第１の端末４-１，第２の端末４-2，・・・，第Ｎの端末４-Nをネットワーク３に接続することにより構成される。第１の端末４-1が司会者端末である。 This remote conference system is configured by connecting a first terminal 4-1, a second terminal 4-2,..., An Nth terminal 4-N, which are N terminals, to the network 3. . The first terminal 4-1 is a moderator terminal.

各端末は映像音声を互いに送受信し、第１の実施形態において映像音声合成・制御サーバ２が行っていた合成処理を各端末内で行うことが、第１の実施形態との主な相違点である。 The main difference from the first embodiment is that each terminal transmits and receives video and audio to each other, and the synthesis processing performed by the video and audio synthesis / control server 2 in the first embodiment is performed in each terminal. is there.

即ち、第１の端末４-1は、映像音声１０-1を第２の端末４-2，・・・，第Ｎの端末４-Nへ送信し、第２の端末４-2，・・・，第Ｎの端末４-Nから映像音声１０-2，・・・，１０-Nを受信する。また、第２の端末４-2は、映像音声１０-2を第１の端末４-1，第３の端末４-3，・・・，第Ｎの端末４-Nへ送信し、第１の端末４-１，第３の端末４-3，・・・，第Ｎの端末４-Nから映像音声１０-1，１０-3，・・・，１０-Nを受信する。また、第Ｎの端末４-Nは、映像音声１０-Nを第１の端末４-1，・・・，第Ｎ−１の端末４-(N−1)へ送信し、第１の端末４-1，・・・，第Ｎ−１の端末４-(N−1) から映像音声１０-1，・・・，１０-(N−1)を受信する。 That is, the first terminal 4-1 transmits the video / audio 10-1 to the second terminal 4-2,..., The Nth terminal 4-N, and the second terminal 4-2,. .., 10-N are received from the Nth terminal 4-N. The second terminal 4-2 transmits the video / audio 10-2 to the first terminal 4-1, the third terminal 4-3,..., The Nth terminal 4-N, , 10-N are received from the terminal 4-1, the third terminal 4-3,..., And the Nth terminal 4-N. The N-th terminal 4-N transmits the video / audio 10-N to the first terminal 4-1,..., The N-1th terminal 4- (N−1), and the first terminal ..,..., 10- (N−1) are received from the N−1th terminal 4- (N−1).

また、司会者端末である第１の端末４-1は、強調対象情報１３を第２の端末４-2，・・・，第Ｎの端末４-Nへ送信する。強調対象情報１３は第１の実施形態の映像音声合成・制御サーバ２の強調対象選択手段２０８が生成する強調対象情報と同じものである。 In addition, the first terminal 4-1 which is a moderator terminal transmits the emphasis target information 13 to the second terminal 4-2,..., The Nth terminal 4-N. The emphasis target information 13 is the same as the emphasis target information generated by the emphasis target selection means 208 of the video / audio synthesis / control server 2 of the first embodiment.

〈端末のハードウェア構成〉
各端末のハードウェア構成は第１の実施形態における端末のハードウェア構成（図２）と同じである。 <Hardware configuration of terminal>
The hardware configuration of each terminal is the same as the hardware configuration (FIG. 2) of the terminal in the first embodiment.

〈司会者端末の機能ブロック図〉
図１２は、本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末の機能ブロック図である。 <Functional block diagram of moderator terminal>
FIG. 12 is a functional block diagram of the moderator terminal in the remote conference system according to the second embodiment of the present invention.

司会者端末である第１の端末４-1は、映像取得手段４０１、映像送信手段４０２、音声取得手段４０３、音声送信手段４０４、センサ情報取得手段４０５、動作検出手段４０６、強調対象選択手段４０７、強調対象送信手段４０８、映像受信手段４０９、映像合成手段４１０、映像出力手段４１１、音声受信手段４１２、音声合成手段４１３、音声出力手段４１４、映像制御手段４１５、及び音声制御手段４１６を備えている。 The first terminal 4-1, which is a moderator terminal, includes a video acquisition unit 401, a video transmission unit 402, a voice acquisition unit 403, a voice transmission unit 404, a sensor information acquisition unit 405, a motion detection unit 406, and an emphasis target selection unit 407. , Emphasis target transmission means 408, video reception means 409, video synthesis means 410, video output means 411, audio reception means 412, audio synthesis means 413, audio output means 414, video control means 415, and audio control means 416. Yes.

ここで、映像取得手段４０１、映像送信手段４０２、音声取得手段４０３、音声送信手段４０４、センサ情報取得手段４０５、動作検出手段４０６、映像受信手段４０９、映像合成手段４１０、映像出力手段４１１、音声受信手段４１２、及び音声出力手段４１４は、映像受信手段４０９が他の各端末から映像２，・・・，Ｎを受信し、音声受信手段４１０が他の各端末から音声２，・・・，Ｎを受信する点以外は図４（第１の実施形態における司会者端末の機能ブロック図）における同名の手段と同じ構成及び機能を備えている。映像２と音声２が映像音声１０-2を構成し、映像Ｎと音声Ｎが映像音声１０-Nを構成する。 Here, video acquisition means 401, video transmission means 402, audio acquisition means 403, audio transmission means 404, sensor information acquisition means 405, motion detection means 406, video reception means 409, video synthesis means 410, video output means 411, audio The receiving means 412 and the audio output means 414 are such that the video receiving means 409 receives the video 2,..., N from each other terminal, and the audio receiving means 410 receives the audio 2,. Except for receiving N, it has the same configuration and function as the means of the same name in FIG. 4 (the functional block diagram of the moderator terminal in the first embodiment). Video 2 and audio 2 constitute video / audio 10-2, and video N and audio N constitute video / audio 10-N.

また、図１２において、強調対象選択手段４０７、映像合成手段４１０、音声合成手段４１３、映像制御手段４１５、及び音声制御手段４１６は、図５（第１の実施形態における映像音声合成・制御サーバ２の機能ブロック図）における同名の手段と同じ構成及び機能を備えている。 In FIG. 12, the emphasis target selection unit 407, the video synthesis unit 410, the voice synthesis unit 413, the video control unit 415, and the voice control unit 416 are the same as those in FIG. 5 (the video / sound synthesis / control server 2 in the first embodiment). In the functional block diagram of FIG.

強調対象送信手段４０８は、強調対象選択手段４０７で生成された強調対象情報をネットワーク３経由で第２の端末４-2，・・・，第Ｎの端末４-Nへ送信する手段である。 The emphasis target transmitting unit 408 is a unit that transmits the emphasis target information generated by the emphasis target selection unit 407 to the second terminal 4-2,..., The Nth terminal 4-N via the network 3.

〈司会者端末以外の端末の機能ブロック図〉
図１３は、本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末以外の端末の一つである第２の端末４-2の機能ブロック図である。この図において、図１２と同一の手段には図１２と同じ参照符号を付すとともに、特に必要な場合以外はその部分の説明を省略する。また、ここでは、便宜上、第２の端末４-2の機能ブロック図を示したが、第３の端末４-3，・・・，第Ｎの端末４-Nの機能ブロックも第２の端末４-2の機能ブロックと同じである。 <Functional block diagram of terminals other than moderator terminal>
FIG. 13 is a functional block diagram of the second terminal 4-2 which is one of terminals other than the moderator terminal in the remote conference system according to the second embodiment of the present invention. In this figure, the same means as those in FIG. 12 are denoted by the same reference numerals as those in FIG. 12, and the description thereof is omitted unless particularly necessary. Here, for convenience, the functional block diagram of the second terminal 4-2 is shown, but the functional blocks of the third terminal 4-3,..., The Nth terminal 4-N are also the second terminal. It is the same as the function block in 4-2.

第２の端末４-2は、第１の端末４-1からセンサ情報取得手段４０５、動作検出手段４０６、強調対象選択手段４０７、及び強調対象送信手段４０８を除去し、強調対象受信手段４１７を付加した構成を備えている。ただし、第１の実施形態と同様に、全ての端末の構成を同一にし、使用しない手段が動作しないように制御してもよい。 The second terminal 4-2 removes the sensor information acquisition unit 405, the motion detection unit 406, the enhancement target selection unit 407, and the enhancement target transmission unit 408 from the first terminal 4-1, and changes the enhancement target reception unit 417. It has an added configuration. However, as in the first embodiment, all the terminals may have the same configuration, and control may be performed so that unused means do not operate.

〈遠隔会議システムの動作〉
《司会者端末の送信動作及び司会者端末以外の端末の受信動作》
図１４は、本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末の送信動作及び司会者端末以外の端末の受信動作を示すフローチャートである。 <Operation of remote conference system>
<< Transmission operation of the moderator terminal and reception operation of terminals other than the moderator terminal >>
FIG. 14 is a flowchart showing the transmission operation of the moderator terminal and the reception operation of terminals other than the moderator terminal in the remote conference system according to the second embodiment of the present invention.

まず司会者端末（第１の端末４-1）において、映像取得手段４０１、音声取得手段４０３、センサ情報取得手段４０５は、それぞれ映像、音声、センサ情報を取得する（ステップＳ２１）。次に動作検出手段４０６は、ステップＳ２１で取得した映像及びセンサ情報から、所定動作を検出したか否かを判定する（ステップＳ２２）。 First, in the moderator terminal (first terminal 4-1), the video acquisition unit 401, the audio acquisition unit 403, and the sensor information acquisition unit 405 acquire video, audio, and sensor information, respectively (step S21). Next, the motion detection means 406 determines whether or not a predetermined motion has been detected from the video and sensor information acquired in step S21 (step S22).

動作検出手段４０６は、所定動作を検出した場合（ステップＳ２２：Yes）、ステップＳ２１で取得した映像及びセンサ情報から、所定の動作の指し示す座標（指定座標）を計算し、動作検出結果に追加する（ステップＳ２３）。 When detecting a predetermined motion (step S22: Yes), the motion detection unit 406 calculates coordinates (designated coordinates) indicated by the predetermined motion from the video and sensor information acquired in step S21, and adds them to the motion detection result. (Step S23).

つまり、ステップＳ２１〜Ｓ２３は図１０（第１の実施形態の動作を示すフローチャート）におけるステップＳ１〜Ｓ３と同じである。 That is, steps S21 to S23 are the same as steps S1 to S3 in FIG. 10 (flow chart showing the operation of the first embodiment).

次のステップＳ２４では、強調対象選択手段４０７が、ステップＳ２３での動作検出結果と、映像合成手段４１０から供給される映像配置情報とから、強調対象とする端末を選択し、選択結果として強調対象情報を生成する。次いで強調対象送信手段４０８は、ステップＳ２１で取得した音声と映像、及びステップＳ２４で生成した強調対象情報をその他の各端末へ送信する（ステップＳ２５）。ステップＳ２２から直接ステップＳ２５へ進んだ場合、即ち所定の動作が検出されなかった場合は、強調対象情報は強調対象が選択されていないことを表す情報となる。 In the next step S24, the emphasis target selecting means 407 selects the terminal to be emphasized from the motion detection result in step S23 and the video arrangement information supplied from the video synthesizing means 410, and the emphasis target is selected as the selection result. Generate information. Next, the emphasis target transmitting unit 408 transmits the audio and video acquired in step S21 and the emphasis target information generated in step S24 to other terminals (step S25). When the process proceeds directly from step S22 to step S25, that is, when a predetermined operation is not detected, the emphasis target information is information indicating that no emphasis target is selected.

司会者端末以外の端末（その他端末）では、自端末以外の全ての端末から送信された映像、音声をそれぞれ映像受信手段４０９、音声受信手段４１２で受信し、司会者端末から送信された強調対象情報を強調対象受信手段４１７で受信する（ステップＳ２６）。 At terminals other than the moderator terminal (other terminals), the video and audio transmitted from all terminals other than the own terminal are received by the video receiving means 409 and the audio receiving means 412, respectively, and the emphasis target transmitted from the moderator terminal Information is received by the emphasis target receiving means 417 (step S26).

次に、映像制御手段４１５、音声制御手段４１６は、強調対象が選択されているか否か、即ち強調対象情報が、選択された端末を表す情報を含んでいるか否かを判定する（ステップＳ２７）。 Next, the video control unit 415 and the audio control unit 416 determine whether the enhancement target is selected, that is, whether the enhancement target information includes information indicating the selected terminal (step S27). .

判定の結果、強調対象が選択されていた場合は（ステップＳ２７：Yes）、映像制御手段４１５、音声制御手段４１６は、選択された端末の映像及び音声を強調するように、それぞれの合成時のパラメータを調整する（ステップＳ２８）。次に映像合成手段４１０、音声合成手段４１３は、映像制御手段４１５、音声制御手段４１６により調整されたパラメータに基づいて、各端末からの映像の合成、音声の合成を行う（ステップＳ２９）。 As a result of the determination, when the emphasis target is selected (step S27: Yes), the video control unit 415 and the audio control unit 416 emphasize the video and audio of the selected terminal at the time of each synthesis. The parameter is adjusted (step S28). Next, the video synthesizing unit 410 and the voice synthesizing unit 413 synthesize video and audio from each terminal based on the parameters adjusted by the video control unit 415 and the audio control unit 416 (step S29).

判定の結果、強調対象が選択されていなかった場合は（ステップＳ２７：No）、映像制御手段４１５、音声制御手段４１６は制御信号のパラメータをそのままの値とし、映像合成手段４１０、音声合成手段４１３は、そのままの値のパラメータを用いて、各端末からの映像の合成、音声の合成を行う（ステップＳ２９）。 As a result of the determination, when the emphasis target is not selected (step S27: No), the video control unit 415 and the audio control unit 416 leave the control signal parameters as they are, and the video synthesis unit 410 and the voice synthesis unit 413. Uses the parameters of the values as they are to synthesize video and audio from each terminal (step S29).

次に映像出力手段４１１が合成映像を表示し、音声出力手段４１４が合成音声を再生する（ステップＳ３０）。 Next, the video output unit 411 displays the synthesized video, and the audio output unit 414 reproduces the synthesized audio (step S30).

《司会者端末以外の端末の送信動作及び司会者端末の受信動作》
図１５は、本発明の第２の実施形態に係る遠隔会議システムにおける司会者端末以外の端末の送信動作及び司会者端末の受信動作示すフローチャートである。 << Transmission operation of terminals other than the moderator terminal and reception operation of the moderator terminal >>
FIG. 15 is a flowchart showing a transmission operation of a terminal other than the moderator terminal and a reception operation of the moderator terminal in the remote conference system according to the second embodiment of the present invention.

司会者端末以外の端末（その他端末）では、映像取得手段４０１、音声取得手段４０３は、それぞれ映像、音声を取得し（ステップＳ３１）、映像送信手段４０２、音声送信手段４０４は、それぞれ映像取得手段４０１、音声取得手段４０３で取得された映像、音声を各端末へ送信する（ステップＳ３２）。 In terminals other than the moderator terminal (other terminals), the video acquisition unit 401 and the audio acquisition unit 403 acquire video and audio, respectively (step S31), and the video transmission unit 402 and the audio transmission unit 404 respectively acquire the video acquisition unit. 401, the video and audio acquired by the audio acquisition means 403 are transmitted to each terminal (step S32).

司会者端末では、各端末から送信された映像、音声をそれぞれ映像受信手段４０９、音声受信手段４１２で受信する（ステップＳ３３）。次に映像制御手段４１５、音声制御手段４１６は、強調対象が選択されているか否か、強調対象情報が、選択された端末を表す情報を含んでいるか否かを判定する（ステップＳ３４）。この強調対象情報は、図１４のステップＳ２５で送信された強調対象情報と同じものである。以後のステップＳ３４〜Ｓ３７は図１４におけるステップＳ２８〜Ｓ３０と同じである。 In the moderator terminal, the video reception unit 409 and the audio reception unit 412 respectively receive the video and audio transmitted from each terminal (step S33). Next, the video control unit 415 and the audio control unit 416 determine whether an enhancement target is selected and whether the enhancement target information includes information representing the selected terminal (step S34). This emphasis target information is the same as the emphasis target information transmitted in step S25 of FIG. Subsequent steps S34 to S37 are the same as steps S28 to S30 in FIG.

なお、以上の各実施形態では、動作検出手段１０６，４０６は、カメラ５２１で取得した映像と距離センサの情報とを用いて、司会者による所定動作を非接触で検出しているが、タッチパネルを搭載したディスプレイを用い、司会者が指やタッチペンなどでタッチした位置を検出することで、所定動作を検出することもできる。また、司会者の視線を検出し、その視線の先の映像上の位置を検出（算出）することで、所定動作を検出することもできる。 In each of the above embodiments, the motion detection units 106 and 406 detect a predetermined motion by the presenter in a non-contact manner using the video acquired by the camera 521 and the information of the distance sensor. A predetermined operation can also be detected by detecting the position where the presenter touches with a finger or a touch pen using a mounted display. Further, a predetermined operation can be detected by detecting the line of sight of the presenter and detecting (calculating) the position on the video ahead of the line of sight.

１…端末、１-1，４-1…第１の端末、１-2，４-2，・・・，１-N，４-N…第２，・・・，第Ｎの端末、１０６，４０６…動作検出手段、２０８，４０７…強調対象選択手段、２０９，４１５…映像制御手段、２１０，４１６…音声制御手段、 DESCRIPTION OF SYMBOLS 1 ... Terminal, 1-1, 4-1 ... 1st terminal, 1-2, 4-2, ..., 1-N, 4-N ... 2nd ..., Nth terminal, 106 , 406 ... motion detection means, 208, 407 ... highlighting object selection means, 209, 415 ... video control means, 210, 416 ... audio control means,

特開２０１２−２１７０６８号公報JP 2012-217068 A 特開平６−１２１３０９号公報JP-A-6-121309

Claims

A teleconferencing system that communicates information including video between conference terminals installed at multiple locations,
Position detecting means for detecting a desired position of a video displayed on one predetermined conference terminal when a conference participant using the conference terminal indicates the desired position by a predetermined first operation. When,
The predetermined one conference terminal from the position detected by the position detecting means and information indicating the position of the video of the bases of other conference terminals in the video displayed on the predetermined one conference terminal A site detection means for detecting the site displayed at the position indicated by the conference participant using
Video enhancement means for enhancing the video of the base detected by the base detection means;
A teleconferencing system.

The remote conference system according to claim 1,
The remote conference system, wherein the information indicating the position of the video at the base includes information indicating the position of the video of a conference participant at the base.

The remote conference system according to claim 1,
A teleconferencing system comprising voice emphasizing means for emphasizing the voice of the base detected by the base detection means.

The remote conference system according to claim 1,
The first operation is a non-contact operation on the video, and the position detecting unit detects the instructed position in a non-contact manner.

The remote conference system according to claim 1,
A teleconferencing system, comprising: video enhancement cancellation means for canceling enhancement of a video when a conference participant using the predetermined one conference terminal performs a second operation different from the first operation.

A video processing method in a remote conference system for communicating information including video between conference terminals installed at a plurality of bases,
A position detection step of detecting a specified position of a video displayed on a predetermined one conference terminal when a conference participant using the conference terminal indicates the desired position by a predetermined operation;
The predetermined one conference terminal from the position detected in the position detection step and information indicating the positions of the video of the bases of other conference terminals in the video displayed on the predetermined one conference terminal A site detection step of detecting a site displayed at a position instructed by a conference participant using
A video enhancement step for enhancing the video of the location detected in the location detection step;
A video processing method in a remote conference system.

A video control device used in a remote conference system that communicates information including video between conference terminals installed at a plurality of locations,
For each conference terminal, video transmission means for transmitting the video of the base of the other conference terminal,
Among the plurality of conference terminals, a conference participant who uses the conference terminal is transmitted from the video transmission means by a predetermined operation from a predetermined one of the conference terminals, and the video displayed on the conference terminal is displayed. A detection result receiving means for receiving a detection result of the designated position when a desired position is designated;
From the detection result received by the detection result receiving means and the information indicating the position of the video of the bases of other conference terminals in the video displayed on the predetermined one conference terminal, the predetermined one A site detection means for detecting a site displayed at a position designated by a conference participant using the conference terminal;
Video enhancement means for enhancing the video of the base detected by the base detection means;
A video control device.

The program for functioning a computer as each means of the image | video control apparatus described in Claim 7.

A conference terminal used in a remote conference system that communicates information including video between conference terminals installed at multiple locations,
Video output means for displaying video of other conference terminal locations;
Position detection means for detecting the indicated position when a conference participant indicates a desired position of the video displayed on the video output means by a predetermined operation;
From the position detected by the position detecting means and the information indicating the position of the video of the bases of the other plurality of conference terminals displayed by the video output means, it is displayed at the position designated by the conference participant. A base detection means for detecting the base
Video enhancement means for enhancing the video of the base detected by the base detection means;
A conference terminal.

A program for causing a computer to function as a position detection unit, a site detection unit, and a video enhancement unit of a conference terminal according to claim 9.