JP2009049734A

JP2009049734A - Camera-mounted microphone and control program thereof, and video conference system

Info

Publication number: JP2009049734A
Application number: JP2007214284A
Authority: JP
Inventors: Yasuhiro Kodama; 康広小玉; Yasuhiko Kato; 靖彦加藤; Jo Matsui; 丈松井; Nobuyuki Kihara; 信之木原; Hideki Kishi; 秀樹岸; Yohei Sakuraba; 洋平櫻庭; Takayoshi Kawaguchi; 貴義川口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2007-08-21
Filing date: 2007-08-21
Publication date: 2009-03-05

Abstract

<P>PROBLEM TO BE SOLVED: To allow a caller to accurately confirm that the caller is located within the range of microphone directivity. <P>SOLUTION: A camera-mounted microphone 1 has the microphone 11, attached to its body housing 10 and has single directivity, and a camera 12 which is attached to the body housing 10 and has an image angle, substantially corresponding to the range of single directivity of the microphone 11. As a result, the caller is photographed by the camera 12, so that the caller can confirm that he or she is located within the range of the microphone directivity. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、マイクロフォンとカメラとが筐体に設けられたカメラ付マイクロフォン、カメラ付きマイクロフォンの制御プログラムおよびテレビ会議システムに関する。 The present invention relates to a microphone with a camera in which a microphone and a camera are provided in a housing, a control program for the microphone with a camera, and a video conference system.

テレビ会議システムは、離れた会議室間で映像および音声を双方向通信することで、相手方の話者の映像や音声をモニタに映し出し、臨場感ある会議を実現するシステムである。ここで、会議室にいる複数の参加者に対してカメラを向けるにあたり、マイクで音声を取り込んでいる位置に合わせてカメラの方向を制御する技術が開示されている（例えば、特許文献１参照。）。 The video conference system is a system that realizes a realistic conference by projecting video and audio of the other party's speaker on a monitor by bidirectionally communicating video and audio between remote conference rooms. Here, when the camera is directed toward a plurality of participants in the conference room, a technique for controlling the direction of the camera in accordance with the position where the microphone is capturing audio is disclosed (for example, see Patent Document 1). ).

特開２００２−１７１４９９号公報JP 2002-171499 A

しかしながら、テレビ会議システムなどで指向性を有するマイクロフォンを用いるような場合、話者は果たして自分が指向性の範囲内にいるのかどうかを知ることは難しい。また、複数の指向性マイクロフォンを用いる場合において、あるマイクロフォンの指向性範囲内にノイズ源が存在すると、その指向性マイクロフォンからの入力音が必要な音を聴きづらくする原因となってしまう。 However, when a microphone having directivity is used in a video conference system or the like, it is difficult for a speaker to know whether or not he is within the directivity range. In addition, when a plurality of directional microphones are used, if a noise source exists within the directional range of a certain microphone, it may cause difficulty in listening to sounds that require input sound from the directional microphone.

さらに、音源方向推定用マイクロフォン(複数個)を具備するメインカメラを発話者方向に向けてテレビ会議を行うような場合、音声処理による各音源方向推定用マイクロフォンの入力音同士の相関関数を用いた音源方向推定(例えば、「盛田敏之：第一波面音源方向検知を用いた自動監視カメラの検討」)などで大体の方向を定め、さらに画像処理によるパターン認識を用いた顔検出(例えば、「赤松茂：コンピュータによる顔の認識サーベイ」)などでメインカメラを発話者方向に向けるなどの方法が考えられるが、その精度はまだ十分ではない。 Furthermore, when performing a video conference with the main camera equipped with a plurality of sound source direction estimation microphones facing the direction of the speaker, a correlation function between the input sounds of each sound source direction estimation microphone by voice processing was used. Estimate the direction of the sound source (for example, “Toshiyuki Morita: Examination of an automatic surveillance camera using the first wavefront sound source direction detection”), etc., and further face detection using pattern recognition by image processing (for example, “Akamatsu Shigeru: A computer-based facial recognition survey)) may be used to point the main camera toward the speaker, but the accuracy is still insufficient.

また、メインカメラの映像だけで発話者を視覚的に強調するような場合、映像の中で発話者が存在する部分を正確に抽出しなければならないという問題もある。 Further, when the speaker is visually emphasized only by the video of the main camera, there is a problem that the portion where the speaker is present must be accurately extracted in the video.

本発明はこのような課題を解決するために成されたものである。すなわち、本発明は、本体筐体に設けられる単一指向性を有するマイクロフォンと、この本体筐体に設けられ、マイクロフォンの単一指向性の範囲とほぼ等しい画角を有するカメラとを備えるカメラ付きマイクロフォンである。 The present invention has been made to solve such problems. That is, the present invention is equipped with a camera including a unidirectional microphone provided in the main body casing and a camera provided in the main body casing and having a field angle substantially equal to the range of the unidirectionality of the microphone. It is a microphone.

このような本発明では、同一の本体筐体に単一指向性を有するマイクロフォンとカメラとが設けられ、このマイクロフォンの単一指向性の範囲とほぼ等しい画角を有するカメラがあることから、話者はカメラによって自らの映像が取り込まれていることでマイクロフォンの指向性範囲内にいることを把握できるようになる。 In the present invention, a microphone having a unidirectionality and a camera are provided in the same main body casing, and there is a camera having an angle of view substantially equal to the range of the unidirectionality of the microphone. The person can grasp that the image is captured by the camera and is within the directivity range of the microphone.

ここで、本発明で適用するマイクロフォンの単一指向性とは、本体筐体からある特定の角度を有する領域にのみ、主として一般的な話者の声に対応した周波数においてある一定以上の音声取り込みゲインを有する性質のことである。また、マイクロフォンの単一指向性とほぼ等しいカメラの画角とは、マイクロフォンの単一指向性の範囲と一致している場合のほか、カメラで映像が取り込まれていればマイクロフォンで音声を取り込むことができる場合を含む。 Here, the unidirectionality of the microphone applied in the present invention means that a certain level or more of sound is captured mainly at a frequency corresponding to a general speaker's voice only in a region having a certain angle from the main body casing. This is a property having a gain. In addition to the case where the angle of view of the camera approximately equal to the unidirectionality of the microphone matches the range of the unidirectionality of the microphone, if the image is captured by the camera, the sound is captured by the microphone. Including the case where

また、本発明は、本体筐体に設けられる単一指向性を有するマイクロフォンと、この本体筐体に設けられ、マイクロフォンの単一指向性の範囲内に画角を有するカメラとを備えるカメラ付きマイクロフォンをコンピュータによって制御するプログラムであり、カメラによって取り込んだ映像から顔の画像を認識し、その認識した顔の位置に基づきマイクロフォンの単一指向性の中心位置を変更するステップをコンピュータによって実行させるものである。 In addition, the present invention provides a microphone with a camera provided with a unidirectional microphone provided in a main body casing and a camera provided in the main body casing and having an angle of view within a range of the unidirectionality of the microphone. Is a computer-controlled program that recognizes a facial image from video captured by a camera and causes the computer to execute a step of changing the unidirectional center position of the microphone based on the recognized facial position. is there.

このような本発明では、カメラ付きマイクロフォンのカメラで取り込んだ映像に基づき顔の画像を認識し、その顔の位置にマイクロフォンの単一指向性の中心位置を変更するため、カメラの撮像中心から話者の位置がずれていても、マイクロフォンによる音声取り込みを確実に行うことができるようになる。 In the present invention, the face image is recognized based on the video captured by the camera-equipped microphone, and the unidirectional center position of the microphone is changed to the face position. Even if the position of the person is deviated, it is possible to reliably capture the sound by the microphone.

ここで、カメラによって取り込んだ映像から顔の画像を認識できなかった場合には、マイクロフォンによる音声の取り込みを行わないようにすれば、話者がカメラの画角内にいない場合に不要な音声の取り込みを行わずに済む。 Here, if the face image cannot be recognized from the video captured by the camera, and if the voice is not captured by the microphone, unnecessary audio can be generated if the speaker is not within the camera angle of view. There is no need to import.

また、本発明は、単一指向性を有するマイクロフォンおよびこのマイクロフォンの単一指向性の範囲内に画角を有するカメラを本体筐体に備えるカメラ付きマイクロフォンが所定のレイアウトで複数配置され、カメラ付きマイクロフォンを利用する参加者の映像をメインカメラで取り込むテレビ会議システムにおいて、音声を取り込んでいるカメラ付きマイクロフォンを特定し、その位置に向けて高画質なメインカメラの撮影方向を移動するとともに、音声を取り込んでいるカメラ付きマイクロフォンのカメラで取り込んだ映像から顔の画像を認識して、その認識結果に基づきメインカメラの撮影方向を調整する制御手段を有するテレビ会議システムである。 Also, the present invention provides a microphone having a single directivity and a plurality of microphones with a camera having a camera body having an angle of view within a range of the single directivity of the microphone in a predetermined layout. In a video conference system that captures the video of participants who use microphones with the main camera, the microphone with the camera capturing the audio is identified, the shooting direction of the high-quality main camera is moved toward that position, and the audio is This is a video conference system having a control means for recognizing a face image from an image captured by a camera-equipped microphone with a camera and adjusting a shooting direction of the main camera based on the recognition result.

このような本発明では、複数のカメラ付きマイクロフォンと、メインカメラとを有するテレビ会議システムにおいて、複数のカメラ付きマイクロフォンのうち音声を取り込んでいるものの位置に合わせてメインカメラの撮影方向を移動し、そのカメラ付きマイクロフォンのカメラで取り込んだ映像から顔の画像を認識することで、メインカメラの撮影方向をその顔の位置に正確に合わせることができるようになる。 In such a present invention, in a video conference system having a plurality of microphones with a camera and a main camera, the shooting direction of the main camera is moved in accordance with the position of the plurality of microphones with a camera that are capturing audio, By recognizing the face image from the video captured by the camera-equipped microphone, the shooting direction of the main camera can be accurately adjusted to the face position.

また、このテレビ会議システムにおいて、複数のカメラ付きマイクロフォンが、各々のカメラによって参加者の顔の画像を逐次取り込み制御手段へ送り、制御手段が、各カメラ付きマイクロフォンから送られる参加者の顔の画像を順次上書き保存しておき、その保存した顔の画像をメインカメラで撮影方向を調整する際に行う顔の画像の認識に用いている。これにより、常に新しい顔の画像を用いた認識によって、正確な位置へメインカメラを向けることができるようになる。 Further, in this video conference system, a plurality of microphones with cameras sequentially capture images of participants' faces by each camera and send them to the control means, and the control means sends images of the faces of participants sent from the microphones with cameras. Are sequentially overwritten and stored, and the stored face images are used for face image recognition when the shooting direction is adjusted by the main camera. As a result, the main camera can be directed to an accurate position by always using a new face image.

また、本発明は、単一指向性を有するマイクロフォンおよびこのマイクロフォンの単一指向性の範囲内に画角を有するカメラを本体筐体に備えるカメラ付きマイクロフォンが所定のレイアウトで複数配置され、カメラ付きマイクロフォンのカメラで取り込んだ映像をモニタに表示するテレビ会議システムにおいて、複数のカメラ付きマイクロフォンのカメラで取り込んだ複数の映像をモニタに並べて表示するにあたり、最も音声の取り込みレベルが高いカメラ付きマイクロフォンのカメラで取り込んだ映像を他のカメラ付きマイクロフォンのカメラで取り込んだ映像より大きく表示する制御手段を有するテレビ会議システムである。 Also, the present invention provides a microphone having a single directivity and a plurality of microphones with a camera having a camera body having an angle of view within a range of the single directivity of the microphone in a predetermined layout. In a video conference system that displays video captured by a microphone camera on a monitor, a microphone with a camera with the highest audio capture level is required to display a plurality of video captured by a camera with multiple cameras side by side on a monitor. This is a video conference system having a control means for displaying the video captured in (1) larger than the video captured by a camera of another microphone with camera.

このような本発明では、複数のカメラ付きマイクロフォンのカメラで取り込んだ映像をモニタに並べて表示する際、音声を取り込んでいるカメラ付きマイクロフォンのカメラで取り込んだ映像を他の映像に比べて大きく表示することで、話者の映像を目立たせることができるようになる。 In the present invention, when images captured by a plurality of camera-equipped microphones are displayed side by side on a monitor, the images captured by the camera-equipped microphone camera that captures the audio are displayed larger than other images. This makes it possible to make the speaker's video stand out.

また、音声を取り込んでいるカメラ付きマイクロフォンのカメラで取り込んだ映像を他の映像に比べて明るく表示するようにしても、上記と同様、話者の映像を目立たせることができるようになる。 In addition, even if the video captured by the camera-equipped microphone that captures the audio is displayed brighter than the other video, the video of the speaker can be conspicuous as described above.

また、複数のカメラ付きマイクロフォンがカスケード接続されている場合、そのカスケード接続の順にカメラ付きマイクロフォンのカメラで取り込んだ複数の映像をモニタに横一列で表示すると、複数の参加者の映像をパノラマ表示することが可能となる。 In addition, when multiple microphones with cameras are connected in cascade, if multiple images captured by the camera microphones are displayed in a horizontal row on the monitor in the order of the cascade connection, the images of multiple participants are displayed in a panorama view. It becomes possible.

したがって、本発明のようなカメラ付きマイクロフォンを用いることで、話者がマイクの指向性の範囲内にいることを的確に認識できるとともに、カメラ付きマイクロフォンのカメラで取り込んだ映像を用いて話者の位置を正確に捉えてメインカメラの撮影方向を正確に設定することが可能となる。しかも、話者の位置を正確に特定できるため、話者のモニタ表示を強調する場合も強調する映像を正確に特定して行うことができ、臨場感のあるテレビ会議を実現することが可能となる。 Therefore, by using the camera-equipped microphone as in the present invention, it is possible to accurately recognize that the speaker is within the microphone directivity range, and using the video captured by the camera-equipped microphone camera, It is possible to accurately set the shooting direction of the main camera by accurately capturing the position. In addition, since the speaker's position can be accurately identified, it is possible to accurately identify the video to be emphasized even when emphasizing the speaker's monitor display, and to realize a realistic video conference. Become.

以下、本発明の実施の形態を図に基づき説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

＜カメラ付きマイクロフォンの構成＞
図１は、本実施形態に係るカメラ付きマイクロフォンを説明する模式図で、（ａ）は平面図、（ｂ）は側面図（ｃ）は上面図である。このカメラ付きマイクロフォン１は、本体筐体１０をテーブル上に載置して利用するもので、本体筐体１０に設けられたマイクロフォン１１と、同じ本体筐体１０に設けられたカメラ１２とを有している。 <Configuration of microphone with camera>
FIG. 1 is a schematic diagram for explaining a microphone with a camera according to the present embodiment, where (a) is a plan view and (b) is a side view (c). This microphone 1 with a camera is used by placing a main body casing 10 on a table, and has a microphone 11 provided in the main body casing 10 and a camera 12 provided in the same main body casing 10. is doing.

本実施形態では、マイクロフォン１１が単一指向性を有するもので、本体筐体１０の一方側に強い音声取り込みゲインを有している。すなわち、主として一般的な発話者の声に対応した周波数で一定の音声取り込みゲインを有する範囲が本体筐体１０の一方側に設けられたものである。このため、単一指向性として一定の角度範囲を持ったものとなっている。 In the present embodiment, the microphone 11 has a single directivity, and has a strong sound capturing gain on one side of the main body housing 10. That is, a range having a constant voice capture gain at a frequency corresponding mainly to the voice of a general speaker is provided on one side of the main body housing 10. For this reason, it has a certain angle range as unidirectionality.

また、本実施形態のカメラ付きマイクロフォン１のカメラ１２は、マイクロフォン１１の有する単一指向性の範囲とほぼ等しい画角を有している。これにより、カメラ１２で映像を取り込む範囲では必ずマイクロフォン１１の指向性範囲内に入っていることになる。 In addition, the camera 12 of the microphone 1 with a camera according to the present embodiment has an angle of view that is substantially equal to the unidirectional range of the microphone 11. As a result, the range in which the image is captured by the camera 12 is always within the directivity range of the microphone 11.

図１に示す例では、本体筐体１０の上面中央にマイクロフォン１１が配置され、図中一点鎖線で示す指向性を有している。これに対し、カメラ１２は本体筐体１０の上面上部に配置され、図中二点鎖線で示す画角を有している。この例のように、マイクロフォン１１とカメラ１２とが隣接して配置され、指向性の角度と画角とが上下方向および左右方向において略平行に設けられていることにより、マイクロフォン１１の単一指向性の範囲とほぼ等しいカメラ１２の画角を実現している。 In the example shown in FIG. 1, the microphone 11 is disposed at the center of the upper surface of the main body housing 10 and has directivity indicated by a one-dot chain line in the drawing. On the other hand, the camera 12 is disposed on the upper surface of the main body housing 10 and has an angle of view indicated by a two-dot chain line in the drawing. As in this example, the microphone 11 and the camera 12 are arranged adjacent to each other, and the directivity angle and the angle of view are provided substantially parallel in the vertical direction and the horizontal direction, so that the unidirectionality of the microphone 11 is achieved. An angle of view of the camera 12 that is substantially equal to the range of the characteristics is realized.

ここで、マイクロフォン１１の単一指向性とほぼ等しいカメラ１２の画角とは、マイクロフォン１１の単一指向性の範囲と一致している場合のほか、カメラ１２で発話者の顔の映像が取り込まれていればマイクロフォン１１でその発話者の音声を取り込むことができる場合を含む。なお、カメラ１２の画角をマイクロフォン１２の指向性よりも少し狭くし、画角が完全にマイクロフォン１２の指向性範囲の内側に収まるように設定することで、より確実に発話者がマイクロフォン１２の指向性範囲内に収まるように工夫することも可能である。 Here, the angle of view of the camera 12 that is substantially equal to the unidirectionality of the microphone 11 matches the range of the unidirectionality of the microphone 11, and the video of the speaker's face is captured by the camera 12. If so, the case where the microphone 11 can capture the voice of the speaker is included. Note that the angle of view of the camera 12 is slightly narrower than the directivity of the microphone 12 and is set so that the angle of view is completely within the directivity range of the microphone 12, so that the speaker is more sure of the microphone 12. It is also possible to devise so as to be within the directivity range.

このようなカメラ付きマイクロフォン１によって、カメラで写し出した映像をモニタに表示すれば、発話者はカメラによって自らの映像が取り込まれている場合にはマイクロフォン１２の指向性範囲内にいることを認識でき、反対にカメラによって自らの映像が映し出されていない場合にはマイクロフォン１２の指向性の範囲外にいることを容易に認識できるようになる。 If such a microphone 1 with a camera is used to display an image captured by the camera on a monitor, the speaker can recognize that the image is captured by the camera and is within the directivity range of the microphone 12. On the other hand, when the camera's own image is not projected by the camera, it can be easily recognized that the microphone 12 is out of the directivity range.

＜本実施形態のカメラ付マイクロフォンを用いて発話者の声を的確に捉える制御プログラム＞
上記構成から成る本実施形態のカメラ付きマイクロフォンの制御プログラムは、このカメラ付きマイクロフォンが接続される制御部で実行されるプログラム処理によって実現される。なお、制御部は、テレビ会議システムのシステム本体に設けられているが、カメラ付きマイクロフォンの本体筐体内に組み込まれている場合も考えられる。 <Control program for accurately capturing the voice of a speaker using the camera-equipped microphone of this embodiment>
The camera-equipped microphone control program of the present embodiment configured as described above is realized by a program process executed by a control unit to which the camera-equipped microphone is connected. Note that the control unit is provided in the system main body of the video conference system, but it can be considered that the control unit is incorporated in the main body housing of the microphone with camera.

（第１の制御プログラム）
第１の制御プログラムは、発話者の声をクリアにするために、カメラ付マイクロフォンにおいて映像をもとにマイクロフォンの指向性の方向を変更する制御プログラムである。この制御プログラムは一定時間間隔で繰り返し実行される。 (First control program)
The first control program is a control program for changing the direction of the directivity of the microphone based on the video in the microphone with camera in order to clear the voice of the speaker. This control program is repeatedly executed at regular time intervals.

先ず、カメラ付マイクロフォンのカメラにおいて取り込んだ映像を用いてパターン認識などの画像処理による顔検出を行う。ここで検出した顔が奇数個であった場合は、真ん中にある顔が指向性の中心となるようにマイクロフォンの指向性を調整する。一方、顔が偶数個検出された場合は真ん中にいる二人の顔の中央が指向性の中心となるようにマイクロフォンの指向性を調整する。 First, face detection is performed by image processing such as pattern recognition using video captured by a camera-equipped microphone. If the detected number of faces is an odd number, the directivity of the microphone is adjusted so that the face in the middle is the center of directivity. On the other hand, when an even number of faces are detected, the directivity of the microphone is adjusted so that the center of the faces of the two persons in the middle is the center of directivity.

なお、マイクロフォンの指向性の中心を変更するには、マイクロフォンを機械的に回転させる場合と、電気的に指向性の方向を変える場合とが挙げられる。 In order to change the center of directivity of the microphone, there are a case where the microphone is mechanically rotated and a case where the direction of directivity is electrically changed.

図２は、第１の制御プログラムの流れを説明するフローチャートである。先ず、カメラ付きマイクロフォンのカメラによって映像を取り込み、画像処理によって顔検出を行う（ステップＳ１０１）。ここで検出した顔の数をｍ、Ｆ０，Ｆ１，…，Ｆｍを検出した顔の画面上の中心座標とする。 FIG. 2 is a flowchart for explaining the flow of the first control program. First, an image is captured by a camera-equipped microphone, and face detection is performed by image processing (step S101). The number of faces detected here is set as the center coordinates on the screen of the detected faces of m, F0, F1,.

次に、ｍ＞０であるか否かを判断する（ステップＳ１０２）。つまり、少なくとも１つの顔を検出したか否かを判断する。ｍ＞０でない場合（１つも顔を検出していない場合）は処理を終了する。一方、ｍ＞０の場合（１つ以上の顔を検出した場合）、ｍは奇数か偶数かの判断を行う（ステップＳ１０３）。 Next, it is determined whether or not m> 0 (step S102). That is, it is determined whether or not at least one face has been detected. When m> 0 is not satisfied (when no face is detected), the process is terminated. On the other hand, if m> 0 (when one or more faces are detected), it is determined whether m is an odd number or an even number (step S103).

ｍが奇数の場合、検出された顔の座標のうち、画面上の原点座標（０，０）から距離が最も近い顔の中心座標にマイクロフォンの指向性の中心を向けるよう制御する（ステップＳ１０４）。 When m is an odd number, control is performed so that the center of the directivity of the microphone is directed to the center coordinate of the face that is closest to the origin coordinate (0, 0) on the screen among the detected face coordinates (step S104). .

一方、ｍが偶数の場合、検出された顔の座標のうち、画面上の原点座標（０，０）から距離が最も近い顔の中心座標と、次に近い顔の中心座標との中間の座標にマイクロフォンの指向性の中心を向けるよう制御する（ステップＳ１０５）。但し、このｍが偶数の場合の処理(ステップＳ１０５)は、ｍが奇数の場合の処理(ステップＳ１０４)と同じにしても良い。この場合はｍが奇数か偶数かの判別（ステップＳ１０３）の必要はなく、ｍ＞０(ステップＳ１０２)ならば常にステップＳ１０４の処理に移る。 On the other hand, when m is an even number, the coordinates between the detected face coordinates and the center coordinates of the face closest to the origin coordinate (0, 0) on the screen and the center coordinates of the next closest face are displayed. Control is performed so that the center of the directivity of the microphone is directed to (step S105). However, the process when m is an even number (step S105) may be the same as the process when m is an odd number (step S104). In this case, it is not necessary to determine whether m is an odd number or an even number (step S103). If m> 0 (step S102), the process always proceeds to step S104.

（第２の制御プログラム）
第２の制御プログラムは、カメラ付きマイクロフォンのカメラによって取り込んだ映像から顔検出を行った結果、映像内に顔が存在しなかった場合はそのマイクロフォンのゲインを下げるか、ミュートすることで必要のない音を拾わないようにする制御プログラムである。この制御プログラムも第１の制御プログラムと同様に一定時間間隔で繰り返し実行される。 (Second control program)
The second control program does not need to reduce the gain of the microphone or mute it if the face is not present in the video as a result of detecting the face from the video captured by the camera microphone. It is a control program that prevents picking up sounds. This control program is also repeatedly executed at regular time intervals in the same manner as the first control program.

先ず、カメラ付マイクロフォンのカメラにおいて取り込んだ映像を用いてパターン認識などの画像処理による顔検出を行う。ここで顔を検出できた場合には上記第１の制御プログラムを実行する。一方、顔を検出できなかった場合には、そのカメラ付きマイクロフォンによる音声取り込みのゲインを下げるもしくはミュートする処理を行う。 First, face detection is performed by image processing such as pattern recognition using video captured by a camera-equipped microphone. If the face can be detected here, the first control program is executed. On the other hand, when the face cannot be detected, a process of lowering or muting the sound capturing gain by the camera microphone is performed.

図３は、第２の制御プログラムの流れを説明するフローチャートである。先ず、カメラ付きマイクロフォンのカメラによって映像を取り込み、画像処理によって顔検出を行う（ステップＳ２０１）。ここで検出した顔の数をｍ、Ｆ０，Ｆ１，…，Ｆｍを検出した顔の画面上の中心座標とする。 FIG. 3 is a flowchart for explaining the flow of the second control program. First, a video is captured by a camera-equipped microphone, and face detection is performed by image processing (step S201). The number of faces detected here is set as the center coordinates on the screen of the detected faces of m, F0, F1,.

次に、ｍ＞０であるか否かを判断する（ステップＳ２０２）。つまり、少なくとも１つの顔を検出したか否かを判断する。ここで、ｍ＞０の場合（１つ以上の顔を検出した場合）、マイクロフォンのゲインを初期設定にする（ステップＳ２０３）。一方、ｍ＞０でない場合（１つも顔を検出していない場合）、マイクロフォンのゲインを下げるか、そのマイクロフォンからの出力を０にする（ステップＳ２０４）。 Next, it is determined whether m> 0 is satisfied (step S202). That is, it is determined whether or not at least one face has been detected. Here, if m> 0 (when one or more faces are detected), the microphone gain is initialized (step S203). On the other hand, when m> 0 is not satisfied (when no face is detected), the gain of the microphone is lowered or the output from the microphone is set to 0 (step S204).

図４は、第１の制御プログラムと第２の制御プログラムとの両方を実現する処理を説明するフローチャートである。先ず、カメラ付きマイクロフォンのカメラによって映像を取り込み、画像処理によって顔検出を行う（ステップＳ３０１）。ここで検出した顔の数をｍ、Ｆ０，Ｆ１，…，Ｆｍを検出した顔の画面上の中心座標とする。 FIG. 4 is a flowchart for explaining processing for realizing both the first control program and the second control program. First, a video is taken in by a camera-equipped microphone, and face detection is performed by image processing (step S301). The number of faces detected here is set as the center coordinates on the screen of the detected faces of m, F0, F1,.

次に、ｍ＞０であるか否かを判断する（ステップＳ３０２）。つまり、少なくとも１つの顔を検出したか否かを判断する。ｍ＞０でない場合（１つも顔を検出していない場合）は、マイクロフォンのゲインを下げるか、そのマイクロフォンからの出力を０にする（ステップＳ３０３）。 Next, it is determined whether or not m> 0 (step S302). That is, it is determined whether or not at least one face has been detected. When m> 0 is not satisfied (when no face is detected), the gain of the microphone is lowered or the output from the microphone is set to 0 (step S303).

一方、ｍ＞０の場合（１つ以上の顔を検出した場合）、ｍは奇数か偶数かの判断を行う（ステップＳ３０４）。そして、ｍが奇数の場合、検出された顔の座標のうち、画面上の原点座標（０，０）から距離が最も近い顔の中心座標にマイクロフォンの指向性の中心を向けるよう制御する（ステップＳ３０５）。一方、ｍが偶数の場合、検出された顔の座標のうち、画面上の原点座標（０，０）から距離が最も近い顔の中心座標と、次に近い顔の中心座標との中間の座標にマイクロフォンの指向性の中心を向けるよう制御する（ステップＳ３０６）。 On the other hand, when m> 0 (when one or more faces are detected), it is determined whether m is an odd number or an even number (step S304). If m is an odd number, control is performed so that the center of the directivity of the microphone is directed to the center coordinate of the face that is closest to the origin coordinate (0, 0) on the screen among the detected face coordinates (step). S305). On the other hand, when m is an even number, the coordinates between the detected face coordinates and the center coordinates of the face closest to the origin coordinate (0, 0) on the screen and the center coordinates of the next closest face are displayed. Control is made so that the center of the directivity of the microphone is directed to (step S306).

＜カメラ付きマイクロフォンを用いたテレビ会議システムの構成＞
図５は、本実施形態のカメラ付きマイクロフォンを適用したテレビ会議システムの構成を説明する模式図である。テレビ会議システム１００は、システム本体（制御部）１０１を中心として、映像を映し出すモニタ１０２、会議参加者の映像を高画質で取り込むメインカメラ１０３、テーブルのレイアウトに対応して配置される複数のカメラ付きマイクロフォン１によって構成される。 <Configuration of video conference system using microphone with camera>
FIG. 5 is a schematic diagram illustrating the configuration of a video conference system to which the microphone with a camera according to the present embodiment is applied. A video conference system 100 is centered on a system main body (control unit) 101, a monitor 102 that displays video, a main camera 103 that captures video of conference participants with high image quality, and a plurality of cameras that are arranged corresponding to the layout of the table The microphone 1 is provided.

モニタ１０２には、メインカメラ１０３やカメラ付きマイクロフォン１のカメラによって取り込んだ映像を映し出したり、離れた場所にある会議室から送られる相手方参加者の映像を映し出したりする。これらの映像は任意に切り替えることができる。 On the monitor 102, the video captured by the camera of the main camera 103 or the camera-equipped microphone 1 is displayed, or the video of the partner participant sent from the conference room at a remote location is displayed. These videos can be switched arbitrarily.

メインカメラ１０３はパン、チルト、ズームといった動作が可能であり、会議室にいる参加者全員の映像を広角で取り込んだり、特定の参加者に向けて撮影範囲を絞り込んで取り込むことができる。また、メインカメラ１０３には、音源方向推定用のマイクロフォンが設けられており、このマイクロフォンで取り込んだ音声によって音源方向を推定し、その向きに撮影方向を合わせることができるようになっている。 The main camera 103 can perform operations such as panning, tilting, and zooming, and can capture images of all participants in the conference room at a wide angle, or can narrow down and capture a shooting range toward a specific participant. Further, the main camera 103 is provided with a microphone for estimating a sound source direction. The sound source direction can be estimated based on sound captured by the microphone, and the shooting direction can be adjusted to the direction.

複数のカメラ付きマイクロフォン１は、テーブルのレイアウトや参加者の座る位置に合わせて配置されている。各カメラ付きマイクロフォン１は、システム本体１０１を起点としてカスケード接続されている。 The plurality of microphones 1 with a camera are arranged in accordance with the layout of the table and the positions where the participants sit. Each microphone 1 with a camera is cascade-connected starting from the system main body 101.

モニタ１０２、メインカメラ１０３、複数のカメラ付きマイクロフォン１はシステム本体１０１に接続され、システム本体１０１で実行される各種のプログラムによって制御される。 The monitor 102, the main camera 103, and the plurality of camera-equipped microphones 1 are connected to the system main body 101 and controlled by various programs executed by the system main body 101.

＜カメラ付きマイクロフォンを用いたメインカメラの制御＞
図５に示すようなテレビ会議システム１００において、本実施形態のカメラ付きマイクロフォン１を用いることにより、メインカメラ１０３を発話者方向に精度良くパン、チルト等することが可能となる。なお、ここでは参加者一人につき一つのカメラ付マイクロフォン１を使用することを前提とする。 <Control of main camera using microphone with camera>
In the video conference system 100 as shown in FIG. 5, by using the camera-equipped microphone 1 of the present embodiment, the main camera 103 can be panned, tilted, etc. in the direction of the speaker accurately. Here, it is assumed that one microphone 1 with a camera is used for each participant.

（第１のメインカメラ制御方法）
第１のメインカメラ制御方法では、エコーキャンセラー出力(ここではエコーキャンセラーにおいてマイク入力信号から推定エコー成分を差し引いた音声のこと)の過去数秒間（例えば、１秒間）のパワー平均が閾値を超えたカメラ付マイクロフォンが存在する場合、発話があったとみなして閾値を超えたカメラ付マイクロフォンの中から最も大きいパワー平均値のカメラ付マイクロフォンを選択するなどして発話者に最も近いと思われるカメラ付マイクロフォンを決める。 (First main camera control method)
In the first main camera control method, the power average of the echo canceller output (here, the sound obtained by subtracting the estimated echo component from the microphone input signal in the echo canceller) for the past several seconds (for example, 1 second) exceeded the threshold. If there is a microphone with a camera, the microphone with the camera that is considered to be closest to the speaker, such as selecting the microphone with the camera with the largest average power from the microphones with a camera that exceeded the threshold, assuming that there was an utterance Decide.

この時、メインカメラは音源方向推定により大まかにカメラを発話者方向にパン・チルトしてから、画像処理による顔検出を行った後、検出された各々の顔の中から発話者に最も近いとみなされたカメラ付マイクロフォンの画像を利用したマッチングによって個人識別を行い、最も類似している顔を見つけ出し、その方向にメインカメラをパン・チルトする。 At this time, after the main camera pans / tilts the camera in the direction of the speaker by estimating the sound source direction and then performs face detection by image processing, it is assumed that the detected face is closest to the speaker from each detected face. Personal identification is performed by matching using the image of the regarded microphone with camera, the most similar face is found, and the main camera is panned and tilted in that direction.

各カメラ付マイクロフォンは、個人識別用の画像として一定時間間隔で画像を保存しており、発話者に最も近いとみなされた時点で保存されている画像をメインカメラへと送信する。 Each camera-equipped microphone stores images at regular time intervals as images for personal identification, and transmits the stored image to the main camera when it is deemed closest to the speaker.

このようなメインカメラの制御によって、メインカメラによる音源方向推定で大まかな撮影方向を設定し、その取り込み映像の中に複数の参加者の顔が映し出されていた場合、その複数の顔の中から、音声の取り込みが最大となるカメラ付きマイクロフォンのカメラで取り込んだ顔の画像と類似するものを検出し、その顔の位置に合わせてメインカメラの撮影方向を微調整する。これにより、音源方向推定だけでは正確に特定できなかった発話者の映像を正確にメインカメラで捉えることができるようになる。 By controlling the main camera in this way, a rough shooting direction is set by estimating the sound source direction using the main camera, and when the faces of multiple participants are displayed in the captured video, the multiple faces are displayed. Then, an image similar to the image of the face captured by the camera microphone with the maximum audio capture is detected, and the shooting direction of the main camera is finely adjusted according to the position of the face. As a result, it is possible to accurately capture the video of the speaker, which could not be accurately identified only by estimating the sound source direction, with the main camera.

図６は、第１のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理の流れを説明するフローチャートである。この処理は、例えば音声のサンプリング周波数を32000Hzとした場合、毎サンプルごとに処理を行う必要があるため、1/32000秒ごとにＳ４０１またはＳ４０３の処理を開始しなければならない。よって１ループ(Ｓ４０１〜Ｓ４０８またはＳ４０３〜Ｓ４０８)の処理を1/32000秒以内に終える必要がある。先ず、音声取り込みのパワー平均を格納するメモリＭｘの値を０に初期化する（ステップＳ４０１）。ここで、メモリＭｘのｘは、カメラ付きマイクロフォン毎に０から順に割り振られた番号である。 FIG. 6 is a flowchart for explaining the flow of processing on the microphone with camera in the first main camera control method. For example, when the audio sampling frequency is set to 32000 Hz, this processing needs to be performed for each sample. Therefore, the processing of S401 or S403 must be started every 1/32000 seconds. Therefore, it is necessary to finish the processing of one loop (S401 to S408 or S403 to S408) within 1/32000 seconds. First, the value of the memory Mx for storing the power average of audio capturing is initialized to 0 (step S401). Here, x in the memory Mx is a number assigned sequentially from 0 for each microphone with camera.

次に、カウンタｃを０に初期化する（ステップＳ４０２）。次いで、所定のサンプリング期間で音声取り込みのパワーを計算する（ステップＳ４０３）。例えば、３２ｋＨｚサンプリングのマイク入力信号であった場合、ｗ秒間（１秒間ならｗ＝１）の平均パワーＰaveを求めるため、ｗ×３２０００の要素をもつ配列Ｐを用意し、毎サンプリング毎に瞬時値パワーＰ[c]＝Input×Inputを計算する。ここで、Inputはマイク入力値である。サンプリング期間が３２ｋＨｚでない場合には、上記３２０００をサンプリング周波数に置き換える。 Next, the counter c is initialized to 0 (step S402). Next, the power for capturing audio is calculated in a predetermined sampling period (step S403). For example, in the case of a 32 kHz sampling microphone input signal, an array P having w × 32000 elements is prepared to obtain an average power Pave for w seconds (w = 1 for 1 second), and an instantaneous value is obtained for each sampling. Power P [c] = Input × Input is calculated. Here, Input is a microphone input value. When the sampling period is not 32 kHz, the above 32000 is replaced with the sampling frequency.

次に、配列Ｐ[c]の全ての要素の値の平均Ｐaveを算出する（ステップＳ４０４）。そして、Ｐaveが所定の閾値を超えているか否かを判断し（ステップＳ４０５）、超えていない場合にはＭｘ＝０とし（ステップＳ４０６）、超えている場合にはＭｘ＝Ｐaveとする（ステップＳ４０７）。 Next, the average Pave of the values of all elements of the array P [c] is calculated (step S404). Then, it is determined whether or not Pave exceeds a predetermined threshold (step S405). If not, Mx = 0 (step S406), and if it exceeds Mx = Pave (step S407). ).

次いで、ｃ＝（ｃ＋１）％（ｗ×３２０００）を行う（ステップＳ４０８）。つまり、ｃ＋１をｗ×３２０００で割った時の余りを新しいｃとする。新しいｃを設定した後は、ステップＳ４０３へ戻り、以降の処理を繰り返す。 Next, c = (c + 1)% (w × 32000) is performed (step S408). That is, the remainder when c + 1 is divided by w × 32000 is set as a new c. After setting a new c, the process returns to step S403, and the subsequent processing is repeated.

図７は、第１のメインカメラ制御方法におけるメインカメラ側の処理の流れを説明するフローチャートである。先ず、タイマーによりＴｏｌｄを現在時刻に初期化し（ステップＳ５０１）、タイマーによりＴｎｏｗを現在時刻に代入する（ステップＳ５０２）。 FIG. 7 is a flowchart for explaining the flow of processing on the main camera side in the first main camera control method. First, Told is initialized to the current time by a timer (step S501), and Tnow is substituted for the current time by the timer (step S502).

次に、Ｔｎｏｗ−Ｔｏｌｄ≧ｗ（秒）であるか否かを判断する（ステップＳ５０３）。ここで、ｗは、メインカメラを一度パン、チルトしてから次にパン、チルトするまでの待ち時間（秒）である。 Next, it is determined whether or not Tnow-Told ≧ w (seconds) (step S503). Here, w is a waiting time (seconds) from the first pan and tilt of the main camera to the next pan and tilt.

ステップＳ５０３の判断でＮｏであればステップＳ５０２へ戻り、タイマーによりＴｎｏｗへ現在時刻を代入する。ステップＳ５０３の判断でＹｅｓであれば、ＴｏｌｄにＴｎｏｗを代入する（ステップＳ５０４）。そして、Ｘmax＝０、Ｍmax＝０にして（ステップＳ５０５）、ステップＳ５０６へ進む。 If No in step S503, the process returns to step S502, and the current time is substituted into Tnow by a timer. If Yes in step S503, Tnow is substituted for Told (step S504). Then, Xmax = 0 and Mmax = 0 are set (step S505), and the process proceeds to step S506.

ステップＳ５０６では、カメラ付きマイクロフォンがｎ本繋がっているとした場合、Ｍmax＜Ｍiであれば、Ｘmax＝ｉ、Ｍmax＝Ｍiにする処理をｎ回分行う。ここで、Ｍiは、図６に示すＭｘに対応している。この処理によって、Ｘmaxには最も平均パワーが大きいカメラ付きマイクロフォンの番号が格納され、Ｍmaxには最も大きい平均パワーが格納される。 In step S506, assuming that n microphones with camera are connected, if Mmax <Mi, the process of setting Xmax = i and Mmax = Mi is performed n times. Here, Mi corresponds to Mx shown in FIG. By this processing, the number of the camera microphone with the highest average power is stored in Xmax, and the highest average power is stored in Mmax.

次いで、Ｘmax＞０か否かの判断を行い（ステップＳ５０７）、ＮｏであればステップＳ５０２へ戻る。Ｙｅｓであれば、音源方向検出によりメインカメラを大まかにパン、チルトさせる（ステップＳ５０８）。また、Ｘmaxに格納された番号に対応するカメラ付きマイクロフォンのカメラから画像Ｖを取り込む（ステップＳ５０９）。 Next, it is determined whether or not Xmax> 0 (step S507). If No, the process returns to step S502. If Yes, the main camera is roughly panned and tilted by detecting the sound source direction (step S508). Further, the image V is captured from the camera of the microphone with camera corresponding to the number stored in Xmax (step S509).

次に、メインカメラで取り込んだ画像内で顔検出を行う（ステップＳ５１０）。ここでは、検出した顔の数をｍ、Ｆ０，Ｆ１，…，Ｆｍを検出した顔の中心座標とする。そして、Ｆmax＝０、Ｓmax＝０にする（ステップＳ５１１）。 Next, face detection is performed in the image captured by the main camera (step S510). Here, it is assumed that the number of detected faces is m, F0, F1,. Then, Fmax = 0 and Smax = 0 are set (step S511).

次いで、メインカメラで取り込んだ画像内で検出した各顔の画像と、Ｘmaxに格納された番号に対応するカメラ付きマイクロフォンのカメラから画像Ｖとの類似度を計算する（ステップＳ５１２）。ここで、Ｓ０〜Ｓｍは各顔の画像と画像Ｖとの類似度（０〜１の値）である。そして、Ｓmax＜Ｓｊであれば、Ｆmax＝ｊ、Ｓmax＝Ｓｊにする処理をｍ回分行う。これにより、各顔の画像のうち画像Ｖと最も類似する画像の番号がＦmaxに格納される。ここで、Ｓ０はＦ０、Ｓ１はＦ１、…、ＳｍはＦｍを中心座標とする顔の画像Ｖとの類似度である。 Next, the similarity between the image of each face detected in the image captured by the main camera and the image V from the camera of the microphone with camera corresponding to the number stored in Xmax is calculated (step S512). Here, S0 to Sm are similarities (values of 0 to 1) between the image of each face and the image V. If Smax <Sj, the process of setting Fmax = j and Smax = Sj is performed m times. Thereby, the number of the image most similar to the image V among the images of each face is stored in Fmax. Here, S0 is F0, S1 is F1,..., Sm is the similarity with the face image V having Fm as the central coordinates.

その後、メインカメラをＦmaxに対応する顔の画像の中心座標方向に合わせるよう、パン、チルトの微調整を行う（ステップＳ５１３）。 After that, fine adjustment of pan and tilt is performed so that the main camera is aligned with the center coordinate direction of the face image corresponding to Fmax (step S513).

（第２のメインカメラ制御方法）
第２のメインカメラ制御方法では、カメラ付マイクロフォンのカメラによって発話者が「よりカメラ正面を向いている時」の画像を保存しておき、メインカメラはその画像をもとに個人識別を行うことで発話者の方向にカメラをパンチルドする際の精度を向上させることができる方法である。これは、大抵の発話者はテレビを向いて話す傾向があるため、図５のような環境ではメインカメラに対してほぼ正面を向いていることが多いためである。 (Second main camera control method)
In the second main camera control method, an image of a speaker when “the camera is facing the front of the camera” is stored by a camera-equipped microphone, and the main camera performs personal identification based on the image. This is a method that can improve the accuracy when the camera is punched in the direction of the speaker. This is because most speakers tend to talk to the TV, and in the environment as shown in FIG. 5, they are often almost in front of the main camera.

カメラ付マイクロフォンは、目と口のサンプル画像を用いたパターン認識などにより目の二点と口の一点を結ぶ三角形の面積の大きさ(面積が大きいほど正面を向いている可能性が高い)などの情報を用いて、一定時間間隔で取り込んだ画像が保存している画像よりも正面を向いていると考えられる場合、その画像を上書き保存する。ただし、保存する顔画像の数は１つとは限らず、複数保存する場合はより正面を向いている可能性の高いものを優先して保存する。 Microphone with camera is a triangle area connecting two points of eyes and one point of mouth by pattern recognition using sample images of eyes and mouth (the larger the area, the more likely it is facing the front), etc. If it is considered that the image captured at regular time intervals is facing the front of the stored image, the image is overwritten and stored. However, the number of face images to be saved is not limited to one, and when a plurality of face images are saved, the face images that are more likely to face the front are preferentially saved.

なお、発話者に最も近いと思われるカメラ付マイクロフォンから人の耳には聞こえない周波数帯域の音声を出力し、最初はその周波数帯域の音を用いた音源方向推定によってメインカメラを大まかにパン、チルトし、次に顔検出・個人識別などによる細かいパン、チルト制御を行うという方法も考えられる。 In addition, the sound of the frequency band that cannot be heard by human ears is output from the microphone with the camera that seems to be closest to the speaker, and the main camera is roughly panned by estimating the direction of the sound source using the sound in that frequency band. A method of tilting and then performing fine pan and tilt control by face detection and personal identification is also conceivable.

図８は、第２のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理（画像の保存数が１の場合）の流れを説明するフローチャートである。先ず、タイマーによりＴｏｌｄを現在時刻に初期化し、目と口で構成される三角形の面積を記憶する変数Ａを０に初期化し、保存画像Ｖを黒画像（もしくは白画像）に初期化する（ステップＳ６０１）。 FIG. 8 is a flowchart for explaining the flow of processing on the microphone with camera side (when the number of stored images is 1) in the second main camera control method. First, Told is initialized to the current time by a timer, a variable A for storing the area of a triangle composed of eyes and mouth is initialized to 0, and a saved image V is initialized to a black image (or a white image) (step) S601).

次いで、タイマーによりＴｎｏｗに現在時刻を代入し（ステップＳ６０２）、Ｔｎｏｗ−Ｔｏｌｄ≧ｗか否かの判断を行う（ステップＳ６０３）。ここで、ｗは次の画像を取り込むまでの時間間隔（秒）である。この判断でＮｏの場合にはステップＳ６０２へ戻り、タイマーによりＴｎｏｗへ現在時刻を代入する。一方、Ｙｅｓの場合にはステップＳ６０４へ進む。 Next, the current time is substituted into Tnow by a timer (step S602), and it is determined whether Tnow-Told ≧ w (step S603). Here, w is the time interval (seconds) until the next image is captured. If the determination is No, the process returns to step S602, and the current time is substituted into Tnow by a timer. On the other hand, if Yes, the process proceeds to step S604.

ステップＳ６０４では、ＴｏｌｄにＴｎｏｗの値を代入する。その後、ステップＳ６０５では、カメラ付きマイクロフォンのカメラで顔の画像Ｖnowを取り込み、目と口で構成される三角形の面積をＡnowとする。次いで、Ａ＜Ａnowであるか否かを判断し（ステップＳ６０６）、Ｙｅｓの場合にはＡ＝Ａnow、Ｖ＝Ｖnowを行い（ステップＳ６０７）、Ｎｏの場合にはステップＳ６０２へ戻る。 In step S604, the value of Tnow is substituted for Told. After that, in step S605, the face image Vnow is captured by the camera-equipped microphone, and the area of the triangle formed by the eyes and mouth is defined as Anow. Next, it is determined whether or not A <Anow (step S606). If Yes, A = Anow and V = Vnow are performed (step S607). If No, the process returns to step S602.

これにより、Ａにはカメラ付きマイクロフォンのカメラで取り込んだ顔の画像のうち目と口で構成される三角形の面積が最も大きい場合の面積が格納され、Ｖにはその時の顔の画像が格納されることになる。 Thereby, A stores the area when the area of the triangle composed of the eyes and mouth is the largest among the face images captured by the camera microphone, and V stores the face image at that time. Will be.

図９は、第２のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理（画像の保存数が複数の場合）の流れを説明するフローチャートである。先ず、タイマーによりＴｏｌｄを現在時刻に初期化し、目と口で構成される三角形の面積を記憶する配列Ａ[0]〜Ａ[o]を０に初期化し、保存画像の配列である画像Ｖ[0]〜Ｖ[o]を黒画像（もしくは白画像）に初期化する（ステップＳ７０１）。 FIG. 9 is a flowchart for explaining the flow of processing on the camera-equipped microphone side (when the number of stored images is plural) in the second main camera control method. First, Told is initialized to the current time by a timer, the arrays A [0] to A [o] storing the area of the triangle formed by the eyes and mouth are initialized to 0, and the image V [ [0] to V [o] are initialized to black images (or white images) (step S701).

次いで、タイマーによりＴｎｏｗに現在時刻を代入し（ステップＳ７０２）、Ｔｎｏｗ−Ｔｏｌｄ≧ｗか否かの判断を行う（ステップＳ７０３）。ここで、ｗは次の画像を取り込むまでの時間間隔（秒）である。この判断でＮｏの場合にはステップＳ７０２へ戻り、タイマーによりＴｎｏｗへ現在時刻を代入する。一方、Ｙｅｓの場合にはステップＳ７０４へ進む。 Next, the current time is substituted into Tnow by the timer (step S702), and it is determined whether Tnow-Told ≧ w (step S703). Here, w is the time interval (seconds) until the next image is captured. If this determination is No, the process returns to step S702, and the current time is substituted into Tnow by a timer. On the other hand, if Yes, the process proceeds to step S704.

ステップＳ７０４では、ＴｏｌｄにＴｎｏｗの値を代入する。その後、ステップＳ７０５では、カメラ付きマイクロフォンのカメラで顔の画像Ｖnowを取り込み、目と口で構成される三角形の面積をＡnowとする。次いで、Ｏmin＝−１、Ａmin＝Ａnowを行い（ステップＳ７０６）、Ａi＜ＡminであればＯmin＝ｉ、Ａmin＝Ａ[i]にする処理をｏ回行う（ステップＳ７０７）。 In step S704, the value of Tnow is substituted for Told. Thereafter, in step S705, the face image Vnow is captured by the camera-equipped microphone, and the area of the triangle formed by the eyes and mouth is defined as Anow. Next, Omin = −1 and Amin = Anow are performed (step S706), and if Ai <Amin, the process of setting Omin = i and Amin = A [i] is performed o times (step S707).

次に、Ｏmin≧０であるか否かを判断し（ステップＳ７０８）、ＮｏであればステップＳ７０２へ戻って以降を繰り返し、ＹｅｓであればＡ[Ｏmin]＝Ａnow、Ｖ[Ｏmin]＝Ｖnowを行う（ステップＳ７０９）。そして、ステップＳ７０２へ戻る。 Next, it is determined whether or not Omin ≧ 0 (step S708). If No, the process returns to step S702 and the subsequent steps are repeated. If Yes, A [Omin] = Anow and V [Omin] = Vnow are set. This is performed (step S709). Then, the process returns to step S702.

これにより、配列Ａ[0]〜Ａ[o]に顔の画像における目と口で構成される三角形の面積が複数格納され、画像Ｖ[0]〜Ａ[o]にその顔の画像が複数格納されることになる。 Thus, a plurality of triangular areas composed of eyes and mouths in the face image are stored in the arrays A [0] to A [o], and a plurality of the face images are stored in the images V [0] to A [o]. Will be stored.

図１０は、第２のメインカメラ制御方法におけるメインカメラ側の処理の流れを説明するフローチャートである。先ず、タイマーによりＴｏｌｄを現在時刻に初期化し（ステップＳ８０１）、タイマーによりＴｎｏｗに現在時刻を代入する（ステップＳ８０２）。 FIG. 10 is a flowchart for explaining the flow of processing on the main camera side in the second main camera control method. First, Told is initialized to the current time by a timer (step S801), and the current time is substituted into Tnow by the timer (step S802).

次に、Ｔｎｏｗ−Ｔｏｌｄ≧ｗか否かを判断する（ステップＳ８０３）。ここで、ｗは、メインカメラを一度パン、チルトしてから次にパン、チルトするまでの待ち時間（秒）である。 Next, it is determined whether Tnow-Told ≧ w (step S803). Here, w is a waiting time (seconds) from the first pan and tilt of the main camera to the next pan and tilt.

ステップＳ８０３の判断でＮｏであればステップＳ８０２へ戻り、タイマーによりＴｎｏｗへ現在時刻を代入する。ステップＳ８０３の判断でＹｅｓであれば、ＴｏｌｄにＴｎｏｗの値を代入し（ステップＳ８０４）する。その後、Ｘmax＝０、Ｍmax＝０にして（ステップＳ８０５）、ステップＳ８０６へ進む。 If NO in step S803, the process returns to step S802, and the current time is substituted into Tnow by a timer. If YES in step S803, the value of Tnow is substituted for Told (step S804). Thereafter, Xmax = 0 and Mmax = 0 are set (step S805), and the process proceeds to step S806.

ステップＳ８０６では、カメラ付きマイクロフォンがｎ本繋がっているとした場合、Ｍmax＜Ｍiであれば、Ｘmax＝ｉ、Ｍmax＝Ｍiにする処理をｎ回分行う。ここで、Ｍiは、図６に示すＭｘに対応している。この処理によって、Ｘmaxには最も平均パワーが大きいカメラ付きマイクロフォンの番号が格納され、Ｍmaxには最も大きい平均パワーが格納される。 In step S806, assuming that n microphones with cameras are connected, if Mmax <Mi, the process of setting Xmax = i and Mmax = Mi is performed n times. Here, Mi corresponds to Mx shown in FIG. By this processing, the number of the camera microphone with the highest average power is stored in Xmax, and the highest average power is stored in Mmax.

次いで、Ｘmax＞０か否かの判断を行い（ステップＳ８０７）、ＮｏであればステップＳ８０２へ戻る。Ｙｅｓであれば、音源方向検出によりメインカメラを大まかにパン、チルトさせる（ステップＳ８０８）。また、Ｘmaxに格納された番号に対応するカメラ付きマイクロフォンのカメラから画像Ｖ[0]〜Ｖ[o]を取り込む（ステップＳ８０９）。 Next, it is determined whether or not Xmax> 0 (step S807). If No, the process returns to step S802. If Yes, the main camera is roughly panned and tilted by detecting the sound source direction (step S808). Also, the images V [0] to V [o] are captured from the camera of the microphone with camera corresponding to the number stored in Xmax (step S809).

次に、メインカメラで取り込んだ画像内で顔検出を行う（ステップＳ８１０）。ここでは、検出した顔の数をｍ、Ｆ０，Ｆ１，…，Ｆｍを検出した顔の中心座標とする。そして、Ｆmax＝０、Ｓmax＝０にする（ステップＳ８１１）。 Next, face detection is performed in the image captured by the main camera (step S810). Here, it is assumed that the number of detected faces is m, F0, F1,. Then, Fmax = 0 and Smax = 0 are set (step S811).

次いで、メインカメラで取り込んだ画像内で検出した各顔の画像と、Ｘmaxに格納された番号に対応するカメラ付きマイクロフォンのカメラから画像Ｖ[0]〜Ｖ[o]との類似度を計算する（ステップＳ８１２）。ここで、ここで、Ｓ０〜Ｓｍは各顔の画像と画像Ｖ[0]〜Ｖ[o]との類似度（０〜１の値）の平均である。そして、Ｓmax＜Ｓｊであれば、Ｆmax＝ｊ、Ｓmax＝Ｓｊにする処理をｍ回分行う。これにより、各顔の画像のうち画像Ｖと最も類似する画像の番号がＦmaxに格納される。ここで、Ｓ０はＦ０、Ｓ１はＦ１、…、ＳｍはＦｍを中心座標とする顔の画像Ｖ[0]〜Ｖ[o]との平均類似度である。 Next, the degree of similarity between the image of each face detected in the image captured by the main camera and the image V [0] to V [o] from the camera of the camera microphone corresponding to the number stored in Xmax is calculated. (Step S812). Here, S0 to Sm are averages of the similarity (value of 0 to 1) between each face image and the images V [0] to V [o]. If Smax <Sj, the process of setting Fmax = j and Smax = Sj is performed m times. Thereby, the number of the image most similar to the image V among the images of each face is stored in Fmax. Here, S0 is F0, S1 is F1,..., Sm is an average similarity with the face images V [0] to V [o] having Fm as the central coordinates.

その後、メインカメラをＦmaxに対応する顔の画像の中心座標方向に合わせるよう、パン、チルトの微調整を行う（ステップＳ８１３）。 After that, fine adjustment of pan and tilt is performed so that the main camera is aligned with the center coordinate direction of the face image corresponding to Fmax (step S813).

＜カメラ付きマイクロフォンを用いたモニタ出力の演出方法＞
図５に示すようなテレビ会議システム１００において、本実施形態のカメラ付きマイクロフォン１を用いることにより、モニタ１０２の出力画面に様々な演出を施すことができる。ここではいくつかの例を挙げて説明を行う。 <Production method for monitor output using microphone with camera>
In the video conference system 100 as shown in FIG. 5, various effects can be given to the output screen of the monitor 102 by using the camera-equipped microphone 1 of the present embodiment. Here, some examples will be described.

（第１の演出方法）
図１１は、第１の演出方法を説明する模式図である。すなわち、第１の演出方法では、テレビ会議システムにおいて、各カメラ付マイクロフォンのカメラで取り込んだ映像を並べてモニタ出力する場合、エコーキャンセラー出力のパワー平均が大きいカメラ付ママイクロフォンほど映像の表示領域を大きくしている。 (First production method)
FIG. 11 is a schematic diagram for explaining the first effect method. That is, in the first production method, in the video conference system, when images captured by the cameras of the microphones with cameras are arranged and output on the monitor, the display area of the image is increased as the camera microphone with a larger power average of the echo canceller output. is doing.

図１１に示す例では、横３つの映像が２段に表示されており、このうち上段の中央に表示される映像ｈ２が最も大きく、次に、映像ｈ１、ｈ３、その次に映像ｈ４、ｈ５、ｈ６の順となっている。つまり、映像ｈ２を取り込んでいるカメラ付きマイクロフォンでの音声取り込みが最も大きいため、それに合わせて最も大きな表示サイズとなっている。これにより、発話者に近いと思われるカメラ付マイクロフォンのカメラで取り込んだ映像ほど大きくなる。 In the example shown in FIG. 11, the three horizontal images are displayed in two rows, and the image h2 displayed in the center of the upper row is the largest, next the images h1, h3, and then the images h4, h5. , H6 in that order. In other words, since the audio capture by the camera microphone capturing the video h2 is the largest, the display size is the largest accordingly. As a result, the larger the video captured by the camera-equipped microphone, which seems to be closer to the speaker.

（第２の演出方法）
図１２は、第２の演出方法を説明する模式図である。すなわち、第２の演出方法では、テレビ会議システムにおいて、各カメラ付マイクロフォンのカメラで取り込んだ映像を並べてモニタ出力する場合、エコーキャンセラー出力のパワー平均が大きいカメラ付マイクロフォンほど映像の明るさを明るくしている。 (Second production method)
FIG. 12 is a schematic diagram for explaining the second effect method. In other words, in the second production method, in the video conference system, when images captured by the cameras of the microphones with cameras are arranged and output on the monitor, the microphone with the camera having a larger power average of the echo canceller output increases the brightness of the images. ing.

図１２に示す例では、横３つの映像が２段に表示されており、このうち上段の中央に表示される映像ｈ２が最も明るく、次に、映像ｈ１、ｈ３、その次に映像ｈ４、ｈ５、ｈ６の順となっている。これにより、発話者に近いと思われるカメラ付マイクの映像ほど明るくなる。 In the example shown in FIG. 12, the three horizontal images are displayed in two rows, and the image h2 displayed in the center of the upper row is the brightest, followed by the images h1, h3, and then the images h4, h5. , H6 in that order. Thereby, the image of the microphone with a camera that seems to be closer to the speaker becomes brighter.

（第３の演出方法）
第３の演出方法は、テレビ会議システムにおいて、発話者に最も近いと思われるカメラ付マイクロフォンのエコーキャンセラー出力のパワー平均の大きさに応じてＢＧＭを変更するという方法である。 (Third production method)
The third presentation method is a method of changing the BGM according to the power average magnitude of the echo canceller output of the microphone with camera that seems to be closest to the speaker in the video conference system.

（第４の演出方法）
図１３は、第４の演出方法を説明する模式図である。すなわち、第４の演出方法では、テレビ会議システムにおいて、各カメラ付きマイクロフォンをシステム本体からカスケード接続で繋いでいった場合、システム本体に近い順にカメラ付マイクロフォンのカメラで取り込んだ映像を横一列に並べて表示することにより、擬似的なパノラマ映像を作り出す方法である。 (Fourth production method)
FIG. 13 is a schematic diagram for explaining the fourth effect method. That is, in the fourth rendering method, when the microphones with cameras are connected in cascade from the system body in the video conference system, the images captured by the camera microphones are arranged in a horizontal row in the order closest to the system body. This is a method of creating a pseudo panoramic image by displaying.

図１３（ａ）に示す例では、映像ｈ１〜ｈ５の順に対応するカメラ付きマイクロフォンがシステム本体から近い順にカスケード接続されており、各カメラ付きマイクロフォンのカメラで取り込んだ映像を横一列に並べて表示したものである。 In the example shown in FIG. 13A, the microphones with cameras corresponding to the order of the images h1 to h5 are cascaded in order from the system main body, and the images captured by the cameras of the respective microphones with cameras are displayed side by side. Is.

また、図１３（ｂ）に示す例では、上記と同様に、システム本体からカスケード接続されたカメラ付きマイクロフォンのカメラで取り込んだ映像につき、システム本体に近い順に映像を横一列に並べて表示したものであるが、エコーキャンセラー出力のパワー平均が大きいカメラ付ママイクロフォンほど映像の表示領域を大きくしている。 Further, in the example shown in FIG. 13B, as described above, the images captured by the camera microphones cascaded from the system main body are displayed in a horizontal row in order from the system main body. Although there is a camera-equipped mamicrophone with a large average power of echo canceller output, the image display area is enlarged.

図１３（ｂ）に示す例では、横一列に５つの映像ｈ１〜ｈ５がカスケード接続の順に表示されており、このうち映像ｈ３が最も大きく、次に、映像ｈ２、ｈ４、その次に映像ｈ１、ｈ５の順となっている。つまり、映像ｈ３を取り込んでいるカメラ付きマイクロフォンでの音声取り込みが最も大きいため、それに合わせて最も大きな表示サイズとなっている。これにより、発話者に近いと思われるカメラ付マイクロフォンのカメラで取り込んだ映像ほど大きくなるとともに、横一列の表示によって擬似的なパノラマ映像を作り出すことが可能となる。 In the example shown in FIG. 13B, five images h1 to h5 are displayed in a horizontal row in the order of cascade connection. Among these, the image h3 is the largest, the images h2, h4, and then the image h1. , H5 in that order. In other words, since the audio capture by the microphone with a camera that captures the video h3 is the largest, the display size is the largest accordingly. As a result, the image captured by the camera-equipped microphone that seems to be close to the speaker becomes larger, and a pseudo panoramic image can be created by the horizontal display.

＜実施効果＞
本実施形態のカメラ付きマイクロフォンをテレビ会議システムで適用することにより、発話者は容易に自分がマイクロフォンの指向性の範囲内にいるのかどうかを知ることができる。また、複数のカメラ付きマイクロフォンを用いる場合において、指向性の範囲内にノイズ源が存在すると、その指向性マイクからの入力音が必要な音を聴きづらくする原因となってしまうが、顔検出によって人のいる方向に指向性を向けることで必要な音をよりクリアにし、近くに人がいない場合はゲインを下げるかミュートすることで余計なノイズを極力拾わないようにすることができる。 <Implementation effect>
By applying the camera-equipped microphone of the present embodiment to the video conference system, the speaker can easily know whether he is within the microphone directivity range. In addition, when using a plurality of microphones with a camera, if there is a noise source within the directivity range, the input sound from the directional microphone may make it difficult to hear the necessary sound. Directing directivity in the direction of the person makes the necessary sound clearer, and when there is no person nearby, the gain can be lowered or muted to avoid picking up extra noise as much as possible.

さらに、メインカメラを発話者方向に向けてテレビ会議を行うような場合、音声処理による音源方向推定で大体の撮影方向を定め、さらに画像処理による顔検出・個人識別でメインカメラを発話者方向に向けるなどの方法が考えられるが、より高い精度を実現するために発話者に最も近いと思われるカメラ付マイクロフォンのカメラで取り込んだ画像を利用することで精度を高めることが可能である。 Furthermore, when a video conference is performed with the main camera facing the direction of the speaker, the approximate shooting direction is determined by sound source direction estimation by voice processing, and the main camera is set by the face detection and personal identification by image processing. However, in order to achieve higher accuracy, it is possible to improve accuracy by using an image captured by a camera-equipped microphone that seems to be closest to the speaker.

また、メインカメラの映像だけで発話者を視覚的に強調するような場合、映像の中で発話者が存在する部分をうまく抽出しなければならないという問題があるが、複数のカメラ付マイクロフォンを用いることによって、抽出処理を用いずに発話者を視覚的に強調することが可能となる。 In addition, when the speaker is visually emphasized only with the video of the main camera, there is a problem that the part where the speaker is present must be extracted well in the video, but a plurality of microphones with cameras are used. Thus, the speaker can be visually emphasized without using the extraction process.

本実施形態に係るカメラ付きマイクロフォンを説明する模式図である。It is a mimetic diagram explaining a microphone with a camera concerning this embodiment. 第１の制御プログラムの流れを説明するフローチャートである。It is a flowchart explaining the flow of a 1st control program. 第２の制御プログラムの流れを説明するフローチャートである。It is a flowchart explaining the flow of a 2nd control program. 第１の制御プログラムと第２の制御プログラムとの両方を実現する処理を説明するフローチャートである。It is a flowchart explaining the process which implement | achieves both a 1st control program and a 2nd control program. 本実施形態のカメラ付きマイクロフォンを適用したテレビ会議システムの構成を説明する模式図である。It is a schematic diagram explaining the structure of the video conference system to which the microphone with a camera of this embodiment is applied. 第１のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of a process by the microphone with a camera in the 1st main camera control method. 第１のメインカメラ制御方法におけるメインカメラ側の処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of the process by the main camera side in the 1st main camera control method. 第２のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理（画像の保存数が１の場合）の流れを説明するフローチャートである。It is a flowchart explaining the flow of the process by the side with a microphone with a camera in the 2nd main camera control method (when the preservation | save number of an image is 1). 第２のメインカメラ制御方法におけるカメラ付きマイクロフォン側の処理（画像の保存数が複数の場合）の流れを説明するフローチャートである。It is a flowchart explaining the flow of the process by the microphone side with a camera in the 2nd main camera control method (when the preservation | save number of an image is multiple). 第２のメインカメラ制御方法におけるメインカメラ側の処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of the process by the side of the main camera in the 2nd main camera control method. 第１の演出方法を説明する模式図である。It is a schematic diagram explaining the 1st production method. 第２の演出方法を説明する模式図である。It is a schematic diagram explaining the 2nd production method. 第４の演出方法を説明する模式図である。It is a schematic diagram explaining the 4th production method.

Explanation of symbols

１…カメラ付きマイクロフォン、１０…本体筐体、１１…マイクロフォン、１２…カメラ、１００…テレビ会議システム、１０１…システム本体（制御部）、１０２…モニタ、１０３…メインカメラ DESCRIPTION OF SYMBOLS 1 ... Microphone with a camera, 10 ... Main body housing, 11 ... Microphone, 12 ... Camera, 100 ... Video conference system, 101 ... System main body (control part), 102 ... Monitor, 103 ... Main camera

Claims

A unidirectional microphone provided in the body housing;
A camera-equipped microphone, comprising: a camera provided in the main body casing and having a field angle substantially equal to a range of unidirectionality of the microphone.

A computer-controlled microphone having a unidirectional microphone provided in a main body casing and a camera provided in the main body casing and having a field angle substantially equal to the unidirectional range of the microphone In the program to
A step of recognizing a face image from an image captured by the camera and changing a unidirectional center position of the microphone based on the recognized face position by a computer. Control program.

The control of a microphone with a camera according to claim 2, wherein when a face image cannot be recognized from the video captured by the camera, the computer executes a step of preventing audio from being captured by the microphone. program.

A plurality of microphones with a camera including a microphone having a unidirectionality and a camera having a field angle substantially equal to a range of the unidirectionality of the microphone in a main body casing are arranged in a predetermined layout, and the microphones with a camera are used In a video conference system that captures participants' video with the main camera,
Identify the microphone with the camera that captures the audio, move the shooting direction of the main camera toward that position, and recognize the face image from the video captured by the camera microphone that captures the audio And a control means for adjusting the shooting direction of the main camera based on the recognition result.

The plurality of microphones with a camera sequentially captures images of participants' faces by each camera and sends them to the control means.
The control means sequentially overwrites and saves the face images of the participants sent from the microphones with cameras, and recognizes the face images when the stored face images are adjusted with the main camera. The video conference system according to claim 4, wherein the video conference system is used.

A plurality of microphones with a camera including a microphone having a unidirectionality and a camera having a field angle substantially equal to the range of the unidirectionality of the microphone in a main body casing are arranged in a predetermined layout. In a video conference system that displays captured video on a monitor,
When displaying a plurality of images captured by a plurality of camera-equipped microphones on the monitor, images captured by a camera-equipped microphone camera with the highest audio capture level were captured by another camera-equipped microphone camera. A video conferencing system comprising control means for displaying an image larger than an image.

A plurality of microphones with a camera including a microphone having a unidirectionality and a camera having a field angle substantially equal to the range of the unidirectionality of the microphone in a main body casing are arranged in a predetermined layout. In a video conference system that displays captured video on a monitor,
When displaying a plurality of images captured by a plurality of camera-equipped microphones on the monitor, images captured by a camera-equipped microphone camera with the highest audio capture level were captured by another camera-equipped microphone camera. A video conferencing system characterized by comprising control means for displaying an image brighter.

A plurality of microphones with a camera including a microphone having a unidirectionality and a camera having a field angle substantially equal to the range of the unidirectionality of the microphone in a main body casing are arranged in a predetermined layout. In a video conference system that displays captured video on a monitor,
A plurality of the microphones with a camera are cascade-connected, and control means is provided for displaying a plurality of images captured by the cameras of the plurality of camera-equipped microphones in a horizontal row on the monitor in the order of the cascade connection. Video conference system.