JP2000295700A

JP2000295700A - Sound source localization method and apparatus using image information and storage medium storing a program for implementing the method

Info

Publication number: JP2000295700A
Application number: JP11095702A
Authority: JP
Inventors: Kazu Miyagawa; 和宮川; Haruhiko Kojima; 治彦児島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1999-04-02
Filing date: 1999-04-02
Publication date: 2000-10-20

Abstract

(57)【要約】【課題】モノラル音声画像をステレオ音声画像に自動
変換でき、パノラマ画像等で画像と音源の位置関係に違
和感が出ない音源定位方法、装置を提供する。【解決手段】入力された音情報付き画像を分割部２０
により音情報と画像に分割する。画像解析部４０は、画
像知識データベース３０の情報を用いて、分割された画
像から画像内の物体、その物体の動き（位置）、カメラ
の操作等を解析し画像情報を取得する。音源分離部６０
は、取得された画像情報と音知識データベース部５０の
情報から、物体が発していると考えられる音源を、分割
された音情報から分離する。音源定位部７０は、取得さ
れた画像情報と再生部９０でのパノラマ画像表示等の映
像表示方法を考慮し、分離された音源を映像に適した音
場空間に再配置する。再配置された音源と画像を合成部
８０で合成し、合成された音情報付き画像を、再生部９
０で所定の映像表示方法で表示、再生する。 (57) [Summary] [PROBLEMS] To provide a sound source localization method and apparatus which can automatically convert a monaural sound image into a stereo sound image and do not give a sense of incongruity in the positional relationship between an image and a sound source in a panoramic image or the like. SOLUTION: A dividing unit 20 divides an input image with sound information.
To divide into sound information and images. The image analysis unit 40 uses the information of the image knowledge database 30 to analyze the object in the image, the movement (position) of the object, the operation of the camera, and the like from the divided image to obtain image information. Sound source separation unit 60
Separates the sound source considered to be emitted by the object from the divided sound information from the acquired image information and the information in the sound knowledge database unit 50. The sound source localization unit 70 rearranges the separated sound sources in a sound field space suitable for a video in consideration of the acquired image information and a video display method such as a panoramic image display in the playback unit 90. The rearranged sound source and the image are synthesized by the synthesis unit 80, and the synthesized image with sound information is reproduced by the reproduction unit 9
When 0, display and reproduction are performed in a predetermined video display method.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、画像情報を用いて
音源の定位を行う方法、及び装置に関する。The present invention relates to a method and an apparatus for localizing a sound source using image information.

【０００２】[0002]

【従来の技術】モノラル音声から擬似的にステレオ音声
を生成する方法は１９８５年頃から研究されており、音
を定位させる操作装置を用いることで音源の位置を上下
左右前後に移動し、擬似的な音場空間を自由に生成する
ことを可能とする研究も行われている。例えば文献“ヘ
ッドホンサウンドコンテンツのための音像定位編集シス
テム”（岡本学他、ＡＥＳ東京コンベンション’９７予
稿集、Ｐ１５４−１５７、Ｊｕｎｅ，１９９７）では、
音源と受聴者を示すアイコンの位置関係をマウスによっ
て操作し、両者の位置関係から角度、距離を算出するこ
とで水平面における音源の定位位置を操作する。このシ
ステムではこのような典型的なＧＵＩを用いることによ
り、操作者は直感的に音源と受聴者の位置関係を把握す
ることが可能となり、操作者の意図する音場空間を生成
することが可能となる。2. Description of the Related Art A method of generating pseudo stereo sound from monaural sound has been studied since about 1985. By using an operation device for localizing sound, the position of a sound source is moved up, down, left, right, back and forth, and a pseudo sound is generated. Research has been conducted to enable free generation of sound field space. For example, in the document “Sound image localization editing system for headphone sound contents” (Mana Okamoto et al., AES Tokyo Convention '97 Preprints, P154-157, June, 1997)
The positional relationship between the sound source and the icon indicating the listener is operated with a mouse, and the localization position of the sound source on the horizontal plane is operated by calculating the angle and distance from the positional relationship between the two. In this system, by using such a typical GUI, the operator can intuitively grasp the positional relationship between the sound source and the listener, and can generate a sound field space intended by the operator. Becomes

【０００３】また近年、ＴＶにおける４：３の矩形領域
のような画像表示領域に捕らわれない、新たな映像表示
方法が多く研究されている。例えば文献“投影法を用い
た映像の解析方法と映像ハンドリングヘの応用”（阿久
津明人他、電子情報通信学会論文誌Ｄ−２，Ｖｏｌ．Ｊ
７９−Ｄ−２，Ｎｏ．５，ｐｐ．６７５−６８６，Ｍａ
ｙ，１９９６）では、図１０に示すように、カメラの動
きを解析することにより、例えばカメラを横移動させな
がら町の風景を撮影した映像を、パノラマ映像として１
枚の画像に変換する手法を提案している。この手法で
は、従来の４：３などの矩形領域で撮影された画像をパ
ノラマ画像に変換することにより、矩形領域に捕らわれ
ない映像空間の表現が可能となる。[0003] In recent years, many new video display methods which are not caught by an image display area such as a 4: 3 rectangular area in a TV have been studied. For example, a document “Analysis method of video using projection method and application to video handling” (Akito Akutsu et al., IEICE Transactions D-2, Vol. J.
79-D-2, no. 5, pp. 675-686, Ma
y, 1996), as shown in FIG. 10, by analyzing the movement of the camera, for example, an image obtained by photographing a town landscape while moving the camera in the horizontal direction is regarded as a panoramic image.
We propose a method to convert it into a single image. In this method, a conventional image captured in a rectangular area such as 4: 3 is converted into a panoramic image, so that an image space not captured by the rectangular area can be expressed.

【０００４】[0004]

【発明が解決しようとする課題】上記、前者の従来技術
は、操作者が音場空間を創造する際において自由に空間
内に音源を配置することを可能とする方法であるため、
例えばモノラル音声で録音された古い画像などにおける
音声を擬似的にステレオ音声に変換する場合、それぞれ
の音源を分離し、操作者は画像を逐一参照しながら音源
の位置を画像に合わせた位置へと操作装置を用いて移動
しなければならない。The former prior art described above is a method that allows an operator to freely arrange sound sources in a space when creating a sound field space.
For example, when converting audio in an old image recorded as a monaural sound into a pseudo stereo sound, the sound sources are separated, and the operator refers to the images one by one and adjusts the position of the sound source to the position corresponding to the image. It must be moved using the operating device.

【０００５】また、従来の動画像表示方法では、画像の
表示される範囲がＴＶの４：３などの矩形領域と固定さ
れており、例えば右から左へ移動する自動車を追いかけ
てカメラが横移動する場面を考えた場合、カメラと目標
物との距離関係が変わらない限り音源の位置は一定とし
て扱う。そのため、この動画像をパノラマ展開するな
ど、前記、後者のような従来と異なる映像表示方法を用
いた場合、画面内で自動車は右から左へと移動はするが
自動車の発する音の移動は行われない、という不自然な
状態を生じてしまう。これは画像をパノラマ映像に変換
する際に画像の表示方法のみを変換しており、音声に関
しては何の変換も行わないために生じる不具合である。Further, in the conventional moving image display method, the display range of the image is fixed to a rectangular area such as TV 4: 3, for example, the camera moves laterally following an automobile moving from right to left. In the case of a scene where sound is generated, the position of the sound source is treated as constant unless the distance relationship between the camera and the target changes. Therefore, when a different video display method such as the latter is used, such as the panorama expansion of this moving image, the car moves from right to left on the screen, but the sound generated by the car does not move. The unnatural state of not being produced occurs. This is a problem that occurs when only an image display method is converted when an image is converted into a panoramic video, and no conversion is performed for audio.

【０００６】本発明の課題は、操作装置を用いて人手で
音源の定位を行っていた従来手法では、モノラル音声の
画像を疑似ステレオ音声に変換する際に手間がかかりす
ぎてしまうという欠点と、従来と異なった新たな映像表
示方法を用いた場合に画像と音源の位置関係に差異が生
じてしまうという欠点とを、画像の情報を用いて解決し
た、音源の定位方法、及び装置を提示することである。The problem of the present invention is that the conventional method, in which a sound source is localized manually using an operation device, takes too much time when converting a monaural sound image into a pseudo-stereo sound. Presenting a sound source localization method and apparatus that solves the disadvantage that a difference in the positional relationship between an image and a sound source occurs when a new video display method different from the conventional one is used is solved by using image information. That is.

【０００７】[0007]

【課題を解決するための手段】上記課題を解決するため
に、本発明の音源定位方法は、音情報付き画像を入力す
る手順と、前記入力された音情報付き画像から音情報と
画像とを分割する手順と、画像を解析するために必要な
知識を蓄えた画像知識データベースの情報に基づいて前
記分割された画像を解析し画像情報を取得する手順と、
画像情報と音情報を関連付けた知識を蓄えた音知識デー
タベースに基づいて前記分割された音情報から前記画像
情報に関連する音源を分離する手順と、前記画像情報と
画像再生時の再生方法に基づいて前記分離された音源を
音場空間上に配置する手順と、前記配置された音源と前
記分割された画像を合成する手順と、前記合成された音
源と画像を再生し表示する手順とを、有することを特徴
とする。In order to solve the above-mentioned problems, a sound source localization method according to the present invention comprises the steps of: inputting an image with sound information; and extracting sound information and image from the input image with sound information. Procedure for dividing, and a procedure for analyzing the divided image based on information of an image knowledge database storing knowledge necessary for analyzing the image and acquiring image information,
A step of separating a sound source related to the image information from the divided sound information based on a sound knowledge database storing knowledge relating the image information and the sound information; and a method of reproducing the image information and the image when reproducing the image. Arranging the separated sound source in the sound field space, synthesizing the arranged sound source and the divided image, and reproducing and displaying the synthesized sound source and image. It is characterized by having.

【０００８】あるいは、前記画像を解析し画像情報を取
得する手順では、画像情報として、入力された画像内に
表示された物体の特徴的情報、物体の位置情報、及びカ
メラの動き情報を取得し、前記分離された音源を音場空
間上に配置する手順では、前記取得された物体の特徴的
情報、物体の位置情報、及びカメラの動き情報を使用
し、更に画像再生時の再生方法の違いによる再生画像に
対する物体の位置情報を加味して、再生画像に対応する
音場空間上に前記分離された音源を配置することを特徴
とする。Alternatively, in the procedure of analyzing the image and obtaining the image information, the image information includes characteristic information of the object displayed in the input image, position information of the object, and motion information of the camera. In the step of arranging the separated sound sources in the sound field space, the acquired characteristic information of the object, the position information of the object, and the motion information of the camera are used. The separated sound sources are arranged in a sound field space corresponding to the reproduced image in consideration of the position information of the object with respect to the reproduced image according to (1).

【０００９】また、上記課題を解決するために、本発明
の音源定位装置は、音情報付き画像を入力する画像入力
部と、前記入力された音情報付き画像から音情報と画像
とを分割する分割部と、画像を解析するために必要な知
識を蓄えた画像知識データベースと、前記画像知識デー
タベースの情報を用いて前記分割された画像を解析し画
像情報を取得する画像解析部と、画像情報と音情報を関
連付けた知識を蓄えた音知識データベースと、前記音知
識データベースの情報を用いて前記分割された音情報か
ら前記画像情報に関連する音源を分離する音源分離部
と、前記画像情報と画像再生時の再生方法に基づいて前
記分離された音源を音場空間上に配置する音源定位部
と、前記配置された音源と前記分割された画像を合成す
る合成部と、前記合成された音源と画像を再生し表示す
る再生部とを、具備することを特徴とする。According to another aspect of the present invention, there is provided a sound source localization apparatus configured to input an image with sound information, and to divide sound information and an image from the input image with sound information. A dividing unit, an image knowledge database storing knowledge necessary for analyzing the image, an image analyzing unit for analyzing the divided image using information of the image knowledge database to obtain image information, and A sound knowledge database storing knowledge relating sound information and sound information, a sound source separation unit that separates a sound source related to the image information from the divided sound information using information of the sound knowledge database, and the image information A sound source localization unit that arranges the separated sound sources in a sound field space based on a reproduction method at the time of image reproduction, a synthesis unit that synthesizes the arranged sound source and the divided image, and the synthesis. The a reproducing unit which reproduces the sound source and the image display, characterized by comprising.

【００１０】あるいは、前記画像解析部は、画像情報と
して、入力された画像内に表示された物体の特徴的情
報、物体の位置情報、及びカメラの動き情報を取得する
ものであり、前記音源定位部は、前記取得された物体の
特徴的情報、物体の位置情報、及びカメラの動き情報を
使用し、更に画像再生時の再生方法の違いによる再生画
像に対する物体の位置情報を加味して、再生画像に対応
する音場空間上に前記分離された音源を配置するもので
あることを特徴とする。Alternatively, the image analysis unit acquires, as image information, characteristic information of the object displayed in the input image, position information of the object, and motion information of the camera. The unit uses the acquired characteristic information of the object, the position information of the object, and the motion information of the camera, and further takes into account the position information of the object with respect to the reproduced image due to a difference in the reproduction method at the time of image reproduction, and The separated sound source is arranged on a sound field space corresponding to an image.

【００１１】なお、上記画像情報を用いた音源定位方法
における手順をコンピュータに実行させるためのプログ
ラムを、該コンピュータが読み取り可能な記憶媒体に記
録することができ、これにより、本発明の画像情報を用
いた音源定位方法を記憶媒体として配布したり、保存し
たりすることが可能となり、コンピュータを用いて本発
明の方法を実現することが可能となる。A program for causing a computer to execute the procedure in the sound source localization method using the image information can be recorded on a storage medium readable by the computer, whereby the image information of the present invention can be stored. The used sound source localization method can be distributed or stored as a storage medium, and the method of the present invention can be realized using a computer.

【００１２】本発明では、画像を解析することで画像情
報を取得し、その情報を用いて音源の位置を特定する。
画像情報により自動的に音源の分離、配置を行うため、
従来の入力装置を用いた方法に比べてモノラル音声から
擬似ステレオ音声への変換などに人手を必要としない。
また、従来と異なった映像表示手法を用いた場合、画像
情報と再生部からの情報により、画像がどのように表示
されるかを音源定位部で判定し、表示手法に適した位置
に音源を配置することで、従来とは異なった画像表示手
法を用いた場合でも、画像と音源の位置関係に違和感を
与えない。According to the present invention, image information is obtained by analyzing an image, and the position of the sound source is specified using the information.
To automatically separate and arrange sound sources based on image information,
Compared with a method using a conventional input device, no manpower is required for conversion from monaural sound to pseudo-stereo sound.
When a video display method different from the conventional method is used, the sound source localization unit determines how the image is displayed based on the image information and the information from the playback unit, and places the sound source in a position suitable for the display method. By arranging, even when an image display method different from the conventional one is used, the positional relationship between the image and the sound source is not uncomfortable.

【００１３】[0013]

【発明の実施形態】以下に、本発明の実施の形態につい
て図面を参照して説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１４】図１は、本発明が適用される装置の一実施
形態例を示すブロック図である。本装置は、音情報付き
画像を入力する画像入力部１０、入力された音情報付き
画像から、音情報と画像を分割する分割部２０、画像を
解析し、画像内の情報を得るために必要な知識を蓄えた
画像知識データベース３０、画像知識データベース３０
の情報を用いて、分割された画像から画像内にどのよう
な物体があるか、どのように動いているか、カメラがど
のように操作されているか等を解析し画像情報を取得す
る画像解析部４０、画像情報とその情報に関連した音情
報とを知識として蓄えた音知識データベース５０、画像
解析部４０により取得された画像情報と音知識データベ
ース５０の情報から、物体が発していると考えられる音
源を、分割部２０により分割された音情報から分離する
音源分離部６０、画像解析部４０から取得された画像情
報と、再生部９０においてどのように画像を表示するの
かという映像表示方法を考慮することで、分離された音
源を映像に適した音場空間に再配置する音源定位部７
０、再配置された音源と画像を再合成する合成部８０、
合成された音情報付き画像を従来のＴＶのような固定さ
れた矩形領域や異なった映像表示方法を用いて表示、再
生する再生部９０、から構成される。FIG. 1 is a block diagram showing an embodiment of an apparatus to which the present invention is applied. The apparatus includes an image input unit 10 for inputting an image with sound information, a dividing unit 20 for dividing the sound information and the image from the input image with sound information, and analyzing the image to obtain information in the image. Knowledge database 30 that stores important knowledge, image knowledge database 30
Image analysis unit that analyzes what kind of object is in the image, how it is moving, how the camera is operated, etc. from the divided image using the information of 40, a sound knowledge database 50 storing image information and sound information related to the information as knowledge, and image information acquired by the image analysis unit 40 and information in the sound knowledge database 50, it is considered that an object is emitted. A sound source separation unit 60 that separates a sound source from the sound information divided by the division unit 20, image information acquired from the image analysis unit 40, and a video display method of how to display an image in the reproduction unit 90 are considered. The sound source localization unit 7 that rearranges the separated sound source in a sound field space suitable for video.
0, a synthesizing unit 80 for re-synthesizing the rearranged sound source and the image,
A playback unit 90 for displaying and playing back the synthesized image with sound information using a fixed rectangular area such as a conventional TV or a different video display method.

【００１５】本発明は、特に画像解析部４０と音源分離
部６０と音源定位部７０の機能や処理に関わる。The present invention particularly relates to the functions and processes of the image analysis unit 40, the sound source separation unit 60, and the sound source localization unit 70.

【００１６】具体的には、画像解析部４０において、画
像を解析することにより画像情報を画像知識データベー
ス３０内の情報を用いて取得する。この場合の画像情報
とは、例えば図３に表示された画像を例とした場合、画
像内に表示されている物体（人物）に関する情報（男性
女性）や物体の位置、体や顔の形状情報やカメラの動き
等を指す。これらは、画像知識データベース３０内に蓄
えられた物体に対する特徴量や形状情報、カメラの動き
特徴量などを参照することにより取得される。More specifically, the image analysis section 40 analyzes the image to obtain image information using information in the image knowledge database 30. The image information in this case is, for example, in the case of the image displayed in FIG. 3, information (male and female) relating to the object (person) displayed in the image, the position of the object, and shape information of the body and face And camera movement. These are acquired by referring to the feature amount and shape information for the object stored in the image knowledge database 30, the motion feature amount of the camera, and the like.

【００１７】音源分離部６０では、画像解析部４０にお
いて取得された物体に関する音情報を音知識データベー
ス５０から取得し、画像に付与されていた音情報内から
その物体に関する音源を分離する。この場合の音情報と
は、音知識データベース５０から取得される、物体がど
のような固有の音を発するか（男性の場合は男性の音声
モデル）という情報を指す。画像解析部４０により取得
された物体に関する情報を用いて、その物体に関する音
情報を音知識データベース５０内から引き出すことによ
り、物体固有の音源を、画像に付与されていた音情報内
から分離することが可能となる。The sound source separation section 60 obtains sound information on the object obtained by the image analysis section 40 from the sound knowledge database 50, and separates a sound source on the object from the sound information given to the image. The sound information in this case refers to information that is obtained from the sound knowledge database 50 and what kind of unique sound the object emits (a male voice model for a male). By using the information on an object acquired by the image analysis unit 40 and extracting sound information on the object from the sound knowledge database 50, a sound source unique to the object is separated from the sound information added to the image. Becomes possible.

【００１８】音源定位部７０では、再生部９０でどのよ
うな映像表示方法を用いるかという情報と、画像解析部
４０において取得した物体の位置情報、カメラの動き情
報等から、視聴者から鑑みて物体の表示位置と等しくな
るような音場空間上の位置に、音源分離部６０によって
分離されたその物体特有の音源を再配置する。再生部９
０からは、固定の矩形領域内に画像を表示するのか、あ
るいは異なった映像表示方法を用いるのかといった情報
が送られるため、音源定位部７０では物体の位置情報な
どの画像情報からのみではなく、映像の表示方法をも考
慮した音場空間を生成することが可能となる。In the sound source localization unit 70, information on what video display method is used in the reproduction unit 90, position information of the object acquired in the image analysis unit 40, motion information of the camera, etc. The sound source unique to the object separated by the sound source separation unit 60 is relocated to a position in the sound field space that is equal to the display position of the object. Reproduction unit 9
From 0, information such as whether to display an image in a fixed rectangular area or to use a different video display method is sent, so that the sound source localization unit 70 does not only use image information such as position information of an object, It is possible to generate a sound field space in consideration of a video display method.

【００１９】このように画像解析により自動的に取得さ
れた画像情報を基に自動的に音源を分離し、また、画像
表示方法をも考慮した音場空間上に音源を再配置するこ
とで、従来の手法に見られる人手を介した音源の配置方
法のような手間をかけず、またＴＶなどの固定した画像
表示領域に捕らわれない、新しい映像表示手法を用いた
場合においても適切な位置に音源を配置することが可能
となる。By automatically separating sound sources based on image information automatically obtained by image analysis as described above, and by rearranging sound sources in a sound field space in consideration of an image display method, The sound source can be placed at an appropriate position even when using a new image display method that does not take the trouble of the method of arranging sound sources through human hands found in the conventional method and is not caught in a fixed image display area such as a TV. Can be arranged.

【００２０】以下に、本発明にかかる画像情報を用いた
音源定位方法の実施形態例を具体的に説明する。An embodiment of a sound source localization method using image information according to the present invention will be specifically described below.

【００２１】〈実施形態例１〉これは、対話を行ってい
る男女２人の人物を捉えた固定カメラからのモノラル音
声画像（図３）を、本手法を用いてステレオ音声画像に
変換する場合の実施形態例である。図３では、男女２人
が画像内の左右で話しているにも関わらず、モノラル音
声のため画像中央あるいは画像全体から両者の音声が発
生されているように視聴者には感じられる。本方法では
このようなモノラル音声を、画像情報を用いて擬似的に
ステレオ化する。<Embodiment 1> This is a case where a monaural audio image (FIG. 3) from a fixed camera capturing two men and women having a conversation is converted into a stereo audio image using the present method. FIG. In FIG. 3, although two men and women are talking on the left and right sides of the image, the viewer feels that both sounds are generated from the center of the image or the entire image because of monaural sound. In this method, such a monaural sound is pseudo-stereo-converted using image information.

【００２２】図２は、本実施形態例の処理手順を表した
フローチャートの一例である。FIG. 2 is an example of a flowchart showing the processing procedure of this embodiment.

【００２３】まず、ステップ１００として画像入力部１
０において入力画像を受け取る。入力された画像はステ
ップ１１０として分割部２０に送られ、音声と画像とに
分割される。分割部２０は、画像を画像解析部４０に、
音声を音源分離部６０にそれぞれ送る。First, as step 100, the image input unit 1
At 0, an input image is received. The input image is sent to the division unit 20 as step 110, and is divided into audio and images. The division unit 20 sends the image to the image analysis unit 40,
The sound is sent to the sound source separation unit 60, respectively.

【００２４】次にステップ１２０として、分割部２０に
より分割された画像を画像解析部４０にて解析する。解
析する際には、画像内にどのような物体が存在している
のかというような物体認識を行い、そのために画像知識
データベース３０を利用する。画像知識データベース３
０には、例えば人間の形状情報や口の位置情報等を保持
しており、それらを利用することにより画像内の物体
（オブジェクト）を認識する。また、画像解析部４０で
は、物体の動き、カメラの動きなど、その他の画像情報
も解析、取得する。画像の物体認識や動き認識、カメラ
の動き認識は従来の技術により達成できるので詳細は省
略する。本実施形態例の場合、画像解析部４０は、図４
に示すように、画像内に２人の人物がおり、左が男性で
あり、右が女性であることと、その口の動きを画像知識
データベース３０の知識を用いて認識する。Next, as a step 120, the image divided by the division unit 20 is analyzed by the image analysis unit 40. At the time of analysis, object recognition such as what kind of object is present in the image is performed, and the image knowledge database 30 is used for that purpose. Image Knowledge Database 3
0 holds, for example, human shape information, mouth position information, and the like, and recognizes an object in an image by using the information. The image analysis unit 40 also analyzes and acquires other image information such as the movement of an object and the movement of a camera. Recognition of an object and motion of an image and motion of a camera can be achieved by conventional techniques, and a detailed description thereof will be omitted. In the case of this embodiment, the image analysis unit 40
As shown in (2), there are two persons in the image, the left is a man and the right is a woman, and the movement of the mouth is recognized using the knowledge of the image knowledge database 30.

【００２５】次にステップ１３０として、音源分離部６
０において、画像解析部４０により取得された画像情報
内の特に物体の情報を用いて、分割部２０により入力さ
れた音声から物体に関連した音源を分離する。分離する
際には、物体がどのような音を発するのか、というよう
な物体と音とを関連付けした知識を有する音知識データ
ベース５０を用いる。音知識データベース５０では、例
えば人間の男性と男性の音声モデルや、車と車のエンジ
ン音モデル等とをそれぞれ関連付けして保持しており、
物体の情報から引き出された音モデルによって音情報内
からその音源を分離する。本実施形態例の場合、ステッ
プ１２０により左側の人物が男性であることがわかって
いるため、音知識データベース５０内の典型的な男性の
音声モデルを音源分離に利用する。さらにステップ１２
０で口の動きを認識することで、口が動いている時間内
で且つ男性の音声モデルに合致する音源（音声）を、画
像音情報内から抽出する。女性の場合も同様な方法で音
源（音声）を抽出する。Next, in step 130, the sound source separation unit 6
At 0, the sound source associated with the object is separated from the sound input by the division unit 20 using the information of the object in the image information acquired by the image analysis unit 40. At the time of separation, a sound knowledge database 50 having knowledge relating the object and the sound, such as what kind of sound the object emits, is used. In the sound knowledge database 50, for example, a human male and a male voice model, a car and a car engine sound model, and the like are stored in association with each other.
The sound source is separated from the sound information by the sound model derived from the information of the object. In the case of this embodiment, since the person on the left side is known to be a male in step 120, a typical male voice model in the sound knowledge database 50 is used for sound source separation. Step 12
By recognizing the movement of the mouth at 0, a sound source (voice) that matches the male voice model within the time during which the mouth is moving is extracted from the image and sound information. In the case of a woman, a sound source (voice) is extracted in a similar manner.

【００２６】次にステップ１４０として、画像解析部４
０において取得された画像情報と再生部９０における映
像表示方法とを考慮して、音源定位部７０において分離
されたいくつかの音源と分離されず残った音声とを音場
空間内に再配置する。画像内のどの位置に物体が位置
し、どのように動いているか、どの部分から音を発して
いるのかという情報を基に、その物体に応じて適切に分
離された音源を音場空間内に配置することで音源の定位
を行う。この時分離されず残った音声は、背景音として
音場空間全域の音として処理する。また音源を配置する
際には、再生部９０における映像表示方法を考慮する。
例えば右から左に移動する車を追いかけるように横移動
するカメラで捉えた画像を考えた場合、通常のＴＶのよ
うな矩形領域内に画像を表示する場合は、車とカメラの
位置関係が変わらない限り音源の位置は変化しない。し
かし、この画像をパノラマ表示した場合、パノラマ表示
された横長の画像内を車は右から左に移動することとな
る。音源定位部７０ではこのような場合、車という物体
情報により分離された音源を、映像表示方法、この場合
パノラマ画像に適するように、右から左へ移動させる。
本実施形態例の場合、図５に示すように、カメラは固定
されているため、人物の口の位置にそれぞれの音源が位
置するように音源を配置する。Next, as step 140, the image analysis unit 4
In consideration of the image information acquired at 0 and the video display method at the playback unit 90, some of the sound sources separated by the sound source localization unit 70 and the remaining unseparated sound are rearranged in the sound field space. . Based on information on where the object is located in the image, how it is moving, and from which part it is emitting sound, a sound source appropriately separated according to the object is placed in the sound field space. The sound source is localized by placing it. At this time, the remaining sound that has not been separated is processed as a sound in the entire sound field space as a background sound. When arranging sound sources, a video display method in the reproducing unit 90 is taken into consideration.
For example, when considering an image captured by a camera that moves laterally to follow a car moving from right to left, when displaying an image in a rectangular area like a normal TV, the positional relationship between the car and the camera changes. Unless it does, the position of the sound source does not change. However, when this image is displayed in a panoramic manner, the car moves from right to left in the horizontally long panoramic image. In such a case, the sound source localization unit 70 moves the sound source separated by the object information of a car from right to left so as to be suitable for a video display method, in this case, a panoramic image.
In the case of the present embodiment, as shown in FIG. 5, since the camera is fixed, the sound sources are arranged so that each sound source is located at the position of the mouth of the person.

【００２７】次にステップ１５０として、合成部８０に
おいて、分割部２０により分割された画像と音源定位部
７０により音場空間内に配置された音源を合成し、音声
付き画像を再構成する。Next, in step 150, the synthesizing unit 80 synthesizes the image divided by the dividing unit 20 and the sound source arranged in the sound field space by the sound source localization unit 70, and reconstructs an image with sound.

【００２８】次にステップ１６０として、図６に示すよ
うに、合成部８０により合成され再構成された音声（音
情報）付き画像を再生部９０において表示、再生する。
表示方法としては、ＴＶ等のような固定矩形領域内に画
像を表示したり、あるいはそれとは異なった画像表示方
法を選択的に利用できる。以上により処理を終了する。Next, in step 160, as shown in FIG. 6, the image with sound (sound information) synthesized and reconstructed by the synthesizing unit 80 is displayed and reproduced in the reproducing unit 90.
As a display method, an image can be displayed in a fixed rectangular area such as a TV, or an image display method different from that can be selectively used. Thus, the process ends.

【００２９】このような処理を行うことにより、本方法
ではモノラル音声画像をステレオ音声画像に自動的に変
換することが可能となる。By performing such processing, the present method makes it possible to automatically convert a monaural sound image into a stereo sound image.

【００３０】〈実施形態例２〉これは、右から左に移動
している自動車を追いかけるように横移動しているカメ
ラから撮影した画像（図７左）をパノラマ画像に展開し
た場合、本手法を用いてパノラマ画像に適したステレオ
音声画像に変換する場合の実施形態例である。<Embodiment 2> This method is applied to a case where an image (left in FIG. 7) taken from a camera moving laterally to follow a car moving from right to left is developed into a panoramic image. 5 is an example of an embodiment in the case of converting to a stereo sound image suitable for a panoramic image using.

【００３１】本実施形態例の処理手順を表したフローチ
ャートの一例は図２と等しいものである。An example of a flowchart showing the processing procedure of this embodiment is the same as that of FIG.

【００３２】ステップ１００からステップ１３０までの
処理は実施形態例１と同様であり、画像内の車を画像解
析部４０により抽出し、音源分離部６０において、車の
発するエンジン音等を音情報から分離、抽出する。The processing from step 100 to step 130 is the same as that of the first embodiment. The car in the image is extracted by the image analysis unit 40, and the sound source separation unit 60 extracts the engine sound and the like emitted by the car from the sound information. Separate and extract.

【００３３】音源を考慮しないでパノラマ画像を生成し
た場合、図７右に示すように、パノラマ画像上におい
て、画像は移動を行うが音源の位置はＴＶで表示した場
合と同じく音源は移動しないという不自然な状態を生じ
てしまう。そこで、本実施形態例では、ステップ１４０
において、音源定位部７０は、図８に示すように、再生
部９０における映像表示方法（本実施形態例の場合、パ
ノラマ画像）を考慮した音場空間上に音源を配置する。When a panoramic image is generated without considering the sound source, the image moves on the panoramic image as shown in the right side of FIG. 7, but the sound source does not move as in the case of displaying on the TV. An unnatural state will result. Thus, in the present embodiment, step 140
In FIG. 8, the sound source localization unit 70 arranges sound sources in a sound field space in consideration of a video display method (a panoramic image in the case of the present embodiment) in the reproduction unit 90, as shown in FIG.

【００３４】ステップ１５０からステップ１６０までの
処理は実施形態例１と同様である。The processing from step 150 to step 160 is the same as in the first embodiment.

【００３５】このように、再生部９０における映像表示
方法を考慮した音場空間上に音源を配置することによ
り、図９に示すように、パノラマ画像上で移動する車に
合わせて音源を移動することが可能となり、カメラから
撮られた画像をそのまま表示するＴＶなどのような表示
方法とは異なる映像表示方法においても、自然なステレ
オ音声画像を生成することが可能となる。As described above, by arranging the sound source in the sound field space in consideration of the video display method in the reproducing unit 90, the sound source is moved in accordance with the car moving on the panoramic image as shown in FIG. This makes it possible to generate a natural stereo sound image even in a video display method different from a display method such as a TV for directly displaying an image taken by a camera.

【００３６】なお、図１で示した装置各部の一部もしく
は全部の機能を、コンピュータを用いて実現することが
できること、あるいは、図２で示した処理手順をコンピ
ュータに実行させることができることは言うまでもな
く、コンピュータでその装置各部の機能を実現するため
のプログラム、あるいは、コンピュータにその処理手順
を実行させるためのプログラムを、そのコンピュータが
読み取り可能な記憶媒体、例えばＦＤ（フロッピーディ
スク）や、ＭＯ、ＲＯＭ、メモリカード、ＣＤ、ＤＶ
Ｄ、リムーバブルディスクなどに記録し、提供し、配布
することが可能である。It goes without saying that some or all of the functions of each unit shown in FIG. 1 can be realized using a computer, or that the processing procedure shown in FIG. 2 can be executed by a computer. In addition, a program for realizing the function of each unit of the apparatus by the computer or a program for causing the computer to execute the processing procedure is stored in a storage medium readable by the computer, for example, FD (floppy disk), MO, ROM, memory card, CD, DV
D, recorded on a removable disk, provided, and distributed.

【００３７】[0037]

【本発明の効果】以上説明したように、本発明によれ
ば、画像解析を行い画像情報を抽出し、画像情報に基づ
く音源を画像に付与された音情報から分離し、画像上の
物体の位置や動きと画像を再生する際の画像表示方法に
応じた音場空間上に音源を再配置し、新たなステレオ音
声画像を自動的に合成することが出来るので、利用者は
音源を人手で配置することなく、また映像表示方法に適
したステレオ音声画像を表示することが可能となる。As described above, according to the present invention, according to the present invention, image analysis is performed to extract image information, a sound source based on the image information is separated from sound information added to the image, and The sound source can be rearranged in the sound field space according to the position, movement, and image display method when playing the image, and a new stereo sound image can be automatically synthesized. It is possible to display a stereo sound image suitable for a video display method without disposing.

[Brief description of the drawings]

【図１】本発明が適用される装置の一実施形態例を示す
ブロック図である。FIG. 1 is a block diagram showing an embodiment of an apparatus to which the present invention is applied.

【図２】本発明の方法の一実施形態例の処理手順を示す
フローチャートである。FIG. 2 is a flowchart showing a processing procedure of an embodiment of the method of the present invention.

【図３】本発明を説明するための、通常のモノラル音声
画像を表示した例を示す図である。FIG. 3 is a diagram showing an example of displaying a normal monaural audio image for explaining the present invention.

【図４】本発明を説明するための、図３のモノラル音声
画像内の画像を画像知識データベースを用いて解析し、
オブジェクトと口の動きを認識した例を示す図である。FIG. 4 is a diagram for analyzing the image in the monaural audio image of FIG. 3 using an image knowledge database for explaining the present invention;
It is a figure showing the example which recognized the movement of the object and the mouth.

【図５】本発明を説明するための。図４によって認識さ
れた口の位置に合わせた音場空間上に男性女性それぞれ
の音源を配置した例を示す図である。FIG. 5 is for explaining the present invention. FIG. 5 is a diagram illustrating an example in which male and female sound sources are arranged in a sound field space corresponding to the position of the mouth recognized by FIG. 4.

【図６】本発明を説明するための、音上空間上に配置さ
れた音源と画像を再合成し表示した例を示す図である。FIG. 6 is a diagram illustrating an example in which a sound source and an image arranged in a sound space are re-synthesized and displayed for explaining the present invention.

【図７】本発明を説明するための、カメラに動きのある
画像を通常のＴＶ等に表示した際と、パノラマ画像に展
開した際の音声の再生状況例を示す図である。FIG. 7 is a diagram for explaining the present invention, showing an example of a sound reproduction situation when an image with a camera moving is displayed on a normal TV or the like and when it is developed into a panoramic image.

【図８】本発明を説明するための、映像表示方法（パノ
ラマ画像表示方法）を考慮に入れた音場空間上に音源を
再配置した例を示す図である。FIG. 8 is a diagram illustrating an example of rearranging sound sources in a sound field space in consideration of a video display method (panoramic image display method) for explaining the present invention.

【図９】本発明を説明するための、本発明を適用して映
像表示方法（パノラマ画像表示）を考慮に入れたステレ
オ音声画像を表示した例を示す図である。FIG. 9 is a view for explaining the present invention, showing an example of displaying a stereo audio image in consideration of a video display method (panoramic image display) by applying the present invention.

【図１０】従来手法によるパノラマ画像の生成と表示の
例を示した図である。FIG. 10 is a diagram illustrating an example of generation and display of a panoramic image according to a conventional method.

[Explanation of symbols]

１０…画像入力部２０…分割部３０…画像知識データベース４０…画像解析部５０…音知識データベース６０…音源分離部７０…音源定位部８０…合成部９０…再生部 Reference Signs List 10 image input unit 20 division unit 30 image knowledge database 40 image analysis unit 50 sound knowledge database 60 sound source separation unit 70 sound source localization unit 80 synthesis unit 90 playback unit

Claims

[Claims]

1. A procedure for inputting an image with sound information, a procedure for dividing sound information and an image from the input image with sound information, and an image knowledge database storing knowledge necessary for analyzing the image Analyzing the divided image based on the information and obtaining image information; anda step of converting the divided sound information into the image information based on a sound knowledge database storing knowledge relating image information and sound information. A step of separating related sound sources, a step of arranging the separated sound sources in a sound field space based on the image information and a reproduction method at the time of image reproduction, and a step of arranging the arranged sound sources and the divided images. A sound source localization method using image information, comprising: synthesizing; and reproducing and displaying the synthesized sound source and image.

2. In the step of analyzing the image and obtaining image information, the image information is obtained by acquiring characteristic information of the object displayed in the input image, position information of the object, and camera movement information. In the step of arranging the separated sound source in the sound field space, the acquired characteristic information of the object, the position information of the object, and the motion information of the camera are used. 2. The sound source localization method using image information according to claim 1, wherein the separated sound sources are arranged in a sound field space corresponding to the reproduced image, taking into account positional information of the object with respect to the reproduced image. .

3. An image input unit for inputting an image with sound information, a division unit for dividing sound information and an image from the input image with sound information, and knowledge necessary for analyzing the image. An image knowledge database, an image analysis unit that analyzes the divided image using information of the image knowledge database to obtain image information, a sound knowledge database storing knowledge relating image information and sound information, A sound source separation unit that separates a sound source related to the image information from the divided sound information by using information of a sound knowledge database; and A sound source localization unit arranged in a field space, a synthesis unit that synthesizes the arranged sound source and the divided image, and a reproduction unit that reproduces and displays the synthesized sound source and the image. A sound source localization device using image information, characterized in that:

4. The image analysis unit acquires, as image information, characteristic information of an object displayed in an input image, position information of the object, and camera movement information. The unit uses the acquired characteristic information of the object, the position information of the object, and the motion information of the camera, and further takes into account the position information of the object with respect to the reproduced image due to the difference in the reproduction method at the time of image reproduction, and reproduces The sound source localization apparatus using image information according to claim 3, wherein the separated sound sources are arranged in a sound field space corresponding to an image.

5. An image, wherein a program for causing a computer to execute a procedure in the sound source localization method using image information according to claim 1 or 2 is recorded on a storage medium readable by the computer. A storage medium storing a program for realizing a sound source localization method using information.