JP2011221603A

JP2011221603A - Multimedia recognition/retrieval system using dominant feature

Info

Publication number: JP2011221603A
Application number: JP2010086983A
Authority: JP
Inventors: Hiromitsu Hama; 裕光濱; Di Zwin Di; ティズインティ; Ting Pai; ティンパイ; Manichi Hama; 萬一濱
Original assignee: Individual
Current assignee: Individual
Priority date: 2010-04-05
Filing date: 2010-04-05
Publication date: 2011-11-04

Abstract

PROBLEM TO BE SOLVED: To provide means for "translation invariance, shortening of computation time, compression of information quantity, and security of flexibility" using dominant features because an important subject for constructing a multimedia recognition/retrieval system with soft brains rich in flexibility is diversity of multimedia information and enormousness of information quantity and the subject should be overcome.SOLUTION: The multimedia recognition/retrieval system comprises means for expressing a content by using a dominant feature and finding out similarity of the content and means for finding out the similarity of the content by using a feature vector obtained from a cooccurrence matrix.

Description

本発明は、画像を含む多種多様なマルチメディア認識・検索システムに関する。 The present invention relates to various multimedia recognition / retrieval systems including images.

グーグルやYahoo!等多くの検索エンジンは、ユーザがキーワードを用いて検索要求（クエリー）を発行した後に、関連ウェブページを検索し、それぞれのランキング基準に基いてユーザに提示する。一般的に、ランキング基準としてリンク構造がよく用いられているが、現在ではウェブページの信頼度、人気度、重要性、季節性等、他の基準を考慮して提示することでユーザ満足度を上げる試みがなされており、実用的にも使われるようになるであろう。インターネット上では、画像、映像、音楽等の所謂マルチメディア情報の検索・利用が広がりつつある。しかし、検索のためのキーワードとして依然として文字ベースのキーワードが使われている。ユーザフレンドリーなインターフェースを実現するために、コンテンツベースの検索技術が提唱され、開発されつつあるが、実用的に広く普及する段階には至っていない。 Many search engines such as Google and Yahoo! search related web pages after a user issues a search request (query) using a keyword, and present it to the user based on their ranking criteria. In general, the link structure is often used as a ranking standard, but at present, user satisfaction can be improved by presenting it in consideration of other criteria such as web page reliability, popularity, importance, and seasonality. Attempts have been made and will be used practically. On the Internet, so-called multimedia information such as images, videos, and music is being searched and used. However, character-based keywords are still used as search keywords. In order to realize a user-friendly interface, a content-based search technology has been proposed and developed, but has not yet reached the point of widespread practical use.

支配的色を自動的に求めるために学習を用いる方法等があり、ヒストグラムから簡単に求められる場合もあるが、問題は求められた支配的色をどのように利用して、どう効率的に認識・検索システムを実現するか、さらに上位概念としてどのように一般化し、コンテンツに応用していくか、にある。また、検索分野だけでなく、その他の代表的な応用例として、ドライバー支援の観点から悪い照明条件、気候条件での歩行者検知などのＩＴＳ（高度道路交通システム）分野や監視・見守りシステムにおける応用が考えられる。 There are methods such as using learning to automatically determine the dominant color, and sometimes it can be easily obtained from the histogram, but the problem is how to use the determined dominant color and how to recognize it efficiently -Whether to implement a search system or how to generalize it as a general concept and apply it to content. In addition to the search field, other typical applications include ITS (Intelligent Transport System) fields such as pedestrian detection under bad lighting conditions and climatic conditions from the viewpoint of driver assistance, and applications in monitoring and monitoring systems. Can be considered.

Ａｈｉｓｔｏｇｒａｍ−ｂａｓｅｄａｐｐｒｏａｃｈｆｏｒｏｂｊｅｃｔ−ｂａｓｅｄｑｕｅｒｙ−ｂｙ−ｓｈａｐｅ−ａｎｄ−ｃｏｌｏｒｉｎｉｍａｇｅａｎｄｖｉｄｅｏｄａｔａｂａｓｅＩｍａｇｅａｎｄＶｉｓｉｏｎＣｏｍｐｕｔｉｎｇ,Ｎｏｖ２００５,ｖｏｌ.２３,ｎｏ.１３,ｐｐ.１１７０−１１８０.A histogram-based applied for object-based query-by-shape-and-colorin image and video database Image and Vision Computing, Nov. 2005, Vol. 13, p. 23. ＡＮｏｖｅｌＲｅｇｉｏｎｓ−ｏｆ−ＩｎｔｅｒｅｓｔＢａｓｅｄＩｍａｇｅＲｅｔｒｉｅｖａｌＵｓｉｎｇＭｕｌｔｉｐｌｅＦｅａｔｕｒｅｓ,１２ｔｈＩｎｔｌ.ＭｕｌｔｉｍｅｄｉａＭｏｄｅｌｉｎｇＣｏｎｆ.Ｂｅｉｊｉｎｇ,Ｃｈｉｎａ,Ｊａｎ２００６,ｐｐ.３７７-３８０.A Novel Regions-of-Interest Based Image Retrieval Usable Multiple Features, 12th Intl.Multimedia Modeling Conf.Beijing, China, Jan 2006, pp. 377-380. ＤｏｍｉｎａｎｔＣｏｌｏｒＲｅｇｉｏｎＢａｓｅｄＩｎｄｅｘｉｎｇＴｅｃｈｎｉｑｕｅｆｏｒＣＢＩＲ,ＩｎＰｒｏｃｏｆｔｈｅＩｎｔｌ. Ｃｏｎｆ. ｏｎＩｍａｇｅＡｎａｌｙｓｉｓａｎｄＰｒｏｃｅｓｓｉｎｇ（ＩＣＩＡＰ‘９９）.Ｖｅｎｉｃｅ,Ｉｔａｌｙ,Ｓｅｐｔ.１９９９,ｐｐ.８８７−８９２Dominant Color Region Based Indexing Technique for CBIR, In Proc of the Intl. Conf. On Image Analysis and Processing (ICIAP'99). Venice, Italy.

マルチメディア認識・検索システム構築の際の、問題の一つはマルチメディアの多様性と情報量の膨大さにある。そこでは、コンテンツの表現方法、情報量圧縮、変換不変性による計算時間の短縮、柔軟性の確保等が大きな課題である。ポイントは、『同じ』ものでなく、『似ている』ものを探しだすことであり、そこで類似度が重要な概念となる。このような柔軟な認識・検索システムの実現は今後ますます重要性を増してくるであろう。問題は、マルチメディア認識・検索に支配的特徴をどのように利用して、どう効率的に認識・検索システムを実現するか、にあり、現状では画像に限定されているだけでなく、その利用技術も開発途上にある。 One of the problems in building a multimedia recognition / retrieval system is the diversity of multimedia and the huge amount of information. There are major issues in the content representation method, information amount compression, reduction of calculation time due to conversion invariance, ensuring flexibility, and the like. The point is to search for “similar” things instead of “same” things, and similarity is an important concept. The realization of such a flexible recognition / retrieval system will become increasingly important in the future. The problem lies in how to use dominant features in multimedia recognition and search, and how to efficiently realize a recognition and search system. Technology is also under development.

一方、今後需要が増大するであろうマルチメディアコンテンツへの対応も不十分である。頑健で柔軟な対象物の認識・検索のために、前処理としてラベリングを行うことがよくあるが、その際にラベルを一意に決めてしまうと悪い条件では誤差が増え、頑健性が失われる。また、人の姿勢・動作等のように画一的でない対象物を分類、識別するために不可欠な柔軟なマッチングは、従来型手法では計算時間の点でも精度の点でも実現が難しい。また、検索分野だけでなく、ドライバー支援の観点から悪い照明条件、気候条件での歩行者検知などのＩＴＳ（高度道路交通システム）分野や監視・見守りシステムにおける頑健な検知手法の開発が求められている。 On the other hand, the response to multimedia contents that will increase in demand in the future is also insufficient. In order to recognize and search for a robust and flexible object, labeling is often performed as preprocessing. However, if a label is uniquely determined at that time, errors increase under bad conditions, and robustness is lost. In addition, flexible matching, which is indispensable for classifying and identifying non-uniform objects such as human postures and movements, is difficult to realize in terms of calculation time and accuracy with conventional methods. In addition to the search field, the development of robust detection methods in the ITS (Intelligent Transport System) field such as pedestrian detection under bad lighting conditions and climatic conditions and monitoring / monitoring systems is required from the viewpoint of driver assistance. Yes.

コンテンツから支配的特徴を抽出する抽出手段と、上記抽出手段によって抽出された支配的特徴とその特徴間の関連性を用いてコンテンツを表現する表現手段と、処理対象となる入力コンテンツを入力する入力手段と、上記表現手段を用いてコンテンツ間の類似度を求め、入力コンテンツを認識する認識手段と、入力コンテンツと類似のコンテンツをマルチメディアデータベースから検索する検索手段と、を備える。 Extraction means for extracting dominant features from content, expression means for expressing content using the dominant features extracted by the extraction means and the relationship between the features, and input for inputting input content to be processed Means, a recognizing means for recognizing the input content by using the expression means, and a search means for searching the multimedia database for content similar to the input content.

上記表現手段が、上記抽出手段により抽出された支配的特徴に対応する原空間での領域あるいは別途支配的特徴を配置した空間（以下、支配的特徴領域という）とその関連性を求めることにより、コンテンツ間の類似度を求めるための元となるパラメータを提供する表現手段を備える。 The representation means obtains the relationship in the original space corresponding to the dominant feature extracted by the extraction means or the space where the dominant feature is separately arranged (hereinafter referred to as the dominant feature area) and its relationship, An expression means for providing a parameter that is a basis for obtaining a similarity between contents is provided.

上記表現手段が、上記支配的特徴領域の存在を反映するパラメータを求め、そのパラメータを用いてコンテンツ間の類似度を求めるための元となるパラメータを提供する表現手段を備える。 The expression means includes expression means for obtaining a parameter reflecting the existence of the dominant feature region and providing a parameter as a basis for obtaining a similarity between contents using the parameter.

上記表現手段が、上記抽出手段により抽出された各支配的特徴に対応する原空間での位置から所定の相対的な位置にある投票箱に投票し、対象物を構成するために必要な支配的特徴からの投票を集めた投票箱があれば、その投票箱の位置に対象物があると認識する認識手段を備える。 The representation means vote to the ballot box at a predetermined relative position from the position in the original space corresponding to each dominant feature extracted by the extraction means, and the dominant necessary to construct the object Recognizing means for recognizing that there is an object at the position of the ballot box if there is a ballot box that collects votes from the features.

上記の支配的特徴領域において、支配的特徴に離散的状態を割当てて、共起行列を求め、それを確率行列に変換した後、初期確率ベクトルと定常確率ベクトルを求め、この２つから得られる特徴ベクトルを用いたコンテンツ間の類似度あるいは距離を求めることにより入力コンテンツを認識する認識手段を備える。ここで求めた類似度は検索にも利用できる。 In the above dominant feature region, assign a discrete state to the dominant feature, find a co-occurrence matrix, convert it to a probability matrix, find an initial probability vector and a stationary probability vector, and obtain from these two Recognizing means for recognizing input content by obtaining similarity or distance between contents using feature vectors is provided. The similarity obtained here can also be used for searching.

支配的特徴を用いることを特徴とし、支配的特徴に離散的状態を割当ててコンテンツを表現する手段と、支配的特徴領域に対応するコンテンツ上での領域特徴と領域相互間の関係（例えば、重心の相対的関係等）を用いてコンテンツの類似度を求める手段とを備える。 A feature characterized by using dominant features, and means for expressing content by assigning discrete states to the dominant features, and the relationship between the region features and regions on the content corresponding to the dominant feature region (for example, the center of gravity And means for obtaining the similarity of the content using the relative relationship of

入力データの次元数とは無関係に最終的には次数を下げた１次元特徴ベクトルFを求め、Fを用いて認識、検索が行えるため次数を下げるため、ブロックマッチングに比べて計算時間は圧倒的に速くなり、少ないメモリ容量で実行でき、種々の変換不変性を持ち、トレランスが生じ、柔軟性も高まる。
ここで、ブロックマッチングは、ある大きさの領域をテンプレートブロックとし、画像中を全探索し、注目ブロックとの差分評価関数の値を最小とする点を対応点とする手法である。もし、所定の閾値以下のブロックが存在しなければ、該当ブロックは無しと判断する。 Regardless of the number of dimensions of the input data, a one-dimensional feature vector F with a reduced order is finally obtained, and the calculation time is overwhelming compared to block matching because the order can be reduced because it can be recognized and searched using F. Faster, can be executed with less memory capacity, have various conversion invariance, provide tolerance, and increase flexibility.
Here, block matching is a method in which a region having a certain size is used as a template block, the entire image is searched, and a point that minimizes the value of the difference evaluation function from the block of interest is used as a corresponding point. If there is no block below the predetermined threshold, it is determined that there is no corresponding block.

本実施形態における処理手段を含むシステム全体の構成図及び処理フローダイアグラムConfiguration diagram and processing flow diagram of entire system including processing means in this embodiment 本実施形態における支配的特徴を用いたコンテンツ（画像）の表現手段の説明図Explanatory drawing of the means for expressing content (image) using dominant features in this embodiment 本実施形態におけるいくつかの支配的特徴の選び方と支配的特徴を例示した図The figure which illustrated how to choose some dominant characteristics and dominant characteristics in this embodiment 本実施形態における一次元の支配的特徴の選び方と支配的特徴を例示した図The figure which illustrated how to choose one-dimensional dominant feature and dominant feature in this embodiment 本実施形態における歩行者検出のための横方向HOGから支配的特徴を選ぶときの例示Example when selecting dominant features from lateral HOG for pedestrian detection in this embodiment 本実施形態における歩行者検知を例に挙げた投票方式による対象物認識手法の説明図Explanatory drawing of the object recognition technique by the voting method which gave pedestrian detection in this embodiment as an example 支配的特徴領域から求められる共起行列を示す図Diagram showing co-occurrence matrix obtained from dominant feature region 本実施形態における支配的特徴領域から支配的特徴ベクトルを求める処理手順を示した図The figure which showed the process sequence which calculates | requires a dominant feature vector from the dominant feature area | region in this embodiment. 本実施形態における２個の支配的特徴領域から特徴ベクトルを求める手順を例示した図The figure which illustrated the procedure which calculates | requires a feature vector from two dominant feature area | regions in this embodiment 本実施形態における境界線の長さの求め方の一例An example of how to obtain the length of the boundary line in this embodiment 本実施形態における支配的特徴（色）を用いた画像データベース検索結果を例示した図The figure which illustrated the image database search result using the dominant feature (color) in this embodiment

以下、本発明の実施の形態について、図面と数式に基づいて説明する。図１(a)は、本実施形態における処理手段を含むシステム全体の構成図である。本システムは、処理対象となるコンテンツを入力する入力手段１０１と、検索対象となるマルチメディアデータベース１０２と、入力コンテンツを認識及び検索する処理手段１０３を備えている。 Hereinafter, embodiments of the present invention will be described based on the drawings and mathematical expressions. FIG. 1A is a configuration diagram of the entire system including processing means in the present embodiment. This system includes an input unit 101 for inputting content to be processed, a multimedia database 102 for search, and a processing unit 103 for recognizing and searching for input content.

処理手段１０３は、入力手段１０１から取得したコンテンツ（画像、音、信号等）を分析し、特徴抽出し、認識及び検索するための処理手段を備えている。これらの手段は、具体的には例えば、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、キーボード（操作部）、ディスプレイ（表示部）、ＨＤＤ（Hard Disk Drive、記憶部）等を備えたコンピュータで構成され、ＣＰＵが、ＲＯＭやＨＤＤ等に記憶されているプログラムを実行することで、各手段としての機能が果たされる。ここで、コンテンツと言えば、マルチメディアコンテンツを意味する。 The processing means 103 includes processing means for analyzing the content (image, sound, signal, etc.) acquired from the input means 101, extracting features, recognizing and searching. Specifically, these means include, for example, CPU (Central Processing Unit), RAM (Random Access Memory), ROM (Read Only Memory), keyboard (operation unit), display (display unit), HDD (Hard Disk Drive, The function as each means is fulfilled when the CPU executes a program stored in a ROM, HDD, or the like. Here, the content means multimedia content.

各手段について説明する前に、まず本実施形態の支配的特徴を用いたマルチメディア認識・検索システムにおけるシステム全体の流れについて簡単に説明する。図１(b)は、本実施形態におけるシステム全体の処理フローダイアグラムである。同図に示すように、処理を始める前に入力手段２００Aからコンテンツを入力し（Ｓ２０１Ａ）、特徴空間に変換して（Ｓ２０２Ａ）その内容を分析し、支配的特徴を抽出する（Ｓ２０３Ａ）。 Before describing each means, the flow of the entire system in the multimedia recognition / search system using the dominant feature of this embodiment will be briefly described first. FIG. 1B is a processing flow diagram of the entire system in the present embodiment. As shown in the figure, before starting the processing, content is input from the input means 200A (S201A), converted into a feature space (S202A), the content is analyzed, and dominant features are extracted (S203A).

さらに原空間での対応領域あるいは別途支配的特徴を配置した空間を構成し（Ｓ２０４Ａ），抽出された支配的特徴間の関連性を求め（Ｓ２０５Ａ）、「コンテンツ、支配的特徴、支配的特徴間の関係」を合わせてコンテンツの中身と属性を表現するものとする（Ｓ２０６Ａ）。同様の処理をマルチメディアデータベースのコンテンツに対しても行う（２００Ｂ、Ｓ２０１Ｂ〜Ｓ２０４Ｂ）。さらに、上記過程で得られたＳ２０６ＡとＳ２０６Ｂの結果を比較するために類似度を計算し（Ｓ２０７）、類似度が所定の閾値より高ければ「似ている」と判断し、検索結果を採択し（Ｓ２０８）、低ければ「似ていない」と判断し、検索結果を棄却する（Ｓ２０９）。これらの手順により一連の処理を終わるが（Ｓ２１０）、マルチメディア認識・検索時間中は上の処理を繰り返す。 Further, a corresponding area in the original space or a space where another dominant feature is arranged is constructed (S204A), and the relationship between the extracted dominant features is obtained (S205A). The contents and attributes of the content are expressed together (S206A). Similar processing is performed on the contents of the multimedia database (200B, S201B to S204B). Further, in order to compare the results of S206A and S206B obtained in the above process, the similarity is calculated (S207). If the similarity is higher than a predetermined threshold, it is judged as “similar” and the search result is adopted. (S208) If it is low, it is determined that it is not similar, and the search result is rejected (S209). Although a series of processing is completed by these procedures (S210), the above processing is repeated during the multimedia recognition / search time.

入力手段として用いる入力デバイスは、「０次元：位置計測GPS、温度、湿度」、「１次元：加速度計（３軸）、角加速度計（３軸）、傾斜計（３方向、角加速度計に含ませ得る）、マイク（音響）」、「２次元：通常の可視光カメラ、近赤外線カメラ、遠赤外線カメラ、サーモグラフィ」、「３次元：３次元データ（レンジセンサー、３次元位置センサー、３次元モーションセンサー、３次元測距センサー）」等から選ばれる。ここでは、直感的な分かりやすさのために主に画像を用いて説明するが、その他のメディア（３次元データ等）を用いても同様に適用できる。 Input devices used as input means are “0 dimension: GPS for position measurement, temperature, humidity”, “1 dimension: accelerometer (3 axes), angular accelerometer (3 axes), inclinometer (3 directions, angular accelerometer , Microphone (sound) ”,“ 2D: normal visible light camera, near infrared camera, far infrared camera, thermography ”,“ 3D: 3D data (range sensor, 3D position sensor, 3D Motion sensor, three-dimensional distance measuring sensor), etc. Here, the description will be made mainly using images for intuitive understanding, but the present invention can be similarly applied even when other media (such as three-dimensional data) are used.

検索対象となるマルチメディアデータベースとしては、インターネットも巨大なデータベースと考えることができ、本システムの検索対象となり得るので、アクセス手段を問わず適用できる。 As a multimedia database to be searched, the Internet can also be considered as a huge database, and can be a search target of this system, so that it can be applied regardless of access means.

図２から図９は、一つの実施形態における支配的特徴の抽出手段及び表現手段に加えて認識及び検索手段の説明図である。図２は支配的特徴の抽出方法と支配的特徴領域による画像の表現結果を例示的に示している。この例では、全体での個数を基準にして、輝度値１と２を支配的特徴として選び、元の画像上で輝度値１と２を持つ画素を支配的特徴領域とする。１と２以外は個数が少ないので、支配的特徴ではないと判断している。どこまで詳細に検出したいのか、計算時間はどの程度まで許されるか等に応じて一定の閾値を決める。 2 to 9 are explanatory diagrams of recognition and retrieval means in addition to the dominant feature extraction means and expression means in one embodiment. FIG. 2 exemplarily shows a dominant feature extraction method and a result of image representation by a dominant feature region. In this example, on the basis of the total number, luminance values 1 and 2 are selected as dominant features, and pixels having luminance values 1 and 2 on the original image are set as dominant feature regions. Since the number of items other than 1 and 2 is small, it is determined that they are not dominant features. A certain threshold value is determined depending on how much detail is desired to be detected and how much calculation time is allowed.

この例では領域原点（OC：Operating Center：以下、OCと略す）を用いていないが、しいて言えば画素の位置が領域原点に相当しており、この表現のまま直接共起行列を計算できる。ここで、領域原点とは、領域を代表する点を意味し、例えば、円の場合は中心にとることができるが、中心に限らず円周上の1点やその他の任意の点に設定することができる。また、この２つの重心を支配的特徴領域の存在位置を反映するパラメータと考えることもできる。あるいは、値１又は２を持つ個々の画素の位置を支配的特徴領域の存在位置を反映するパラメータと考えることもできる。同図において２つの領域DFR1とDFR2の重心を求め、それらを領域原点として、それらを結ぶベクトルを支配的特徴領域の関連性とすることもできる。 In this example, the region origin (OC: Operating Center: hereinafter abbreviated as OC) is not used. In other words, the pixel position corresponds to the region origin, and the co-occurrence matrix can be calculated directly with this representation. . Here, the area origin means a point that represents the area. For example, in the case of a circle, it can be set at the center, but it is not limited to the center but is set to one point on the circumference or any other arbitrary point. be able to. Further, these two centroids can be considered as parameters reflecting the position of the dominant feature region. Alternatively, the position of each pixel having the value 1 or 2 can be considered as a parameter reflecting the position where the dominant feature region exists. In the same figure, the centroids of the two regions DFR1 and DFR2 can be obtained, using them as the region origin, and the vector connecting them as the relevance of the dominant feature region.

また、コンテンツを表す特徴の中で優勢な特徴をそのコンテンツを代表する特徴の意味で支配的特徴（DF：Dominant Feature）とよび、一般的には数の点、数値の点で優位にある特徴を指す。例えば、原画像上で赤色領域が大きければ、赤色を支配的特徴として選ぶこともできる。支配的特徴領域（DFR：Dominant Feature Region)とは、支配的特徴に対応する原空間での領域あるいは別途支配的特徴を配置した空間のことをいい、コンテンツ間の類似度を計算するために用いられる。例えば、支配的特徴に対応する原コンテンツ上の領域を例に取って、赤色が支配的特徴であれば、原画像上で赤色の部分を指す。次に、明示的に領域原点を用いる例を示す。このように、ある画像の中で複数個の色が大部分を占めていればそれらの色を支配的色とよび、その色を持つ画素を支配的特徴領域として扱うこともできる。テキスチャー、形状、エッジの方向・傾斜強度、距離画像等も特徴として利用し得る。ここで、コンテンツは画像や音声、３次元データなどのコンテンツを意味するので、様々な支配的特徴が存在する。局所的な支配的特徴を統計的に集めたものを改めて支配的特徴と呼ぶこともできる。 In addition, the dominant feature among the features that represent the content is called the dominant feature (DF: Dominant Feature) in the meaning of the feature that represents the content, and is generally a feature that is superior in terms of number and numerical value. Point to. For example, if the red region is large on the original image, red can be selected as the dominant feature. Dominant Feature Region (DFR) is a region in the original space corresponding to the dominant feature or a space where a separate dominant feature is placed, and is used to calculate the similarity between contents. It is done. For example, taking an area on the original content corresponding to the dominant feature as an example, if red is the dominant feature, it indicates a red portion on the original image. Next, an example in which the region origin is explicitly used will be shown. Thus, if a plurality of colors occupy most of a certain image, these colors are called dominant colors, and pixels having these colors can be treated as dominant feature regions. Texture, shape, edge direction / inclination intensity, distance image, and the like can also be used as features. Here, since content means content such as images, audio, and three-dimensional data, there are various dominant features. A statistical collection of local dominant features can be called a dominant feature again.

図３は、本実施形態における支配的特徴の選び方と支配的特徴の例示である。この例に示すように支配的特徴の選び方にはいくつかあるので、対象となるコンテンツに応じて選択する。識別能力、汎化能力が高くなるように選ぶ。ここで、白い部分は背景として対象から除いて考える。同図(a)において、まず、原画像から支配的特徴である三角形と楕円を用いて、三角形の３個のOCと楕円の1個のOCを検出し、それらの位置関係を用いて認識を行う。支配的特徴の検出には該三角形及び該楕円を構造化要素（SE： Structure Element）としてモフォロジー演算を行うこと等により実現できる。 FIG. 3 is an example of how dominant features are selected and dominant features in the present embodiment. As shown in this example, there are several ways to select the dominant feature, so the selection is made according to the target content. Choose so that discrimination ability and generalization ability become high. Here, the white part is excluded from the object as a background. In Fig. 3 (a), first, the triangle and ellipse, which are dominant features, are detected from the original image, and three OCs of the triangle and one OC of the ellipse are detected, and recognition is performed using their positional relationship. Do. The dominant feature can be detected by performing a morphological operation using the triangle and the ellipse as a structuring element (SE).

ここで、モルフォロジー演算は画像の特徴抽出やノイズ除去など様々な画像処理に用いられる基本的処理である。その基本的な形態学的演算子には膨張（Dilation）と侵食（Erosion）がある。孤立点除去、膨張、収縮、オープニング、クロージング、排他的膨張、縮退、細線化、穴埋め、スケルトン、輪郭線、再構成、円形分離等に用いられる。その際にどの範囲で演算を行うかを表す構造化要素（SE：Structuring Element）が必要であるが、構造化要素は円形のディスクであってもよく、その他の任意の図形をすることもできる。ただし、円形ディスクを用いると回転不変、正方形を使うと90°回転不変になる等、不変性を確保することができる。構造化要素を用いる際には原点を指定する必要があり、ここでは広い意味で領域原点（OC：Operating
Center）という。イメージと構造化要素のセットを設定するには、２次元平面に制限される必要はない。また、上記膨張と侵食の演算結果の差はコントラストの強度を表す。ここでは、例としてモフォロジー演算をあげるが、存在位置を反映するその他の演算を用いることもできる。例えば、最大値（該領域に存在していることを表す）、最小値（該領域に存在していないことを表す）を組み合わせて用いることもできる。オプティカルフローは、画像の明るさのパターンの見かけの動きのことをいう。 Here, the morphological operation is a basic process used for various image processing such as image feature extraction and noise removal. The basic morphological operators are Dilation and Erosion. Used for isolated point removal, expansion, contraction, opening, closing, exclusive expansion, contraction, thinning, hole filling, skeleton, contour, reconstruction, circular separation, etc. At that time, a structuring element (SE: Structuring Element) indicating the range in which the operation is performed is necessary, but the structuring element may be a circular disk or any other figure. . However, invariance can be ensured, such as rotation invariance with a circular disk and rotation with 90 ° invariance with a square. When using a structuring element, it is necessary to specify the origin. Here, the area origin (OC: Operating
Center). Setting a set of images and structuring elements need not be restricted to a two-dimensional plane. The difference between the expansion and erosion calculation results represents the contrast intensity. Here, a morphological operation is given as an example, but other operations reflecting the existing position can also be used. For example, a maximum value (representing that it exists in the region) and a minimum value (representing not existing in the region) can be used in combination. Optical flow refers to the apparent movement of the brightness pattern of an image.

図３(b)の例ではテキスチャーを支配的特徴として選び、右上の小さな領域は雑音として除去しているが、もし重要なら残してもよい。どちらを選択するかは用途に応じて決めればよい。 In the example of FIG. 3B, the texture is selected as the dominant feature, and the small area in the upper right is removed as noise, but it may be left if it is important. Which one to select can be determined according to the application.

図３(c)の例において、２つの支配的特徴（黒と灰色）から原画像上で対応する部分を求めると、各連結領域は小さいが、たくさんの同じ支配的特徴を持った領域が近くにたくさんある。そこで、２つの支配的特徴領域を簡単に検出するために共通のOCを設定している。この例では、円板SEとリング状SEを用いてモフォロジー演算により簡単に検出でき、しかも回転不変となる。この例に見るように支配的特徴領域は重なっていてもよい。 In the example of Fig. 3 (c), when the corresponding part on the original image is obtained from two dominant features (black and gray), each connected region is small, but many regions with the same dominant feature are close to each other. There are a lot. Therefore, a common OC is set to easily detect the two dominant feature regions. In this example, the disk SE and the ring SE can be used for simple detection by morphological calculation, and the rotation is invariant. As seen in this example, the dominant feature regions may overlap.

図４は、1次元の例である。モフォロジー演算（膨張−縮退）で支配的特徴領域の検出は簡単にできる。この場合の雑音の扱いには小領域を残すか、除去するか、2通りがある。この例では各支配的特徴が上下２つの閾値を用いて表現されている。領域内の分散等を用いて、その大小で分けることもできる。 FIG. 4 is a one-dimensional example. The dominant feature region can be easily detected by morphological operation (expansion-degeneration). There are two ways to handle noise in this case, leaving the small area or removing it. In this example, each dominant feature is expressed using two upper and lower thresholds. It is also possible to divide by size using dispersion within the area.

図５は、本実施形態における歩行者検出のための横方向HOGから支配的特徴を選ぶ例である。同図では、横方向HOGから得られたヒストグラムの横方向ベクトルの大きさを矢印の大小で表したものであるが、ヒストグラムをそのまま支配的特徴として用いるのは、効率が悪いので、ヒストグラムを離散ラベルに変換して用いてもよい。その際に、平均値、最大値、中央値等を用いて、さらにそれらの値を複数のビン（bin）に分類する（クラスに分けること）ことで離散化する。それぞれの位置でのラベルが与えられると、共起行列が計算でき、次のステップに進むことができる。全方向HOGでは支配的特徴として最大の値を持つ方向ベクトル等を用いてもよい。ここで、HOG（Histgrams of Oriented Gradients）は物体認識のための勾配ベースの特徴量である。 FIG. 5 is an example in which the dominant feature is selected from the lateral HOG for pedestrian detection in the present embodiment. In this figure, the magnitude of the horizontal vector of the histogram obtained from the horizontal HOG is represented by the size of the arrow. However, using the histogram as a dominant feature as it is is inefficient, so the histogram is discrete. It may be converted into a label and used. At that time, the average value, the maximum value, the median value, and the like are used to further discretize those values by classifying them into a plurality of bins (dividing them into classes). Given the labels at each position, a co-occurrence matrix can be calculated and the process can proceed to the next step. In the omnidirectional HOG, a direction vector having the maximum value may be used as a dominant feature. Here, HOG (Histgrams of Oriented Gradients) is a gradient-based feature quantity for object recognition.

図６は、本実施形態における歩行者検知を行う場合を例とした投票方式の説明図であり、各特徴から中央の点に投票する様子を示している。この図では矢印が、各支配的特徴に対応する原空間での位置から所定の相対的な位置を示しており、投票の結果、対象物を構成するために必要な支配的特徴からの投票を集めた投票箱があれば、その投票箱の位置に対象物が存在すると判断する。例えば、頭部、胴体、脚部に対応する特徴が集まれば、歩行者を検出したと判断する。頭部と胴体の特徴に限定する、あるいは胴体と脚部に限定すれば、上半身と下半身に分けて検出することもできる。一つの特徴からの投票は1箇所に限らず、ある範囲に広めて複数個の投票箱に投票してもよい。 FIG. 6 is an explanatory diagram of a voting method taking as an example a case where pedestrian detection is performed in the present embodiment, and shows a state where each feature is voted to a central point. In this figure, the arrows indicate the predetermined relative positions from the positions in the original space corresponding to the dominant features. As a result of voting, the voting from the dominant features necessary for composing the object is performed. If there is a collected ballot box, it is determined that there is an object at the position of the ballot box. For example, if features corresponding to the head, torso, and legs are gathered, it is determined that a pedestrian has been detected. If it is limited to the characteristics of the head and torso, or limited to the torso and leg, detection can be performed separately for the upper body and the lower body. Voting from one feature is not limited to one place, and it may be spread over a certain range and voted in a plurality of ballot boxes.

図７及び図８は、それぞれ一つの実施形態における支配的特徴から共起行列を求め、さらに支配的特徴ベクトルを求める処理手順の例である。支配的特徴（色、HOG特徴、エッジの方向・傾斜強度、距離画像等）に離散的状態を割当てて対象物を表現し、その特徴の共起行列から求まる支配的特徴ベクトル間の類似度（あるいは距離）をコンテンツ間の類似度（あるいは距離）として用いる。ここでは直感的な分かりやすさのために画像を用いて説明するが、支配的特徴が求まるなら、入力は必ずしも画像である必要はなく、音楽や加速度センサー等からの１次元信号や３次元データのようなコンテンツにも適用できる。また、画素値だけではなく、HOGに代表される各ブロックの濃淡値の傾斜やオプティカルフロー等のブロック情報等を統一的に扱うための統合的手法であり、柔軟な認識・検索システムでの利用を目指している。離散的な状態に割当てるラベルについては一意に決めてもよいが、確率あるいは重みが割当てられた複数のラベルの集合を、支配的特徴領域の一つの状態として設定してもよいし、各ラベル毎にグルーピングして別々に複数個の共起行列を求めてもよい。 FIGS. 7 and 8 are examples of processing procedures for obtaining a co-occurrence matrix from dominant features and further obtaining dominant feature vectors in one embodiment. Represents an object by assigning discrete states to dominant features (color, HOG features, edge direction / tilt intensity, distance image, etc.), and the similarity between the dominant feature vectors obtained from the co-occurrence matrix of the features ( Or distance) is used as the similarity (or distance) between contents. Here, images are used for intuitive understanding, but if dominant characteristics are obtained, the input does not necessarily have to be an image. One-dimensional signals and three-dimensional data from music, acceleration sensors, etc. It can also be applied to content such as In addition to pixel values, this is an integrated method for uniformly handling block information such as gradient values of optical blocks and optical flows, as represented by HOG, and is used in flexible recognition and search systems. The aims. Labels assigned to discrete states may be uniquely determined, but a set of labels assigned probabilities or weights may be set as one state of the dominant feature region, or for each label. A plurality of co-occurrence matrices may be obtained separately.

支配的特徴に離散的状態を割当て、ある距離r、ある角度θにある支配的特徴の組から共起行列を求める。一番単純には２つの組の共起行列が用いられることが多いが、３個以上の組を用いてもよい。共起行列にはいくつかの種類がある。図２において、双方向で選べば（例えば、θ=0°, 180°）、共起行列は対称行列になり、拡大縮小、上下左右反転等に関して不変性が保たれる。この性質は、認識システムの構築には非常に重要である。連続空間上では、これらの不変性は厳密に保たれるが、離散空間上では、標本化誤差の影響を受ける。それでも、領域数が十分多いときは近似的に成立する。また、rを大きくとれば、方向性が顕著に表れる。 A discrete state is assigned to a dominant feature, and a co-occurrence matrix is obtained from a set of dominant features at a certain distance r and a certain angle θ. In the simplest case, two sets of co-occurrence matrices are often used, but three or more sets may be used. There are several types of co-occurrence matrices. In FIG. 2, if bi-directional selection is performed (for example, θ = 0 °, 180 °), the co-occurrence matrix becomes a symmetric matrix, and invariance is maintained with respect to enlargement / reduction, up / down / left / right inversion, and the like. This property is very important for the construction of a recognition system. These invariances are kept strictly on continuous space, but are affected by sampling errors on discrete space. Nevertheless, it is approximately established when the number of regions is sufficiently large. In addition, if r is large, the directionality becomes remarkable.

一般に、画像処理で２次元濃度ヒストグラムというと、相対的に同じ位置関係にある２点の画素値の統計量を表したものを意味するが（ある方向（θ）にある距離（r）離れた画素値ヒストグラム）、ここでは、画素値でなく支配的特徴領域に拡張する。２領域間の統計量だけでなく、それ以上（一般にn個の組（n≧2））に拡張して用いることもできる。例えば、次式数１のような３個の組を用いるときは、r₂、θ_2、r₃、θ₃をパラメータにとって度数を計算し、確率に変換することができる。ここで、領域特徴として、スカラー、ベクトル、行列等種々の表現形式を用いることができる。
In general, a two-dimensional density histogram in image processing means a statistic of pixel values of two points that are relatively in the same positional relationship (separated by a distance (r) in a certain direction (θ)). Pixel value histogram), which here extends to the dominant feature region rather than the pixel value. Not only the statistics between the two regions but also more (generally n sets (n ≧ 2)) can be used. For example, when using a set of three like the following equation 1, the frequency can be calculated using r ₂ , θ _2, r ₃ , and θ ₃ as parameters and converted into a probability. Here, various representation formats such as scalars, vectors, and matrices can be used as region features.

図９は一つの実施形態における支配的特徴ベクトルを求める手順を説明するための具体例である。離散的状態が割当てられた支配的領域から共起行列を求め、それを確率行列に変換した後、初期確率ベクトルと定常確率ベクトルを求め、この２つから得られる特徴ベクトル間の類似度（あるいは距離）を計算する。図X₁と図X₂は類似しており、図X₃とは似ていない。ここで、最上段のD₁、D₂は原画像から求めた支配的領域特徴を意味する。空欄は支配的領域特徴を持たない部分である。２行目は２個の組（0°,180°）の度数から得られた共起行列、３行目はそれを各行ごとに確率に変換したものである。４行目は、３行目から求めた図X₁〜X₃に対応する支配的特徴ベクトルである。次に順を追って説明する。
次式は、各支配的特徴のラベルD_iの度数及び支配的特徴を持たない領域の度数を、それぞれc_i（i=1、・・・、ｒ）及びc₀で表わしたものである。 FIG. 9 is a specific example for explaining a procedure for obtaining a dominant feature vector in one embodiment. Obtain a co-occurrence matrix from a dominant region to which discrete states are assigned, convert it to a probability matrix, obtain an initial probability vector and a stationary probability vector, and obtain a similarity between feature vectors obtained from these two (or Distance). Figure X ₁ and Figure X ₂ are similar and not like that of FIG. X _3. Here, D ₁ and D ₂ in the uppermost stage mean dominant region features obtained from the original image. A blank is a portion having no dominant region feature. The second row is a co-occurrence matrix obtained from the frequency of two sets (0 °, 180 °), and the third row is a probability converted for each row. The fourth line is a dominant feature vectors corresponding to FIG X ₁ to X ₃ obtained from the third line. Next, a description will be given step by step.
The following expression represents the frequency of the label D _i of each dominant feature and the frequency of the region having no dominant feature by c _i (i = 1,..., R) and c ₀ , respectively.

さらに、次式はそれを確率（２種類）に変換したものである。
Further, the following equation is converted into probabilities (two types).

次に、共起行列を求める。
c_ijは、D_iの位置から距離r、角度θ離れ位置にD_jが存在する度数を意味する。c_iiは両端の点を除いてラベルD_iを持つ領域に含まれる格子点の数に一致するので、c_iiはその領域の大きさにほぼ一致する。また、θ＝0°, 90°, 180°, 270°の時のc_ij（i≠j）は境界線の長さを表すので、近似的に回転不変となる。このような性質から、c_iiをD_iを含む領域の大きさ、c_ijをその領域の境界線の長さとすることもできるが、そのときは、共起行列は標本化に伴う誤差を除いて回転不変になる。同時に、特徴ベクトルの元々の性質から、特徴ベクトルは拡大縮小、上下左右反転に対しても不変となり、検索能力に柔軟性が増す。厳密に境界線の長さを離散空間上で求めることは難しいが、斜め線に対して図１０に示すような簡便法を用いてもよい。 Next, a co-occurrence matrix is obtained.
c _ij denotes the number of degrees D _j is present from the position of D _i to the distance r, angle θ away position. Since c _ii matches the number of grid points included in the region having the label D _i excluding both ends, c _ii substantially matches the size of the region. Further, c _ij (i ≠ j) at θ = 0 °, 90 °, 180 °, and 270 ° represents the length of the boundary line, and therefore is approximately rotation invariant. Because of this property, c _ii can be the size of the region containing D _i , and c _ij can be the length of the boundary line of the region, but in that case, the co-occurrence matrix excludes sampling errors. Rotation invariant. At the same time, due to the original nature of the feature vector, the feature vector becomes invariant to enlargement / reduction and up / down / left / right inversion, thereby increasing the flexibility of search capability. Although it is difficult to strictly determine the length of the boundary line on the discrete space, a simple method as shown in FIG. 10 may be used for the diagonal line.

次に、上でも求められた共起行列を確率行列に変換する。
Next, the co-occurrence matrix obtained above is converted into a probability matrix.

一旦、確率行列Pが得られると、次のような処理を通じて次元数を減らし、柔軟性を高めることができる。
Once the probability matrix P is obtained, the number of dimensions can be reduced through the following processing to increase flexibility.

上式のdとsを用いて、特徴ベクトルFを次のように定義する。
The feature vector F is defined as follows using d and s in the above equation.

さらにFをコンテンツ（画像）の特徴として用い、Fの間の距離（あるいは類似度）を定義し、その値が小さい（１に近い）ほどコンテンツ（画像）は似ているとする。sは本来回転、左右上下反転、拡大縮小不変な性質を持ち、Fは回転、左右上下反転、拡大縮小不変な性質を持つように選ぶことができるので、変形にも強いので頑健なパターン認識システムが実現できる。ここで、２つの特徴ベクトル間F₁, F₂の距離ED, CDと類似度CCは、次式数９で定義される。EDとCDは小さいほど、CCは１に近いほど（大きいほど）よく似たコンテンツ（画像）と判断する。常に同じ結果を得られるという訳ではないが、だいたいよく似た傾向を持つ。
Further, F is used as a feature of the content (image), the distance (or similarity) between F is defined, and the content (image) is more similar as the value is smaller (closer to 1). s can be chosen to have rotation, left / right upside down, scaling invariant properties, and F can be chosen to have rotation, left / right upside down, scaling invariant properties, so it is robust against deformation, so it is a robust pattern recognition system. Can be realized. Here, the distances ED and CD and the similarity CC between the two feature vectors F ₁ and F ₂ are defined by the following equation (9). The smaller the ED and the CD and the closer the CC is to 1 (the larger), the more similar content (image) is judged. The results are not always the same, but they tend to be very similar.

上式数９に沿って計算した結果が次式数１０に示されている。この結果からは、どの指標を用いてもF₁とF₂は似ており、F₃とは似ていないことが分かる。
以上の特徴ベクトルFに加えて、支配的特徴領域の重心間の相対的な位置関係、あるいはモーメント等の形状特徴を用いて類似度の測度としてもよい。その際に、それぞれの測度に重みをつけてもよい。 The result calculated along the above equation (9) is shown in the following equation (10). From this result, it can be seen that F ₁ and F ₂ are similar and not similar to F ₃ regardless of which index is used.
In addition to the above feature vector F, a relative positional relationship between the centroids of the dominant feature regions, or a shape feature such as a moment may be used as a measure of similarity. At that time, each measure may be weighted.

図１１は一つの実施形態における支配的特徴として支配的色を用いた画像データベースの検索結果を示している。同図(a)は、支配的特徴領域としてヒストグラムから支配的色を決め、支配的色を持つ画素を抽出し、対応する支配的特徴領域（DCR）を抽出した結果を示している。ここでは色を用いているので、DFR（Dominant Feature Region）と言わずにDCR（Dominant Color Region）と表現している。同図(b)は画像データベースから検索された結果を示している。クエリー（データベース管理システムに対する処理要求(問い合わせ)）には携帯電話で撮影した写真を用いている。 FIG. 11 shows a search result of an image database using a dominant color as a dominant feature in one embodiment. FIG. 4A shows the result of determining the dominant color from the histogram as the dominant feature region, extracting pixels having the dominant color, and extracting the corresponding dominant feature region (DCR). Since color is used here, it is expressed as DCR (Dominant Color Region) instead of DFR (Dominant Feature Region). FIG. 5B shows the result retrieved from the image database. For the query (processing request (inquiry) to the database management system), a photograph taken with a mobile phone is used.

特徴ベクトルの性質として、形状が同じならFも同じであり、支配的特徴のID番号によらないので、ID番号を入れ替えても同じ結果が得られる。即ち、ラベリングに依存しないので、形状が同じで色が異なる物を探し出したり、似た画像等を探し出す能力は高い。例えば、図９でD₁とD₂（濃い色と淡い色）、図１１でDCR₁とDCR₂（黄色と赤色）、を入れ替えても同じ結果を得る。一方、支配的特徴としての各色は分かっているので、入れ替えたものを識別することも可能であり、その時々の要求に合わせて使い分ければよい。また、確率から求めているので図形を拡大したもの、回転したものからも同じ結果が得られるように設定できる。また、変形に強く、少し欠けた部分があっても頑健に働くので、対象物の認識や類似したものを検索するのには有効である。 If the shape of the feature vector is the same, F is the same, and it does not depend on the ID number of the dominant feature, so the same result can be obtained even if the ID number is replaced. That is, since it does not depend on labeling, it has a high ability to search for objects having the same shape and different colors, or searching for similar images. For example, the same result is obtained even if D ₁ and D ₂ (dark and light colors) in FIG. 9 and DCR ₁ and DCR ₂ (yellow and red) in FIG. 11 are interchanged. On the other hand, since each color as the dominant feature is known, it is possible to identify the replaced color, and it is sufficient to use it appropriately according to the occasional request. Moreover, since it calculates | requires from a probability, it can set so that the same result may be obtained from what expanded the figure and rotated. Also, it is resistant to deformation and works robustly even if there are some missing parts, so it is effective for recognizing objects and searching for similar ones.

対象は、色、形状、濃度傾斜、オプティカルフロー等、何でもあっても構わないが、支配的特徴が存在することが必要である。一般には何らかの支配的特徴が存在しているので、問題ではない。 The object may be anything such as color, shape, density gradient, optical flow, etc., but it must be dominant. In general, it is not a problem because some dominant feature exists.

１０１入力手段
１０２マルチメディアデータベース
１０３処理手段
２００A 入力手段
２００B マルチメディアデータベース
101 input means 102 multimedia database 103 processing means 200A input means 200B multimedia database

Claims

An extraction means for extracting dominant dominant features from the multimedia content;
Expression means for expressing multimedia content using the dominant feature extracted by the extraction means and the relationship between the features;
Input means for inputting multimedia content to be processed (hereinafter referred to as input content);
Recognizing means for recognizing the input content by obtaining the similarity between the multimedia contents using the expression means;
Search means for searching multimedia content similar to input content from a multimedia database;
Multimedia recognition and retrieval system using dominant features.

The representation means obtains the relationship in the original space corresponding to the dominant feature extracted by the extraction means or the space where the dominant feature is separately arranged (hereinafter referred to as the dominant feature area) and its relationship, The multimedia recognition / retrieval system using dominant features according to claim 1, wherein a parameter as a basis for obtaining a similarity between multimedia contents is provided.

3. The expression unit according to claim 1 or 2, wherein the expression means obtains a parameter reflecting the existence of the dominant feature region and provides a parameter as a basis for obtaining a similarity between multimedia contents using the parameter. A multimedia recognition and retrieval system using dominant features.

The recognizing means is required for voting to a ballot box at a predetermined relative position from the position in the original space corresponding to each dominant feature extracted by the extracting means, and to construct the object. The multimedia recognition / retrieval system using dominant features according to claim 1 or 2, wherein if there is a ballot box in which votes from features are collected, an object is located at the position of the ballot box.

The recognition means assigns a discrete state to the dominant feature in the dominant feature region, obtains a co-occurrence matrix, converts it into a probability matrix, obtains an initial probability vector and a stationary probability vector, The multimedia recognition / retrieval system using dominant features according to claim 1 or 2, wherein the input content is recognized by obtaining the similarity or distance between the contents using the feature vector obtained from the two.