JP2004046612A

JP2004046612A - Data matching method and device, data matching program, and computer readable recording medium

Info

Publication number: JP2004046612A
Application number: JP2002204306A
Authority: JP
Inventors: Kenji Kita; 北　研二; Masamiki Shishibori; 獅々堀　正幹
Original assignee: Techno Network Shikoku Co Ltd
Current assignee: Techno Network Shikoku Co Ltd
Priority date: 2002-07-12
Filing date: 2002-07-12
Publication date: 2004-02-12

Abstract

<P>PROBLEM TO BE SOLVED: To achieve highly efficient and fast data matching by using a one-dimensional self-organizing map as an approximate nearest neighbor search, and increasing the efficiency for accessing a secondary storage medium. <P>SOLUTION: A one-dimensional self-organizing map is created based on a group of multidimensional data for data matching when conditions are designated to extract desired data from multidimensional data having a plurality of feature quantities and capable of expressing the feature quantities in the form of vectors. Then multidimensional indexing is performed to retain the multidimensional data in consecutive areas on a secondary storage part. A unit with the reference vector closest to matching condition vectors is retrieved from the one-dimensional self-organizing map and another unit located in the vicinity of the former is selected from the one-dimensional self-organizing map. The multidimensional data belonging to the latter is referred to from the secondary storage part and the distance between the vectors is calculated. The calculation result is output after being subjected to a predetermined process. By limiting the range of search using information obtained from the one-dimensional self-organizing map, a very fast nearest neighbor search is made possible. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、１次元自己組織化マップを用いた高次元データの高速近傍検索によるデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
近年のコンピュータなど電子計算機の高性能化、記憶容量の大容量化および低価格化により、情報の電子化、ＩＴ化の急速な普及が進み、電子化されたデータが利用される機会が飛躍的に増大した。電子化されたデータは紙のデータに比較して複製や加工、共有が容易であり、しかも検索の面などで優れている。特に最近ではインターネットの普及により、文書のみならず画像や映像、音声データなどのマルチメディアデータが頻繁に扱われるようになった。このような背景において、所望のデータやこれに類似するデータの検索や分類、整理などの技術が重要となっている。ここでは、マルチメディアデータの検索、データマイニング、パターン認識、機械学習、コンピュータ・ビジョン、統計データ解析などを含めてデータマッチングと呼ぶ。
【０００３】
データマッチングをコンピュータで行う際、マルチメディアデータはコンピュータ内部では特徴量のベクトルで表現することができる。特徴量のベクトルは、指定された検索条件に類似するデータをデータベース中から検索する際にも利用できる。図１に特徴量ベクトルを利用したマルチメディア・コンテンツ検索の一例を示す。類似検索を行う検索条件として特徴量のベクトルを指定すると、検索の実行は検索条件のベクトルとデータベース中のベクトルの距離を計算し、距離の小さいものを検索結果として出力する。このように、条件として与えられたベクトルとの距離が小さいベクトルをデータベース中から検索することを、最近傍検索（ｎｅａｒｅｓｔ　ｎｅｉｇｈｂｏｒ　ｓｅａｒｃｈ）と呼ぶ。最近傍検索では、複数の特徴量を多次元ベクトルで表現し、ベクトル間の距離によりデータ同士の類似性を判定している。例えば、文書検索の場合には、索引語の重みベクトルで文書や検索条件を表現することができる。また画像の類似検索の場合には、カラーヒストグラム、テクスチャ特徴量、形状特徴量などから成る特徴量ベクトルにより画像データを表現する。
【０００４】
このような特徴量ベクトルに基づくコンテンツの類似検索として、線形探索が知られている。線形探索では、データベース中のすべてのデータに関する特徴量ベクトルを検索条件のベクトルと逐次的に比較するため、データベースの規模に比例した計算量が必要となる。計算量の増大は計算機側の処理の負担の増大、要処理時間の増大に繋がる。このため、データベースが大規模化した際の検索システムの処理効率に深刻な影響を及ぼすことになる。したがって、最近傍検索を効率的に行なうための多次元インデキシング技術の開発が、重要な課題として従来より活発に研究されてきた。
【０００５】
【発明が解決しようとする課題】
しかしながら、このような研究にも拘わらず決定的な方式は開発されていない。特徴量ベクトルの次元数は非常に大きいため、高次元空間における効率的な多次元インデキシング技術の開発は容易でない。
【０００６】
例えば、ユークリッド空間における多次元インデキシング手法には、Ｒ−ｔｒｅｅ、ＳＳ−ｔｒｅｅ、ＳＲ−ｔｒｅｅなどが提案されている。また、より一般的な距離空間を対象にしたインデキシング手法としては、ＶＰ−ｔｒｅｅ、ＭＶＰ−ｔｒｅｅ、Ｍ−ｔｒｅｅなどが提案されている。これらのインデキシング手法は、多次元空間を階層的に分割することにより、探索範囲を限定することを基本としている。探索範囲を限定すれば、その分だけ計算量も少なくて済む。しかしながら、高次元空間では、ある点の最近点と最遠点との間に距離的な差が生じなくなるという現象が起こる。この「次元の呪い（ｃｕｒｓｅ　ｏｆ　ｄｉｍｅｎｓｉｏｎａｌｉｔｙ）」として知られる現象によって、探索する領域を限定することができなくなり、結果として線形探索に近い計算量が必要になってしまうという問題点がある。
【０００７】
高次元空間における上記の問題点に対処するために、近似的な最近傍検索についても研究が進められている。たとえば、ハッシュ法に基づく近似検索手法や空間充填曲線（ｓｐａｃｅ−ｆｉｌｌｉｎｇ　ｃｕｒｖｅ）を用いて、高次元空間の点を索引付けする手法なども提案されているが、実用化には至っていない。
【０００８】
一方、文書データ（テキスト）と画像データが混在するクロスメディア情報検索では、一度の検索で所望の検索結果を得ることが困難であり、ユーザは複数回のやりとりによって所望の検索結果を得ることが多々ある。したがって、クロスメディア情報検索では特徴量ベクトルに基づく最近傍検索の実行回数が必然的に多くなってしまう。このような場合、完全な最近傍検索は必要ではなく、むしろ高速な近似的な最近傍検索のほうが望ましいということができる。
【０００９】
さらに、近年の肥大化したデータベース中からの検索においては、高速なＲＡＭなどの半導体メモリのみで処理することができず、より大容量であるがアクセス速度の劣るハードディスクなどの２次記憶媒体を使って行われることが一般的である。この場合、ハードディスクへのアクセス速度は半導体メモリに比べて極めて遅いため、ハードディスクへのアクセス速度がボトルネックとなって検索処理速度を上げることができないという問題もあった。
【００１０】
本発明は、このような問題を解決するためになされたものである。本発明の主な目的は、近似的な最近傍検索として１次元自己組織化マップを利用し、さらに２次記憶媒体へのアクセスを効率化することによって高効率かつ高速なデータマッチングを実現可能なデータ検索装置、データ検索方法、データ検索プログラムおよびコンピュータで読み取り可能な記録媒体を提供することにある。
【００１１】
【課題を解決するための手段】
上記の目的を達成するために、本発明の請求項１に記載されるデータマッチング方法は、複数の特徴量を持ちこれらの特徴量をベクトルで表現可能な多次元のデータに対し、条件を指定して所望のデータを抽出するデータマッチング方法に関する。このデータマッチング方法は、データマッチングの対象となる多次元のデータ群に基づいて、同じ次元数の参照ベクトルを備えるユニットに分割した１次元自己組織化マップを作成するステップと、前記作成された１次元自己組織化マップに基づいて多次元インデキシングを行い、２次記憶部上の連続した領域に多次元データを保持するステップと、前記所望のデータを抽出するマッチング条件をベクトルとして表現し、マッチング条件ベクトルに最も近い参照ベクトルを持つユニットを１次元自己組織化マップから検索するステップと、前記１次元自己組織化マップから検索されたユニットの近傍に位置するユニットを１次元自己組織化マップから選択するステップと、前記選択されたユニットに属する多次元データを前記２次記憶部から参照し、ベクトル間の距離計算を行うステップと、上記ベクトル距離計算の結果に対し所定の処理を行って出力するステップとを備えることを特徴とする。
【００１２】
また、本発明の請求項２に記載されるデータマッチング方法は、前記請求項１に記載される特徴に加えて、前記１次元自己組織化マップの作成ステップが、入力層がすべてのユニットと結合されており、各ユニットには入力層に入力されるデータと同じ次元数の参照ベクトルが対応している自己組織化ネットワークにおいて、入力層に入力されたベクトルと最も近い参照ベクトルを持つユニットを検索するステップと、前記検索されたユニットとその近傍にあるユニットの参照ベクトルを入力ベクトルに近づけるステップとを備えることを特徴とする。
【００１３】
さらに、本発明の請求項３に記載されるデータマッチング方法は、上記請求項２に記載される特徴に加えて、前記１次元自己組織化マップの学習アルゴリズムが、参照ベクトルをランダムな値で初期化するステップと、入力ベクトルに最も近い参照ベクトルを持つユニットを検出するステップと、前記検出されたユニットおよびその近傍領域の参照ベクトルを更新するステップと、参照ベクトルが収束するまで上記ステップを繰り返すステップとを備えることを特徴とする。
【００１４】
さらにまた、本発明の請求項４に記載されるデータマッチング方法は、上記請求項１から３のいずれかに記載される特徴に加えて、前記データマッチングが、指定された検索条件に適合するデータの検索であることを特徴とする。
【００１５】
さらにまた、本発明の請求項５に記載されるデータマッチング方法は、上記請求項４に記載される特徴に加えて、前記検索されるデータが文書データ、静止画または動画である画像データ、音声データのいずれかもしくはこれらの組み合わせであることを特徴とする。
【００１６】
さらにまた、本発明の請求項６に記載されるデータマッチング方法は、上記請求項１から３のいずれかに記載される特徴に加えて、前記データマッチングが、画像のパターン認識であることを特徴とする。
【００１７】
さらにまた、本発明の請求項７に記載されるデータマッチング方法は、上記請求項１から３のいずれかに記載される特徴に加えて、前記データマッチングが、データマイニングであることを特徴とする。
【００１８】
また、本発明の請求項８に記載されるデータマッチング装置は、複数の特徴量を持ちこれらの特徴量をベクトルで表現可能な多次元のデータに対し、条件を指定して所望のデータを抽出するものである。このデータマッチング装置は、データマッチングの対象となる多次元のデータ群を保持するデータベースと、前記データベースに保持された多次元データ群に基づいて、同じ次元数の参照ベクトルを備えるユニットに分割した１次元自己組織化マップを作成するための演算部と、前記演算部で作成された１次元自己組織化マップに基づいて多次元インデキシングを行うための主記憶部と、前記主記憶部で実行された多次元インデキシングに基づいて、多次元データを連続した領域に保持する２次記憶部とを備える。さらに、前記所望のデータを抽出するマッチング条件をベクトルとして表現し、マッチング条件ベクトルに最も近い参照ベクトルを持つユニットを１次元自己組織化マップから検索し、前記１次元自己組織化マップから検索されたユニットの近傍に位置するユニットを１次元自己組織化マップから選択し、前記選択されたユニットに属する多次元データを前記２次記憶部から参照し、ベクトル間の距離計算を行い、上記ベクトル距離計算の結果に対し所定の処理を行って出力することを特徴とする。
【００１９】
さらに、本発明の請求項９に記載されるデータマッチングプログラムは、複数の特徴量を持ちこれらの特徴量をベクトルで表現可能な多次元のデータに対し、条件を指定して所望のデータを抽出するデータマッチングプログラムに関する。このデータマッチングプログラムは、データマッチングの対象となる多次元のデータ群に基づいて、同じ次元数の参照ベクトルを備えるユニットに分割した１次元自己組織化マップを作成する処理と、前記作成された１次元自己組織化マップに基づいて多次元インデキシングを行い、２次記憶部上の連続した領域に多次元データを保持する処理と、前記所望のデータを抽出するマッチング条件をベクトルとして表現し、マッチング条件ベクトルに最も近い参照ベクトルを持つユニットを１次元自己組織化マップから検索する処理と、前記１次元自己組織化マップから検索されたユニットの近傍に位置するユニットを１次元自己組織化マップから選択する処理と、前記選択されたユニットに属する多次元データを前記２次記憶部から参照し、ベクトル間の距離計算を行う処理と、上記ベクトル距離計算の結果に対し所定の処理を行って出力する処理とをコンピュータに実行させる。
【００２０】
さらに、本発明の請求項１０に記載されるコンピュータで読み取り可能な記録媒体は、前記請求項９に記載されるデータマッチングプログラムを格納するものである。記録媒体には、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷやフレキシブルディスク、磁気テープ、ＭＯ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−Ｒ、ＤＶＤ−ＲＷ、ＤＶＤ＋ＲＷ、ＤＶＤ＋Ｒなどの磁気ディスク、光ディスク、光磁気ディスク、半導体メモリその他のプログラムを格納可能な媒体が含まれる。
【００２１】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて説明する。ただし、以下に示す実施の形態は、本発明の技術思想を具体化するためのデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体を例示するものであって、本発明はデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体を以下のものに特定しない。
【００２２】
さらに、この明細書は、特許請求の範囲を理解し易いように、実施の形態に示される部材に対応する番号を、「特許請求の範囲の欄」、および「課題を解決するための手段の欄」に示される部材に付記している。ただ、特許請求の範囲に示される部材を、実施の形態の部材に特定するものでは決してない。なお、各図面が示す部材の大きさや位置関係などは、説明を明確にするため誇張していることがある。
【００２３】
本明細書において、データマッチングの対象となるデータには、テキストなどの文書データ、静止画または動画である画像データ、音楽や演奏、公演、スピーチなどの音声データといったマルチメディアデータが含まれる。またデータマッチングには、文書や映像といった単独種類のデータもしくは複数種類のデータが混在するデータベース中からのマルチメディアデータの検索、データマイニング、パターン認識、機械学習、コンピュータ・ビジョン、統計データ解析などが含まれる。ここでデータマイニングとは、統計的・数理的な手法により、多様で大量のデータから自動的に有用な情報を発見する処理を指す。有用な情報には、データの傾向、パターン、相関関係、規則などが含まれる。データマイニングで使われる手法には、統計的データ解析、決定木、ニューラルネットワークなどがあり、これらの手法ではデータを多次元のベクトルで表現することが多い。このような場合において、あるデータに類似したデータを探すための処理に、本発明の最近傍探索を利用することができる。
【００２４】
［特徴量ベクトル］
本明細書において特徴量ベクトルには、電子データ（メディア・コンテンツ）の種類に応じて様々なものが設定できる。様々なメディア・コンテンツに従った検索を行う場合、データベース中に含まれるメディア全体の内容すなわちデータそのものを用いると、非常に大規模なデータを取り扱わなければならない。そのため、データコンテンツの内容を顕著に表す特徴量を使用する。特徴量は多次元ベクトル形式の特徴量ベクトルとして表す。ここで多次元について説明すると、データがｎ個の属性の性質を持ち、ｎ個の属性値の並びにより表現されるとき、このデータをｎ次元データと呼び、各データはｎ次元空間内に配置される。一般にｎが大きい場合を多次元データと呼び、各データを検索する際には多次元空間内を検索することとなる。
【００２５】
文書コンテンツを表す特徴量としては、文書内に出現する単語の内、文書内容を顕著に表す単語を索引語として抽出し、その索引語の頻度を特徴量として用いる。
【００２６】
画像コンテンツを表す特徴量としては、色情報、形状情報、テクスチャ情報が用いられる。色情報は、ＲＧＢ表色系やＣＩＥ　Ｌａｂ表色系などに従って画像内の色分布をヒストグラムに変換したものを多次元ベクトルとする。形状情報、テクスチャ情報は、Ｗａｖｅｌｅｔ変換などで周波数分解した値を多次元ベクトルとする。
【００２７】
映像コンテンツを表す特徴量としては、各画像間の動きベクトルの大きさを多次元ベクトルで表現し、映像の特徴量とする。
【００２８】
音楽コンテンツを表す特徴量としては、音楽内に現れる各音の音高を基に、音高の時間的推移や音高差の分布を多次元ベクトルで表現したものを特徴量として用いる。
【００２９】
また、多次元ベクトルでコンテンツ特徴量を表現し、コンテンツが類似したデータを検索する技術は、上記のマルチメディア情報検索分野だけに限らず、データマイニング、パターン認識、機械学習、コンピュータ・ビジョン、統計データ解析のような幅広い分野で利用されている。これらの分野では、データが有する様々な属性の値が、そのデータの特徴量として多次元ベクトルで表現されている。
【００３０】
本明細書においてデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体は、データマッチングを行うシステムそのもの、ならびにデータマッチングに関連する入出力、表示、演算、通信その他の処理をハードウェア的に行う装置や方法に限定するものではない。ソフトウェア的に処理を実現する装置や方法も本発明の範囲内に包含する。例えば汎用の回路やコンピュータにソフトウェアやプログラム、プラグイン、オブジェクト、ライブラリ、アプレット、コンパイラなどを組み込んでデータマッチングそのものあるいはこれに関連する処理を可能とした汎用あるいは専用のコンピュータ、ワークステーション、端末、携帯型電子機器、ＰＤＣやＣＤＭＡ、Ｗ−ＣＤＭＡ、ＦＯＲＭＡなどの携帯電話、ＰＨＳ、ＰＤＡ、ページャ、スマートフォンその他の電子デバイスも、本発明のデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体の少なくともいずれかに含まれる。また本明細書においては、プログラム自体もデータマッチング装置に含むものとする。
【００３１】
本発明の実施例において使用されるコンピュータなどの端末同士、およびサーバやこれらに接続される操作、制御、表示、各種処理その他のためのコンピュータ、あるいはプリンタなどその他の周辺機器との接続は、例えばＩＥＥＥ１３９４、ＲＳ−２３２ＣやＲＳ−４２２、ＵＳＢなどのシリアル接続、パラレル接続、あるいは１０ＢＡＳＥ−Ｔ、１００ＢＡＳＥ−ＴＸなどのネットワークを介して電気的に接続して通信を行う。接続は有線を使った物理的な接続に限られず、無線ＬＡＮやＢｌｕｅｔｏｏｔｈなどの電波、赤外線、光通信などを利用した無線接続などでもよい。さらに観察像のデータ保存などを行うための記録媒体には、メモリカードや磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどが利用できる。
【００３２】
［データマッチング装置］
以下、本発明に係るデータマッチングの一実施例として、マルチメディアデータの検索に利用した例を図２に基づいて説明する。この図に示すデータマッチング装置１は汎用あるいは専用のコンピュータなどが利用でき、演算部２と主記憶部３と２次記憶部４を備える。演算部２はＣＰＵやＭＰＵ、システムＬＳＩやＩＣなどで構成され、自己組織化マップの作成や特徴量ベクトルの距離計算その他の必要な演算を実行する。演算部２はこれらの処理が可能なようにハードウェア的に構成することもでき、または演算プログラムを実行させてソフトウェア的に実現させてもよい。主記憶部３は、高速な汎用もしくは組み込みメモリで構成され、ＳＤＲＡＭ、ＤＤＲＲＡＭ、ＲＤＲＡＭ、ＥＤＯＲＡＭ、ファーストページＲＡＭといったＲＡＭ等の半導体メモリが利用できる。２次記憶部４はハードディスク（固定ディスク）などの二次記憶媒体で構成され、主記憶部３に比べ大容量のものを使用する。ディスクアクセス速度は速いほど好ましいが、一般に主記憶部３に比して遅い。さらにデータマッチング装置１には必要に応じてマウスやキーボードなどの入力部５が接続される。
【００３３】
データベース６は、検索対象のデータを保存する記憶媒体であり、大容量のハードディスクなどが利用される。一般にはサーバ側のホストコンピュータに内蔵あるいは接続されており、データマッチング装置１と通信可能に接続されている。またデータベース６は、データマッチング装置１内部に備えることもでき、さらに２次記憶部４と兼用することもできる。
【００３４】
データベース６中から所望のデータを検索するための検索条件の入力は、直接特徴量のベクトルを指定する他、予めキーワードを設定しておき、入力されたキーワードに応じた特徴量ベクトルに変換して検索条件とすることもできる。この変換はデータマッチング装置１内部で行われるため、ユーザは特徴量ベクトルを意識することなくキーワードで検索することができる。
【００３５】
検索条件の入力は、データマッチング装置１がスタンドアロンで運用される場合は、入力部５から入力される。またネットワークで運用される場合は、さらにネットワーク接続されたクライアント側のコンピュータや携帯電話などの端末７から入力する。ネットワーク接続としてはＬＡＮやＷＡＮ、インターネットなどが利用できる。この形態では、データマッチング装置１がサーチエンジンとして機能し、各端末から入力された検索条件に対して検索を行った結果をそれぞれの端末に出力する。
【００３６】
［自己組織化マップ］
本発明の実施例では、マルチメディアデータの検索に１次元自己組織化マップを用いて、高速な近似的最近傍検索を行った。自己組織化マップは、高次元空間での近傍関係を可能な限り維持しつつ、低次元空間に高次元のデータを配置するという特性を持っている。一般には、高次元データを２次元上に配置し、人間が相互の配置関係を視覚的に把握できるようにしている。例えば特開平１１−１２０１８０号公報では、自己組織化マップを用いて２次元平面上にデータを配置することにより、類似関係を人間の目で視覚的に把握しやすい位置関係に表す例が示されている。このように自己組織化マップを用いる例では、２次元平面上への展開が一般的である。
【００３７】
これに対し本発明においては、２次元自己組織化マップではなく１次元自己組織化マップにデータを写像した。この理由は、近傍検索の検索時間の効率を向上させるためである。仮に２次元自己組織化マップ上にデータを写像した場合、あるデータの近傍に存在するデータを検索するためには、隣接する８方向についてユニットにアクセスしなければならない。これだと単にアクセス数が増加するだけでなく、不要なデータが多く検索対象となってしまい、時間効率が悪化する。
【００３８】
これに対し１次元自己組織化マップに写像した場合、左右２方向のユニットにだけアクセスすればよく、アクセス数が軽減する。また後述するように、各ユニットのデータを順次２次記憶上に配置することにより、２次記憶上をランダムにアクセスすることなく、ユニットに対応したデータを一括して取得することができるようになり、近傍検索の検索時間効率の向上が望める。また写像処理（マッピング）に要する時間についても、２次元マップを１次元マップとすることでユニット数が減る場合には使用する参照ベクトルも少なくなるためマッピング時間も軽減する。以上のように、本発明では２次元上でなく１次元上に配置することで、より簡素かつ高速な処理を可能としている。さらに後述するように、１次元上に写像する近似でも、精度の低下は殆ど問題とならず、むしろ精度が向上しているとも言えることが実験により確認された。
【００３９】
自己組織化マップ（ｓｅｌｆ−ｏｒｇａｎｉｚｉｎｇ　ｍａｐ：ＳＯＭ）は、教師なし競合学習により、高次元データを低次元データに写像する２階層型のニューラルネットワークである。自己組織化マップでは、高次元空間での近傍関係をできる限り保ちつつ、低次元空間へデータを配置するという位相的整列性と呼ばれる特徴を持っている。自己組織化マップの典型的な適用例は、多次元データの可視化であり、この場合には高次元データを２次元平面上に配置することを行なう。
【００４０】
図３は、ｎ次元の入力データを２次元平面上に配置する自己組織化ネットワークの例を示している。ネットワークの入力層は、２次元平面上に格子状に配置されたすべてのユニットと結合されており、各ユニットには、入力層に入力されるデータと同じ次元数の参照ベクトル（ｒｅｆｅｒｅｎｃｅ　ｖｅｃｔｏｒ）ｍ_１〜６が対応している。学習の過程では、入力層に入力されたベクトルと最も近い参照ベクトルを持つユニットを探し、このユニットとその近傍にあるユニットの参照ベクトルを入力ベクトルに近づけるという操作を繰り返す。このようにして、同じような位相的特徴を持ったユニットが近傍領域に集まり、結果的に入力データの位相的特徴を反映した自己組織化マップが作られることになる。
【００４１】
［自己組織化マップの学習アルゴリズム］
上記の自己組織化マップの学習アルゴリズムを図４のフローチャートに基づいて説明する。
（ステップＳ１）参照ベクトルｍ_ｉをランダムな値で初期化する。
（ステップＳ２）入力ベクトルｘに最も近い参照ベクトルｍ_ｃを持つユニットｃを見つける。その際の計算式は、以下の数１のようになる。
【００４２】
【数１】

【００４３】
（ステップＳ３）ユニットｃおよびｃの近傍領域の参照ベクトルｍ_ｉを以下の数２により更新する。ここで、ｈ_ｃｉはユニットｃから離れるにつれ、小さな値になるように設定する。また、ｈ_ｃｉは学習が進むにつれ、単調に減少するようにする。
【００４４】
【数２】
ｍ_ｉ＝ｍ_ｉ＋ｈ_ｃｉ（ｘ−ｍ_ｉ）
【００４５】
（ステップＳ４）参照ベクトルが収束するまでステップ２に戻って処理を繰り返す。
【００４６】
［自己組織化マップを用いた最近傍検索手法］
以上のようにして作成された自己組織化マップは、高次元空間での近傍関係をできる限り保ちつつ、入力データを低次元空間へ配置することができるという特長を備える。この特長を用いると、高次元空間での最近傍検索を低次元空間での最近傍検索問題に置き換えることができると考えられる。ただ、自己組織化マップの学習には誤差がともなう上、低次元のマップ上では高次元空間での距離が保存されていないため、低次元マップだけを用いて最近傍検索を行うには問題がある。
【００４７】
そこで本発明の実施例では、自己組織化マップにより得られた低次元空間での近傍関係から、最近傍検索の探索範囲を限定し、限定されたデータに関してだけ、元の高次元空間上で距離を計算するという方法を採用した。また、探索範囲の限定を効率的に行なうために１次元の自己組織化マップを用いている。図５に、１次元自己組織化マップを用いた最近傍検索手法の概略を示す。
【００４８】
［多次元インデキシングの作成］
次に図５に基づき、データマッチング装置が１次元自己組織化マップを用いて多次元インデキシングを作成する過程を説明する。
【００４９】
（１）自己組織化マップの学習アルゴリズムにより、多次元データを１次元上に配置する。データマッチング装置１がデータベース６にアクセスし、演算部２はデータベース６に格納されているデータを読み込み、主記憶部３に１次元自己組織化マップを展開する。ユニット数をｋとすると、データはｋ個のクラスタに分割されることになる。
【００５０】
（２）各クラスタに属するデータを、２次記憶部４上の連続した領域に格納する。またこの際、１次元自己組織化マップ上の各ユニットに２次記憶部４上の格納領域を示すポインタを持たせる。なお２次記憶部４には、１次元に変換したデータでなく元の多次元データを格納する。
【００５１】
多次元データをハードディスクなどの２次記憶部４上の連続した領域に格納することで、効率的なアクセスが可能となる。アクセス速度や回数、効率等は、使用するディスクアクセスのアルゴリズムに依存するものの、一般にはディスクへのアクセス要求が生じると、目標となるデータのみならず、その近傍のデータも一括してメモリに取り込まれる。したがって必要なデータがまとまって存在しているほど、一回のアクセスで取り込まれたデータ中に必要なデータが含まれているヒット率は高くなり、結果的にアクセス回数は少なくて済むことになる。これによって効率的なアクセスが実現され、高速化が見込まれる。特にハードディスクのアクセス速度はメモリに比べて極めて遅く、処理速度向上のボトルネックとなっていたため、この部分での効率化は処理速度の劇的な改善をもたらす。
【００５２】
［最近傍検索］
以上のようにして作成された多次元インデキシングを用いて、データマッチング装置は以下の手順で最近傍検索を実行する。
【００５３】
（１）入力された検索条件のベクトルに最も近い参照ベクトルを持つユニットｃを、演算部２は主記憶部３の１次元自己組織化マップから探し出す。
【００５４】
（２）１次元自己組織化マップ中から探し出されたユニットｃの、近傍に位置するユニットを１次元自己組織化マップから選択する。選択されたユニットが保持するポインタに従って、各ユニットに属する多次元データを２次記憶部４から参照し、ベクトル間の距離計算を行う。このようにして、参照ベクトルを持つユニットおよびその近傍に位置するユニットに属するデータに対してのみ、検索条件ベクトルとの距離計算を行う。
【００５５】
（３）上記で計算された結果を、距離の小さい順にソートし、これを最近傍検索の結果として出力する。
【００５６】
以上のようにして、１次元自己組織化マップを用いた最近傍検索の結果が得られる。検索条件ベクトルと比較して距離計算が行われるデータは、２次記憶部４上の連続した領域に格納されているため、２次記憶部４へのアクセスはきわめて効率的に行うことが可能となる。実験の結果、１次元自己組織化マップ上の各ユニットに割り当てられる多次元データの数は大きく偏ることなく、概ね平均化している。したがって、２次記憶部４へのアクセス回数は数回程度に抑えられる。
【００５７】
［距離尺度］
多次元ベクトル間の距離尺度には様々なものがある。本発明の実施例においてはユークリッド距離を用いているが、本発明はこれに限定されるものでない。
【００５８】
一般的に用いられている距離尺度には、Ｌｐノルムあるいはミンコフスキー距離と呼ばれているものがある。これは、数３のベクトルに対して、数４のように定義される。
【００５９】
【数３】

【００６０】
【数４】

【００６１】
数４において、ｐ＝１の場合を特にマンハッタン距離、またｐ＝２の場合をユークリッド距離と呼ぶ。Ｌｐノルム以外にも、マハラノビス距離などがあるが、自己組織化マップでは通常、ユークリッド距離が用いられる。ただこれら以外にも、ｅａｒｔｈ　ｍｏｖｅｒ’ｓ　ｄｉｓｔａｎｃｅ、ｔａｎｇｅｎｔ　ｄｉｓｔａｎｃｅ、Ｈａｕｓｄｏｒｆｆ　ｄｉｓｔａｎｃｅ、ｕｎｆｏｌｄｅｄｄｉｓｔａｎｃｅなど、様々な距離尺度があり、本発明においても適宜利用できる。
【００６２】
［実施例１］
１次元自己組織化マップを用いた最近傍検索手法の有効性を調べるために、類似画像検索実験と文書検索実験を行った。まず、類似画像検索実験について説明する。類似画像検索実験では、Ｃｏｒｅｌデータベースから抽出した４２，３８１件のカラー写真画像を用いた。また、このうち、４２４件（全体の１％）の画像データをランダムに抽出し、検索画像とした。これらの画像データから、図６に示すような、次元数の異なる４種類の特徴量ベクトルを作成した。
【００６３】
自己組織化マップを用いた最近傍検索の精度を調べるためには、検索された画像のうち、どれが正解であるかという情報が必要である。このため、各検索画像と全画像データとの間のユークリッド距離を線形探索により求め、距離の小さい４００件を正解データとした。与えられた検索画像から、自己組織化マップを用いた最近傍検索により、上位４００件の検索結果を出力し、これを正解データと比較することにより適合率を算出した。
【００６４】
自己組織化マップを用いた最近傍検索では、検索条件によって適合率は変化する。適合率が変化する主な要因は、１次元マップ上の総ユニット数、および、検索の際に用いる近傍数である。ここで、近傍数とは、探索候補の絞り込みの際にいくつのユニットを参照したかを意味しており、具体的には、検索質問の属するユニットに加え、その近傍のユニットをいくつ参照したかを示す。以下では、検索質問の属するユニットのみを参照したときは近傍数１、検索質問の属するユニットに加え、その左右両側のユニットを参照したときは近傍数３というように表すことにする。なお、近傍数３の際、検索質問が１次元マップ上の左端（あるいは右端）のユニットに属している場合には、そのユニットの右側（あるいは左側）しか参照しない。
【００６５】
図７は、ユニット数が１０、近傍数３のときの適合率曲線を、また図８は、ユニット数が２０、近傍数３のときの適合率曲線をそれぞれ示している。横軸方向は検索結果数を、縦軸方向は平均適合率を表しており、グラフは上位ｎ件の結果が検索された時点での平均適合率をプロットしたものである。また図９は、さまざまな条件のもとでの平均Ｒ適合率と検索質問１件当たりの平均距離計算回数を示している。
【００６６】
ここでＲ適合率（Ｒ−ｐｒｅｃｉｓｉｏｎ）とは、検索質問に適合する結果の総数をＲとするとき、上位からＲ番目までの検索結果を出力した時点での適合率を意味する。Ｒ適合率は、上位に順位付けされた検索結果の有効性を示す評価尺度である。
【００６７】
図９から分かるように、同じ検索条件のもとでは、適合率は次元数が大きくなるにつれ低下する傾向にあるが、距離計算回数は次元数によらずにほぼ一定である。したがってこの方法は、次元数が増大した場合にも高速性が失われることはない。なお、距離計算回数が次元数によらず一定である理由は、各ユニットに割り当てられるデータ数が概ね平均化しているためである。図１０に、ユニット数２０の際に各ユニットに割り当てられたデータ数を示すが、１次元マップ上の各ユニットに割り当てられるデータ数が極端にばらついていないことを読み取ることができる。
【００６８】
［実施例２］
以上の実施例１では、類似画像検索を対象にした実験結果を示した。画像特徴量の次元数は、数１０〜数１００次元程度であるが、これよりも次元数が大きい場合の手法の有効性を調べるために、実施例２としてベクトル空間モデル（ｖｅｃｔｏｒ　ｓｐａｃｅ　ｍｏｄｅｌ：ＶＳＭ）に基づく文書検索を対象とした実験を行なった。
【００６９】
文書検索のベクトル空間モデルでは、文書中から索引語を抽出し、文書を索引語の出現頻度に基づくベクトルで表現する。文書ベクトルの次元数は、文書集合全体にわたる索引語の総数と等しいため、次元数はきわめて大きくなる。
【００７０】
この実施例２では、検索対象のデータベースとして、情報検索評価用のテストコレクションであるＭＥＤＬＩＮＥを用いた。ＭＥＤＬＩＮＥは、英文の検索対象文書１，０３３文書、検索質問３０文書から成る小規模なコレクションであり、各検索質問にはどの文書が適合しているかという適合情報が用意されている。なお、各検索質問に対する平均適合文書数は２３．２文書である。
【００７１】
まず前処理として、ＭＥＤＬＩＮＥコレクションから「ａ」や「ａｂｏｕｔ」などの不要語４３９単語、および全文書中に１回しか出現しなかった単語を削除した。その後、ポーター・アルゴリズム（Ｐｏｒｔｅｒ　ａｌｇｏｒｉｔｈｍ）によるステミングを行なった結果、４，３２９個の索引語が得られた。以上の処理により得られた索引語から４，３２９次元の文書ベクトルを構成した。この際、索引語の重み付けとして、局所的重み付けには対数化索引語頻度を、大域的重み付けにはエントロピーを、文書正規化にはコサイン正規化を用いた。
【００７２】
文書検索の評価では、通常のベクトル空間モデルに基づく最近傍検索（線形探索）と自己組織化マップを用いた最近傍検索の両者とも３０件の検索結果を出力し、出力結果をＭＥＤＬＩＮＥの適合情報と比較することにより適合率を求めた。この際、自己組織化マップを用いた最近傍検索では、ユニット数２０、近傍数３の条件で検索を行なった。図１１に、文書検索の適合率曲線を示すが、自己組織化マップを用いた検索のほうがわずかながら良い結果を与えている。なお、ベクトル空間モデルに基づく検索のＲ適合率は０．５３であり、自己組織化マップを用いた検索のＲ適合率は０．５８であった。また、自己組織化マップを用いた最近傍検索の平均距離計算回数は１検索質問当たり１４１回であり、これは線形探索の約１／７に相当する。従って、条件が同一であれば処理時間も約１／７に短縮することができ、高速化が実現される。
【００７３】
以上はＭＥＤＬＩＮＥコレクションの適合情報に対する評価であるが、次に、自己組織化マップによる最近傍検索の近似誤差について述べる。ベクトル空間モデルの検索結果を正解とみなした場合、自己組織化マップを用いた検索結果のＲ適合率は０．６８であった。したがって、上位３０件までの検索では３２％の近似誤差が生じていることになる。しかし、近似誤差があるにもかかわらず、ＭＥＤＬＩＮＥの適合情報に対する評価では、通常のベクトル空間モデルよりも適合率が高くなっている。このことから、最近傍検索を近似するという意味からは結果に若干の誤差があるものの、検索そのものの絶対的な評価として考えれば、通常の最近傍検索よりも精度が高いということができ、処理速度の短縮に加えて精度の面でも優れていることが明らかとなった。
【００７４】
【発明の効果】
以上のように、本発明のデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体は、従来の線形探索方式に比べて、極めて効率的なデータマッチングを実行可能であり、処理に要する計算量を極減し、処理時間を大幅に短縮できるという特長を実現する。それは、本発明のデータマッチング方法、データマッチング装置、データマッチングプログラムおよびコンピュータで読み取り可能な記録媒体が、自己組織化マップを用いて高次元データを従来のような２次元空間でなく、１次元空間上に配置することにより、最近傍検索の検索範囲を限定し、限定された範囲に存在するデータに対してのみベクトルの距離計算を行っているからである。これによって、最近傍検索の検索範囲を大きく削減することができ、膨大なデータベース中のデータすべてに対し逐次的に距離計算を行う線形探索に比べて、計算量を飛躍的に低減できる。
【００７５】
さらに、本発明は１次元空間上に配置された１次元自己組織化マップに基づいて、１次元に変換されたデータでなく高次元のデータを２次記憶媒体上の連続した領域に格納する構成としている。このため、２次記憶媒体から主記憶素子上に読み込まれるデータには連続した一連のデータが含まれることとなり、少ないアクセス回数で必要なデータを取り込むことが可能となり、この結果ディスクアクセスの回数が極めて少なくなり、アクセス速度の遅さがボトルネックとなっていた問題を解消し、高速かつ高効率のデータマッチング処理を可能としている。このため、大規模なデータ集合に対しても、きわめて高速な最近傍検索を行なうことが可能である。
【図面の簡単な説明】
【図１】特徴量ベクトルを利用したマルチメディア・コンテンツ検索の一例を示す概念図である。
【図２】本発明の一実施例に係るデータマッチング装置の一例を示す概略ブロック図である。
【図３】ｎ次元の入力データを２次元平面上に配置する自己組織化ネットワークの一例を示す概念図である。
【図４】自己組織化マップの学習アルゴリズムの一例を示すフローチャートである。
【図５】本発明の一実施例に係る１次元自己組織化マップを用いた最近傍検索を示す概念図である。
【図６】本発明の実施例１に係る画像検索に使用した特徴量ベクトルの概略を示す表である。
【図７】本発明の実施例１に係る画像検索の結果としてユニット数が１０、近傍数３のときの適合率曲線を示すグラフである。
【図８】本発明の実施例１に係る画像検索の結果としてユニット数が２０、近傍数３のときの適合率曲線を示すグラフである。
【図９】本発明の実施例１に係る画像検索の結果として平均Ｒ適合率および平均距離計算回数を示す表である。
【図１０】ユニット数２０において各ユニットに割り当てられたデータ数を示すグラフである。
【図１１】本発明の実施例２に係る文書検索の結果として適合率曲線を示すグラフである。
【符号の説明】
１・・・データマッチング装置
２・・・演算部
３・・・主記憶部
４・・・２次記憶部
５・・・入力部
６・・・データベース
７・・・端末[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a data matching method, a data matching device, a data matching program, and a computer-readable recording medium by high-speed neighborhood search of high-dimensional data using a one-dimensional self-organizing map.
[0002]
[Prior art]
With the recent increase in the performance of computers and other electronic computers, the increase in storage capacity, and the price reduction, the rapid spread of computerization of information and IT has dramatically increased the opportunity for the use of computerized data. Has increased. The digitized data is easier to copy, process, and share than paper data, and is superior in terms of retrieval and the like. Particularly, recently, with the spread of the Internet, multimedia data such as images, videos, and audio data as well as documents has been frequently used. Against this background, techniques for searching, classifying, and organizing desired data and similar data have become important. Here, data matching includes multimedia data search, data mining, pattern recognition, machine learning, computer vision, statistical data analysis, and the like.
[0003]
When performing data matching with a computer, the multimedia data can be represented by a vector of a feature amount inside the computer. The feature vector can also be used when searching data similar to the specified search condition from the database. FIG. 1 shows an example of a multimedia content search using a feature vector. When a vector of a feature amount is designated as a search condition for performing a similarity search, the search execution calculates a distance between the vector of the search condition and a vector in the database, and outputs a search result having a small distance as a search result. Retrieving a vector having a small distance from a vector given as a condition from the database in this manner is called a nearest neighbor search. In the nearest neighbor search, a plurality of feature amounts are represented by a multidimensional vector, and the similarity between data is determined based on the distance between the vectors. For example, in the case of a document search, a document or a search condition can be expressed by a weight vector of an index word. In the case of an image similarity search, image data is represented by a feature vector including a color histogram, a texture feature, a shape feature, and the like.
[0004]
As a similarity search for content based on such a feature amount vector, a linear search is known. In the linear search, since the feature amount vectors of all data in the database are sequentially compared with the search condition vector, a calculation amount proportional to the size of the database is required. An increase in the amount of calculation leads to an increase in the processing load on the computer side and an increase in the required processing time. For this reason, the processing efficiency of the search system when the size of the database increases is seriously affected. Therefore, the development of a multidimensional indexing technique for efficiently performing the nearest neighbor search has been actively studied as an important issue.
[0005]
[Problems to be solved by the invention]
However, despite such research, no definitive method has been developed. Since the number of dimensions of the feature vector is very large, it is not easy to develop an efficient multidimensional indexing technique in a high-dimensional space.
[0006]
For example, R-tree, SS-tree, SR-tree, and the like have been proposed as a multidimensional indexing method in the Euclidean space. In addition, VP-tree, MVP-tree, M-tree, and the like have been proposed as indexing techniques for a more general metric space. These indexing techniques are based on limiting a search range by dividing a multidimensional space hierarchically. If the search range is limited, the amount of calculation can be reduced accordingly. However, in a high-dimensional space, a phenomenon occurs in which a distance difference does not occur between the nearest point and the farthest point of a certain point. This phenomenon known as "curse of dimensionality" makes it impossible to limit the area to be searched, resulting in a problem that a calculation amount close to a linear search is required.
[0007]
In order to address the above-mentioned problems in a high-dimensional space, research on approximate nearest neighbor search has been advanced. For example, a method of indexing points in a high-dimensional space using an approximate search method based on a hash method or a space-filling curve has been proposed, but has not been put to practical use.
[0008]
On the other hand, in a cross-media information search in which document data (text) and image data are mixed, it is difficult to obtain a desired search result by a single search, and a user may obtain a desired search result by performing a plurality of exchanges. There are many. Therefore, in the cross-media information search, the number of executions of the nearest neighbor search based on the feature amount vector inevitably increases. In such a case, a complete nearest neighbor search is not necessary, but rather a fast approximate nearest neighbor search is more desirable.
[0009]
Furthermore, in searching from an enlarged database in recent years, processing cannot be performed only by a semiconductor memory such as a high-speed RAM, and a secondary storage medium such as a hard disk having a larger capacity but an inferior access speed is used. It is common to be performed. In this case, since the access speed to the hard disk is much lower than that of the semiconductor memory, there is a problem that the access speed to the hard disk becomes a bottleneck and the search processing speed cannot be increased.
[0010]
The present invention has been made to solve such a problem. A main object of the present invention is to use a one-dimensional self-organizing map as an approximate nearest neighbor search, and to realize efficient and high-speed data matching by making access to a secondary storage medium more efficient. A data search device, a data search method, a data search program, and a computer-readable recording medium are provided.
[0011]
[Means for Solving the Problems]
In order to achieve the above object, a data matching method according to claim 1 of the present invention specifies a condition for multidimensional data having a plurality of feature amounts and capable of expressing these feature amounts by a vector. And a data matching method for extracting desired data. This data matching method includes the steps of creating a one-dimensional self-organizing map divided into units having reference vectors of the same dimension based on a multidimensional data group to be subjected to data matching; Performing multidimensional indexing based on the dimensional self-organizing map, storing multidimensional data in a continuous area on a secondary storage unit, and expressing a matching condition for extracting the desired data as a vector, Retrieving a unit having a reference vector closest to the vector from the one-dimensional self-organizing map, and selecting a unit located near the unit retrieved from the one-dimensional self-organizing map from the one-dimensional self-organizing map. And storing the multi-dimensional data belonging to the selected unit from the secondary storage unit. Irradiation and characterized by comprising a step of performing distance calculation between vectors, and outputting to the result of the vector distance calculation performs a predetermined process.
[0012]
In the data matching method according to a second aspect of the present invention, in addition to the features described in the first aspect, the step of creating the one-dimensional self-organizing map includes a step in which the input layer is connected to all units. Search for the unit with the closest reference vector to the vector input to the input layer in the self-organizing network where each unit has a reference vector of the same dimension as the data input to the input layer. And a step of bringing reference vectors of the searched unit and units in the vicinity thereof to an input vector.
[0013]
Further, in the data matching method according to the third aspect of the present invention, in addition to the features described in the second aspect, the learning algorithm for the one-dimensional self-organizing map may be configured such that a reference vector is initialized with a random value. Transforming, detecting a unit having a reference vector closest to the input vector, updating the reference vector of the detected unit and its neighboring area, and repeating the above steps until the reference vector converges And characterized in that:
[0014]
Furthermore, in the data matching method according to a fourth aspect of the present invention, in addition to the features according to any one of the first to third aspects, the data matching may be performed by a data matching a specified search condition. Search.
[0015]
Furthermore, in the data matching method according to claim 5 of the present invention, in addition to the features described in claim 4, the data to be searched is document data, still image or moving image, It is characterized by being any of the data or a combination thereof.
[0016]
Furthermore, a data matching method according to claim 6 of the present invention is characterized in that, in addition to the features described in any one of claims 1 to 3, the data matching is image pattern recognition. And
[0017]
Furthermore, a data matching method according to a seventh aspect of the present invention is characterized in that, in addition to the features described in any one of the first to third aspects, the data matching is data mining. .
[0018]
The data matching apparatus according to claim 8 of the present invention extracts desired data by designating conditions for multidimensional data having a plurality of feature amounts and capable of expressing these feature amounts by vectors. Is what you do. This data matching device is divided into a unit having a reference vector having the same number of dimensions based on a database holding a multidimensional data group to be subjected to data matching and a multidimensional data group held in the database. An operation unit for creating a one-dimensional self-organizing map; a main storage unit for performing multidimensional indexing based on the one-dimensional self-organizing map created by the operation unit; A secondary storage unit that stores the multidimensional data in a continuous area based on the multidimensional indexing. Further, the matching condition for extracting the desired data is expressed as a vector, a unit having a reference vector closest to the matching condition vector is searched from the one-dimensional self-organizing map, and the unit is searched from the one-dimensional self-organizing map. A unit located in the vicinity of the unit is selected from the one-dimensional self-organizing map, multidimensional data belonging to the selected unit is referred from the secondary storage unit, a distance between vectors is calculated, and the vector distance calculation is performed. Is subjected to predetermined processing and output.
[0019]
Further, the data matching program according to the ninth aspect of the present invention extracts desired data by specifying conditions for multidimensional data having a plurality of feature amounts and capable of expressing these feature amounts by vectors. Data matching program. The data matching program creates a one-dimensional self-organizing map divided into units having the same number of reference vectors based on a multidimensional data group to be subjected to data matching. A process of performing multidimensional indexing based on the dimensional self-organizing map and storing multidimensional data in a continuous area on a secondary storage unit, and expressing a matching condition for extracting the desired data as a vector, A process of searching the one-dimensional self-organizing map for a unit having a reference vector closest to the vector, and selecting a unit located near the unit searched from the one-dimensional self-organizing map from the one-dimensional self-organizing map. Processing, referring to the secondary storage unit for multidimensional data belonging to the selected unit, A process of performing a distance calculation between Torr, to execute a process of outputting performing predetermined processing on the result of the vector distance calculation computer.
[0020]
Further, a computer-readable recording medium according to a tenth aspect of the present invention stores the data matching program according to the ninth aspect. Recording media include magnetic disks such as CD-ROM, CD-R, CD-RW, flexible disk, magnetic tape, MO, DVD-ROM, DVD-RAM, DVD-R, DVD-RW, DVD + RW, and DVD + R, and optical disks. , A magneto-optical disk, a semiconductor memory, and other media capable of storing programs.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below exemplify a data matching method, a data matching device, a data matching program, and a computer-readable recording medium for embodying the technical idea of the present invention. Does not specify a data matching method, a data matching device, a data matching program, and a computer-readable recording medium as follows.
[0022]
Further, in this specification, in order to make it easy to understand the claims, the numbers corresponding to the members described in the embodiments will be referred to as “claims” and “means for solving the problems”. Column). However, the members described in the claims are not limited to the members of the embodiments. In addition, the size, the positional relationship, and the like of the members illustrated in each drawing may be exaggerated for clarity of description.
[0023]
In the present specification, data to be subjected to data matching includes document data such as text, image data that is a still image or a moving image, and multimedia data such as audio data such as music, performances, performances, and speeches. Data matching includes multimedia data search, data mining, pattern recognition, machine learning, computer vision, statistical data analysis, etc. from a single type of data such as documents and videos, or a mixed database of multiple types of data. included. Here, the data mining refers to a process of automatically finding useful information from various and large amounts of data by a statistical and mathematical method. Useful information includes data trends, patterns, correlations, rules, and the like. Methods used in data mining include statistical data analysis, decision trees, neural networks, and the like. In these methods, data is often represented by multidimensional vectors. In such a case, the nearest neighbor search of the present invention can be used for a process for searching for data similar to certain data.
[0024]
[Feature vector]
In this specification, various types of feature amount vectors can be set according to the type of electronic data (media content). When performing a search according to various media contents, if the contents of the entire medium included in the database, that is, the data itself is used, very large-scale data must be handled. For this reason, a feature amount that prominently represents the content of the data content is used. The feature amount is represented as a feature amount vector in a multidimensional vector format. Here, the multi-dimension will be described. When data has a property of n attributes and is expressed by a sequence of n attribute values, this data is called n-dimensional data, and each data is arranged in an n-dimensional space. Is done. In general, a case where n is large is called multidimensional data, and when searching for each data, a search is made in a multidimensional space.
[0025]
As the feature amount representing the document content, a word that remarkably represents the document content is extracted as an index word from words appearing in the document, and the frequency of the index word is used as the feature amount.
[0026]
Color information, shape information, and texture information are used as feature amounts representing image contents. The color information is obtained by converting a color distribution in an image into a histogram in accordance with an RGB color system, a CIE Lab color system, or the like, to obtain a multidimensional vector. As the shape information and the texture information, a value obtained by frequency decomposition by Wavelet transform or the like is used as a multidimensional vector.
[0027]
As the feature amount representing the video content, the magnitude of the motion vector between the images is represented by a multidimensional vector, and the feature amount of the video is used.
[0028]
As the feature amount representing the music content, a temporal change of the pitch and a distribution of the pitch difference represented by a multidimensional vector based on the pitch of each sound appearing in the music are used as the feature amount.
[0029]
In addition, the technology for expressing content features using multidimensional vectors and searching for data with similar contents is not limited to the multimedia information search field described above, but also includes data mining, pattern recognition, machine learning, computer vision, statistics It is used in a wide range of fields such as data analysis. In these fields, values of various attributes of data are represented by multidimensional vectors as feature amounts of the data.
[0030]
In this specification, a data matching method, a data matching apparatus, a data matching program, and a computer-readable recording medium are used for a data matching system itself, and input / output, display, calculation, communication, and other processes related to data matching. The present invention is not limited to a hardware-based device or method. An apparatus and a method for realizing the processing by software are also included in the scope of the present invention. For example, a general-purpose circuit or computer incorporating software, programs, plug-ins, objects, libraries, applets, compilers, etc., to enable data matching itself or related processing, general-purpose or special-purpose computers, workstations, terminals, mobile phones Mobile devices such as PDC, CDMA, W-CDMA, FORMA, PHS, PDAs, pagers, smartphones and other electronic devices can also be read by the data matching method, data matching device, data matching program and computer of the present invention. Included in at least one of various recording media. In this specification, the program itself is also included in the data matching device.
[0031]
Terminals such as computers used in the embodiments of the present invention, and servers and operations and controls connected to them, control, display, various processing and other computers, or connection with other peripherals such as printers, for example, Communication is performed by serial connection such as IEEE 1394, RS-232C, RS-422, or USB, parallel connection, or electrical connection via a network such as 10BASE-T or 100BASE-TX. The connection is not limited to a physical connection using a wire, but may be a wireless connection using radio waves such as a wireless LAN or Bluetooth, infrared rays, optical communication, or the like. Further, as a recording medium for storing data of an observed image, a memory card, a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be used.
[0032]
[Data matching device]
Hereinafter, as an embodiment of data matching according to the present invention, an example used for searching multimedia data will be described with reference to FIG. A data matching device 1 shown in FIG. 1 can use a general-purpose computer or a dedicated computer, and includes a calculation unit 2, a main storage unit 3, and a secondary storage unit 4. The operation unit 2 is configured by a CPU, an MPU, a system LSI, an IC, and the like, and executes a self-organizing map, a distance calculation of a feature amount vector, and other necessary operations. The arithmetic unit 2 may be configured by hardware so that these processes can be performed, or may be realized by software by executing an arithmetic program. The main storage unit 3 is configured by a high-speed general-purpose or embedded memory, and a semiconductor memory such as a RAM such as an SDRAM, a DDRRAM, an RDRAM, an EDORAM, and a first page RAM can be used. The secondary storage unit 4 is composed of a secondary storage medium such as a hard disk (fixed disk) and has a larger capacity than the main storage unit 3. The faster the disk access speed, the better, but it is generally slower than the main storage unit 3. Further, an input unit 5 such as a mouse or a keyboard is connected to the data matching device 1 as needed.
[0033]
The database 6 is a storage medium for storing data to be searched, and uses a large-capacity hard disk or the like. Generally, it is built in or connected to a host computer on the server side, and is communicably connected to the data matching device 1. Further, the database 6 can be provided inside the data matching device 1, and can also be used as the secondary storage unit 4.
[0034]
In order to input search conditions for searching for desired data from the database 6, in addition to directly specifying a feature amount vector, a keyword is set in advance and converted into a feature amount vector corresponding to the input keyword. Search conditions can also be used. Since this conversion is performed inside the data matching apparatus 1, the user can search by keyword without being conscious of the feature amount vector.
[0035]
The search condition is input from the input unit 5 when the data matching device 1 is operated stand-alone. In the case of operating on a network, the data is further input from a terminal 7 such as a client computer or a mobile phone connected to the network. As a network connection, a LAN, a WAN, the Internet, or the like can be used. In this embodiment, the data matching device 1 functions as a search engine, and outputs a result of a search performed on search conditions input from each terminal to each terminal.
[0036]
[Self-organizing map]
In the embodiment of the present invention, a high-speed approximate nearest neighbor search is performed using a one-dimensional self-organizing map for multimedia data search. The self-organizing map has the property of arranging high-dimensional data in a low-dimensional space while maintaining the neighborhood relation in a high-dimensional space as much as possible. In general, high-dimensional data is arranged two-dimensionally so that humans can visually grasp mutual arrangement relations. For example, Japanese Patent Application Laid-Open No. 11-120180 discloses an example in which data is arranged on a two-dimensional plane using a self-organizing map, and a similar relationship is represented as a positional relationship that can be easily visually recognized by human eyes. ing. In such an example using the self-organizing map, development on a two-dimensional plane is common.
[0037]
On the other hand, in the present invention, data is mapped to a one-dimensional self-organizing map instead of a two-dimensional self-organizing map. The reason for this is to improve the efficiency of the search time of the neighborhood search. If data is mapped on a two-dimensional self-organizing map, the unit must be accessed in eight adjacent directions in order to search for data existing near certain data. This not only increases the number of accesses, but also causes a large amount of unnecessary data to be searched, resulting in poor time efficiency.
[0038]
On the other hand, when the image is mapped onto the one-dimensional self-organizing map, it is only necessary to access the units in the two right and left directions, and the number of accesses is reduced. In addition, as described later, by sequentially arranging the data of each unit on the secondary storage, data corresponding to the unit can be collectively obtained without randomly accessing the secondary storage. Thus, the search time efficiency of the neighborhood search can be improved. Also, regarding the time required for the mapping process (mapping), when the number of units is reduced by converting the two-dimensional map into a one-dimensional map, the number of reference vectors to be used is reduced, so that the mapping time is also reduced. As described above, according to the present invention, simpler and faster processing is possible by arranging one-dimensional instead of two-dimensional. Further, as will be described later, it has been confirmed by an experiment that even in the approximation in which mapping is performed on one dimension, a decrease in accuracy hardly causes a problem, and it can be said that the accuracy is rather improved.
[0039]
The self-organizing map (SOM) is a two-layer neural network that maps high-dimensional data to low-dimensional data by unsupervised competitive learning. The self-organizing map has a feature called topological alignment in which data is arranged in a low-dimensional space while maintaining a neighborhood relation in a high-dimensional space as much as possible. A typical application of the self-organizing map is visualization of multi-dimensional data, in which high-dimensional data is arranged on a two-dimensional plane.
[0040]
FIG. 3 shows an example of a self-organizing network that arranges n-dimensional input data on a two-dimensional plane. The input layer of the network is connected to all units arranged in a lattice on a two-dimensional plane, and each unit has a reference vector m having the same dimension as data input to the input layer. _1-6 Is supported. In the learning process, an operation of searching for a unit having a reference vector closest to the vector input to the input layer, and bringing the reference vectors of this unit and its neighboring units close to the input vector is repeated. In this way, units having similar topological characteristics are gathered in the neighboring area, and as a result, a self-organizing map reflecting the topological characteristics of the input data is created.
[0041]
[Self-organizing map learning algorithm]
The learning algorithm of the self-organizing map will be described based on the flowchart of FIG.
(Step S1) Reference vector m _i Is initialized with a random value.
(Step S2) Reference vector m closest to input vector x _c Find unit c with The calculation formula at that time is as shown in Equation 1 below.
[0042]
(Equation 1)

[0043]
(Step S3) Units c and reference vector m in the neighborhood of c _i Is updated by the following equation (2). Where h _ci Is set so as to become smaller as the distance from the unit c increases. Also, h _ci Is to decrease monotonically as learning progresses.
[0044]
(Equation 2)
m _i = M _i + H _ci (X-m _i )
[0045]
(Step S4) Return to step 2 and repeat the process until the reference vector converges.
[0046]
[Nearest Neighbor Search Method Using Self-Organizing Map]
The self-organizing map created as described above has a feature that input data can be arranged in a low-dimensional space while maintaining a neighborhood relation in a high-dimensional space as much as possible. By using this feature, it is considered that the nearest neighbor search in a high-dimensional space can be replaced with the nearest neighbor search problem in a low-dimensional space. However, learning a self-organizing map involves errors, and distances in a high-dimensional space are not preserved on a low-dimensional map. is there.
[0047]
Therefore, in the embodiment of the present invention, the search range of the nearest neighbor search is limited based on the neighborhood relation in the low-dimensional space obtained by the self-organizing map, and only the limited data has a distance in the original high-dimensional space. Was calculated. In addition, a one-dimensional self-organizing map is used to efficiently limit the search range. FIG. 5 shows an outline of a nearest neighbor search method using a one-dimensional self-organizing map.
[0048]
[Create multidimensional indexing]
Next, a process in which the data matching device creates multidimensional indexing using the one-dimensional self-organizing map will be described with reference to FIG.
[0049]
(1) Multidimensional data is arranged on one dimension by a learning algorithm of a self-organizing map. The data matching device 1 accesses the database 6, the operation unit 2 reads data stored in the database 6, and develops a one-dimensional self-organizing map in the main storage unit 3. Assuming that the number of units is k, the data is divided into k clusters.
[0050]
(2) Data belonging to each cluster is stored in a continuous area on the secondary storage unit 4. At this time, each unit on the one-dimensional self-organizing map is provided with a pointer indicating a storage area on the secondary storage unit 4. The secondary storage unit 4 stores the original multidimensional data instead of the one-dimensionally converted data.
[0051]
Storing multidimensional data in a continuous area on the secondary storage unit 4 such as a hard disk enables efficient access. The access speed, number of times, efficiency, etc. depend on the disk access algorithm used, but in general, when a disk access request occurs, not only the target data, but also the data in the vicinity of the target data is loaded into the memory at once. It is. Therefore, the more the required data is present, the higher the hit rate in which the required data is included in the data fetched by one access, and consequently the number of accesses is reduced. . As a result, efficient access is realized, and higher speed is expected. In particular, the access speed of the hard disk is extremely slower than that of the memory, and has become a bottleneck for improving the processing speed. Therefore, the efficiency improvement in this portion leads to a dramatic improvement in the processing speed.
[0052]
[Nearest Neighbor Search]
Using the multidimensional indexing created as described above, the data matching apparatus executes the nearest neighbor search in the following procedure.
[0053]
(1) The arithmetic unit 2 searches the one-dimensional self-organizing map of the main storage unit 3 for a unit c having a reference vector closest to the vector of the input search condition.
[0054]
(2) A unit located close to the unit c found in the one-dimensional self-organizing map is selected from the one-dimensional self-organizing map. According to the pointer held by the selected unit, the multidimensional data belonging to each unit is referred from the secondary storage unit 4 to calculate the distance between the vectors. In this way, only the data belonging to the unit having the reference vector and the unit located in the vicinity of the unit have the distance calculation with the search condition vector.
[0055]
(3) The results calculated above are sorted in ascending order of the distance, and are output as the results of the nearest neighbor search.
[0056]
As described above, the result of the nearest neighbor search using the one-dimensional self-organizing map is obtained. Since data for which distance calculation is performed in comparison with the search condition vector is stored in a continuous area on the secondary storage unit 4, access to the secondary storage unit 4 can be performed extremely efficiently. Become. As a result of the experiment, the number of multidimensional data assigned to each unit on the one-dimensional self-organizing map is substantially averaged without being largely biased. Therefore, the number of accesses to the secondary storage unit 4 can be suppressed to about several times.
[0057]
[Distance scale]
There are various distance measures between multidimensional vectors. Although the Euclidean distance is used in the embodiment of the present invention, the present invention is not limited to this.
[0058]
Some commonly used distance scales are called Lp norm or Minkowski distance. This is defined as Equation 4 for the vector of Equation 3.
[0059]
[Equation 3]

[0060]
(Equation 4)

[0061]
In Equation 4, the case where p = 1 is particularly called the Manhattan distance, and the case where p = 2 is called the Euclidean distance. In addition to the Lp norm, there is a Mahalanobis distance and the like, but the Euclidean distance is usually used in the self-organizing map. However, in addition to these, there are various distance scales such as an earth mover's distance, a tangent distance, a Hausdorff distance, and an unfolded distance, which can be appropriately used in the present invention.
[0062]
[Example 1]
Similar image search experiments and document search experiments were performed to examine the effectiveness of the nearest neighbor search method using a one-dimensional self-organizing map. First, a similar image search experiment will be described. In the similar image search experiment, 42,381 color photographic images extracted from the Corel database were used. Of these, 424 (1% of the total) image data were randomly extracted and used as search images. From these image data, four types of feature amount vectors having different numbers of dimensions as shown in FIG. 6 were created.
[0063]
In order to check the accuracy of the nearest neighbor search using the self-organizing map, information on which of the searched images is the correct answer is required. For this reason, the Euclidean distance between each search image and all image data was obtained by a linear search, and 400 cases with a small distance were determined as correct data. From the given search image, the top 400 search results were output by nearest neighbor search using a self-organizing map, and the relevance was calculated by comparing this with the correct answer data.
[0064]
In the nearest neighbor search using the self-organizing map, the precision changes depending on the search condition. The main factors that change the precision are the total number of units on the one-dimensional map and the number of neighbors used for searching. Here, the number of neighbors means how many units have been referred to when narrowing down the search candidates. Specifically, in addition to the unit to which the search query belongs, how many units in the vicinity have been referred to Is shown. In the following description, the number of neighbors is expressed as 1 when referring only to the unit to which the search query belongs, and the number of neighbors as 3 when referring to the left and right units in addition to the unit to which the search query belongs. When the number of neighbors is 3, if the search query belongs to the leftmost (or rightmost) unit on the one-dimensional map, only the right (or left) side of that unit is referred to.
[0065]
FIG. 7 shows a precision curve when the number of units is 10 and the number of neighbors is 3, and FIG. 8 shows a precision curve when the number of units is 20 and the number of neighbors is 3, respectively. The horizontal axis represents the number of search results and the vertical axis represents the average relevance. The graph plots the average relevance at the time when the top n results are retrieved. FIG. 9 shows the average R matching rate under various conditions and the average number of distance calculations per search query.
[0066]
Here, the R relevance rate (R-precision) means the relevance rate at the time of outputting the top to R-th search results, where R is the total number of results that match the search query. The R relevance ratio is an evaluation scale indicating the validity of a search result ranked at the top.
[0067]
As can be seen from FIG. 9, under the same search condition, the precision rate tends to decrease as the number of dimensions increases, but the number of distance calculations is almost constant regardless of the number of dimensions. Therefore, this method does not lose speed even when the number of dimensions increases. The reason why the number of distance calculations is constant irrespective of the number of dimensions is that the number of data allocated to each unit is generally averaged. FIG. 10 shows the number of data allocated to each unit when the number of units is 20, and it can be read that the number of data allocated to each unit on the one-dimensional map does not vary extremely.
[0068]
[Example 2]
In the first embodiment described above, the experimental results for similar image retrieval are shown. Although the number of dimensions of the image feature amount is about several tens to several hundreds of dimensions, in order to examine the effectiveness of the technique when the number of dimensions is larger than this, as a second embodiment, a vector space model (VSM) is used. An experiment for document retrieval based on) was conducted.
[0069]
In the vector space model for document search, an index word is extracted from a document, and the document is represented by a vector based on the frequency of occurrence of the index word. Since the number of dimensions of the document vector is equal to the total number of index words over the entire document set, the number of dimensions is extremely large.
[0070]
In the second embodiment, MEDLINE, which is a test collection for information search evaluation, was used as a search target database. MEDLINE is a small collection of 1,033 documents to be searched in English and 30 documents to be searched, and relevance information indicating which document is suitable for each search query is prepared. The average number of matching documents for each search query is 23.2 documents.
[0071]
First, as preprocessing, 439 unnecessary words, such as "a" and "about", and words that appeared only once in all documents were deleted from the MEDLINE collection. After that, stemming was performed by the Porter algorithm, and as a result, 4,329 index words were obtained. A 4,329-dimensional document vector was constructed from the index words obtained by the above processing. At this time, logarithmic index word frequency was used for local weighting, entropy was used for global weighting, and cosine normalization was used for document normalization.
[0072]
In the evaluation of document search, both the nearest neighbor search (linear search) based on a normal vector space model and the nearest neighbor search using a self-organizing map output 30 search results, and the output results are referred to as MEDLINE conformity information. The precision was determined by comparing with. At this time, in the nearest neighbor search using the self-organizing map, the search was performed under the conditions of 20 units and 3 neighbors. FIG. 11 shows the relevance curve of the document search. The search using the self-organizing map gives slightly better results. Note that the R precision of the search based on the vector space model was 0.53, and the R precision of the search using the self-organizing map was 0.58. The number of average distance calculations in the nearest neighbor search using the self-organizing map is 141 for each search query, which is about 1/7 of the linear search. Therefore, if the conditions are the same, the processing time can be reduced to about 1/7, and high speed can be realized.
[0073]
The above is the evaluation of the matching information of the MEDLINE collection. Next, an approximation error of the nearest neighbor search using the self-organizing map will be described. When the search result of the vector space model was regarded as a correct answer, the R precision of the search result using the self-organizing map was 0.68. Therefore, an approximation error of 32% occurs in the search for the top 30 cases. However, in spite of the approximation error, in the evaluation of the MEDLINE matching information, the matching rate is higher than that of a normal vector space model. From this, although there is a slight error in the result in the sense of approximating the nearest neighbor search, it can be said that the accuracy is higher than that of the normal nearest neighbor search when considered as an absolute evaluation of the search itself. It became clear that in addition to the reduction in speed, the precision was also excellent.
[0074]
【The invention's effect】
As described above, the data matching method, the data matching device, the data matching program, and the computer-readable recording medium of the present invention can perform extremely efficient data matching as compared with the conventional linear search method. The feature of minimizing the amount of calculation required for processing and greatly reducing the processing time is realized. That is, the data matching method, data matching apparatus, data matching program and computer-readable recording medium of the present invention use a self-organizing map to store high-dimensional data in a one-dimensional space instead of a conventional two-dimensional space. This is because, by arranging them above, the search range of the nearest neighbor search is limited, and the distance calculation of the vector is performed only for the data existing in the limited range. As a result, the search range of the nearest neighbor search can be greatly reduced, and the amount of calculation can be drastically reduced as compared with the linear search in which distance calculation is sequentially performed on all data in a huge database.
[0075]
Further, the present invention is configured to store not a one-dimensionally converted data but a high-dimensional data in a continuous area on a secondary storage medium based on a one-dimensional self-organizing map arranged in a one-dimensional space. And Therefore, the data read from the secondary storage medium onto the main storage element includes a continuous series of data, and it is possible to capture necessary data with a small number of access times. As a result, the number of disk accesses is reduced. The problem that the number has become extremely small and the slow access speed has become a bottleneck is eliminated, and high-speed and high-efficiency data matching processing is enabled. Therefore, it is possible to perform a very high-speed nearest neighbor search even for a large-scale data set.
[Brief description of the drawings]
FIG. 1 is a conceptual diagram illustrating an example of a multimedia content search using a feature vector.
FIG. 2 is a schematic block diagram illustrating an example of a data matching device according to an embodiment of the present invention.
FIG. 3 is a conceptual diagram showing an example of a self-organizing network in which n-dimensional input data is arranged on a two-dimensional plane.
FIG. 4 is a flowchart illustrating an example of a learning algorithm for a self-organizing map.
FIG. 5 is a conceptual diagram showing a nearest neighbor search using a one-dimensional self-organizing map according to an embodiment of the present invention.
FIG. 6 is a table showing an outline of a feature amount vector used for an image search according to the first embodiment of the present invention.
FIG. 7 is a graph showing a matching rate curve when the number of units is 10 and the number of neighbors is 3 as a result of image search according to the first embodiment of the present invention.
FIG. 8 is a graph showing a precision curve when the number of units is 20 and the number of neighbors is 3 as a result of the image search according to the first embodiment of the present invention.
FIG. 9 is a table showing an average R matching rate and an average distance calculation count as a result of an image search according to the first embodiment of the present invention.
FIG. 10 is a graph showing the number of data allocated to each unit when the number of units is 20.
FIG. 11 is a graph showing a matching rate curve as a result of a document search according to the second embodiment of the present invention.
[Explanation of symbols]
1 ... Data matching device
2 ... Calculation unit
3. Main storage unit
4 Secondary storage unit
5 ... input unit
6 ... Database
7 Terminal

Claims

In a data matching method of extracting desired data by specifying conditions for multidimensional data having a plurality of feature amounts and capable of expressing these feature amounts by vectors,
Creating a one-dimensional self-organizing map divided into units having the same number of reference vectors based on a multidimensional data group to be subjected to data matching;
Performing multi-dimensional indexing based on the created one-dimensional self-organizing map, and storing multi-dimensional data in a continuous area on a secondary storage unit;
Expressing the matching condition for extracting the desired data as a vector, and searching a unit having a reference vector closest to the matching condition vector from the one-dimensional self-organizing map;
Selecting a unit located near the unit retrieved from the one-dimensional self-organizing map from the one-dimensional self-organizing map;
Referring to multidimensional data belonging to the selected unit from the secondary storage unit, and calculating a distance between vectors;
Performing a predetermined process on the result of the vector distance calculation and outputting the result.

In the step of creating the one-dimensional self-organizing map, the input layer is connected to all units, and each unit corresponds to a reference vector having the same number of dimensions as data input to the input layer. In the network,
Searching for a unit having a reference vector closest to the vector input to the input layer;
Bringing the reference unit of the searched unit and the unit in the vicinity thereof closer to the input vector;
The data matching method according to claim 1, further comprising:

A learning algorithm for the one-dimensional self-organizing map,
Initializing the reference vector with a random value;
Detecting a unit having a reference vector closest to the input vector; and updating a reference vector of the detected unit and a neighboring area thereof,
Repeating the above steps until the reference vector converges;
3. The data matching method according to claim 2, comprising:

4. The data matching method according to claim 1, wherein the data matching is a search for data that meets a specified search condition.

5. The data search method according to claim 4, wherein the searched data is any one of document data, image data that is a still image or a moving image, and audio data, or a combination thereof.

4. The data matching method according to claim 1, wherein the data matching is image pattern recognition.

4. The data matching method according to claim 1, wherein the data matching is data mining.

For a multi-dimensional data having a plurality of feature amounts and capable of expressing these feature amounts by vectors, a data matching device that extracts desired data by designating a condition,
A database holding a multidimensional data group to be subjected to data matching, and a one-dimensional self-organizing map divided into units having the same number of reference vectors based on the multidimensional data group held in the database. An operation unit for creating;
A main storage unit for performing multidimensional indexing based on the one-dimensional self-organizing map created by the arithmetic unit;
A secondary storage unit that holds multidimensional data in a continuous area based on the multidimensional indexing performed in the main storage unit;
The matching condition for extracting the desired data is expressed as a vector, a unit having a reference vector closest to the matching condition vector is searched from the one-dimensional self-organizing map, and the unit is searched from the one-dimensional self-organizing map. The unit located in the vicinity of the selected unit is selected from the one-dimensional self-organizing map, the multidimensional data belonging to the selected unit is referred from the secondary storage unit, the distance between vectors is calculated, and the vector distance is calculated. A data matching device for performing predetermined processing on a result of calculation and outputting the result.

In a data matching program that specifies conditions and extracts desired data for multidimensional data that has a plurality of feature amounts and can express these feature amounts by vectors,
A process of creating a one-dimensional self-organizing map divided into units having the same number of reference vectors based on a multidimensional data group to be subjected to data matching;
A process of performing multidimensional indexing based on the created one-dimensional self-organizing map and holding multidimensional data in a continuous area on a secondary storage unit;
A process of expressing the matching condition for extracting the desired data as a vector, and searching a unit having a reference vector closest to the matching condition vector from the one-dimensional self-organizing map;
A process of selecting a unit located near the unit retrieved from the one-dimensional self-organizing map from the one-dimensional self-organizing map;
A process of referring to multidimensional data belonging to the selected unit from the secondary storage unit and calculating a distance between vectors;
Processing of performing predetermined processing on the result of the vector distance calculation and outputting the result;
Data matching program for causing a computer to execute.

A computer-readable recording medium storing the data matching program according to claim 9.