JP3810575B2

JP3810575B2 - Association rule extraction apparatus and recording medium

Info

Publication number: JP3810575B2
Application number: JP05737599A
Authority: JP
Inventors: 康小幡; 彰純三石
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-03-04
Filing date: 1999-03-04
Publication date: 2006-08-16
Anticipated expiration: 2019-03-04
Also published as: JP2000259611A

Description

【０００１】
【発明の属する技術分野】
この発明は、データベースに記録された複数のレコードから、そのデータベースの品目セット間の相関ルールを抽出するための相関ルール抽出装置、相関ルール抽出方法および記録媒体に関するものである。
【０００２】
【従来の技術】
図１８は従来の相関ルール抽出装置の構成例を示すブロック図である。この従来の相関ルール抽出装置では、例えば「Fast Algorithms for Mining Association Rules」（Ａｇｒａｗａｌら著、Proc. of the 20th VLDB Conference, Santiago, Chile、１９９４年）や特開平８−２８７１０６号公報に記載のアプリオリ（Apriori）法に基づいて、データベースに記録された複数のレコードから、そのデータベースの品目セット間の相関ルールが抽出される。アプリオリ法では、すべてのレコードのうち品目セット「Ａ，Ｂ，・・・，Ｘ，Ｙ」の含まれるレコードの割合を支持度とし、ルール「Ａ，Ｂ，・・・，Ｘ→Ｙ」の条件部の品目セット「Ａ，Ｂ，・・・，Ｘ」の含まれるレコードのうち帰結部の品目「Ｙ」の含まれるレコードの割合を確信度として、これらの支持度および確信度がそれぞれ所定の基準値より高い場合に、ルール「Ａ，Ｂ，・・・，Ｘ→Ｙ」が相関ルールとして抽出される。
【０００３】
図１８において、１は所定の対象についてのデータを、複数のレコードとして保持するデータベースである。５は生成された相関ルールを保持する相関ルール集合ファイルである。１０２はデータベース１のレコードを参照して、品目数がｋの大品目セットから品目数が（ｋ＋１）の大品目セットを生成する大品目セット生成手段である。なお、大品目セットとは、所定の基準値より高い支持度を有する品目セットをいう。１０４は各大品目セットからルール（すなわち相関ルールの候補）を生成し、そのルールの確信度に基づいて相関ルールを抽出する仮説生成検証手段である。
【０００４】
大品目セット生成手段１０２において、１２１はデータベース１のレコードを参照して、候補品目セット生成手段１２２により生成された候補品目セットの支持度を計算し、所定の基準値より高い支持度を有する候補品目セットを大品目セットとする候補品目セット検証手段であり、１２２は（ｋ−２）個の品目が同一であり、かつ品目数が（ｋ−１）である複数の大品目セットのうちの各々２つの大品目セットから、（ｋ−２）個の同一の品目、およびその大品目セットにおけるそれぞれ異なる２個の品目で構成される品目数がｋである候補品目セットを生成する候補品目セット生成手段である。
【０００５】
仮説生成検証手段１０４において、３１は品目数がｋである大品目セットから、（ｋ−１）個の品目の条件部と１個の品目の帰結部で構成されるすべてのルールを生成するルール候補生成手段であり、３３はルール候補生成手段３１により生成されたルールの確信度を、大品目セットを参照して計算する確信度計算手段である。
【０００６】
次に動作について説明する。
まず、候補品目セット検証手段１２１は、品目数が１である大品目セットを、データベース１の各レコードの品目を参照して生成する。候補品目セット生成手段１２２は、品目数が１である複数の大品目セットのうちの各々２つの大品目セットから、およびその大品目セットにおけるそれぞれ異なる２個の品目で構成される品目数が２である候補品目セットを生成する。
【０００７】
次に、候補品目セット検証手段１２１は、データベース１のレコードを参照して、候補品目セット生成手段１２２により生成された候補品目セットの支持度を計算し、所定の基準値より高い支持度を有する候補品目セットを大品目セットとする。
【０００８】
そして、候補品目セット生成手段１２２は、（ｋ−２）個の品目が共通し、かつ品目数が（ｋ−１）である複数の大品目セットのうちの各々２つの大品目セットから、（ｋ−２）個の共通の品目、およびその大品目セットにおけるそれぞれ異なる２個の品目で構成される品目数がｋである候補品目セットを生成する。以下同様に、大品目セット生成手段１０２は、候補品目セットがなくなるまで品目数を１ずつ増加させながら、大品目セットを生成していく。
【０００９】
一方、仮説生成検証手段１０４においては、ルール候補生成手段３１が、候補品目セット検証手段１２１による品目数がｋである大品目セットから、（ｋ−１）個の品目の条件部と１個の品目の帰結部で構成されるすべてのルールを生成する。そして、確信度計算手段３３が、大品目セットを参照して、ルール候補生成手段３１により生成されたルールの確信度を計算し、所定の基準値より高い確信度を有するルールを相関ルールとして相関ルール集合ファイル５に保存する。
【００１０】
このようにして、各品目数についての大品目セットが生成され、その大品目セットに基づいて相関ルールが抽出される。
【００１１】
なお、大品目セットを保持する場合、データ構造をハッシュ木としてメモリなどに保持することが多い。図１９は大品目セットを示すハッシュ木の一例を示す図である。このハッシュ木においては、各品目がノードを構成し、１つの品目セットが、ルート（ｒｏｏｔ）から下位の所定のノードまでの１つの枝を構成する。例えば図１９において、ルートからノード１およびノード３を経てノード５までの枝は、品目１，３，５により構成される品目セット［１，３，５］により構成されている。なお、以下、品目ｘ１，ｘ２，・・・，ｘｎにより構成される品目セットを［ｘ１，ｘ２，・・・，ｘｎ］と表記する。
【００１２】
すなわち、上述の大品目セットの生成の繰り返しにより、大品目セットが増加する毎にハッシュ木に枝が追加されていき、品目数が増加する毎に枝を構成するノードを増加して枝伸ばしをしていく。ハッシュ木において、すべての枝はいずれかの大品目セットを構成する。
【００１３】
このようにして木構造で複数の品目数の異なる大品目セットを表現することができるのは、大品目セットにおける部分集合は、必ず大品目セットになるという性質を利用したものである。なお、大品目セットにおける部分集合とは、大品目セットを構成するｋ個の品目のうちの（ｋ−１）個以下の品目で構成される品目セットのことである。
【００１４】
このように、ハッシュ木として複数の大品目セットを保持することにより、大品目セットを単に順次記憶していく場合に比較して、大品目セットの保持に必要な記憶容量を低減させている。
【００１５】
しかしながら、ハッシュ木を利用しても品目数が多い場合には、ハッシュ木自体が大きくなってしまい、大品目セットを生成するために、多くの記憶容量を有するメモリが必要になる。また、メモリの記憶容量が小さく仮想記憶方式を採用している場合には、ページングが頻繁に発生し、処理に要する時間が長くなってしまうなどの課題があった。
【００１６】
上述のものの他の従来の技術としては、特開平８−１６１２８７号公報、特開平８−２６３３４６号公報、特開平８−２７２８２５号公報、特開平８−３１４９８１号公報、特開平９−３４７２１号公報、特開平１０−２６９２４８号公報にそれぞれ記載のものなどがある。
【００１７】
そこで、本出願人は、このような課題を解決するために、ページングを抑制するための手法を特願平１０−１８２４１６号において先に提案した。図２０は先に提案した第１の手法について説明する図であり、図２１は先に提案した第２の手法について説明する図である。
【００１８】
先に提案した第１の手法においては、大品目セットがファイルに格納され、そのファイルから１つずつ大品目セットが読み出されてハッシュ木に追加され、さらにその追加された大品目セットおよび既に追加されている大品目セットから生成されるすべての候補品目セットがハッシュ木に追加される。
【００１９】
ここで、大品目セットに対応して候補品目セットを生成する際には、その大品目セットの最後の品目より大きな番号の品目のノードを、大品目セットの下位にそれぞれ並行して追加して候補品目セットの枝が生成される。例えば、１から１０までの１０種類の品目が存在し、大品目セット［１，３，５］に対応して候補品目セットを生成する場合、大品目セット［１，３，５］の枝の下位に、ノード６，７，８，９，１０がそれぞれ追加され、候補品目セット［１，３，５，６］，［１，３，５，７］，［１，３，５，８］，［１，３，５，９］，［１，３，５，１０］が生成される。
【００２０】
上述の一連の処理が品目数の同一な大品目セットのそれぞれについて順番に実行されていく。そして、この一連の処理毎に、ハッシュ木により占有されている記憶容量が計算され、その値が所定の基準値を超えた場合に、ハッシュ木に追加された候補品目セットについてレコードとのマッチングにより支持度が計算され、ファイルに保存された大品目セットより品目数が１だけ大きい大品目セットが図２０に示すように生成され、すべての大品目セットが生成された後に一括して更新される。従ってファイルには同一の品目数の大品目セットが常に保存されていることになる。なお、大品目セットとならなかった候補品目セットは削除される。
【００２１】
そして、ルールの検証のために必要な品目セット（ルール検証用品目セット）、すなわち大品目セットより１だけ品目数が少ない、その大品目セットの部分集合である品目セットもハッシュ木に追加される。これらの部分木に基づいて相関ルールを生成した後、追加された大品目セットおよびそれに関して追加された部分木が削除される。そして、次の大品目セットが同様にして追加され同様に処理される。
【００２２】
先に提案した第２の手法においては、アプリオリ法と同様の枝伸ばしが実行される。この手法においては、ハッシュ木のために予め割り当てられた記憶容量の数分の１である所定の記憶容量に、ハッシュ木のデータ量が達するまで、最後の品目以外の品目がすべて同一である大品目セットが一括してファイルから読み出されてハッシュ木に順次追加される。ここで、予め割り当てられた記憶容量の数分の１である所定の記憶容量分だけ大品目セットを読み込むのは、その後に、候補品目セットやルール検証用品目セットの追加に必要な記憶容量を残しておくためである。
【００２３】
そして、最後の品目以外の品目がすべて同一である大品目セットから候補品目セットが生成される。例えば大品目セット［２，３，５］，［２，３，７］がファイルに保存されている場合、図２１に示すように、これらの大品目セット［２，３，５］，［２，３，７］は一括して読み出され、ハッシュ木に追加される。そして、互いの異なる最後の品目「５」，「７」に基づいて、大品目セット［２，３，５］に品目「７」を追加して候補品目セット「２，３，５，７」が追加される。
【００２４】
以上のように、先に提案した第１および第２の手法によれば、予め割り当てられた記憶容量の範囲内で効率良く大品目セットの生成ができ、ひいては相関ルールを効率良く抽出することができる。
【００２５】
【発明が解決しようとする課題】
先に提案した手法は以上のように構成されており、上記の課題を解決することはできるものの第１の手法においては候補品目セットの生成の際にアプリオリ法と同様の枝伸ばしとは異なり多くの枝を生成するため、支持度を計算する候補品目セットの数が多くなり、処理時間がその分だけ長くなるという課題があった。また、第２の手法においては、予め割り当てられた記憶容量の数分の１に、大品目セットのデータ量が達した場合に、残りの記憶容量を使用して、それらの大品目セットについての候補品目セットなどをハッシュ木に追加するようにしているため、予め割り当てられた記憶容量がすべて有効に使用されない可能性があるという課題があった。
【００２６】
また、上述の従来の技術においては、表形式のデータから相関ルールを抽出する場合、前処理として、表における各属性値に番号１，２，３などを割り当て、属性値をその番号（品目）に変換した後に、上述の処理を実行するようにしているため、自明な、同じ属性における属性値の相関ルール（例えば属性「性別」における属性値「男」，「女」に対する「性別が『男』であるならば性別は『女』ではない」といういわゆる負の相関を有するルール）も抽出され、後処理として、そのような自明な相関ルールを除去する必要があるという課題があった。
【００２７】
さらに、上述の従来の技術においては、帰結部の品目数が１である相関ルールのみが抽出されているため、帰結部の品目数が２以上である相関ルールを抽出することが困難であるという課題があった。
【００２８】
この発明は上記のような課題を解決するためになされたもので、大品目セットを、最後の品目以外の品目が共通するもの毎に読み出し、ハッシュ木に大品目セットの共通部分である品目セットを追加し、その大品目セットの最後の品目の集合のうちのいずれか２つを順番に選択し、その２つの品目をそれぞれ共通部分に追加して候補品目セットを生成するとともに、ルール検証用品目セットを生成してハッシュ木に追加した後にハッシュ木のための記憶容量を計算し、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、各候補品目セットおよび各ルール検証用品目セットの支持度を計算し、候補品目セットのうち、支持度が所定の基準値より大きいものを大品目セットとして保存し、ルール検証用品目セットおよび大品目セットの支持度より計算される確信度またはχ２乗値に基づいて大品目セットから相関ルールを生成し、その後にハッシュ木を消去するようにして、予め割り当てられた記憶容量を有効に使用するようにする相関ルール抽出装置、相関ルール抽出方法および記録媒体を得ることを目的とする。
【００２９】
また、各品目を、属性番号と属性値で構成される２次元インデックスとしてデータベースに保存し、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号を有しない品目だけに制限して大品目セットを生成するようにして、同じ属性における属性値の相関ルールの生成を抑制するための相関ルール抽出装置、相関ルール抽出方法および記録媒体を得ることを目的とする。
【００３０】
さらに、品目数がｋである大品目セットから相関ルールの候補を作成する際に、条件部の品目数が１〜（ｋ−１）であり、帰結部の品目数が（ｋ−１）〜１であるルールをそれぞれ生成するようにして、帰結部が２以上の相関ルールを生成するための相関ルール抽出装置、相関ルール抽出方法および記録媒体を得ることを目的とする。
【００３１】
【課題を解決するための手段】
この発明に係る相関ルール抽出装置は、ハッシュ木として複数の品目セットを記憶する記憶手段と、大品目セットを保存する保存手段と、保存手段から大品目セットを、最後の品目以外の品目が共通するもの毎に読み出し、その共通部分である品目セットをハッシュ木に追加する共通品目セット追加手段と、読み出された大品目セットの最後の品目の集合のうちのいずれか２つを順番に選択し、その２つの品目をそれぞれ共通部分に追加して候補品目セットを生成する候補品目セット生成手段と、各候補品目セットの一部であってその候補品目セットより品目数が１だけ少ない品目セットをルール検証用品目セットとしてハッシュ木に追加するルール検証用品目セット生成手段と、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加後のハッシュ木のための記憶容量を計算する記憶容量計算手段と、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、各候補品目セットおよび各ルール検証用品目セットの支持度を計算し、候補品目セットのうち、支持度が所定の基準値より大きいものを大品目セットとして保存手段に保存する支持度計算手段と、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、ルール検証用品目セットおよび大品目セットの支持度より計算される確信度またはχ２乗値に基づいて大品目セットから相関ルールを生成し、その後にハッシュ木を記憶手段から消去する相関ルール生成手段とを備えるものである。
【００３２】
この発明に係る相関ルール抽出装置は、大品目セットとともにその支持度を保存手段に保存し、各ルール検証用品目セットと同一の大品目セットが保存手段に保存されている場合には、その大品目セットの支持度をそのルール検証用品目セットの支持度とするようにしたものである。
【００３３】
この発明に係る相関ルール抽出装置は、データベースからレコードを読み出し、候補品目セットの出現数およびルール検証用品目セットの出現数を並行してカウントして候補品目セットの支持度およびルール検証用品目セットの支持度を計算するようにしたものである。
【００３４】
この発明に係る相関ルール抽出装置は、ハッシュ木からルール検証用品目セットに対応する部分木を切り離した後、各候補品目セットの支持度を計算し、支持度の計算後、ルール検証用品目セットに対応する部分木をハッシュ木の元の位置に戻すようにしたものである。
【００３５】
この発明に係る相関ルール抽出装置は、切り離した部分木が接続されていたハッシュ木の接続ノードとその部分木のルートとの対応関係を示すデータをスタックによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すようにしたものである。
【００３６】
この発明に係る相関ルール抽出装置は、切り離した部分木が接続されていたハッシュ木の接続ノードとその部分木のルートとの対応関係を示すデータをリストによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すようにしたものである。
【００３７】
この発明に係る相関ルール抽出装置は、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、直前に追加された大品目セットの共通部分、候補品目セットおよびルール検証用品目セットをハッシュ木から削除するようにしたものである。
【００３８】
この発明に係る相関ルール抽出装置は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量のうちの最大値が、所定の基準値とハッシュ木のための記憶容量との差より大きくなった場合に、支持度を計算し、相関ルールを生成し、その後にハッシュ木を消去するようにしたものである。
【００３９】
この発明に係る相関ルール抽出装置は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量の平均値が、所定の基準値とハッシュ木のための記憶容量との差より大きくなった場合に、支持度を計算し、相関ルールを生成し、その後にハッシュ木を消去するようにしたものである。
【００４１】
この発明に係る記録媒体は、コンピュータに、所定の保存手段から大品目セットを、最後の品目以外の品目が共通するもの毎に読み出し、所定の記憶手段に記憶されたハッシュ木に、その大品目セットの共通部分である品目セットを追加する手順、読み出した大品目セットの最後の品目の集合のうちのいずれか２つを順番に選択し、その２つの品目をそれぞれ共通部分に追加して候補品目セットを生成する手順、各候補品目セットの一部であって候補品目セットより品目数が１だけ少ない品目セットをルール検証用品目セットとしてハッシュ木に追加する手順、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加後のハッシュ木のための記憶容量を計算する手順、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、各候補品目セットおよび各ルール検証用品目セットの支持度を計算し、候補品目セットのうち、支持度が所定の基準値より大きいものを大品目セットとして所定の保存手段に保存する手順、ルール検証用品目セットおよび大品目セットの支持度より計算される確信度またはχ２乗値に基づいて大品目セットから相関ルールを生成し、その後にハッシュ木を所定の記憶手段から消去する手順を実行させるためのプログラムを記録したものである。
【００４２】
この発明に係る相関ルール抽出装置は、属性番号と属性値で構成される２次元インデックスとして各品目を保存するデータベースと、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号を有しない品目だけに制限して大品目セットを生成し、その大品目セットから相関ルールを抽出する相関ルール抽出部とを備えるものである。
【００４３】
この発明に係る相関ルール抽出装置は、データベースに、各品目の値を、属性番号と属性値で構成される２次元インデックスとして、または、そのデータベースの複数の属性で構成される属性グループについての属性グループ番号とそれらのうちのいずれかの属性値で構成される２次元インデックスとして保存し、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号または属性グループ番号を有しない品目だけに制限して大品目セットを生成し、その大品目セットから相関ルールを抽出するようにしたものである。
【００４４】
この発明に係る相関ルール抽出装置は、所定の表形式でデータベースに予め保存されたレコードの各属性値を２次元インデックスに変換し、その２次元インデックスをデータベースに保存するデータ変換手段を備えるものである。
【００４６】
この発明に係る記録媒体は、コンピュータに、属性番号と属性値で構成される２次元インデックスとして各品目を保存するデータベースから品目を読み出す手順、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号を有しない品目だけに制限して大品目セットを生成し、その大品目セットから相関ルールを抽出する手順を実行させるためのプログラムを記録したものである。
【００４７】
この発明に係る相関ルール抽出装置は、大品目セットを保存する保存手段と、保存手段に保存された大品目セットのうち、最後の品目以外の品目が共通する大品目セットから候補品目セットを生成する候補品目セット生成手段と、候補品目セットの支持度を計算し、その支持度に基づいて候補品目セットと同一品目数の大品目セットを生成して保存手段に保存する支持度計算手段と、大品目セットの品目数をｋとして、大品目セットを、品目数が１から（ｋ−１）までの条件部と品目数が（ｋ−１）から１までの帰結部とに分割してそれぞれ生成されるルールのうち、確信度またはχ２乗値が所定の基準値より大きいものを相関ルールとする相関ルール生成手段とを備えるものである。
【００４８】
この発明に係る相関ルール抽出装置は、大品目セットを保存手段に保存する際に、その大品目セットの支持度も保存し、ルールの条件部と同一の大品目セットが保存手段に保存されている場合には、その大品目セットの支持度をそのルールの条件部の支持度としてルールの確信度またはχ２乗値を計算するようにしたものである。
【００５０】
この発明に係る記録媒体は、コンピュータに、所定の保存手段に保存された大品目セットのうち、最後の品目以外の品目が共通する大品目セットから、その大品目セットより品目数が１だけ多い候補品目セットを生成する手順、候補品目セットの支持度を計算し、その支持度に基づいて候補品目セットと同一品目数の大品目セットを生成して所定の保存手段に保存する手順、大品目セットの品目数をｋとして、大品目セットを、品目数が１から（ｋ−１）までの条件部と品目数が（ｋ−１）から１までの帰結部とに分割してそれぞれ生成されるルールのうち、確信度またはχ２乗値が所定の基準値より大きいものを相関ルールとする手順を実行させるためのプログラムを記録したものである。
【００５１】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による相関ルール抽出装置の構成を示すブロック図である。図において、１は所定の対象についてのデータを、複数のレコードとして保持するデータベースである。２はメモリ６にハッシュ木として大品目セットを記憶させ、そのハッシュ木を操作して、品目数がｋの大品目セットから品目数が（ｋ＋１）の大品目セットを生成し、大品目セットファイル（保存手段）３に保存する大品目セット生成手段である。４は各大品目セットからルール（すなわち相関ルールの候補）を生成し、そのルールの確信度またはχ２乗値に基づいて相関ルールを抽出する仮説生成検証手段（相関ルール生成手段）である。５は生成された相関ルールを保持する相関ルール集合ファイルである。６は大品目セット、候補品目セットおよびルール検証用品目セットをハッシュ木として記憶するメモリ（記憶手段）である。なお、ハッシュ木を構成する品目はルートに近い程、その番号が低くなるように配置される。
【００５２】
なお、以下、品目数がｋである大品目セットをＬ（ｋ）と、品目数がｋである候補品目セットをＣ（ｋ）と、品目数がｋであるルール検証用品目セットをＬＬ（ｋ）と表記する。
【００５３】
大品目セット生成手段２において、２１はデータベース１のレコードを参照して、候補品目セット生成手段２２により生成された候補品目セットＣ（ｋ）の支持度を計算し、所定の基準値（最小支持度）より高い支持度を有する候補品目セットＣ（ｋ）を大品目セットＬ（ｋ）とする候補品目セット検証手段（支持度計算手段）である。
【００５４】
２２は最後の品目以外が共通な大品目セットＬ（ｋ−１）のうちの各々２つの大品目セットＬ（ｋ−１）から、それらの共通部分（第１品目〜第（ｋ−２）品目）、および、残り２個の最後の品目（各大品目セットの第（ｋ−１）品目）で構成される候補品目セットＣ（ｋ）を生成し、メモリ６のハッシュ木に追加する候補品目セット生成手段（候補品目セット生成手段、ルール検証用品目セット生成手段）である。
【００５５】
２３は最後の品目以外が共通な大品目セットＬ（ｋ）を大品目セットファイル３から読み出し、その共通部分である品目数（ｋ−１）の品目セットをハッシュ木に追加するとともに、メモリ６にハッシュ木として記憶されている品目セット（大品目セット、候補品目セットなど）の占有する記憶容量を調べるハッシュ木操作手段（共通品目セット追加手段、記憶容量計算手段）である。
【００５６】
仮説生成検証手段４において、３１は大品目セットＬ（ｋ）から、（ｋ−１）個の品目の条件部と１個の品目の帰結部で構成されるすべてのルールを生成するルール候補生成手段であり、３２はルール候補生成手段３１により生成されたルールの確信度またはχ２乗値を計算する検証値計算手段である。
【００５７】
なお、χ２乗値はχ２乗検定のためのもので、ここでは、ルールの条件部と帰結部との関連する度合を示すものとして次式で計算される。
【数１】

ここでｎはデータベース１におけるレコードの総数であり、Ａはルールの条件部の支持度であり、Ｂはルールの帰結部の支持度であり、Ｃはルール全体の支持度である。
【００５８】
次に動作について説明する。
図２は実施の形態１による相関ルール抽出装置の動作について説明するフローチャートである。図３は大品目セットファイル３の内容の一例を示す図であり、図４は、図３の大品目セット［１，２，３］，［１，２，４］，［１，３，５］，［１，４，５］で構成されるハッシュ木を示す図である。図５は、図２のステップＳＴ３における動作の一例について説明する図であり、図６は、図２のステップＳＴ４における動作の一例について説明する図であり、図７は、図２のステップＳＴ５における動作の一例について説明する図である。
【００５９】
まず図２のステップＳＴ１において、大品目セット生成手段２の候補品目セット検証手段２１がデータベース１のレコードに出現する各品目の支持度を計算し、所定の基準値より高い支持度を有する品目を大品目セットＬ（１）として大品目セットファイル３に、その支持度とともに保存する。
【００６０】
次に、品目数の指標ｋの初期値を２として（ステップＳＴ２）、以下、大品目セットＬ（ｋ−１）から大品目セットＬ（ｋ）を生成する処理、および生成した大品目セットＬ（ｋ）から相関ルールを生成する処理について説明する。
【００６１】
最初にステップＳＴ３において、ハッシュ木操作手段２３が大品目セットファイル３から大品目セットＬ（ｋ−１）を、最後の品目以外の品目が共通する大品目セット毎に一括して読み込み、その共通部分である品目セットをハッシュ木に追加し、各大品目セットＬ（ｋ−１）の最後の品目を、その共通部分に関連づけてハッシュ木とは別にメモリ６に記憶させる。
【００６２】
例えば大品目セットＬ（３）について（すなわちｋ＝４である場合）、図３に示す大品目セットファイル３の［１，２，３］，［１，２，４］，［１，３，５］，［１，４，５］についてステップＳＴ３〜ステップＳＴ５の処理が終了した状態のハッシュ木は図４に示すようになるが、この次、ステップＳＴ３においてこの大品目セットファイル３から大品目セット［２，４，５］，［２，４，６］，［２，４，７］が一括して読み込まれる。そして、図４に示すハッシュ木に、それらの共通部分である品目セット［２，４］で構成される枝が追加される。なお、このとき、読み込まれた各大品目セットの残りの品目「５」，「６」，「７」が例えば１つの配列としてハッシュ木とは別にメモリ６に記憶され、図５に示すように、ポインタなどによる、その配列と共通部分の枝との対応関係も合わせて記憶される。
【００６３】
次にステップＳＴ４において、候補品目セット生成手段２２は、大品目セットＬ（ｋ−１）の最後の品目の集合の各品目について、その品目と、その集合内でその品目より番号の大きい各品目により、上述の共通部分の枝をそれぞれ伸ばし、候補品目セットＣ（ｋ）を生成する。
【００６４】
例えば大品目セットＬ（３）を読み込んだ後、図５に示す状態になった場合では、まず、配列「５，６，７」のうちの品目５が読み出され、図６に示すように、その品目「５」およびその品目より番号の大きい品目「６」，「７」により、共通部分「２，４」の下位に「５，６」，「５，７」が追加され、候補品目セット［２，４，５，６］，［２，４，５，７］が生成される。なお、候補品目セット［２，４，５，６］，［２，４，５，７］について「２，４，５」は共通しているので、このときのハッシュ木は図６に示すように「２，４，５」から「６」と「７」が派生する形態になっている。
【００６５】
そして候補品目セットＣ（ｋ）の生成が終了すると、ステップＳＴ５において仮説生成検証手段４のルール候補生成手段３１が、生成された候補品目セットＣ（ｋ）の一部であって品目数が１だけ少ない品目セットをルール検証用品目セットＬＬ（ｋ−１）としてハッシュ木に追加する。なお、ルール検証用品目セットＬＬ（ｋ−１）に対応する枝が既に存在している場合には、そのルール検証用品目セットＬＬ（ｋ−１）については特に何も追加しない。
【００６６】
ここでルール検証用品目セットＬＬ（ｋ−１）は、候補品目セットＣ（ｋ）が大品目セットＬ（ｋ）になった場合に生成される候補ルールの条件部に対応するものである。例えば図６に示すように候補品目セットＣ（４）が［２，４，５，６］，［２，４，５，７］の２つである場合には、ルール検証用品目セットＬＬ（３）は、［２，４，５］，［２，４，６］，［２，５，６］，［４，５，６］，［２，４，７］，［２，５，７］，［４，５，７］の７つになる。なお、［２，４，５］は重複している。すなわち候補品目セット［２，４，５，６］が大品目セットになった場合には、ルール「２，４，５→６」，「２，４，６→５」，「２，５，６→４」，「４，５，６→２」が生成され、候補品目セット［２，４，５，７］が大品目セットになった場合には、ルール「２，４，５→７」，「２，４，７→５」，「２，５，７→４」，「４，５，７→２」が生成されるため、その条件部であるルール検証用品目セットが予め生成される。
【００６７】
そして、図６に示すハッシュ木に、これらのルール検証用品目セットＬＬ（３）のうち、ハッシュ木に存在しない枝「２，４，６」，「２，５，６」，「４，５，６」，「２，４，７」，「２，５，７」，「４，５，７」が図７に示すように追加される。
【００６８】
以上のステップＳＴ３〜ステップＳＴ５の処理が終了した時点でステップＳＴ６において、ハッシュ木操作手段２３は、メモリ６においてハッシュ木が占有する記憶容量を調べ、その記憶容量が所定の基準値より大きいか否かを判断する。そして、ハッシュ木が占有する記憶容量が所定の基準値以下である場合には、ハッシュ木操作手段２３は、候補品目セット生成手段２２に、ステップＳＴ３で読み込んだすべての大品目セットＬ（ｋ−１）の最後の品目について候補品目セットＣ（ｋ）の生成が終了しているか否かを問い合わせる。
【００６９】
そして、すべての大品目セットＬ（ｋ−１）の最後の品目について候補品目セットＣ（ｋ）の生成が終了していない場合には、ステップＳＴ４に戻り、次の、大品目セットＬ（ｋ−１）の最後の品目についての処理が同様に実行される。例えば、図７に示すハッシュ木が占有する記憶容量が所定の基準値以下である場合には、次にステップＳＴ４において、配列「５，６，７」のうちの品目「６」について候補品目セットＣ（ｋ）の生成およびルール検証用品目セットＬＬ（ｋ−１）の生成が実行される。
【００７０】
一方、すべての大品目セットＬ（ｋ−１）の最後の品目について候補品目セットＣ（ｋ）の生成が終了している場合、ハッシュ木操作手段２３は、ステップＳＴ８において、すべての大品目セットＬ（ｋ−１）が読み込まれて処理されたか否かを判断する。そして、すべての大品目セットＬ（ｋ−１）が処理されていないと判断された場合には、ステップＳＴ３に戻り、次の一群の大品目セットＬ（ｋ−１）が読み込まれ、同様の処理が実行される。一方、すべての大品目セットＬ（ｋ−１）が処理されたと判断された場合、ステップＳＴ９に進み、大品目セットＬ（ｋ）の生成などが実行される。
【００７１】
また、ステップＳＴ６においてハッシュ木が占有する記憶容量が所定の基準値より高い場合には、ステップＳＴ９に進み、その時点までにメモリ６のハッシュ木に追加された各候補品目セットＣ（ｋ）の支持度が候補品目セット検証手段２１によりデータベース１のレコードを参照して計算され、候補品目セットＣ（ｋ）のうち、所定の基準値より高い支持度を有する候補品目セットＣ（ｋ）が大品目セットＬ（ｋ）とされ、その大品目セットＬ（ｋ）はその支持度とともに大品目セットファイル３に保存される。なお、大品目セットファイル３において、大品目セットがその品目数毎に保存される。すなわち、生成された大品目セットＬ（ｋ）は大品目セットＬ（ｋ−１）とは別に保存される。
【００７２】
そして大品目セットＬ（ｋ）が生成されると、次にステップＳＴ１０において、ハッシュ木操作手段２３が、その時点までにメモリ６のハッシュ木に追加された各ルール検証用品目セットＬＬ（ｋ−１）と同一の大品目セットＬ（ｋ−１）を大品目セットファイル３で検索し、該当する大品目セットＬ（ｋ−１）を発見した場合には、その大品目セットＬ（ｋ−１）の支持度を読み出し、そのルール検証用品目セットＬＬ（ｋ−１）の支持度として、そのルール検証用品目セットＬＬ（ｋ−１）に関連づけてメモリ６に記憶させる。該当する大品目セットＬ（ｋ−１）が発見されなかった場合、特に何もしない。そのルール検証用品目セットＬＬ（ｋ−１）を包含する品目セットの支持度は所定の基準値以下であり、そのような品目セットは大品目セットＬ（ｋ）にならず、そのルール検証用品目セットＬＬ（ｋ−１）の支持度は特に必要ないからである。
【００７３】
次に、ステップＳＴ１１において、まずルール候補生成手段３１は、大品目セットＬ（ｋ）のうちの各（ｋ−１）の品目を条件部とし、残りの１品目を帰結部としたルールを生成する。そして、検証値計算手段３２は、各ルールについて、条件部の支持度（すなわちルール検証用品目セットＬＬ（ｋ−１）の支持度）および大品目セットＬ（ｋ）の支持度から確信度を計算するか、あるいは、条件部の支持度、帰結部の支持度、大品目セットＬ（ｋ）の支持度およびデータベースのレコード数からχ２乗値を計算し、所定の基準値より確信度またはχ２乗値が高いルールを相関ルールとして相関ルール集合ファイル５に保存する。
【００７４】
このように相関ルールを生成し保存した後、ステップＳＴ１２において、仮説生成検証手段４は、メモリ６のハッシュ木を消去して、ハッシュ木に占有された記憶領域を解放する。
【００７５】
そしてステップＳＴ１３において、ハッシュ木操作手段２３は、大品目セットファイル３における大品目セットＬ（ｋ−１）をすべて読み込み、かつ、読み込んだ大品目セットＬ（ｋ−１）からすべての候補品目セットＣ（ｋ）が生成されているか否かを判断する。読み込んだ大品目セットＬ（ｋ−１）からすべての候補品目セットＣ（ｋ）が生成されていない場合には、ステップＳＴ７を介してステップＳＴ４に戻り、残りの候補品目セットＣ（ｋ）の生成が実行され、読み込んだ大品目セットＬ（ｋ−１）からすべての候補品目セットＣ（ｋ）が生成されているが、読み込んでいない大品目セットＬ（ｋ−１）がある場合には、ステップＳＴ７およびステップＳＴ８を介してステップＳＴ３に戻り、残りの大品目セットＬ（ｋ−１）が読み込まれる。
【００７６】
一方、大品目セットファイル３における大品目セットＬ（ｋ−１）をすべて読み込み、かつ、読み込んだ大品目セットＬ（ｋ−１）からすべての候補品目セットＣ（ｋ）が生成されていると判断された場合には、ステップＳＴ１４に進み、ハッシュ木操作手段２３は、大品目セットＬ（ｋ）が少なくとも１つは生成されたか否かを判断する。大品目セットＬ（ｋ）が少なくとも１つは生成されている場合には、品目数の指標ｋの値を１だけ増加して（ステップＳＴ１５）、ステップＳＴ３に戻り、品目数ｋが１だけ多い大品目セットＬ（ｋ＋１）の生成が実行される。一方、大品目セットＬ（ｋ）が１つも生成されなかった場合には、処理を終了する。
【００７７】
以上のように、この実施の形態１によれば、大品目セットＬ（ｋ−１）を最後の品目以外の品目が共通するもの毎に読み出し、ハッシュ木に大品目セットＬ（ｋ−１）の共通部分である品目セットを追加し、その大品目セットＬ（ｋ−１）の最後の品目の集合のうちのいずれか２つを順番に選択し、その２つの品目をそれぞれ共通部分に追加して候補品目セットＣ（ｋ）を生成するとともに、ルール検証用品目セットＬＬ（ｋ−１）を生成してハッシュ木に追加した後にハッシュ木のための記憶容量を計算し、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、各候補品目セットＣ（ｋ）および各ルール検証用品目セットＬＬ（ｋ−１）の支持度を計算し、候補品目セットＣ（ｋ）のうち、支持度が所定の基準値より大きいものを大品目セットＬ（ｋ）として保存し、ルール検証用品目セットＬＬ（ｋ−１）および大品目セットＬ（ｋ）の支持度より計算される確信度またはχ２乗値に基づいて大品目セットＬ（ｋ）から相関ルールを生成し、その後にハッシュ木を消去するようにしたので、各大品目セットの処理毎にハッシュ木の使用する記憶容量を調べることになり、予め割り当てられた記憶容量を有効に使用することができるという効果が得られる。
【００７８】
実施の形態２．
この発明の実施の形態２による相関ルール抽出装置は、ルール検証用品目セットの支持度の計算についての処理を実施の形態１のものから変更したものである。ここでは、実施の形態２におけるルール検証用品目セットの支持度の計算についてのみ説明し、その他の処理については実施の形態１と同様であるのでその説明を省略する。
【００７９】
実施の形態２による相関ルール抽出装置においては、ルール検証用品目セットの支持度は、候補品目セット検証手段２１によりデータベース１のレコードを参照して行われる。そのとき候補品目セット検証手段２１は、データベース１からレコードを順次読み出し、各レコードと候補品目セットとのマッチングをするとともに、各レコードとルール検証用品目セットとのマッチングも並行して行う。
【００８０】
そして、候補品目セット検証手段２１は、データベース１のレコードにおける各候補品目セットの出現数および各ルール検証用品目セットの出現数をカウントして支持度を計算する。
【００８１】
以上のように、この実施の形態２によれば、データベース１から読み出したレコードに対して、候補品目セットおよびルール検証用品目セットのマッチングを並行して実行するようにしたので、大品目セットが多い場合には、大品目セットファイル３に保存された大品目セットを検索するより短い時間で各ルール検証用品目セットの支持度を計算することができるという効果が得られる。
【００８２】
実施の形態３．
この発明の実施の形態３による相関ルール抽出装置は、候補品目セットの支持度の計算についての処理を実施の形態１によるものから変更したものである。ここでは、実施の形態１における候補品目セットの支持度の計算についてのみ説明し、その他の処理については実施の形態１と同様であるのでその説明を省略する。
【００８３】
実施の形態３においては、候補品目セット検証手段２１は、候補品目セットの支持度を計算する際に、この計算とは関係ないルール検証用品目セットの部分木をハッシュ木から一時的に切り離し、この計算の終了後にその部分木をハッシュ木の元の位置に戻す。このとき候補品目セット検証手段２１は、ハッシュ木の切り離し位置に隣接する接続ノードと、その部分木のルートであるノードとの対応関係を記憶しておき、支持度の計算終了後に、その対応関係に基づいて、その部分木をハッシュ木の元の位置に戻す。
【００８４】
図８はハッシュ木から部分木を切り離す位置の一例を示す図である。図９はハッシュ木と、切り離した部分木との対応関係の一例を示す図である。例えば、図８に示すハッシュ木においては、候補品目セットが［２，４，５，６］，［２，４，５，７］であり、その他の枝がすべてルール検証用セットの枝である。したがって○印を付されたノードがハッシュ木から切り離されるが、このとき、切り離し位置を可能な限り少なく設定する。例えばルートから順番に下位へノードを探索していき、ノードの品目が、いずれの候補品目セットの対応する位置にも出現しない場合にそのノードへのリンクが切り離し位置に設定される。
【００８５】
すなわち、図８のハッシュ木においては、品目「４」が候補品目セットの第１品目に出現しないので、ルートから品目「４」のノードへのリンクが切り離し位置とされる。同様に、品目「５」が候補品目セットの第２品目に出現しないので、品目「２」のノードから品目「５」のノードへのリンクが切り離し位置とされ、品目「６」，「７」が候補品目セットの第３品目に出現しないので、品目「４」のノードから品目「６」，「７」のノードへのリンクがそれぞれ切り離し位置とされる。
【００８６】
そして、図９（ａ）に示すように切り離し位置に隣接するルートおよび品目「４」のノードへのポインタがａ，ｄ、品目「２」のノードおよび品目「５」のノードへのポインタがｂ，ｃ、品目「４」のノードおよび品目「６」，「７」のノードへのポインタがｅ，ｆ，ｇである場合には、図９（ｂ）に示すように、切り離し位置に隣接する２つのノードへのポインタが関連づけられて記憶される。このように、ハッシュ木の切り離し位置に隣接する接続ノードと、部分木のルートであるノードとの対応関係を記憶した後、ルートおよび品目「２」，「４」，「５」ノードの下位ノードへのリンク情報を書き換えて部分木を切り離し、支持度の計算終了後に、その対応関係（図９（ｂ））に基づいて、ルートおよび品目「２」，「４」，「５」ノードの下位ノードへのリンク情報を元に戻してその部分木をハッシュ木の元の位置に戻す。
【００８７】
以上のように、この実施の形態３によれば、候補品目セットの支持度の計算に関係のないルール検証用品目セットの部分木をハッシュ木から切り離すようにしたので、候補品目セットの支持度の計算に関係のないルール検証用品目セットについてのマッチングが行われず、効率的に候補品目セットのマッチングを実行することができるという効果が得られる。
【００８８】
実施の形態４．
この発明の実施の形態４による相関ルール抽出装置は、ハッシュ木の接続ノードと、ルール検証用品目セットの部分木のルートとの対応関係を示すデータをスタックによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すようにしたものである。その他の処理については実施の形態３と同様であるのでその説明を省略する。
【００８９】
実施の形態４においては、候補品目セット検証手段２１は、ハッシュ木の切り離し位置に隣接する接続ノードと、その部分木のルートであるノードとの対応関係を、一定の記憶容量のスタック領域において、スタックによるデータ構造に従って順次記憶していき、支持度の計算終了後に、その対応関係をスタックから順次読み出して、その部分木をハッシュ木の元の位置に戻す。
【００９０】
図１０はハッシュ木と、切り離した部分木との対応関係を記憶するスタックの一例を示す図である。例えば図８のハッシュ木からルール検証用品目セットの部分木を切り離す場合、図１０に示すように、切り離し位置に隣接する２つのノードへのポインタが関連づけられてスタックに記憶されていき、支持度の計算終了後に、逆順にスタックから読み出され、部分木が元の位置に戻される。
【００９１】
以上のように、この実施の形態４によれば、実施の形態３による効果の他、ハッシュ木の接続ノードと、ルール検証用品目セットの部分木のルートとの対応関係を示すデータをスタックによるデータ構造に従って記憶するようにしたので、メモリ管理を簡単化することができるという効果が得られる。
【００９２】
実施の形態５．
この発明の実施の形態５による相関ルール抽出装置は、ハッシュ木の接続ノードと、ルール検証用品目セットの部分木のルートとの対応関係を示すデータをリストによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すようにしたものである。その他の処理については実施の形態３と同様であるのでその説明を省略する。
【００９３】
実施の形態５においては、候補品目セット検証手段２１は、ハッシュ木の切り離し位置に隣接する接続ノードと、その部分木のルートであるノードとの対応関係をリストによるデータ構造に従って順次記憶していき、支持度の計算終了後に、その対応関係をリストから順次読み出して、その部分木をハッシュ木の元の位置に戻す。
【００９４】
図１１はハッシュ木と、切り離した部分木との対応関係を記憶するリストの一例を示す図である。例えば図８のハッシュ木からルール検証用品目セットの部分木を切り離す場合、図１１（ａ）に示すように、切り離された部分木のルートを一連のリストとして、図１１（ｂ）に示すように、切り離し位置に隣接する２つのノードへのポインタが関連づけられたレコードがリストとして記憶されていき、支持度の計算終了後に、各レコードがリストから順次読み出され、部分木が元の位置に戻される。
【００９５】
以上のように、この実施の形態５によれば、実施の形態３による効果の他、ハッシュ木の接続ノードと、ルール検証用品目セットの部分木のルートとの対応関係を示すデータをリストによるデータ構造に従って記憶するようにしたので、切り離し位置の数がいくつでもよく、また不要な記憶領域を予め確保する必要がなくメモリを効率良く使用することができるという効果が得られる。
【００９６】
実施の形態６．
この発明の実施の形態６による相関ルール抽出装置は、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、直前に追加された大品目セットの共通部分、候補品目セットおよびルール検証用品目セットをハッシュ木から削除するようにしたものである。その他の処理については実施の形態１と同様であるのでその説明を省略する。
【００９７】
実施の形態６においては、ハッシュ木操作手段２３は、読み出した大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加後のハッシュ木のための記憶容量を計算するが、ハッシュ木のための記憶容量が所定の基準値より大きい場合には、ハッシュ木から、これらの大品目セットの共通部分、候補品目セットおよびルール検証用品目セットに対応する部分木を削除する。
【００９８】
この後に候補品目セット検証手段２１は残りの候補品目セットおよびルール検証用品目セットの支持度を計算する。なお、このとき削除された大品目セットは、ハッシュ木が消去された後に、再度読み込まれて処理される。
【００９９】
以上のように、この実施の形態６によれば、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、直前に追加した大品目セットの共通部分、候補品目セットおよびルール検証用品目セットをハッシュ木から削除し、ハッシュ木のための記憶容量が所定の基準値以下である状態に戻した後、支持度の計算などを実行するようにしたので、使用記憶容量をより正確に制限することができるという効果が得られる。
【０１００】
実施の形態７．
この発明の実施の形態７による相関ルール抽出装置は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量のうちの最大値が所定の記憶領域の残量より大きくなった場合に、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行するようにしたものである。その他の処理については実施の形態１と同様であるのでその説明を省略する。
【０１０１】
実施の形態７においては、ハッシュ木操作手段２３は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットＬＬ（ｋ−１）がハッシュ木に追加された後に、ハッシュ木のための記憶容量を計算し、その計算結果と前回の計算結果との差を計算してハッシュ木のための記憶容量の増加量を計算する。次にハッシュ木操作手段２３は、その増加量と、前回までの増加量の最大値とを比較し、その増加量の値がその最大値より大きい場合には、その増加量の値でその最大値を更新する。そしてハッシュ木操作手段２３は、所定の基準値と今回計算したハッシュ木のための記憶容量との差である記憶容量の残量を計算し、その残量より上述の最大値が大きい場合には、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行させ、そうでない場合には、次の大品目セットを読み込み処理する。
【０１０２】
以上のように、この実施の形態７によれば、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量のうちの最大値が所定の記憶領域の残量より大きくなった場合に、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行するようにしたので、ハッシュ木のための記憶容量が所定の基準値を超える手前まで有効にメモリが使用されるという効果が得られる。
【０１０３】
実施の形態８．
この発明の実施の形態８による相関ルール抽出装置は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量の平均値が所定の記憶領域の残量より大きくなった場合に、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行するようにしたものである。その他の処理については実施の形態１と同様であるのでその説明を省略する。
【０１０４】
実施の形態８においては、ハッシュ木操作手段２３は、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットＬＬ（ｋ−１）がハッシュ木に追加された後に、ハッシュ木のための記憶容量を計算し、その計算結果と前回の計算結果との差を計算してハッシュ木のための記憶容量の増加量を計算する。次にハッシュ木操作手段２３は、その増加量、前回までの増加量の平均値および追加の回数に基づいて今回までの増加量の平均値を計算する。そしてハッシュ木操作手段２３は、所定の基準値と今回計算したハッシュ木のための記憶容量との差である記憶容量の残量を計算し、その残量より上述の平均値が大きい場合には、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行させ、そうでない場合には、次の大品目セットを読み込み処理する。
【０１０５】
以上のように、この実施の形態８によれば、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量の平均値が所定の記憶領域の残量より大きくなった場合に、大品目セットおよび相関ルールを生成し、その後にハッシュ木を消去する処理を実行するようにしたので、ハッシュ木のための記憶容量が所定の基準値を超える手前まで有効にメモリが使用されるという効果が得られる。
【０１０６】
実施の形態９．
図１２はこの発明の実施の形態９による相関ルール抽出装置の構成を示すブロック図である。図において、１Ａは複数のレコードを表形式で保存し、レシート形式のデータを保存するデータベースである。ここでレシート形式とは、データベース１Ａのレコードにおける各属性の値に対する品目を連続させて表したものである。なお、このときの品目は例えば（属性番号｜属性値）なる２次元インデックスとされる。この場合、属性「性別」の値として「男」，「女」があり、属性「身長」の値として「大」，「中」，「小」があった場合、これらの属性の値は（１｜１），（１｜２）および（２｜１），（２｜２），（２｜３）なる２次元インデックスの品目でそれぞれ表される。
【０１０７】
５１はデータベース１Ａのレシート形式のレコードを読み出し、レコードに含まれる品目に基づいて各品目数の大品目セットをハッシュ木として展開していくとともに、その大品目セットから、確信度またはχ２乗値に基づいて相関ルールを生成する相関ルール抽出部である。５２はデータベース１Ａの表形式のレコードを読み出してレシート形式に変換して、変換後のレコードをデータベース１Ａに保存するレシートファイル生成手段（データ変換手段）である。
【０１０８】
相関ルール抽出部５１において、２Ａはデータベース１Ａに保存された２次元インデックス形式の品目に対応して処理を実行する候補品目セット検証手段２１Ａ、候補品目セット生成手段２２Ａおよびハッシュ木操作手段２３Ａを有する大品目セット生成手段である。
【０１０９】
なお、相関ルール抽出部５１におけるその他の構成要素については、実施の形態１（図１）におけるものと同様であるのでその説明を省略する。
【０１１０】
次に動作について説明する。
図１３は表形式の一例を示す図であり、図１４はレシート形式でデータベース１Ａに保存されたレコードの一例を示す図であり、図１５はレシート形式の品目により構成されるハッシュ木の一例を示す図である。
【０１１１】
まず、レシートファイル生成手段５２は、例えば図１３に示すような表形式で保存されたレコードを読み出し、表の各属性に対応して付される属性番号と、読み出したレコードにおけるその属性番号に対応する属性に対応して付される番号（属性値）とで構成される２次元インデックス形式（属性番号｜属性値）のデータをレコードの各属性値毎に生成してデータベース１Ａに保存させる。
【０１１２】
次に、相関ルール抽出部５１の大品目セット生成手段２Ａは、データベース１Ａからレシート形式のデータを適宜読み出して、メモリ６にハッシュ木を展開していき、実施の形態１と同様に各品目数の大品目セットを順次生成していく。
【０１１３】
ただし、ハッシュ木操作手段２３Ａおよび候補品目セット生成手段２２Ａがハッシュ木の枝伸ばしを実行する際に下記の規則に従って実行し、下記の規則に従わない品目はハッシュ木に追加されない。
規則１．属性番号が大きい品目ほど、その品目は「大」であるとする。例えば（４｜１）＞（２｜３）である。
規則２．属性番号が同じ場合、属性値が大きいほど、その品目は「大」であるとする。例えば（３｜５）＞（３｜２）である。
規則３．あるノードの品目の属性番号は、その上位のノードの品目の属性番号よりも大きくなければならない。例えば図１５のハッシュ木において、枝「（１｜１），（２｜１）」の下位に、品目「（２｜３）」のノードは追加されない。
【０１１４】
このようにすることにより、ハッシュ木のノードである品目の下位に追加できる品目が、同一の属性番号を有しない品目だけに制限されるため、同じ属性における属性値で構成される品目セットは生成されない。
【０１１５】
そして、相関ルール抽出部５１の仮説生成検証手段４は実施の形態１と同様にして相関ルールを生成する。
【０１１６】
以上のように、この実施の形態９によれば、属性番号と属性値で構成される２次元インデックスとしてデータベースに各品目を保存し、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号を有しない品目だけに制限して大品目セットを生成するようにしたので、同じ属性における属性値の相関ルールの生成を抑制することができ、負の相関ルールの生成を抑制することができるという効果が得られる。
【０１１７】
なお、上記実施の形態９においては、レシートファイル生成手段５２によりデータベース１Ａに記録されているデータの形式をレシート形式に変換するようにしているが、最初からレシート形式のデータをデータベース１Ａに記録するようにしてもよい。その場合には、レシートファイル生成手段５２は特に必要ない。
【０１１８】
また、上記実施の形態９においては、相関ルール抽出部５１が実施の形態１によるものとほぼ同様に構成されているが、相関ルール抽出部５１はこのように限定されるものではなく、ハッシュ木に品目セットを展開しながら大品目セットを生成していくものであれば他のものでもよい。
【０１１９】
実施の形態１０．
この発明の実施の形態１０による相関ルール抽出装置は、実施の形態９による相関ルール抽出装置において互いに相関ルールを生成しない複数の属性を指定可能にしたものである。その他の処理については実施の形態９と同様であるのでその説明を省略する。
【０１２０】
実施の形態１０においては、互いに相関ルールを生成しない複数の属性を属性グループとし、その属性グループの属性の値は、（属性グループ番号｜属性値）なる品目に変換される。例えば、属性「問診１」および属性「問診２」を属性グループと設定し、属性「問診１」の値が「ａ」，「ｂ」であり、属性「問診２」の値が「ａ」，「ｂ」，「ｃ」である場合には、これらの属性の値は、（１｜１），（１｜２），（１｜３）および（１｜４），（１｜５）なる品目に変換される。
【０１２１】
そしてハッシュ木の枝伸ばしを実行する際に実施の形態９と同様に下記の規則に従い、下記の規則に従わない品目はハッシュ木に追加されない。
規則１．属性グループ番号が大きい品目ほど、その品目は「大」であるとする。例えば（４｜１）＞（２｜３）である。
規則２．属性グループ番号が同じ場合、属性値が大きいほど、その品目は「大」であるとする。例えば（３｜５）＞（３｜２）である。
規則３．あるノードの品目の属性グループ番号は、その上位のノードの品目の属性グループ番号よりも大きくなければならない。例えば図１５のハッシュ木において、枝「（１｜１），（２｜１）」の下位に、品目「（２｜３）」のノードは追加されない。
【０１２２】
以上のように、この実施の形態１０によれば、実施の形態９による効果の他、互いに相関ルールを生成しない属性に対応する品目については同一の属性グループ番号を使用するようにしたので、不要な相関ルールの生成を抑制することができ、そのような相関ルールを除去する後処理を省略することができるという効果が得られる。
【０１２３】
実施の形態１１．
この発明の実施の形態１１による相関ルール抽出装置は、大品目セットの品目数がｋであるとき、大品目セットを品目数が１から（ｋ−１）までの条件部と品目数が（ｋ−１）から１までの帰結部とに分割してそれぞれ生成されるルールのうち、確信度またはχ２乗値が所定の基準値より大きいものを相関ルールとするものである。
【０１２４】
実施の形態１１においては、（ｋ−１）≧Ｎｃ≧１であるすべての整数Ｎｃについて、条件部の品目数がＮｃであり、帰結部の品目数が（ｋ−Ｎｃ）であるルールが大品目セットＬ（ｋ）から生成される。
【０１２５】
例えば、大品目セット「２，４，５，６」から、ルール「２，４，５→６」，「２，４，６→５」，「２，５，６→４」，「４，５，６→２」、ルール「２，４→５，６」，「２，５→４，６」，「２，６→４，５」，「４，５→２，６」、「４，６→２，５」，「５，６→２，４」およびルール「２→４，５，６」，「４→２，５，６」，「５→２，４，６」，「６→２，４，５」が生成される。同様に、大品目セット「２，４，５，７」から、ルール「２，４，５→７」，「２，４，７→５」，「２，５，７→４」，「４，５，７→２」、ルール「２，４→５，７」，「２，５→４，７」，「２，７→４，５」，「４，５→２，７」、「４，７→２，５」，「５，７→２，４」およびルール「２→４，５，７」，「４→２，５，７」，「５→２，４，７」，「７→２，４，５」が生成される。
【０１２６】
図１６はルールの条件部の品目数が１〜（ｋ−１）である場合に生成されるルール検証用品目セットＬＬ（１）〜ＬＬ（ｋ−１）を追加したハッシュ木の値例を示す図である。例えば実施の形態１のように大品目セットを生成する場合においては、このルールの条件部に相当するルール検証用品目セットＬＬ（１）〜ＬＬ（ｋ−１）が図１６に示すようにハッシュ木に追加される。
【０１２７】
そして、生成された各ルールについて、確信度またはχ２乗値が計算され、所定の基準値より高い確信度またはχ２乗値を有するルールが相関ルールとして保存される。
【０１２８】
なお、確信度またはχ２乗値を計算する際、品目数が１〜（ｋ−１）であるルールの条件部の支持度が必要になるが、例えば実施の形態１のように大品目セットを生成する場合においては、生成した大品目セットＬ（１）〜Ｌ（ｋ−１）を大品目セットファイル３から削除せずに保存しておくことにより、品目数が１〜（ｋ−１）であるルールの条件部と同一の大品目セットを大品目ファイル３で検索してルールの条件部の支持度を取得することができる。
【０１２９】
そして例えば図１６に示すハッシュ木においてそのルールの条件部に相当するルール検証用品目セットＬＬ（ｉ）（ｉ＝１〜（ｋ−１））に関連づけて記憶される。
【０１３０】
以上のように、この実施の形態１１によれば、品目数がｋである大品目セットから相関ルールの候補を作成する際に、条件部の品目数が１〜（ｋ−１）であり、帰結部の品目数が（ｋ−１）〜１である候補のルールをそれぞれ生成するようにしたので、帰結部が２以上の相関ルールを生成することができるという効果が得られる。
【０１３１】
さらに、生成した大品目セットＬ（１）〜Ｌ（ｋ−１）を保存しておき、ルールの条件部の支持度として使用するようにしたので、帰結部が２以上の相関ルールを生成する際にも、簡単に確信度またはχ２乗値を計算することができるという効果が得られる。
【０１３２】
なお、この実施の形態１１による相関ルール抽出装置における大品目セットの生成手順は、上記実施の形態１〜実施の形態１０によるものに限定されるものではなく、大品目セットをファイルに格納して管理する任意のアルゴリズムで有効である。
【０１３３】
実施の形態１２．
図１７はこの発明の実施の形態１２による相関ルール抽出装置の構成を示すブロック図である。実施の形態１２による相関ルール抽出装置は、上述の実施の形態１〜実施の形態１１による処理手順を記述したプログラムに従って動作するいわゆるコンピュータによるものである。
【０１３４】
図において、７１はＲＯＭ７２、ハードディスクドライブ装置７４、記録媒体９１などに予め記録されたプログラム８１，８１Ａに従って各種処理を実行するＣＰＵ（相関ルール生成手段、支持度計算手段、候補品目セット生成手段、ルール検証用品目セット生成手段、共通品目セット追加手段、記憶容量計算手段、相関ルール抽出部、データ変換手段）であり、７２は起動時に実行されるプログラムや各種処理に必要なデータなどを予め記憶するＲＯＭであり、７３はハードディスクドライブ装置７４や記録媒体９１に予め記録されたプログラム８１，８１Ａをロードされるとともに、各種処理において各種データを一時的に記憶するＲＡＭ（記憶手段）である。
【０１３５】
７４は上述の処理を記述したプログラム８１、データベース１，１Ａ、大品目セットファイル３、相関ルール集合ファイル５などを保存するハードディスクドライブ装置であり、７５は記録媒体９１に対してデータの読み書きを実行する記録媒体駆動装置である。なお、ＣＰＵ７１、ＲＯＭ７２、ＲＡＭ７３、ハードディスクドライブ装置７４および記録媒体駆動装置７５は互いにデータバスやアドレスバスで接続されている。なお、このバス構成は一例であり他の構成でもよい。９１は上述の処理を記述したプログラム８１Ａを記録されたフレキシブルディスク、ＣＤ−ＲＯＭ（Compact Disc-Read Only Memory）などのコンピュータ読み取り可能な記録媒体である。
【０１３６】
次に動作について説明する。
まず、ユーザによる操作などに対応してＣＰＵ７１が、ハードディスクドライブ装置７４に記録されたプログラム８１または記録媒体９１に記憶されたプログラム８１ＡをＲＡＭ７３にロードして、そのプログラムに従って上述の実施の形態１〜実施の形態１１に記載の各種処理を実行する。すなわち、ＣＰＵ７１が上述の大品目セット生成手段２，２Ａおよび仮説生成検証手段４として動作し、ＲＡＭ７３がメモリ６として機能する。
【０１３７】
以上のように、この実施の形態１２によれば、上述の実施の形態１〜実施の形態１１による処理手順を記述したプログラムをコンピュータで実行して上述の各種処理を実行するようにしたので、上述の実施の形態１〜実施の形態１１による効果と同様の効果が得られる。
【０１３８】
【発明の効果】
以上のように、この発明によれば、所定の保存手段から大品目セットを最後の品目以外の品目が共通するもの毎に読み出し、所定の記憶手段に記憶されたハッシュ木に、大品目セットの共通部分である品目セットを追加し、読み出した大品目セットの最後の品目の集合のうちのいずれか２つを順番に選択し、その２つの品目をそれぞれ共通部分に追加して候補品目セットを生成し、各候補品目セットの一部であって候補品目セットより品目数が１だけ少ない品目セットをルール検証用品目セットとしてハッシュ木に追加し、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加後のハッシュ木のための記憶容量を計算し、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、各候補品目セットおよび各ルール検証用品目セットの支持度を計算し、候補品目セットのうち、支持度が所定の基準値より大きいものを大品目セットとして所定の保存手段に保存し、ルール検証用品目セットおよび大品目セットの支持度より計算される確信度またはχ２乗値に基づいて大品目セットから相関ルールを生成し、その後にハッシュ木を所定の記憶手段から消去するように構成したので、各大品目セットの処理毎にハッシュ木の使用する記憶容量を調べることになり、予め割り当てられた記憶容量を有効に使用することができるという効果がある。
【０１３９】
この発明によれば、大品目セットとともにその支持度を保存手段に保存し、各ルール検証用品目セットと同一の大品目セットが保存手段に保存されている場合には、その大品目セットの支持度をそのルール検証用品目セットの支持度とするように構成したので、大品目セットが少ない場合には、短い時間で各ルール検証用品目セットの支持度を取得することができるという効果がある。
【０１４０】
この発明によれば、データベースからレコードを読み出し、候補品目セットの出現数およびルール検証用品目セットの出現数を並行してカウントして候補品目セットの支持度およびルール検証用品目セットの支持度を計算するように構成したので、大品目セットが多い場合には、保存手段に保存された大品目セットを検索するより短い時間で各ルール検証用品目セットの支持度を計算することができるという効果がある。
【０１４１】
この発明によれば、ハッシュ木からルール検証用品目セットに対応する部分木を切り離した後、各候補品目セットの支持度を計算し、支持度の計算後、ルール検証用品目セットに対応する部分木をハッシュ木の元の位置に戻すように構成したので、候補品目セットの支持度の計算に関係のないルール検証用品目セットについてのマッチングが行われず、効率的に候補品目セットのマッチングを実行することができるという効果がある。
【０１４２】
この発明によれば、切り離した部分木が接続されていたハッシュ木の接続ノードとその部分木のルートとの対応関係を示すデータをスタックによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すように構成したので、メモリ管理を簡単化することができるという効果がある。
【０１４３】
この発明によれば、切り離した部分木が接続されていたハッシュ木の接続ノードとその部分木のルートとの対応関係を示すデータをリストによるデータ構造に従って記憶し、そのデータに基づいて、切り離した部分木をハッシュ木の元の位置に戻すように構成したので、切り離し位置の数がいくつでもよく、また不要な記憶領域を予め確保する必要がなくメモリを効率良く使用することができるという効果がある。
【０１４４】
この発明によれば、ハッシュ木のための記憶容量が所定の基準値より大きくなった場合に、直前に追加された大品目セットの共通部分、候補品目セットおよびルール検証用品目セットをハッシュ木から削除するように構成したので、使用記憶容量をより正確に制限することができるという効果がある。
【０１４５】
この発明によれば、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量のうちの最大値が、所定の基準値とハッシュ木のための記憶容量との差より大きくなった場合に、支持度を計算し、相関ルールを生成し、その後にハッシュ木を消去するように構成したので、ハッシュ木のための記憶容量が所定の基準値を超える手前まで有効にメモリが使用されるという効果がある。
【０１４６】
この発明によれば、大品目セットの共通部分、候補品目セットおよびルール検証用品目セットの追加によるハッシュ木のための記憶容量の増加量の平均値が、所定の基準値とハッシュ木のための記憶容量との差より大きくなった場合に、支持度を計算し、相関ルールを生成し、その後にハッシュ木を消去するように構成したので、ハッシュ木のための記憶容量が所定の基準値を超える手前まで有効にメモリが使用されるという効果がある。
【０１４７】
この発明によれば、属性番号と属性値で構成される２次元インデックスとして各品目をデータベースに保存し、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号を有しない品目だけに制限して大品目セットを生成し、その大品目セットから相関ルールを抽出するように構成したので、同じ属性における属性値の相関ルールの生成を抑制することができ、負の相関ルールの生成を抑制することができるという効果がある。
【０１４８】
この発明によれば、データベースに、各品目の値を、属性番号と属性値で構成される２次元インデックスとして、または、そのデータベースの複数の属性で構成される属性グループについての属性グループ番号とそれらのうちのいずれかの属性値で構成される２次元インデックスとして保存し、ハッシュ木のノードである品目の下位に追加できる品目を、同一の属性番号または属性グループ番号を有しない品目だけに制限して大品目セットを生成し、その大品目セットから相関ルールを抽出するように構成したので、不要な相関ルールの生成を抑制することができ、そのような相関ルールを除去する後処理を省略することができるという効果がある。
【０１４９】
この発明によれば、所定の表形式でデータベースに予め保存されたレコードの各属性値を２次元インデックスに変換し、その２次元インデックスをデータベースに保存するデータ変換手段を備えるようにしたので、データベースに表形式でレコードが記録されていても２次元インデックスに変換され、同様に、同じ属性における属性値の相関ルールの生成を抑制することができ、負の相関ルールの生成を抑制することができるという効果がある。
【０１５０】
この発明によれば、所定の保存手段に保存された大品目セットのうち、最後の品目以外の品目が共通する大品目セットから、その大品目セットより品目数が１だけ多い候補品目セットを生成し、候補品目セットの支持度を計算し、その支持度に基づいて候補品目セットと同一品目数の大品目セットを生成して保存手段に保存するステップと、大品目セットの品目数をｋとして、大品目セットを、品目数が１から（ｋ−１）までの条件部と品目数が（ｋ−１）から１までの帰結部とに分割してそれぞれ生成されるルールのうち、確信度またはχ２乗値が所定の基準値より大きいものを相関ルールとするように構成したので、帰結部が２以上の相関ルールを生成することができるという効果がある。
【０１５１】
この発明によれば、大品目セットを保存手段に保存する際に、その大品目セットの支持度も保存し、ルールの条件部と同一の大品目セットが保存手段に保存されている場合には、その大品目セットの支持度をそのルールの条件部の支持度としてルールの確信度またはχ２乗値を計算するように構成したので、帰結部が２以上の相関ルールを生成する際に、簡単に確信度またはχ２乗値を計算することができるという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による相関ルール抽出装置の構成を示すブロック図である。
【図２】実施の形態１による相関ルール抽出装置の動作について説明するフローチャートである。
【図３】大品目セットファイルの内容の一例を示す図である。
【図４】図３の大品目セット［１，２，３］，［１，２，４］，［１，３，５］，［１，４，５］で構成されるハッシュ木を示す図である。
【図５】図２のステップＳＴ３における動作の一例について説明する図である。
【図６】図２のステップＳＴ４における動作の一例について説明する図である。
【図７】図２のステップＳＴ５における動作の一例について説明する図である。
【図８】ハッシュ木から部分木を切り離す位置の一例を示す図である。
【図９】ハッシュ木と、切り離した部分木との対応関係の一例を示す図である。
【図１０】ハッシュ木と、切り離した部分木との対応関係を記憶するスタックの一例を示す図である。
【図１１】ハッシュ木と、切り離した部分木との対応関係を記憶するリストの一例を示す図である。
【図１２】この発明の実施の形態９による相関ルール抽出装置の構成を示すブロック図である。
【図１３】表形式の一例を示す図である。
【図１４】レシート形式でデータベースに保存されたレコードの一例を示す図である。
【図１５】レシート形式の品目により構成されるハッシュ木の一例を示す図である。
【図１６】ルールの条件部の品目数が１〜（ｋ−１）である場合に生成されるルール検証用品目セットＬＬ（１）〜ＬＬ（ｋ−１）を追加したハッシュ木の値例を示す図である。
【図１７】この発明の実施の形態１２による相関ルール抽出装置の構成を示すブロック図である。
【図１８】従来の相関ルール抽出装置の構成例を示すブロック図である。
【図１９】大品目セットを示すハッシュ木の一例を示す図である。
【図２０】先に提案した第１の手法について説明する図である。
【図２１】先に提案した第２の手法について説明する図である。
【符号の説明】
１，１Ａデータベース、３大品目セットファイル（保存手段）、４仮説生成検証手段（相関ルール生成手段）、６メモリ（記憶手段）、２１候補品目セット検証手段（支持度計算手段）、２２候補品目セット生成手段（候補品目セット生成手段、ルール検証用品目セット生成手段）、２３ハッシュ木操作手段（共通品目セット追加手段、記憶容量計算手段）、５１相関ルール抽出部、５２レシートファイル生成手段（データ変換手段）、７１ＣＰＵ（相関ルール生成手段、支持度計算手段、候補品目セット生成手段、ルール検証用品目セット生成手段、共通品目セット追加手段、記憶容量計算手段、相関ルール抽出部、データ変換手段）、７３ＲＡＭ（記憶手段）、８１Ａプログラム、９１記録媒体。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a correlation rule extraction apparatus, a correlation rule extraction method, and a recording medium for extracting a correlation rule between item sets in a database from a plurality of records recorded in the database.
[0002]
[Prior art]
FIG. 18 is a block diagram showing a configuration example of a conventional correlation rule extraction device. In this conventional correlation rule extraction apparatus, for example, “Fast Algorithms for Mining Association Rules” (Agrawal et al., Proc. Of the 20th VLDB Conference, Santiago, Chile, 1994) and Japanese Patent Application Laid-Open No. 8-287106 are disclosed. Based on the (Apriori) method, a correlation rule between item sets in the database is extracted from a plurality of records recorded in the database. In the a priori method, the ratio of records including the item set “A, B,..., X, Y” out of all records is defined as the support level, and the rule “A, B,. The ratio of the records that include the item “Y” in the consequent part among the records that include the item set “A, B,... When the value is higher than the reference value, the rule “A, B,..., X → Y” is extracted as the correlation rule.
[0003]
In FIG. 18, reference numeral 1 denotes a database that holds data about a predetermined target as a plurality of records. Reference numeral 5 denotes a correlation rule set file that holds the generated correlation rules. Reference numeral 102 denotes a large item set generation unit that refers to a record in the database 1 and generates a large item set with the number of items (k + 1) from a large item set with the number of items k. The large item set refers to an item set having a support level higher than a predetermined reference value. Reference numeral 104 denotes hypothesis generation / verification means for generating a rule (that is, a candidate for a correlation rule) from each large item set and extracting the correlation rule based on the certainty of the rule.
[0004]
In the large item

set generation unit

102, 121 refers to the record of the database 1, calculates the support level of the candidate item set generated by the candidate item set generation unit 122, and has a support level higher than a predetermined reference value. Candidate item set verification means that sets an item set as a large item set, and 122 is one of a plurality of large item sets in which (k-2) items are the same and the number of items is (k-1) A candidate item set that generates a candidate item set having k items each composed of (k−2) identical items and two different items in the large item set from two large item sets. It is a generation means.
[0005]
In the hypothesis generation verification means 104, 31 is a rule for generating all rules composed of a condition part of (k-1) items and a result part of one item from a large item set having k items. Candidate generating means 33 is a certainty factor calculating means 33 for calculating the certainty factor of the rule generated by the rule candidate generating means 31 with reference to the large item set.
[0006]
Next, the operation will be described.
First, the candidate item set verification unit 121 generates a large item set with one item by referring to the items of each record in the database 1. The candidate item set generation means 122 has two items each composed of two large item sets out of a plurality of large item sets each having one item, and two different items in the large item set. A candidate item set is generated.
[0007]
Next, the candidate item set verification unit 121 refers to the record in the database 1 to calculate the support level of the candidate item set generated by the candidate item set generation unit 122 and has a support level higher than a predetermined reference value. Let the candidate item set be a large item set.
[0008]
Then, the candidate item set generation means 122 is configured so that (k−2) items are common and two large item sets among a plurality of large item sets having the number of items (k−1) are ( k-2) A candidate item set is generated in which the number of items composed of two common items and two different items in the large item set is k. Similarly, the large item set generation unit 102 generates a large item set while increasing the number of items by one until there is no candidate item set.
[0009]
On the other hand, in the hypothesis generation verification means 104, the rule candidate generation means 31 determines that the condition item of (k−1) items and one item from the large item set whose number of items is k by the candidate item set verification means 121. Generate all rules that consist of the consequences of an item. Then, the certainty factor calculating unit 33 refers to the large item set, calculates the certainty factor of the rule generated by the rule candidate generating unit 31, and correlates a rule having a certainty factor higher than a predetermined reference value as a correlation rule. Save in the rule set file 5.
[0010]
In this way, a large item set for each number of items is generated, and a correlation rule is extracted based on the large item set.
[0011]
When holding a large item set, the data structure is often held in a memory or the like as a hash tree. FIG. 19 is a diagram illustrating an example of a hash tree indicating a large item set. In this hash tree, each item constitutes a node, and one item set constitutes one branch from a root to a predetermined lower node. For example, in FIG. 19, the branch from the root to the node 5 through the node 1 and the node 3 is configured by an item set [1, 3, 5] including

items

1, 3, and 5. Hereinafter, an item set including items x1, x2,..., Xn will be referred to as [x1, x2,.
[0012]
That is, by repeating the generation of a large item set as described above, a branch is added to the hash tree every time the large item set increases, and each time the number of items increases, the number of nodes constituting the branch is increased to extend the branch. I will do it. In the hash tree, all branches constitute one large item set.
[0013]
In this way, a large item set having a different number of items can be expressed in a tree structure by utilizing the property that a subset in the large item set always becomes a large item set. The subset in the large item set is an item set including (k−1) or less items of k items constituting the large item set.
[0014]
In this way, by holding a plurality of large item sets as a hash tree, the storage capacity required for holding the large item sets is reduced as compared with the case where the large item sets are simply stored sequentially.
[0015]
However, if the number of items is large even if a hash tree is used, the hash tree itself becomes large, and a memory having a large storage capacity is required to generate a large item set. Moreover, when the memory capacity of the memory is small and the virtual storage method is adopted, there is a problem that paging occurs frequently and the time required for processing becomes long.
[0016]
Other conventional techniques described above include JP-A-8-161287, JP-A-8-263346, JP-A-8-272825, JP-A-8-314981, and JP-A-9-34721. And those described in JP-A-10-269248.
[0017]
Therefore, in order to solve such a problem, the present applicant previously proposed a method for suppressing paging in Japanese Patent Application No. 10-182416. FIG. 20 is a diagram for explaining the previously proposed first method, and FIG. 21 is a diagram for explaining the previously proposed second method.
[0018]
In the first method proposed previously, a large item set is stored in a file, and the large item set is read from the file one by one and added to the hash tree. All candidate item sets generated from the large item set being added are added to the hash tree.
[0019]
Here, when generating a candidate item set corresponding to a large item set, add nodes of items with numbers higher than the last item of the large item set in parallel to the lower items of the large item set. A branch of candidate item sets is generated. For example, when there are 10 items from 1 to 10 and a candidate item set is generated corresponding to the large item set [1, 3, 5], the branch of the large item set [1, 3, 5]

Nodes

6, 7, 8, 9, and 10 are added below, and candidate item sets [1, 3, 5, 6], [1, 3, 5, 7], [1, 3, 5, 8] are added. , [1, 3, 5, 9], [1, 3, 5, 10] are generated.
[0020]
The above-described series of processing is sequentially executed for each large item set having the same number of items. Then, for each series of processing, the storage capacity occupied by the hash tree is calculated. When the value exceeds a predetermined reference value, the candidate item set added to the hash tree is matched with the record. The support level is calculated and a large item set whose number of items is one larger than the large item set stored in the file is generated as shown in FIG. 20, and all large item sets are generated and updated in a batch. . Therefore, a large item set having the same number of items is always stored in the file. Note that candidate item sets that have not become large item sets are deleted.
[0021]
Then, an item set necessary for rule validation (rule validation item set), that is, an item set that is a subset of the large item set, which is one less than the large item set, is also added to the hash tree. . After generating the association rules based on these subtrees, the added large item set and the subtree added for it are deleted. Then, the next large item set is added and processed in the same manner.
[0022]
In the previously proposed second method, branching similar to the a priori method is executed. In this method, all items other than the last item are the same until the data amount of the hash tree reaches a predetermined storage capacity that is a fraction of the storage capacity previously allocated for the hash tree. Item sets are collectively read from the file and added sequentially to the hash tree. Here, the large item set is read by a predetermined storage capacity that is a fraction of the pre-allocated storage capacity, and then the storage capacity necessary for adding the candidate item set and rule verification item set is increased. This is to keep it.
[0023]
A candidate item set is generated from a large item set in which all items other than the last item are the same. For example, when large item sets [2, 3, 5] and [2, 3, 7] are stored in a file, as shown in FIG. 21, these large item sets [2, 3, 5], [2 , 3, 7] are collectively read and added to the hash tree. Then, based on the last different items “5” and “7”, the item “7” is added to the large item set [2, 3, 5] to obtain the candidate item set “2, 3, 5, 7”. Is added.
[0024]
As described above, according to the first and second methods previously proposed, a large item set can be generated efficiently within the range of the storage capacity allocated in advance, and as a result, the association rules can be extracted efficiently. it can.
[0025]
[Problems to be solved by the invention]
The previously proposed method is configured as described above. Although the above-mentioned problem can be solved, the first method is different from the branch expansion similar to the a priori method in generating the candidate item set. Therefore, there is a problem that the number of candidate item sets for which the degree of support is calculated increases, and the processing time increases accordingly. Further, in the second method, when the data amount of a large item set reaches a fraction of the pre-allocated storage capacity, the remaining storage capacity is used to determine the large item set. Since a candidate item set or the like is added to the hash tree, there is a problem in that there is a possibility that all of the storage capacity allocated in advance may not be used effectively.
[0026]
In the above-described conventional technique, when extracting a correlation rule from tabular data, as preprocessing,

numbers

1, 2, 3, etc. are assigned to the attribute values in the table, and the attribute values are assigned to the numbers (items). Since the above processing is executed after conversion to, it is obvious that the attribute value correlation rule for the same attribute (eg, “gender is“ male ”for attribute values“ male ”and“ female ”) ”Is a rule having a negative correlation that the gender is not“ female ””, and there is a problem that it is necessary to remove such obvious correlation rules as post-processing.
[0027]
Furthermore, in the above-described conventional technology, since only the correlation rule with the number of items in the result part is extracted, it is difficult to extract the correlation rule with the number of items in the result part being 2 or more. There was a problem.
[0028]
The present invention has been made to solve the above-described problems. A large item set is read for each item other than the last item in common, and an item set that is a common part of the large item set in the hash tree. , Select any two of the last set of items in the large item set in turn, add the two items to the common part to generate a candidate item set, and After the eye set is generated and added to the hash tree, the storage capacity for the hash tree is calculated, and when the storage capacity for the hash tree becomes larger than a predetermined reference value, each candidate item set and each rule verification Calculate the support level of the product item set, save the candidate item set with the support level larger than the specified reference value as a large item set, and the rule verification item set and large item set. Effectively use pre-allocated storage capacity by generating association rules from large item sets based on confidence or chi-square value calculated from the support of the network and then deleting the hash tree An object of the present invention is to obtain a correlation rule extraction device, a correlation rule extraction method, and a recording medium.
[0029]
In addition, each item is stored in the database as a two-dimensional index composed of attribute numbers and attribute values, and items that can be added to the lower level of items that are nodes of the hash tree are limited to items that do not have the same attribute number. It is an object of the present invention to obtain a correlation rule extraction device, a correlation rule extraction method, and a recording medium for suppressing generation of a correlation rule for attribute values in the same attribute so as to generate a large item set.
[0030]
Furthermore, when creating a correlation rule candidate from a large item set with k items, the number of items in the condition part is 1 to (k−1), and the number of items in the consequence part is (k−1) to It is an object to obtain a correlation rule extraction apparatus, a correlation rule extraction method, and a recording medium for generating a correlation rule having two or more consequents so that each rule is 1 is generated.
[0031]
[Means for Solving the Problems]
The correlation rule extracting apparatus according to the present invention has a storage means for storing a plurality of item sets as a hash tree, a storage means for storing a large item set, a large item set from the storage means, and items other than the last item in common. Read out every item to be read, and select either one of the common item set addition means that adds the common item set to the hash tree and the last item set of the large item set that was read out A candidate item set generating means for generating a candidate item set by adding the two items to a common part, and an item set that is a part of each candidate item set and has one less item than the candidate item set Rule validation item set generation means to add to the hash tree as a rule validation item set, common part of large item set, candidate item set and rule validation Storage capacity calculation means for calculating the storage capacity for the hash tree after the addition of the item set, and each candidate item set and each rule verification article when the storage capacity for the hash tree exceeds a predetermined reference value The degree of support of the eye set is calculated, and among the candidate item sets, the degree of support greater than a predetermined reference value is stored in the storage means as a large item set, and the storage capacity for the hash tree is predetermined. If the value exceeds the standard value, a correlation rule is generated from the large item set based on the certainty or chi-square value calculated from the support of the rule validation item set and the large item set, and then the hash tree is A correlation rule generating unit for deleting from the storage unit.
[0032]
The correlation rule extraction device according to the present invention stores the support level together with the large item set in the storage unit, and when the same large item set as each rule verification item set is stored in the storage unit, the large item set is stored in the storage unit. The support level of the item set is set as the support level of the rule verification item set.
[0033]
The correlation rule extraction device according to the present invention reads records from a database, counts the number of appearances of candidate item sets and the number of appearances of rule verification item sets in parallel, and supports the support level of candidate item sets and rule verification item sets. The degree of support is calculated.
[0034]
The correlation rule extraction device according to the present invention separates a subtree corresponding to the rule verification item set from the hash tree, calculates the support level of each candidate item set, and after calculating the support level, the rule verification item set The subtree corresponding to is returned to the original position of the hash tree.
[0035]
The correlation rule extracting apparatus according to the present invention stores data indicating a correspondence relationship between a connection node of a hash tree to which a separated subtree is connected and a root of the subtree according to a data structure of a stack, and based on the data Thus, the separated subtree is returned to the original position of the hash tree.
[0036]
The correlation rule extraction device according to the present invention stores data indicating a correspondence relationship between a connection node of a hash tree to which a separated subtree is connected and a root of the subtree according to a data structure of a list, and based on the data Thus, the separated subtree is returned to the original position of the hash tree.
[0037]
When the storage capacity for the hash tree becomes larger than a predetermined reference value, the correlation rule extraction device according to the present invention has a common part of a large item set added immediately before, a candidate item set, and a rule verification item set. Is deleted from the hash tree.
[0038]
In the correlation rule extraction device according to the present invention, the maximum value of the increase in the storage capacity for the hash tree due to the addition of the common part of the large item set, the candidate item set, and the rule verification item set is a predetermined reference value. And the storage capacity for the hash tree, the support level is calculated, an association rule is generated, and then the hash tree is deleted.
[0039]
The correlation rule extraction apparatus according to the present invention is configured such that an average value of an increase in storage capacity for a hash tree by adding a common part of a large item set, a candidate item set, and a rule verification item set is a predetermined reference value and a hash When the difference is larger than the storage capacity for the tree, the support level is calculated, an association rule is generated, and then the hash tree is deleted.
[0041]
The recording medium according to the present invention reads out a large item set from a predetermined storage means to a computer for each common item other than the last item, and stores the large items in a hash tree stored in the predetermined storage means. Procedure for adding an item set that is a common part of a set, selecting any two of the last set of items in a large item set that has been read out, and adding the two items to the common part in order Procedures for generating item sets, procedures for adding item sets that are part of each candidate item set and having one item less than the candidate item set to the hash tree as rule validation item sets, common parts of large item sets, Procedure for calculating the storage capacity for the hash tree after addition of the candidate item set and the rule verification item set, the storage capacity for the hash tree from the predetermined reference value When it becomes short, calculate the support level of each candidate item set and each rule verification item set, and if the candidate item set has a support level greater than a predetermined reference value, it is stored in a predetermined storage means as a large item set. Generate a correlation rule from the large item set based on the confidence level or chi-square value calculated from the saving procedure, rule validation item set and large item set support, and then delete the hash tree from the given storage means A program for executing the procedure is recorded.
[0042]
The correlation rule extracting apparatus according to the present invention includes a database that stores each item as a two-dimensional index composed of an attribute number and an attribute value, and an item that can be added to a lower level of an item that is a node of a hash tree. And a correlation rule extracting unit that generates a large item set by restricting only to items that do not have a, and extracts a correlation rule from the large item set.
[0043]
The correlation rule extraction device according to the present invention provides a database with attributes of each item as a two-dimensional index composed of attribute numbers and attribute values, or an attribute group composed of a plurality of attributes of the database. Items that do not have the same attribute number or attribute group number can be stored as a two-dimensional index composed of group numbers and attribute values of any of them, and items that can be added to the lower level of items that are nodes of the hash tree The large item set is generated by restricting to this, and the correlation rule is extracted from the large item set.
[0044]
The correlation rule extracting apparatus according to the present invention comprises data conversion means for converting each attribute value of a record stored in a database in a predetermined table format into a two-dimensional index and storing the two-dimensional index in the database. is there.
[0046]
The recording medium according to the present invention provides a computer with a procedure for reading items from a database storing each item as a two-dimensional index composed of an attribute number and an attribute value, and an item that can be added to a lower level of an item that is a hash tree node. , A program for generating a large item set by restricting only to items not having the same attribute number and executing a procedure for extracting a correlation rule from the large item set is recorded.
[0047]
The correlation rule extraction apparatus according to the present invention generates a candidate item set from a storage unit that stores a large item set and a large item set that is common to items other than the last item among the large item sets stored in the storage unit. A candidate item set generating means for calculating the support level of the candidate item set, a support level calculating means for generating a large item set having the same number of items as the candidate item set based on the support level, and storing the large item set in the storage means; The number of items in the large item set is k, and the large item set is divided into a condition part from 1 to (k-1) and a result part from (k-1) to 1 Among the rules to be generated, there is provided a correlation rule generating means that uses a rule having a certainty factor or chi-square value larger than a predetermined reference value as a correlation rule.
[0048]
The correlation rule extraction apparatus according to the present invention saves the support level of the large item set when the large item set is stored in the storage unit, and the large item set identical to the condition part of the rule is stored in the storage unit. If there is, the rule confidence or χ square value is calculated with the support of the large item set as the support of the condition part of the rule.
[0050]
The recording medium according to the present invention has one more item than the large item set from the large item set that is common to items other than the last item among the large item sets stored in the predetermined storage unit in the computer. Procedure for generating a candidate item set, calculating the support level of the candidate item set, generating a large item set with the same number of items as the candidate item set based on the support level, and storing it in a predetermined storage means, large item The number of items in the set is k, and a large item set is generated by dividing it into a condition part with the number of items from 1 to (k-1) and a result part with the number of items from (k-1) to 1. Among these rules, a program for executing a procedure in which a rule having a certainty factor or chi-square value larger than a predetermined reference value is used as an association rule is recorded.
[0051]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below.
Embodiment 1 FIG.
1 is a block diagram showing a configuration of an association rule extraction apparatus according to Embodiment 1 of the present invention. In the figure, reference numeral 1 denotes a database that holds data on a predetermined target as a plurality of records. 2 stores a large item set as a hash tree in the memory 6, and operates the hash tree to generate a large item set with the number of items (k + 1) from a large item set with the number of items k, and a large item set file (Saving means) Large item set generating means to be stored in 3. Reference numeral 4 denotes a hypothesis generation verification unit (correlation rule generation unit) that generates a rule (that is, a candidate for a correlation rule) from each large item set and extracts a correlation rule based on the certainty factor or χ square value of the rule. Reference numeral 5 denotes a correlation rule set file that holds the generated correlation rules. A memory 6 stores a large item set, a candidate item set, and a rule verification item set as a hash tree. It should be noted that the items constituting the hash tree are arranged such that the closer to the root, the lower the number.
[0052]
Hereinafter, a large item set with the number of items k is L (k), a candidate item set with the number of items k is C (k), and a rule verification item set with the number of items k is LL ( k).
[0053]
In the large item set generation means 2, 21 refers to the record of the database 1, calculates the support level of the candidate item set C (k) generated by the candidate item set generation means 22, and obtains a predetermined reference value (minimum support) This is candidate item set verification means (support degree calculation means) in which a candidate item set C (k) having a higher support degree is a large item set L (k).
[0054]
Reference numeral 22 denotes a common part (first item to (k-2)) from two large item sets L (k-1) in the large item set L (k-1) except for the last item. Item), and a candidate item set C (k) composed of the last two last items (the (k-1) th item of each large item set) and a candidate to be added to the hash tree in the memory 6 Item set generation means (candidate item set generation means, rule verification item set generation means).
[0055]
23 reads out a large item set L (k) other than the last item from the large item set file 3 and adds the item set of the number of items (k−1), which is the common part, to the hash tree, and the memory 6 Hash tree operation means (common item set addition means, storage capacity calculation means) for checking the storage capacity occupied by the item set (large item set, candidate item set, etc.) stored as a hash tree.
[0056]
In the hypothesis generation verification means 4, rule candidate generation 31 generates all rules composed of a condition part of (k−1) items and a result part of one item from the large item set L (k). 32 is verification value calculation means for calculating the certainty factor or χ-square value of the rule generated by the rule candidate generation means 31.
[0057]
The chi-square value is used for the chi-square test. Here, the chi-square value is calculated by the following equation as indicating the degree of relation between the rule condition part and the result part.
[Expression 1]

Here, n is the total number of records in the database 1, A is the support level of the rule condition part, B is the support level of the rule result part, and C is the support level of the entire rule.
[0058]
Next, the operation will be described.
FIG. 2 is a flowchart for explaining the operation of the correlation rule extracting apparatus according to the first embodiment. 3 is a diagram showing an example of the contents of the large item set file 3. FIG. 4 is a diagram illustrating the large item sets [1, 2, 3], [1, 2, 4], [1, 3, 5 of FIG. ], [1, 4, 5] is a diagram showing a hash tree. 5 is a diagram for explaining an example of the operation in step ST3 of FIG. 2, FIG. 6 is a diagram for explaining an example of the operation in step ST4 of FIG. 2, and FIG. 7 is a diagram in step ST5 of FIG. It is a figure explaining an example of operation.
[0059]
First, in step ST1 of FIG. 2, the candidate item set verification unit 21 of the large item set generation unit 2 calculates the support level of each item appearing in the record of the database 1, and selects an item having a support level higher than a predetermined reference value. The large item set L (1) is stored in the large item set file 3 together with the degree of support.
[0060]
Next, the initial value of the index k of the number of items is set to 2 (step ST2), hereinafter, the process of generating the large item set L (k) from the large item set L (k-1), and the generated large item set L Processing for generating an association rule from (k) will be described.
[0061]
First, in step ST3, the hash tree operation means 23 reads the large item set L (k-1) from the large item set file 3 in batches for each large item set that is common to items other than the last item, and the common The part item set is added to the hash tree, and the last item of each large item set L (k−1) is stored in the memory 6 separately from the hash tree in association with the common part.
[0062]
For example, for the large item set L (3) (that is, when k = 4), [1, 2, 3], [1, 2, 4], [1, 3, 4 of the large item set file 3 shown in FIG. 5], [1, 4, 5], the hash tree in the state in which the processing of step ST3 to step ST5 has been completed is as shown in FIG. 4. Next, in step ST3, the large item is stored from this large item set file 3. The sets [2, 4, 5], [2, 4, 6], [2, 4, 7] are read in a batch. And the branch comprised by item set [2, 4] which is those common parts is added to the hash tree shown in FIG. At this time, the remaining items “5”, “6”, and “7” of each read large item set are stored in the memory 6 separately from the hash tree as one array, for example, as shown in FIG. The correspondence between the array and the branch of the common part, such as pointers, is also stored.
[0063]
Next, in step ST4, the candidate item set generation means 22 for each item in the last item set of the large item set L (k-1), and each item having a larger number than that item in the set. Thus, the branches of the above-mentioned common part are respectively extended to generate a candidate item set C (k).
[0064]
For example, in the case where the state shown in FIG. 5 is obtained after reading the large item set L (3), first, the item 5 in the array “5, 6, 7” is read out, as shown in FIG. , “5, 6”, “5, 7” are added to the lower part of the common part “2, 4” by the item “5” and the items “6”, “7” having a larger number than the item, and candidate items Sets [2, 4, 5, 6], [2, 4, 5, 7] are generated. Since “2, 4, 5” is common to candidate item sets [2, 4, 5, 6] and [2, 4, 5, 7], the hash tree at this time is as shown in FIG. “6” and “7” are derived from “2, 4, 5”.
[0065]
When the generation of the candidate item set C (k) is completed, in step ST5, the rule candidate generation unit 31 of the hypothesis generation verification unit 4 is a part of the generated candidate item set C (k) and the number of items is 1. As many item sets as possible are added to the hash tree as rule verification item sets LL (k−1). If a branch corresponding to the rule verification item set LL (k−1) already exists, nothing is added to the rule verification item set LL (k−1).
[0066]
Here, the rule verification item set LL (k−1) corresponds to the condition part of the candidate rule generated when the candidate item set C (k) becomes the large item set L (k). For example, as shown in FIG. 6, when there are two candidate item sets C (4) [2, 4, 5, 6] and [2, 4, 5, 7], the rule verification item set LL ( 3), [2, 4, 5], [2, 4, 6], [2, 5, 6], [4, 5, 6], [2, 4, 7], [2, 5, 7 ], [4, 5, 7]. Note that [2, 4, 5] overlap. That is, when the candidate item set [2, 4, 5, 6] becomes a large item set, the rules “2, 4, 5 → 6”, “2, 4, 6 → 5”, “2, 5, 6 → 4 ”,“ 4, 5, 6 → 2 ”are generated, and the candidate item set [2, 4, 5, 7] becomes a large item set, the rule“ 2, 4, 5 → 7 ” ”,“ 2, 4, 7 → 5 ”,“ 2, 5, 7 → 4 ”,“ 4, 5, 7 → 2 ”are generated, and the rule verification item set that is the condition part is generated in advance. Is done.
[0067]
The branches “2, 4, 6”, “2, 5, 6”, “4, 5” that do not exist in the hash tree in the rule verification item set LL (3) are added to the hash tree shown in FIG. , 6 "," 2, 4, 7 "," 2, 5, 7 "," 4, 5, 7 "are added as shown in FIG.
[0068]
When the processes in steps ST3 to ST5 are completed, in step ST6, the hash tree operation means 23 checks the storage capacity occupied by the hash tree in the memory 6, and whether the storage capacity is larger than a predetermined reference value. Determine whether. If the storage capacity occupied by the hash tree is less than or equal to the predetermined reference value, the hash tree operation means 23 sends all large item sets L (k−) read to the candidate item set generation means 22 in step ST3. An inquiry is made as to whether or not generation of the candidate item set C (k) has been completed for the last item of 1).
[0069]
If the generation of the candidate item set C (k) is not completed for the last item of all the large item sets L (k−1), the process returns to step ST4 and the next large item set L (k The process for the last item of -1) is executed in the same manner. For example, if the storage capacity occupied by the hash tree shown in FIG. 7 is less than or equal to a predetermined reference value, then in step ST4, the candidate item set for the item “6” in the array “5, 6, 7” Generation of C (k) and generation of rule verification item set LL (k-1) are executed.
[0070]
On the other hand, when the generation of the candidate item set C (k) has been completed for the last item of all the large item sets L (k−1), the hash tree operation means 23 determines that all large item sets are in step ST8. It is determined whether L (k-1) has been read and processed. If it is determined that all large item sets L (k−1) have not been processed, the process returns to step ST3, and the next group of large item sets L (k−1) is read. Processing is executed. On the other hand, when it is determined that all the large item sets L (k−1) have been processed, the process proceeds to step ST9, and the generation of the large item set L (k) is performed.
[0071]
If the storage capacity occupied by the hash tree is higher than the predetermined reference value in step ST6, the process proceeds to step ST9, where each candidate item set C (k) added to the hash tree in the memory 6 up to that point is stored. The support level is calculated by the candidate item set verification means 21 with reference to the record in the database 1, and among the candidate item sets C (k), the candidate item set C (k) having a support level higher than a predetermined reference value is large. The large item set L (k) is stored in the large item set file 3 together with the support level. In the large item set file 3, a large item set is stored for each number of items. That is, the generated large item set L (k) is stored separately from the large item set L (k−1).
[0072]
When the large item set L (k) is generated, in step ST10, the hash tree operation means 23 then adds each rule verification item set LL (k−) added to the hash tree of the memory 6 up to that point. If the same large item set L (k−1) as in 1) is searched in the large item set file 3 and the corresponding large item set L (k−1) is found, the large item set L (k−) The support level of 1) is read and stored in the memory 6 in association with the rule verification item set LL (k-1) as the support level of the rule verification item set LL (k-1). If the corresponding large item set L (k-1) is not found, nothing is done. The support level of the item set including the rule verification item set LL (k-1) is equal to or lower than a predetermined reference value. Such an item set does not become the large item set L (k), and the rule verification product This is because the degree of support of the eye set LL (k-1) is not particularly required.
[0073]
Next, in step ST11, the rule candidate generating means 31 first generates a rule in which each (k-1) item in the large item set L (k) is a condition part and the remaining one item is a consequent part. To do. Then, the verification value calculation means 32 obtains the certainty factor for each rule from the support level of the condition part (that is, the support level of the rule verification item set LL (k−1)) and the support level of the large item set L (k). Or the chi-square value is calculated from the support of the condition part, the support of the consequent part, the support of the large item set L (k) and the number of records in the database, and the certainty or χ2 from the predetermined reference value A rule having a high multiplier value is stored in the correlation rule set file 5 as a correlation rule.
[0074]
After generating and storing the association rule in this way, in step ST12, the hypothesis generation verification unit 4 deletes the hash tree in the memory 6 and releases the storage area occupied by the hash tree.
[0075]
In step ST13, the hash tree operation means 23 reads all large item sets L (k-1) in the large item set file 3, and all candidate item sets from the read large item set L (k-1). It is determined whether C (k) has been generated. If all candidate item sets C (k) are not generated from the read large item set L (k−1), the process returns to step ST4 via step ST7, and the remaining candidate item sets C (k) When generation is executed and all candidate item sets C (k) are generated from the read large item set L (k−1), but there is a large item set L (k−1) that has not been read. The process returns to step ST3 via steps ST7 and ST8, and the remaining large item set L (k-1) is read.
[0076]
On the other hand, when all large item sets L (k−1) in the large item set file 3 are read and all candidate item sets C (k) are generated from the read large item sets L (k−1). If it is determined, the process proceeds to step ST14, where the hash tree operation means 23 determines whether at least one large item set L (k) has been generated. When at least one large item set L (k) has been generated, the value of the item number index k is increased by 1 (step ST15), and the process returns to step ST3, where the item number k is increased by 1. The large item set L (k + 1) is generated. On the other hand, if no large item set L (k) is generated, the process is terminated.
[0077]
As described above, according to the first embodiment, the large item set L (k−1) is read for each item other than the last item in common, and the large item set L (k−1) is stored in the hash tree. Add a set of items that are common parts of, select any two of the last set of items in the large item set L (k-1) in turn, and add the two items to the common parts respectively The candidate item set C (k) is generated and the rule verification item set LL (k-1) is generated and added to the hash tree, and then the storage capacity for the hash tree is calculated. When the storage capacity of the candidate item set C (k) and each rule verification item set LL (k-1) is calculated, the candidate item set C (k) Of which, the degree of support is greater than a predetermined reference value Items are stored as a large item set L (k), and large items based on the certainty or χ square value calculated from the support of the rule verification item set LL (k−1) and the large item set L (k) Since the association rule is generated from the set L (k) and then the hash tree is deleted, the storage capacity used by the hash tree is checked for each large item set process, and the memory allocated in advance The effect that the capacity can be used effectively is obtained.
[0078]
Embodiment 2. FIG.
The correlation rule extracting apparatus according to the second embodiment of the present invention is obtained by changing the processing for calculating the support level of the rule verification item set from that of the first embodiment. Here, only the calculation of the support level of the rule verification item set in the second embodiment will be described, and the other processes are the same as those in the first embodiment, and the description thereof will be omitted.
[0079]
In the correlation rule extracting apparatus according to the second embodiment, the support level of the rule verification item set is performed by referring to the records in the database 1 by the candidate item set verification unit 21. At that time, the candidate item set verification means 21 sequentially reads out records from the database 1, matches each record with the candidate item set, and also performs matching between each record and the rule verification item set in parallel.
[0080]
Then, the candidate item set verification means 21 counts the number of appearances of each candidate item set and the number of appearances of each rule verification item set in the record of the database 1 to calculate the support level.
[0081]
As described above, according to the second embodiment, since the matching of the candidate item set and the rule verification item set is executed in parallel with respect to the record read from the database 1, the large item set When there are many, the effect that the support degree of each rule verification item set can be calculated in a shorter time than searching for the large item set stored in the large item set file 3 is obtained.
[0082]
Embodiment 3 FIG.
The correlation rule extracting device according to the third embodiment of the present invention is obtained by changing the processing for calculating the support level of the candidate item set from that according to the first embodiment. Here, only the calculation of the support level of the candidate item set in the first embodiment will be described, and the other processes are the same as those in the first embodiment, and the description thereof will be omitted.
[0083]
In the third embodiment, the candidate item set verification means 21 temporarily separates the subtree of the rule verification item set not related to this calculation from the hash tree when calculating the support level of the candidate item set. After completion of this calculation, the subtree is returned to the original position of the hash tree. At this time, the candidate item set verification means 21 stores the correspondence relationship between the connection node adjacent to the hash tree separation position and the node that is the root of the subtree, and after the calculation of the support level, the correspondence relationship is stored. Based on, return the subtree to its original position in the hash tree.
[0084]
FIG. 8 is a diagram illustrating an example of a position where a partial tree is cut from a hash tree. FIG. 9 is a diagram illustrating an example of a correspondence relationship between a hash tree and a separated partial tree. For example, in the hash tree shown in FIG. 8, the candidate item set is [2, 4, 5, 6], [2, 4, 5, 7], and the other branches are all branches of the rule verification set. . Therefore, nodes marked with a circle are separated from the hash tree. At this time, the number of separation positions is set as small as possible. For example, when a node is searched in order from the root in order, and the item of the node does not appear in the corresponding position of any candidate item set, the link to that node is set as the disconnection position.
[0085]
That is, in the hash tree of FIG. 8, since the item “4” does not appear in the first item of the candidate item set, the link from the root to the node of the item “4” is set as the disconnection position. Similarly, since the item “5” does not appear in the second item of the candidate item set, the link from the node of the item “2” to the node of the item “5” is set as the disconnection position, and the items “6”, “7” Does not appear in the third item of the candidate item set, the links from the node of the item “4” to the nodes of the items “6” and “7” are set as the separation positions.
[0086]
Then, as shown in FIG. 9A, the pointers to the route adjacent to the separation position and the node of the item “4” are a and d, the pointers to the node of the item “2” and the node of the item “5” are b. , C, when the pointers to the item “4” node and the item “6”, “7” node are e, f, and g, as shown in FIG. Pointers to two nodes are stored in association with each other. In this way, after storing the correspondence between the connection node adjacent to the separation position of the hash tree and the node that is the root of the subtree, the lower nodes of the root and item “2”, “4”, and “5” nodes are stored. Rewrite the link information to divide the subtree, and after the calculation of the support level, based on the correspondence (FIG. 9B), the root and items “2”, “4”, “5” nodes The link information to the node is restored and the subtree is returned to the original position of the hash tree.
[0087]
As described above, according to the third embodiment, since the subtree of the rule verification item set not related to the calculation of the support level of the candidate item set is separated from the hash tree, the support level of the candidate item set The matching is not performed for the rule verification item set that is not related to the calculation, and the matching of the candidate item set can be performed efficiently.
[0088]
Embodiment 4 FIG.
The correlation rule extracting device according to the fourth embodiment of the present invention stores data indicating the correspondence between the connection node of the hash tree and the root of the subtree of the rule verification item set according to the data structure of the stack, and stores the data in the data. Based on this, the separated subtree is returned to the original position of the hash tree. Since other processes are the same as those in the third embodiment, description thereof is omitted.
[0089]
In the fourth embodiment, the candidate item set verification means 21 displays the correspondence relationship between the connection node adjacent to the separation position of the hash tree and the node that is the root of the subtree in the stack area having a certain storage capacity. The data is sequentially stored in accordance with the data structure of the stack, and after the calculation of the support degree is completed, the corresponding relationship is sequentially read from the stack, and the subtree is returned to the original position of the hash tree.
[0090]
FIG. 10 is a diagram illustrating an example of a stack that stores the correspondence between the hash tree and the separated subtree. For example, when the partial tree of the rule verification item set is separated from the hash tree of FIG. 8, pointers to two nodes adjacent to the separation position are associated and stored in the stack as shown in FIG. After the calculation of is finished, the subtree is read from the stack in reverse order, and the subtree is returned to the original position.
[0091]
As described above, according to the fourth embodiment, in addition to the effects of the third embodiment, the data indicating the correspondence between the connection node of the hash tree and the root of the subtree of the rule verification item set is represented by the stack. Since the data is stored in accordance with the data structure, the memory management can be simplified.
[0092]
Embodiment 5 FIG.
The correlation rule extracting device according to the fifth embodiment of the present invention stores data indicating the correspondence between the connection node of the hash tree and the root of the subtree of the rule verification item set according to the data structure of the list, and stores the data in the data Based on this, the separated subtree is returned to the original position of the hash tree. Since other processes are the same as those in the third embodiment, description thereof is omitted.
[0093]
In the fifth embodiment, the candidate item set verification unit 21 sequentially stores the correspondence between the connection node adjacent to the separation position of the hash tree and the node that is the root of the subtree according to the data structure of the list. After the calculation of the support degree, the correspondence is sequentially read out from the list, and the subtree is returned to the original position of the hash tree.
[0094]
FIG. 11 is a diagram illustrating an example of a list that stores correspondence relationships between hash trees and separated subtrees. For example, when the subtree of the rule verification item set is separated from the hash tree of FIG. 8, as shown in FIG. 11 (b), the roots of the separated subtrees are shown as a series of lists as shown in FIG. 11 (b). In addition, records associated with pointers to two nodes adjacent to the separation position are stored as a list, and after the calculation of support is completed, each record is sequentially read from the list, and the subtree is returned to the original position. Returned.
[0095]
As described above, according to the fifth embodiment, in addition to the effects of the third embodiment, data indicating the correspondence between the connection node of the hash tree and the root of the subtree of the rule verification item set is represented by a list. Since the data is stored in accordance with the data structure, the number of separation positions may be any number, and it is not necessary to secure an unnecessary storage area in advance, so that the memory can be used efficiently.
[0096]
Embodiment 6 FIG.
When the storage capacity for the hash tree becomes larger than a predetermined reference value, the correlation rule extraction device according to Embodiment 6 of the present invention has a common part, candidate item set, and rule of the large item set added immediately before. The verification item set is deleted from the hash tree. Since other processes are the same as those in the first embodiment, description thereof is omitted.
[0097]
In the sixth embodiment, the hash tree operation means 23 calculates the storage capacity for the hash tree after adding the common part of the read large item set, the candidate item set, and the rule verification item set. If the storage capacity for the large item set is larger than a predetermined reference value, the common part of these large item sets, the candidate item set, and the subtree corresponding to the rule verification item set are deleted from the hash tree.
[0098]
Thereafter, the candidate item set verification means 21 calculates the support level of the remaining candidate item sets and rule verification item sets. The large item set deleted at this time is read and processed again after the hash tree is deleted.
[0099]
As described above, according to the sixth embodiment, when the storage capacity for the hash tree becomes larger than the predetermined reference value, the common part of the large item set added immediately before, the candidate item set, and the rule verification Since the item set is deleted from the hash tree and the storage capacity for the hash tree is restored to a state where it is less than or equal to the predetermined reference value, the calculation of support is performed, so the storage capacity used is more accurate. The effect that it can restrict | limit to is acquired.
[0100]
Embodiment 7 FIG.
In the correlation rule extracting device according to the seventh embodiment of the present invention, the maximum value of the increase in the storage capacity for the hash tree by adding the common part of the large item set, the candidate item set, and the rule verification item set is predetermined. When the remaining storage area becomes larger than the remaining storage area, a large item set and a correlation rule are generated, and thereafter a process of deleting the hash tree is executed. Since other processes are the same as those in the first embodiment, description thereof is omitted.
[0101]
In the seventh embodiment, the hash tree operation unit 23 adds the common part of the large item set, the candidate item set, and the rule verification item set LL (k−1) to the hash tree, The storage capacity is calculated, and the difference between the calculation result and the previous calculation result is calculated to calculate the increase in the storage capacity for the hash tree. Next, the hash tree operation means 23 compares the increase amount with the maximum value of the increase amount up to the previous time, and when the increase amount value is larger than the maximum value, the maximum value is used as the increase amount value. Update the value. The hash tree operation means 23 calculates the remaining capacity of the storage capacity, which is the difference between the predetermined reference value and the storage capacity for the hash tree calculated this time, and if the above maximum value is larger than the remaining capacity, Then, a large item set and a correlation rule are generated, and then a process of deleting the hash tree is executed. If not, the next large item set is read and processed.
[0102]
As described above, according to the seventh embodiment, the maximum value of the increase in the storage capacity for the hash tree due to the addition of the common part of the large item set, the candidate item set, and the rule verification item set is predetermined. When the remaining storage area exceeds the remaining capacity of the storage area, a large item set and a correlation rule are generated, and then the hash tree is deleted. The effect is obtained that the memory is effectively used up to the point before the value is exceeded.
[0103]
Embodiment 8 FIG.
In the correlation rule extraction device according to the eighth embodiment of the present invention, the average value of the increase in the storage capacity for the hash tree by adding the common part of the large item set, the candidate item set, and the rule verification item set is stored in a predetermined amount. When the area becomes larger than the remaining capacity, a large item set and a correlation rule are generated, and then a process for deleting the hash tree is executed. Since other processes are the same as those in the first embodiment, description thereof is omitted.
[0104]
In the eighth embodiment, the hash tree operation means 23 adds the common part of the large item set, the candidate item set, and the rule verification item set LL (k−1) to the hash tree, The storage capacity is calculated, and the difference between the calculation result and the previous calculation result is calculated to calculate the increase in the storage capacity for the hash tree. Next, the hash tree operation means 23 calculates the average value of the increase amount up to this time based on the increase amount, the average value of the increase amount up to the previous time, and the number of additions. The hash tree operation means 23 calculates the remaining capacity of the storage capacity, which is the difference between the predetermined reference value and the storage capacity for the hash tree calculated this time, and if the above average value is larger than the remaining capacity, Then, a large item set and a correlation rule are generated, and then a process of deleting the hash tree is executed. If not, the next large item set is read and processed.
[0105]
As described above, according to the eighth embodiment, the average value of the increase in the storage capacity for the hash tree due to the addition of the common part of the large item set, the candidate item set, and the rule verification item set is stored in the predetermined memory. Since the large item set and the correlation rule are generated and the processing for deleting the hash tree is executed after that when the remaining area becomes larger than the remaining capacity of the area, the storage capacity for the hash tree is set to a predetermined reference value. The effect is obtained that the memory is effectively used up to the point before it exceeds.
[0106]
Embodiment 9 FIG.
FIG. 12 is a block diagram showing a configuration of an association rule extraction apparatus according to Embodiment 9 of the present invention. In the figure, 1A is a database that stores a plurality of records in a table format and stores receipt format data. Here, the receipt format is a continuous representation of items for each attribute value in the records of the database 1A. The item at this time is, for example, a two-dimensional index (attribute number | attribute value). In this case, if the attribute “sex” has “male” and “female”, and the attribute “height” has “large”, “medium”, and “small”, the values of these attributes are ( 1 | 1), (1 | 2) and (2 | 1), (2 | 2), and (2 | 3) are represented by items of a two-dimensional index.
[0107]
51 reads a record in the receipt format of the database 1A, expands a large item set for each number of items as a hash tree based on the items included in the record, and converts the large item set to a certainty factor or a χ-square value. It is a correlation rule extraction part which produces | generates a correlation rule based on. Reference numeral 52 denotes a receipt file generation means (data conversion means) that reads a table-format record of the database 1A, converts it into a receipt format, and saves the converted record in the database 1A.
[0108]
In the correlation rule extraction unit 51, 2A includes a candidate item set verification unit 21A, a candidate item set generation unit 22A, and a hash tree operation unit 23A that execute processing corresponding to the items in the two-dimensional index format stored in the database 1A. It is a large item set generation means.
[0109]
Since the other components in the correlation rule extraction unit 51 are the same as those in the first embodiment (FIG. 1), description thereof is omitted.
[0110]
Next, the operation will be described.
FIG. 13 is a diagram illustrating an example of a table format, FIG. 14 is a diagram illustrating an example of a record stored in the database 1A in a receipt format, and FIG. 15 is an example of a hash tree including receipt format items. FIG.
[0111]
First, the receipt file generation means 52 reads out a record stored in, for example, a table format as shown in FIG. 13, and corresponds to an attribute number assigned to each attribute of the table and the attribute number in the read record. Data in a two-dimensional index format (attribute number | attribute value) composed of a number (attribute value) assigned corresponding to the attribute to be generated is generated for each attribute value of the record and stored in the database 1A.
[0112]
Next, the large item set generation means 2A of the correlation rule extraction unit 51 reads out the receipt format data from the database 1A as appropriate, and expands the hash tree in the memory 6, and the number of items as in the first embodiment. The large item set is generated sequentially.
[0113]
However, when the hash tree operation unit 23A and the candidate item set generation unit 22A execute the branch expansion of the hash tree, the hash tree operation unit 23A and the candidate item set generation unit 22A execute according to the following rules, and items that do not comply with the following rules are not added to the hash tree.
Rule 1. It is assumed that an item having a larger attribute number is “large”. For example, (4 | 1)> (2 | 3).
Rule 2. When the attribute number is the same, the larger the attribute value, the larger the item is. For example, (3 | 5)> (3 | 2).
Rule 3. The attribute number of the item of a certain node must be larger than the attribute number of the item of the upper node. For example, in the hash tree of FIG. 15, the node of the item “(2 | 3)” is not added below the branches “(1 | 1), (2 | 1)”.
[0114]
By doing so, the items that can be added under the item that is the node of the hash tree are limited to items that do not have the same attribute number, so an item set consisting of attribute values in the same attribute is generated Not.
[0115]
Then, the hypothesis generation verification unit 4 of the correlation rule extraction unit 51 generates a correlation rule in the same manner as in the first embodiment.
[0116]
As described above, according to the ninth embodiment, each item is stored in the database as a two-dimensional index composed of an attribute number and an attribute value, and items that can be added under the item that is a node of the hash tree are Since a large item set is generated by restricting to items that do not have the same attribute number, generation of attribute value correlation rules for the same attribute can be suppressed, and generation of negative correlation rules can be suppressed. The effect that it can be obtained.
[0117]
In the ninth embodiment, the receipt file generation means 52 converts the data format recorded in the database 1A into the receipt format. However, the receipt format data is recorded in the database 1A from the beginning. You may do it. In that case, the receipt file generation means 52 is not particularly necessary.
[0118]
In the ninth embodiment, the correlation rule extraction unit 51 is configured in substantially the same manner as that in the first embodiment. However, the correlation rule extraction unit 51 is not limited to this, and the hash tree Other items may be used as long as they generate a large item set while expanding the item set.
[0119]
Embodiment 10 FIG.
The correlation rule extracting apparatus according to the tenth embodiment of the present invention is capable of designating a plurality of attributes that do not generate correlation rules with each other in the correlation rule extracting apparatus according to the ninth embodiment. Since other processes are the same as those in the ninth embodiment, the description thereof is omitted.
[0120]
In the tenth embodiment, a plurality of attributes that do not generate correlation rules are defined as attribute groups, and the attribute values of the attribute groups are converted into items of (attribute group number | attribute value). For example, the attribute “Question 1” and the attribute “Question 2” are set as attribute groups, the value of the attribute “Question 1” is “a”, “b”, and the value of the attribute “Question 2” is “a”, In the case of “b” and “c”, the values of these attributes are (1 | 1), (1 | 2), (1 | 3) and (1 | 4), (1 | 5). Converted to item.
[0121]
Then, when executing branch extension of the hash tree, the following rules are followed as in the ninth embodiment, and items that do not follow the following rules are not added to the hash tree.
Rule 1. It is assumed that an item having a larger attribute group number is “large”. For example, (4 | 1)> (2 | 3).
Rule 2. When the attribute group number is the same, the larger the attribute value, the larger the item is. For example, (3 | 5)> (3 | 2).
Rule 3. The attribute group number of the item of a certain node must be larger than the attribute group number of the item of the upper node. For example, in the hash tree of FIG. 15, the node of the item “(2 | 3)” is not added below the branches “(1 | 1), (2 | 1)”.
[0122]
As described above, according to the tenth embodiment, in addition to the effects of the ninth embodiment, the same attribute group number is used for items corresponding to attributes that do not generate correlation rules with each other. As a result, the generation of a correlation rule can be suppressed, and post-processing for removing such a correlation rule can be omitted.
[0123]
Embodiment 11 FIG.
In the correlation rule extracting device according to the eleventh embodiment of the present invention, when the number of items in the large item set is k, the condition item for the large item set from 1 to (k−1) and the number of items are (k Among the rules generated by dividing each of the -1) to 1 consequent parts, a rule having a certainty factor or chi-square value larger than a predetermined reference value is used as a correlation rule.
[0124]
In the eleventh embodiment, for all integers Nc where (k−1) ≧ Nc ≧ 1, the rule that the number of items in the condition part is Nc and the number of items in the result part is (k−Nc) is large. Generated from the item set L (k).
[0125]
For example, from the large item set “2, 4, 5, 6”, the rules “2, 4, 5 → 6”, “2, 4, 6 → 5”, “2, 5, 6 → 4”, “4, 5,6 → 2 ”, rules“ 2,4 → 5,6 ”,“ 2,5 → 4,6 ”,“ 2,6 → 4,5 ”,“ 4,5 → 2,6 ”,“ 4 , 6 → 2, 5 ”,“ 5, 6 → 2, 4 ”and rules“ 2 → 4, 5, 6 ”,“ 4 → 2, 5, 6 ”,“ 5 → 2, 4, 6 ”,“ 6 → 2, 4, 5 ”is generated. Similarly, from the large item set “2, 4, 5, 7”, the rules “2, 4, 5 → 7”, “2, 4, 7 → 5”, “2, 5, 7 → 4”, “4” , 5, 7 → 2 ”, rules“ 2,4 → 5,7 ”,“ 2,5 → 4,7 ”,“ 2,7 → 4,5 ”,“ 4,5 → 2,7 ”,“ 4,7 → 2,5 ”,“ 5,7 → 2,4 ”and rules“ 2 → 4,5,7 ”,“ 4 → 2,5,7 ”,“ 5 → 2,4,7 ”, “7 → 2, 4, 5” is generated.
[0126]
FIG. 16 shows an example of a hash tree value to which rule verification item sets LL (1) to LL (k−1) generated when the number of items in the condition part of the rule is 1 to (k−1). FIG. For example, when generating a large item set as in the first embodiment, rule verification item sets LL (1) to LL (k−1) corresponding to the condition part of this rule are hashed as shown in FIG. Added to the tree.
[0127]
Then, for each generated rule, a certainty factor or chi-square value is calculated, and a rule having a certainty factor or chi-square value higher than a predetermined reference value is stored as an association rule.
[0128]
In addition, when calculating the certainty factor or the chi-square value, it is necessary to support the condition part of the rule whose number of items is 1 to (k−1). For example, as in the first embodiment, a large item set is selected. In the case of generation, the generated large item sets L (1) to L (k−1) are stored without being deleted from the large item set file 3 so that the number of items is 1 to (k−1). The large item set that is the same as the condition part of the rule can be searched in the large item file 3 to obtain the support level of the rule part of the rule.
[0129]
For example, in the hash tree shown in FIG. 16, the rule verification item set LL (i) (i = 1 to (k−1)) corresponding to the condition part of the rule is stored in association with it.
[0130]
As described above, according to the eleventh embodiment, when creating a correlation rule candidate from a large item set with k items, the number of items in the condition part is 1 to (k−1). Since the candidate rules having the number of items in the consequence part of (k-1) to 1 are respectively generated, the effect that the outcome part can generate two or more correlation rules can be obtained.
[0131]
Furthermore, since the generated large item sets L (1) to L (k-1) are stored and used as the support degree of the condition part of the rule, the consequent part generates two or more correlation rules. Even in this case, it is possible to easily calculate the certainty factor or the chi-square value.
[0132]
The procedure for generating a large item set in the correlation rule extracting apparatus according to the eleventh embodiment is not limited to that according to the first to tenth embodiments, but the large item set is stored in a file. It is effective for any algorithm to be managed.
[0133]
Embodiment 12 FIG.
FIG. 17 is a block diagram showing a configuration of an association rule extraction apparatus according to Embodiment 12 of the present invention. The correlation rule extraction apparatus according to the twelfth embodiment is a so-called computer that operates according to a program that describes the processing procedure according to the first to eleventh embodiments.
[0134]
In the figure, reference numeral 71 denotes a CPU (correlation rule generation means, support degree calculation means, candidate item set generation means, rules) that executes various processes in accordance with

programs

81 and 81A recorded in advance in a ROM 72, a hard disk drive device 74, a recording medium 91, and the like. A verification item set generation unit, a common item set addition unit, a storage capacity calculation unit, a correlation rule extraction unit, and a data conversion unit. 72 stores in advance a program executed at startup, data necessary for various processes, and the like. A ROM 73 is a RAM (storage means) that is loaded with

programs

81 and 81A recorded in advance in the hard disk drive device 74 and the recording medium 91 and temporarily stores various data in various processes.
[0135]
Reference numeral 74 denotes a hard disk drive device that stores the program 81, the

database

1, 1A, the large item set file 3, the correlation rule set file 5, and the like describing the above-described processing, and 75 reads and writes data from and to the recording medium 91. A recording medium driving device. The CPU 71, ROM 72, RAM 73, hard disk drive device 74, and recording medium drive device 75 are connected to each other by a data bus or an address bus. This bus configuration is an example, and other configurations may be used. Reference numeral 91 denotes a computer-readable recording medium such as a flexible disk or a CD-ROM (Compact Disc-Read Only Memory) on which the program 81A describing the above-described processing is recorded.
[0136]
Next, the operation will be described.
First, the CPU 71 loads the program 81A recorded in the hard disk drive 74 or the program 81A stored in the recording medium 91 into the RAM 73 in response to the user's operation and the like according to the above-described first to first embodiments. Various processes described in the eleventh embodiment are executed. That is, the CPU 71 operates as the above-described large item set generation means 2, 2 A and the hypothesis generation verification means 4, and the RAM 73 functions as the memory 6.
[0137]
As described above, according to the twelfth embodiment, since the program describing the processing procedure according to the first to eleventh embodiments is executed on the computer, the various processes described above are executed. The same effects as those of the first to eleventh embodiments are obtained.
[0138]
【The invention's effect】
As described above, according to the present invention, the large item set is read from the predetermined storage unit for each common item other than the last item, and the large item set is stored in the hash tree stored in the predetermined storage unit. Add an item set that is a common part, select any two of the last set of items in the large item set that you have read out, and add the two items to the common part respectively. Generate and add an item set that is a part of each candidate item set and has one item less than the candidate item set to the hash tree as a rule verification item set, and common parts of the large item set, candidate item sets and rules The storage capacity for the hash tree after the addition of the verification item set is calculated, and when the storage capacity for the hash tree becomes larger than a predetermined reference value, each candidate item set and The support level of each rule verification item set is calculated, and among the candidate item sets, those whose support level is greater than a predetermined reference value are stored as a large item set in a predetermined storage means, and the rule verification item set and the large item are stored. Since the association rule is generated from the large item set based on the certainty factor or the chi-square value calculated from the support of the set, and then the hash tree is deleted from the predetermined storage means, each large item set The storage capacity used by the hash tree is checked for each process, and there is an effect that the storage capacity allocated in advance can be used effectively.
[0139]
According to the present invention, the support level is stored in the storage unit together with the large item set, and when the same large item set as each rule verification item set is stored in the storage unit, the support of the large item set is supported. Since the degree is set as the support level of the rule verification item set, when there are few large item sets, the support level of each rule verification item set can be obtained in a short time. .
[0140]
According to this invention, a record is read from the database, and the number of appearances of the candidate item set and the number of occurrences of the rule verification item set are counted in parallel to determine the support level of the candidate item set and the support level of the rule verification item set. Since it is configured to calculate, when there are many large item sets, it is possible to calculate the support level of each rule verification item set in a shorter time than searching the large item set stored in the storage means There is.
[0141]
According to this invention, after separating the partial tree corresponding to the rule verification item set from the hash tree, the support level of each candidate item set is calculated, and after the support level is calculated, the part corresponding to the rule verification item set is calculated. Since the tree is configured to return to the original position of the hash tree, matching is not performed for the rule verification item set that is not related to the calculation of the support level of the candidate item set, and the candidate item set is efficiently matched There is an effect that can be done.
[0142]
According to the present invention, the data indicating the correspondence between the connection node of the hash tree to which the separated subtree is connected and the root of the subtree is stored according to the data structure of the stack, and the data is separated based on the data. Since the partial tree is returned to the original position of the hash tree, the memory management can be simplified.
[0143]
According to this invention, the data indicating the correspondence between the connection node of the hash tree to which the separated subtree is connected and the root of the subtree is stored according to the data structure of the list, and the data is separated based on the data Since the configuration is such that the subtree is returned to the original position of the hash tree, the number of separation positions is not limited, and there is no need to reserve an unnecessary storage area in advance, and the memory can be used efficiently. is there.
[0144]
According to the present invention, when the storage capacity for the hash tree becomes larger than a predetermined reference value, the common part of the large item set, the candidate item set, and the rule verification item set that are added immediately before are stored from the hash tree. Since the deletion is configured, there is an effect that the used storage capacity can be more accurately limited.
[0145]
According to the present invention, the maximum value of the increase in the storage capacity for the hash tree due to the addition of the common part of the large item set, the candidate item set, and the rule verification item set is the predetermined reference value and the hash tree. Since the support level is calculated, the association rule is generated, and the hash tree is erased after that, the storage capacity for the hash tree is set to a predetermined standard. There is an effect that the memory is used effectively before the value is exceeded.
[0146]
According to the present invention, the average value of the increase in the storage capacity for the hash tree due to the addition of the common part of the large item set, the candidate item set, and the rule verification item set is calculated for the predetermined reference value and the hash tree. Since the support degree is calculated when the difference from the storage capacity becomes larger, the association rule is generated, and then the hash tree is erased, the storage capacity for the hash tree has a predetermined reference value. There is an effect that the memory is effectively used up to the front.
[0147]
According to this invention, each item is stored in the database as a two-dimensional index composed of an attribute number and an attribute value, and an item that can be added to a lower level of an item that is a node of a hash tree does not have the same attribute number It is configured to generate a large item set and extract a correlation rule from the large item set, so it is possible to suppress the generation of correlation rules for attribute values in the same attribute, and the negative correlation rule There is an effect that generation can be suppressed.
[0148]
According to this invention, the value of each item is stored in the database as a two-dimensional index composed of attribute numbers and attribute values, or attribute group numbers for attribute groups composed of a plurality of attributes of the database and those It is saved as a two-dimensional index consisting of any one of the attribute values, and the items that can be added to the lower level of items that are nodes of the hash tree are limited to items that do not have the same attribute number or attribute group number. Since a large item set is generated and correlation rules are extracted from the large item set, generation of unnecessary correlation rules can be suppressed, and post-processing for removing such correlation rules is omitted. There is an effect that can be.
[0149]
According to this invention, the data conversion means for converting each attribute value of the record stored in advance in the database in a predetermined table format into a two-dimensional index and storing the two-dimensional index in the database is provided. Even if records are recorded in tabular form, they are converted to a two-dimensional index, and similarly, generation of correlation rules for attribute values in the same attribute can be suppressed, and generation of negative correlation rules can be suppressed. There is an effect.
[0150]
According to the present invention, a candidate item set whose number of items is one more than the large item set is generated from the large item set that is common to the items other than the last item among the large item sets stored in the predetermined storage unit. And calculating the degree of support of the candidate item set, generating a large item set having the same number of items as the candidate item set based on the degree of support, and storing it in the storage means; The certainty factor among the rules generated by dividing the large item set into a condition part with the number of items from 1 to (k-1) and a result part with the number of items from (k-1) to 1 Alternatively, since the correlation rule is such that the χ square value is larger than the predetermined reference value, there is an effect that the consequent part can generate two or more correlation rules.
[0151]
According to this invention, when the large item set is stored in the storage unit, the support level of the large item set is also stored, and when the same large item set as the condition part of the rule is stored in the storage unit Since the rule confidence level or chi-square value is calculated using the support level of the large item set as the support level of the condition part of the rule, it is easy for the consequent part to generate two or more correlation rules. There is an effect that the certainty factor or the chi-square value can be calculated.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an association rule extraction device according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart for explaining the operation of the correlation rule extraction device according to the first embodiment.
FIG. 3 is a diagram showing an example of contents of a large item set file.
4 is a diagram showing a hash tree composed of the large item sets [1, 2, 3], [1, 2, 4], [1, 3, 5], [1, 4, 5] in FIG. 3; It is.
FIG. 5 is a diagram illustrating an example of an operation in step ST3 of FIG.
6 is a diagram illustrating an example of an operation in step ST4 of FIG.
FIG. 7 is a diagram for explaining an example of an operation in step ST5 of FIG.
FIG. 8 is a diagram illustrating an example of a position where a partial tree is cut from a hash tree.
FIG. 9 is a diagram illustrating an example of a correspondence relationship between a hash tree and a separated subtree.
FIG. 10 is a diagram illustrating an example of a stack that stores a correspondence relationship between a hash tree and a separated subtree.
FIG. 11 is a diagram illustrating an example of a list storing a correspondence relationship between a hash tree and a separated subtree.
FIG. 12 is a block diagram showing a configuration of an association rule extraction device according to Embodiment 9 of the present invention.
FIG. 13 is a diagram showing an example of a table format.
FIG. 14 is a diagram showing an example of a record stored in a database in a receipt format.
FIG. 15 is a diagram illustrating an example of a hash tree including receipt format items.
FIG. 16 is an example of a hash tree value to which rule verification item sets LL (1) to LL (k−1) are generated when the number of items in the rule condition part is 1 to (k−1). FIG.
FIG. 17 is a block diagram showing a configuration of an association rule extraction device according to Embodiment 12 of the present invention.
FIG. 18 is a block diagram showing a configuration example of a conventional correlation rule extraction device.
FIG. 19 is a diagram illustrating an example of a hash tree indicating a large item set.
FIG. 20 is a diagram for explaining the previously proposed first technique.
FIG. 21 is a diagram for explaining the previously proposed second technique.
[Explanation of symbols]
1, 1A database, 3 large item set files (storage means), 4 hypothesis generation verification means (association rule generation means), 6 memory (storage means), 21 candidate item set verification means (support level calculation means), 22 candidate items Set generation means (candidate item set generation means, rule verification item set generation means), 23 hash tree operation means (common item set addition means, storage capacity calculation means), 51 correlation rule extraction unit, 52 receipt file generation means (data Conversion means), 71 CPU (correlation rule generation means, support level calculation means, candidate item set generation means, rule verification item set generation means, common item set addition means, storage capacity calculation means, correlation rule extraction unit, data conversion means ), 73 RAM (storage means), 81A program, 91 recording medium.

Claims

In a correlation rule extraction device that extracts a correlation rule between item sets of a database from a plurality of records recorded in the database,
Storage means for storing a plurality of item sets as a hash tree with each item as a node;
A storage means for storing large item sets;
A common item set adding unit that reads the large item set from the storage unit for each item other than the last item in common, and adds an item set that is a common part to the hash tree;
Select any two of the last set of items of the read large item set in order, and add the two items to the common part, respectively, and add one item from the large item set. Candidate item set generation means for generating a large number of candidate item sets;
A rule verification item set generation means for adding an item set that is a part of each candidate item set and has one item less than the candidate item set to the hash tree as a rule verification item set;
A storage capacity calculating means for calculating a storage capacity for the hash tree after addition of the common part of the large item set, the candidate item set, and the rule verification item set;
When the storage capacity for the hash tree calculated by the storage capacity calculation means is greater than a predetermined reference value, the support level of each candidate item set and each rule verification item set is calculated, and the candidate item Of the set, the support degree calculation means for storing the support degree larger than a predetermined reference value as a large item set in the storage means,
When the storage capacity for the hash tree calculated by the storage capacity calculation means becomes larger than a predetermined reference value, the certainty factor calculated from the support level of the rule verification item set and the large item set or An association rule extraction device comprising: an association rule generation unit that generates an association rule from the large item set based on a chi-square value and then deletes the hash tree from the storage unit.

The support degree calculation means stores the support degree together with the large item set in the storage means. When the same large item set as each rule verification item set is stored in the storage means, the large item set is stored in the storage means. The correlation rule extraction device according to claim 1, wherein the support level is a support level of the rule verification item set.

The support level calculation means reads a record from the database, counts the number of appearances of the candidate item set and the number of occurrences of the rule verification item set in parallel, and supports the support level of the candidate item set and the rule verification item set. The correlation rule extracting device according to claim 1, wherein the degree is calculated.

The support level calculation means, after separating the subtree corresponding to the rule verification item set from the hash tree, calculates the support level of each candidate item set, and after the support level calculation, corresponds to the rule verification item set The correlation rule extraction device according to claim 1, wherein the subtree to be returned is returned to the original position of the hash tree.

The support degree calculation means stores data indicating the correspondence relationship between the connection node of the hash tree to which the separated subtree is connected and the root of the subtree in accordance with the data structure of the stack, and separates based on the data The correlation rule extraction device according to claim 4, wherein the subtree is returned to the original position of the hash tree.

The support degree calculation means stores data indicating the correspondence relationship between the connection node of the hash tree to which the separated subtree is connected and the root of the subtree according to the data structure of the list, and separates based on the data The correlation rule extraction device according to claim 4, wherein the subtree is returned to the original position of the hash tree.

When the storage capacity for the hash tree becomes larger than a predetermined reference value, the storage capacity calculating means stores the common part, candidate item set, and rule verification item set of the large item set added immediately before the hash tree. The correlation rule extracting device according to claim 1, wherein the correlation rule extracting device is deleted from the correlation rule extracting device.

The support degree calculation means and the correlation rule generation means add a common part of a large item set, a candidate item set, and a rule verification item set instead of the case where the storage capacity for the hash tree becomes larger than a predetermined reference value. When the maximum value of the amount of increase in the storage capacity for the hash tree according to is greater than the difference between the predetermined reference value and the storage capacity for the hash tree, the support level is calculated, and the correlation rule The correlation rule extracting apparatus according to claim 1, wherein the hash tree is deleted after that.

The support degree calculation means and the correlation rule generation means add a common part of a large item set, a candidate item set, and a rule verification item set instead of the case where the storage capacity for the hash tree becomes larger than a predetermined reference value. If the average value of the increase in storage capacity for the hash tree due to is greater than the difference between the predetermined reference value and the storage capacity for the hash tree, the support level is calculated and an association rule is generated. Then, the hash tree is deleted after that. The correlation rule extraction device according to claim 1, wherein:

In a computer-readable recording medium recording a program for causing a computer to extract a correlation rule between item sets of a database from a plurality of records recorded in the database,
On the computer,
A large item set is read from a predetermined storage unit for each item other than the last item in common, and the large item set is stored in a predetermined storage unit and stored in a predetermined storage unit as a node in the item set hash tree. To add item sets that are part,
Select any two of the last set of items in the large item set that has been read out, add the two items to the common part, and have one more item than the large item set Steps to generate a candidate item set,
A step of adding an item set that is a part of each candidate item set and has one item less than the candidate item set to the hash tree as a rule verification item set;
Calculating storage capacity for the hash tree after addition of the common part of the large item set, the candidate item set and the rule validation item set;
When the storage capacity for the hash tree becomes larger than a predetermined reference value, the support level of each candidate item set and each rule verification item set is calculated, and the support level of the candidate item set is predetermined. A procedure for storing a larger item set in the predetermined storage means as a large item set,
A correlation rule is generated from the large item set based on a certainty factor or a chi-square value calculated from the support level of the rule verification item set and the large item set, and then the hash tree is stored from the predetermined storage unit. A computer-readable recording medium having recorded thereon a program for executing the erasing procedure.

Store the item set as a hash tree with each item corresponding to each attribute of the database as a node, and add a node corresponding to the item to the hash tree according to a predetermined rule from multiple records recorded in the database In the correlation rule extraction device that generates a set and extracts a correlation rule between attributes of the database,
A database that stores each item as a two-dimensional index composed of attribute numbers and attribute values;
A correlation rule extraction unit that generates a large item set by limiting the items that can be added to the lower level of the item that is a node of the hash tree to only items that do not have the same attribute number, and extracts a correlation rule from the large item set An association rule extraction device comprising:

The database sets the value of each item as a two-dimensional index composed of an attribute number and an attribute value, or an attribute group number for an attribute group composed of a plurality of attributes of the database and one of them. Save as a two-dimensional index consisting of attribute values,
The correlation rule extraction unit generates a large item set by limiting the items that can be added below the item that is the node of the hash tree to only items that do not have the same attribute number or attribute group number, and generates a large item set from the large item set. The correlation rule extracting device according to claim 11 , wherein the correlation rule is extracted.

The correlation according to claim 11 , further comprising data conversion means for converting each attribute value of a record previously stored in a database in a predetermined table format into a two-dimensional index and storing the two-dimensional index in the database. Rule extraction device.

The computer stores the item set as a hash tree with each item corresponding to each attribute of the database as a node, and adds a node corresponding to the item to the hash tree according to a predetermined rule from a plurality of records recorded in the database In a computer-readable recording medium recording a program for generating a large item set and extracting a correlation rule between attributes of the database,
On the computer,
A procedure for reading items from the database that stores each item as a two-dimensional index composed of an attribute number and an attribute value;
Generate a large item set by limiting the items that can be added to the lower level of the item that is a node of the hash tree to only items that do not have the same attribute number, and execute a procedure for extracting a correlation rule from the large item set A computer-readable recording medium on which a program for recording is recorded.

In a correlation rule extraction device that extracts a correlation rule between item sets of a database from a plurality of records recorded in the database,
A storage means for storing large item sets;
Candidate item set generation means for generating a candidate item set having a larger number of items than the large item set by one from the large item set that is common to items other than the last item among the large item sets stored in the storage unit When,
A support degree calculating means for calculating a support degree of the candidate item set, generating a large item set having the same number of items as the candidate item set based on the support degree, and storing the large item set in the storage means;
The number of items in the large item set is k, and the large item set is divided into a condition part with the number of items from 1 to (k-1) and a consequent part with the number of items from (k-1) to 1. And a correlation rule generation unit that uses a rule having a certainty factor or chi-square value greater than a predetermined reference value as a correlation rule.

The storage means stores the support level together with the large item set,
When the support level calculation unit stores the large item set in the storage unit, the support level calculation unit also stores the support level of the large item set, and the same large item set as the condition part of the rule is stored in the storage unit. The correlation rule extraction device according to claim 15 , wherein the certainty factor or χ square value of the rule is calculated using the support level of the large item set as the support level of the condition part of the rule.

In a computer-readable recording medium recording a program for causing a computer to extract a correlation rule between item sets of a database from a plurality of records recorded in the database,
On the computer,
A procedure for generating a candidate item set having one item larger than the large item set from the large item set having items other than the last item among the large item sets stored in the predetermined storage unit,
Calculating a support level of the candidate item set, generating a large item set having the same number of items as the candidate item set based on the support level, and storing the large item set in the predetermined storage unit;
The number of items in the large item set is k, and the large item set is divided into a condition part with the number of items from 1 to (k-1) and a consequent part with the number of items from (k-1) to 1. A computer-readable recording medium having recorded thereon a program for executing a procedure in which a rule having a certainty factor or chi-square value greater than a predetermined reference value is used as a correlation rule.