JP2001167098A

JP2001167098A - Distributed parallel analysis of large amounts of data

Info

Publication number: JP2001167098A
Application number: JP34702699A
Authority: JP
Inventors: Hideyuki Maki; 牧　　秀行; Toyohisa Morita; 豊久森田; Yukiyasu Ito; 幸康伊藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1999-12-07
Filing date: 1999-12-07
Publication date: 2001-06-22

Abstract

(57)【要約】【課題】大量のデータから知識を発見するデータマイニ
ングの並列分散処理方法に関しては、従来、データを分
割して複数の処理装置に分配するので、処理装置間で分
析対象データの転送が必要となったり、並列分散処理方
法を実行するためには、前処理としてデータベースから
各処理装置へデータを分配する必要があるという問題が
ある。また、処理装置に大容量の主記憶が必要であると
いう問題がある。【解決手段】単数、または複数のデータ格納手段、分析
結果集計手段、複数のデータ分析手段を用いる。データ
格納手段は分析対象データを一回送信し、複数のデータ
分析手段がこれを受信する。各々のデータ分析手段は受
信したデータを対象として分析を行い、その後、それぞ
れのデータ分析手段において得られた分析結果を分析結
果集計手段が集計し、全体の分析結果とする。 (57) [Summary] [Problem] Regarding a parallel distributed processing method of data mining that discovers knowledge from a large amount of data, conventionally, data is divided and distributed to a plurality of processing devices, so that data to be analyzed among the processing devices is analyzed. In order to execute the parallel distributed processing method, it is necessary to distribute data from a database to each processing device as preprocessing. Another problem is that the processing device requires a large-capacity main memory. One or a plurality of data storage means, an analysis result totaling means, and a plurality of data analysis means are used. The data storage means transmits the data to be analyzed once, and the plurality of data analysis means receives the data. Each data analysis unit analyzes the received data, and thereafter, the analysis result obtained by each data analysis unit is totaled by the analysis result totaling unit, and the total analysis result is obtained.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量のデータを対
象とするデータ分析技術に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a data analysis technique for a large amount of data.

【０００２】[0002]

【従来の技術】大量のデータから知識を発見する技術は
データマイニングと呼ばれている。発見される知識の具
体例として、相関ルール（Association Rule）がよく知
られている。相関ルールの基本的概念は文献「Mining A
ssociation Rules between Setof Items in Large Data
bases」（proceeding of ACM SIGMOD、1993）に説明さ
れている。それによれば、 I_1 から I_m までの m 個
の二値属性（アイテムと呼ばれ、0 か 1 の一方の値を
持つ）と、アイテムに対応する m 個の要素からなる二
値ベクトル（トランザクションと呼ばれる）と、このト
ランザクションの集合 T を考えた時、相関ルールは「X
→ I_j」と記述される。ここで、X は I_1から I_m ま
での m 個のアイテムのうちのいくつかからなるアイテ
ムの集合（アイテムセットと呼ばれる）、 I_j は X に
含まれない単一のアイテムである。１つのトランザクシ
ョン t と、１つのアイテム i を考えた時、 t の要素
のうち、i に対応するものの値が 1 であれば、トラン
ザクション t はアイテム i を満足すると言い、 t が
アイテムセット X に含まれる全てのアイテムを満足す
る時、トランザクション t はアイテムセット X を満足
すると言う。トランザクション集合 T において、アイ
テムセット X を満足するトランザクションの数を K、
アイテムセット X を満足し、かつアイテム I_j をも満
足するトランザクションの数を J とした時、割合 J/K
を相関ルール「X → I_j」の「コンフィデンス」と呼
ぶ。また、トランザクション集合 T の全体に対する上
記 J の割合を相関ルール「X → I_j」の「サポート」
と呼ぶ。また、トランザクション集合 T の全体に対す
る上記 K の割合をアイテムセット X の「サポート」と
呼ぶ。2. Description of the Related Art A technique for finding knowledge from a large amount of data is called data mining. As a specific example of the discovered knowledge, an association rule (Association Rule) is well known. The basic concept of association rules is described in the document "Mining A
ssociation Rules between Setof Items in Large Data
bases "(procededing of ACM SIGMOD, 1993). According to it, m binary attributes from I_1 to I_m (called an item and having one of 0 or 1) and a binary vector consisting of m elements corresponding to the item (called a transaction) ) And this set of transactions T, the association rule is "X
→ I_j ”. Here, X is a set of items (called an item set) consisting of some of the m items I_1 to I_m, and I_j is a single item not included in X. When one transaction t and one item i are considered, if the value of the element corresponding to i among the elements of t is 1, it is said that the transaction t satisfies the item i, and t is included in the item set X. A transaction t satisfies item set X when all items satisfied are satisfied. Let K be the number of transactions that satisfy itemset X in transaction set T.
Assuming that the number of transactions that satisfy item set X and also satisfy item I_j is J, the ratio J / K
Is referred to as “confidence” of the association rule “X → I_j”. In addition, the ratio of the above J to the entire transaction set T is determined by “support” of the correlation rule “X → I_j”.
Call. Also, the ratio of the above K to the entire transaction set T is called “support” of the item set X.

【０００３】アイテムセット X、アイテム I_j の組合
せは多数ある得るが、その中から、与えられた最小コン
フィデンス c、および最小サポート s 以上のコンフィ
デンスとサポートを持つ相関ルールを発見するための基
本的な手法について文献「Fast Algorithms for Mining
Association Rules」（Proceedings of VLDB、1994）
に述べられている。この文献では、 n 個のアイテムか
らなるアイテムセットのうち、最小サポート s 以上の
サポートを持つものの集合をラージアイテム集合 L_n
と呼び、 L_(k-1) を元に L_k を得る処理を１つのパス
とし、 k の値を1 ずつ増加させながら、新たなラージ
アイテム集合が得られなくなるまでパスを繰り返すこと
によって最小サポート s を満たすアイテムセットの集
合を求める。There can be many combinations of the item set X and the item I_j, and a basic method for finding an association rule having a given minimum confidence c and a minimum confidence s and higher than s is provided. Refer to the "Fast Algorithms for Mining
Association Rules "(Proceedings of VLDB, 1994)
It is described in. In this document, among itemsets consisting of n items, a set of itemsets with support greater than or equal to the minimum support s is defined as a large item set L_n
The process of obtaining L_k based on L_ (k-1) is considered as one pass, and the value of k is incremented by 1 while repeating the pass until a new large item set cannot be obtained. Find a set of item sets that satisfy.

【０００４】L_(k-1) を元に L_k を得るには、まず、
L_(k-1) に含まれるうちの k 個のアイテムからなる可
能な全てのアイテムセットを作成し、これらアイテムセ
ットの集合を候補アイテム集合 C_k とし、次にトラン
ザクション集合 T を走査し、C_k のそれぞれのアイテ
ムセットについてアイテムセットを満足するトランザク
ションの数を数え上げ、それによってサポートの値を算
出する。候補アイテム集合 C_k のうちで s 以上のサポ
ートを持つアイテムセットの集合を新たなラージアイテ
ム集合 L_k とする。この処理を、新たなラージアイテ
ム集合が得られなくなるまでパスを繰り返す。ラージア
イテム集合が得られた後、これに含まれるアイテムセッ
トのそれぞれにおいて、含まれるアイテムを用いて作成
可能な相関ルールについてそのコンフィデンスを算出
し、最小コンフィデンス c を満たす相関ルールを選び
出す。こうして得られた相関ルールが最終的な結果とな
る。[0004] To obtain L_k based on L_ (k-1), first,
Create all possible itemsets of k items included in L_ (k-1), set these itemsets as the candidate itemset C_k, then scan the transaction set T, and For each itemset, count the number of transactions that satisfy the itemset, and calculate the support value accordingly. A set of item sets having support of s or more in the candidate item set C_k is defined as a new large item set L_k. This process is repeated until a new large item set cannot be obtained. After the large item set is obtained, in each of the item sets included in the large item set, the confidence is calculated for the association rule that can be created by using the included item, and the correlation rule that satisfies the minimum confidence c is selected. The association rule thus obtained is the final result.

【０００５】上記の基本的アルゴリズムを計算機で実行
する際には、トランザクションの集合を２次記憶に保持
し、候補アイテム集合を満たすトランザクションの数を
数え上げるカウンタを主記憶に保持することになる。こ
の時、２つの問題点がある。１つは、１回のパスを実行
するごとにトランザクション集合全体を走査するため、
２次記憶からのデータの読み出しに多くの処理時間を費
やしてしまうという点である。もう１つは、与えられた
トランザクション集合に現れるアイテムの数によって
は、候補アイテム集合が非常に大きくなり、カウンタを
保持するために大容量の主記憶が必要となる点である。
これらの問題を解決するための並列分散処理方法が文献
「Parallel Mining of Association Rules」（IEEE Tra
nsactions on Knowledge and Data Engineering、199
6）に述べられている。この文献には３つの並列アルゴ
リズムが説明されている。これら３つのアルゴリズムは
いずれも、複数の処理装置を有し、トランザクション集
合は分割されて各処理装置に局所的な２次記憶に分配さ
れて保持されることを前提としている。第１の並列アル
ゴリズムは「Count Distribution」と呼ばれ、候補アイ
テム集合のサポートの算出の際のトランザクション集合
の走査を複数の処理装置で並列に行うことによってデー
タの読み出しに要する時間を短縮することが可能であ
る。（しかし、候補アイテム集合のサポートを算出する
際のカウンタを各々の処理装置で全て保持するので、大
容量の主記憶が要求されると言う点は解決されない。）
第２の並列アルゴリズムは「Data Distribution」と呼
ばれ、カウンタを複数の処理装置に分配して保持するこ
とによって、各処理装置において必要となる主記憶量を
削減することが可能である。（しかし、各処理装置に分
配されたトランザクションを処理装置間で転送する必要
があるという別の問題が生じる。）また、 Count Distr
ibution、Data Distribution ともに、各々の処理装置
で得られた候補アイテム集合のサポートを１回のパスご
とに集計するので、処理装置ごとの処理時間に差がある
場合、１回のパスごとに最も遅い処理装置の終了を待た
ねばならず、処理時間の無駄が生じるという問題もあ
る。第３の並列アルゴリズムは「Candidate Distributi
on」と呼ばれ、この問題を解決することを目的としてい
る。あるアイテムセットを満たすトランザクションの集
合をサポートトランザクション集合と呼ぶことにする。When the above basic algorithm is executed by a computer, a set of transactions is held in a secondary storage, and a counter for counting the number of transactions satisfying the candidate item set is held in a main storage. At this time, there are two problems. One is to scan the entire transaction set each time one pass is performed,
The point is that a lot of processing time is spent reading data from the secondary storage. The other is that the candidate item set becomes very large depending on the number of items appearing in a given transaction set, and a large-capacity main memory is required to hold a counter.
A parallel distributed processing method for solving these problems is described in the document "Parallel Mining of Association Rules" (IEEE Tra
nsactions on Knowledge and Data Engineering, 199
6). This document describes three parallel algorithms. Each of these three algorithms has a plurality of processing units, and it is premised that a transaction set is divided and distributed and held in a secondary storage local to each processing unit. The first parallel algorithm is called “Count Distribution”, and it is possible to reduce the time required for reading data by scanning a transaction set in parallel with a plurality of processing devices when calculating support for a candidate item set. It is possible. (However, since the counters for calculating the support of the candidate item set are all held in each processing device, the point that a large-capacity main memory is required cannot be solved.)
The second parallel algorithm is called “Data Distribution”. By distributing and holding counters to a plurality of processing devices, it is possible to reduce the amount of main memory required in each processing device. (However, another problem arises in that transactions distributed to each processing unit need to be transferred between the processing units.)
For both ibution and Data Distribution, the support of the candidate item set obtained by each processing device is totaled for each pass, so if there is a difference in the processing time for each processing device, the slowest for each pass It is necessary to wait for the end of the processing device, and there is a problem that processing time is wasted. The third parallel algorithm is "Candidate Distributi
It is called "on" and aims to solve this problem. A set of transactions that satisfy a certain item set is called a support transaction set.

【０００６】Candidate Distribution アルゴリズムで
は途中のパスm（m はヒューリスティックに決定する）
において、候補アイテム集合 C_m に含まれるアイテム
セットをグループ分けする。この時、異なるグループに
属するアイテムセットの間でサポートトランザクション
集合がなるべく重ならないようにする。このグループ分
けにしたがって候補アイテム集合と、そのサポートトラ
ンザクション集合を複数の処理装置へ分配し直す。これ
により、各々の処理装置へ分配されたトランザクション
のみの走査で以降のパスにおけるラージアイテム集合を
得ることができ、全ての処理装置がパスごとに同期を取
る必要がなくなる。In the Candidate Distribution algorithm, an intermediate path m (m is determined heuristically)
In, the item sets included in the candidate item set C_m are grouped. At this time, the support transaction sets should not overlap as much as possible between the item sets belonging to different groups. According to the grouping, the candidate item set and its support transaction set are redistributed to a plurality of processing devices. Thus, a large item set in a subsequent pass can be obtained by scanning only the transactions distributed to the respective processing devices, and it is not necessary for all the processing devices to synchronize for each pass.

【０００７】相関ルール以外のデータマイニング手法と
して、特徴ルール（CharacteristicRule）と、その発見
法が文献「Characteristic Rule Induction Algorithm
forData Mining」（proceedings of PAKDD、1998）に述
べられている。この文献では、複数のフィールドからな
るレコードの集合を分析対象データとしている。特徴ル
ールは「if A then B」と記述される。 A は１個以上の
条件の組合せ、B は単一の条件である。ここで言う「条
件」とは、フィールドとその値の組であり、例えば、
「X_i」をフィールド、「v_ij」を値とすれば、これら
を組にした物は条件であり、「X_i = v_ij」と記述され
る。また、分析対象のレコードにおいて、フィールド X
_i に値 v_ij を持つ場合、そのレコードは条件「X_i =
v_ij」を満足すると言う。また、特徴ルールは評価値
を持ち、特徴ルール「if A thenB」の評価値はAs a data mining method other than the association rule, a characteristic rule (CharacteristicRule) and a method for finding the characteristic rule are described in the document "Characteristic Rule Induction Algorithm"
forData Mining "(proceedings of PAKDD, 1998). In this document, a set of records including a plurality of fields is used as analysis target data. The feature rule is described as “if A then B”. A is a combination of one or more conditions, and B is a single condition. The "condition" here is a pair of a field and its value. For example,
If “X_i” is a field and “v_ij” is a value, a combination of these is a condition, and is described as “X_i = v_ij”. In the record to be analyzed, field X
If _i has the value v_ij, then that record is the condition "X_i =
v_ij ”. The feature rule has an evaluation value, and the evaluation value of the feature rule “if A thenB” is

【０００８】[0008]

【数１】P(A)^a log{P(B|A)／P(B)} と定義される。ここで、P(A)、P(B) はそれぞれ、分析
対象レコード全体のうちで条件 A および条件 B を満足
するレコードの割合であり、 P(B|A) は条件 Aを満足す
るレコードのうち、条件 A と条件 B の両方を満足す
るレコードの割合である。また、指数 a はヒューリス
ティックに定められる正の定数である。## EQU1 ## It is defined as P (A) ^ a log {P (B | A) / P (B)}. Here, P (A) and P (B) are the proportions of records that satisfy conditions A and B in the entire record to be analyzed, respectively, and P (B | A) is the percentage of records that satisfy condition A. Of these, the percentage of records that satisfy both conditions A and B. The index a is a heuristically determined positive constant.

【０００９】この文献で述べられている特徴ルールの発
見法は、分析対象データと then 部の条件、if 部に含
まれる条件数の上限、ルール数 M が与えられた時に、
特徴ルールの生成と評価値の算出を繰り返し、評価値の
最も大きい M 個の特徴ルールを発見する方法である。[0009] The feature rule discovery method described in this document is based on the condition that the data to be analyzed and the then part, the upper limit of the number of conditions included in the if part, and the number of rules M are given.
This is a method of repeatedly generating feature rules and calculating evaluation values to find the M feature rules with the highest evaluation values.

【００１０】上記の文献に述べられている特徴ルール発
見法では、１つの特徴ルールを生成し、その評価値を算
出するごとに分析対象データの全体、または一部を走査
する必要があり、相関ルールの場合と同じように、２次
記憶からのデータの読み出しに時間がかかるという問題
がある。これを解決するために、分析対象データの走査
を一回のみ行う方法が特開平11−3360号「大規模データ
分析方法」である。この方法は、可能な全ての条件につ
いて、これを満足するレコードの数を数え上げるための
カウンタを用意し、一回のデータ走査で全ての条件につ
いてのレコードの数え上げを行うというものである。こ
れによれば、データの読み出しにかかる時間を削減する
ことができる。しかし一方、カウンタを保持するために
大容量の主記憶が必要になるという問題がある。In the feature rule finding method described in the above-mentioned document, it is necessary to generate one feature rule and scan the whole or a part of the data to be analyzed every time an evaluation value is calculated. As in the case of the rule, there is a problem that it takes time to read data from the secondary storage. To solve this problem, Japanese Patent Laid-Open No. Hei 11-3360 discloses a method of performing a scan of data to be analyzed only once. This method prepares a counter for counting the number of records satisfying the condition for all possible conditions, and counts the records for all conditions in one data scan. According to this, the time required for reading data can be reduced. However, on the other hand, there is a problem that a large-capacity main memory is required to hold the counter.

【００１１】[0011]

【発明が解決しようとする課題】相関ルール、特徴ルー
ルのいずれの場合も、基本的な発見方法においては、分
析対象データの走査にかかる処理時間とカウンタ保持に
必要な主記憶の容量の問題があり、一方を小さくしよう
とすると、他方が大きくなるという問題がある。これを
解決するための並列分散処理方法がいくつかあるが、相
関ルール発見の並列分散処理方法に関しては、データを
分割して複数の処理装置に分配するので、分析対象デー
タの走査にかかる処理時間は小さくできるものの、処理
装置間で分析対象データの転送が必要となったり、カウ
ンタを保持するために、依然として各々の処理装置にお
いて大容量の主記憶が必要となる。また、データマイニ
ング機能を持たない通常のデータベースシステムに分析
対象データが保持されている場合、並列分散処理方法を
実行するためには、前処理としてデータベースから各処
理装置へデータを分配する必要があり、このための処理
時間が余計にかかるという問題がある。In both cases of the association rule and the feature rule, the basic discovery method has problems of processing time required for scanning the data to be analyzed and capacity of main memory required for holding the counter. There is a problem that if one is to be made smaller, the other becomes larger. There are several parallel distributed processing methods for solving this. However, regarding the parallel distributed processing method for finding association rules, since data is divided and distributed to a plurality of processing devices, the processing time required for scanning the analysis target data is reduced. Although it is possible to reduce the data size, it is necessary to transfer the data to be analyzed between the processing devices, and each processing device still needs a large-capacity main memory to hold the counter. In addition, when data to be analyzed is stored in a normal database system without a data mining function, it is necessary to distribute data from the database to each processing device as preprocessing in order to execute the parallel distributed processing method. However, there is a problem that processing time for this is extra.

【００１２】本発明の目的は、複数の処理装置を用いた
並列分散処理において、分析対象データの転送量、転送
回数を少なく抑え、かつ、各々の処理装置において必要
となる主記憶の量を少なく抑え、かつ、通常のデータベ
ースシステムに接続して実行可能なデータマイニングの
方法を提供することである。An object of the present invention is to reduce the transfer amount and the number of transfers of data to be analyzed in parallel distributed processing using a plurality of processing devices, and to reduce the amount of main memory required in each processing device. An object of the present invention is to provide a data mining method that can be executed while being suppressed and connected to an ordinary database system.

【００１３】[0013]

【課題を解決するための手段】本発明の並列分散分析方
法では、単数、または複数のデータ格納手段と複数のデ
ータ分析手段を用いる。単数のデータ格納手段を用いる
場合、全ての分析対象データは単数のデータ格納手段に
格納される。複数のデータ格納手段を用いる場合、分析
対象データは分割され、複数のデータ格納手段に分配さ
れて格納される。The parallel variance analysis method of the present invention uses one or more data storage means and a plurality of data analysis means. When a single data storage unit is used, all the data to be analyzed is stored in the single data storage unit. When a plurality of data storage units are used, the data to be analyzed is divided and distributed and stored in the plurality of data storage units.

【００１４】データ格納手段は分析対象データを複数の
データ分析手段に対して送信する。複数のデータ分析手
段は同一のデータを受信し、受信したデータを対象とし
てそれぞれのデータ分析手段において分析を行う。その
後、それぞれのデータ分析手段において得られた分析結
果をまとめて、全体の分析結果とする。The data storage means transmits the data to be analyzed to a plurality of data analysis means. The plurality of data analysis units receive the same data, and each of the data analysis units analyzes the received data. After that, the analysis results obtained by the respective data analysis means are put together to make an overall analysis result.

【００１５】データ格納手段は分析対象データを複数の
データ分析手段に対してまとめて一回送信し、複数のデ
ータ分析手段はこれを共有する。すなわち、データ格納
手段は複数のデータ分析手段の各々に対して個別に分析
対象データを送信することはしない。The data storage means collectively transmits the data to be analyzed to a plurality of data analysis means once, and the plurality of data analysis means share the data. That is, the data storage unit does not individually transmit the analysis target data to each of the plurality of data analysis units.

【００１６】また、本発明の並列分散分析方法では、複
数のデータ分析手段が共有する共有記憶手段を用いる場
合がある。複数のデータ分析手段は、それぞれの分析結
果を共有記憶手段上に保持する。したがって、共有記憶
手段に保持された分析結果を読み出すことによって全体
の分析結果を得ることができる。In the parallel variance analysis method of the present invention, a common storage means shared by a plurality of data analysis means may be used. The plurality of data analysis units hold respective analysis results on the shared storage unit. Therefore, the entire analysis result can be obtained by reading the analysis result held in the shared storage means.

【００１７】[0017]

【発明の実施の形態】本発明の第一の実施の形態を説明
する。図１に本実施形態の構成を示す。本実施形態は、
データ格納装置１０１、複数のデータ分析装置１０２、
分析結果集計装置１０３がバス型通信路１０４によって
接続されている。分析対象データはデータ格納装置に格
納されている。DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described. FIG. 1 shows the configuration of the present embodiment. In this embodiment,
A data storage device 101, a plurality of data analysis devices 102,
The analysis result totaling device 103 is connected by a bus-type communication path 104. The data to be analyzed is stored in the data storage device.

【００１８】図２にデータ分析の手順をしめす。データ
分析装置の準備処理２０１では、データ分析装置のそれ
ぞれにおいてデータ分析の準備を行い、分析対象データ
を受信する準備が完了したら、準備完了の信号をデータ
格納装置へ送信する。データ格納装置では、全てのデー
タ分析装置からの準備完了信号を受信するのを待つ（処
理２０２）。全てのデータ分析装置からの準備完了信号
を受信後、処理２０３において、データ格納装置は分析
対象データを走査し、まだ送信していないデータが残っ
ている場合は処理２０４へ、残っていない場合（全ての
分析対象データを送信し終わった場合）は処理２０６へ
進む。レコード送信処理２０４では、データ格納装置
は、まだ送信していないデータのうちの１レコードを送
信する。FIG. 2 shows the procedure of data analysis. In the preparation process 201 of the data analysis device, each of the data analysis devices prepares for data analysis, and when preparation for receiving data to be analyzed is completed, a preparation completion signal is transmitted to the data storage device. The data storage device waits for reception of a preparation completion signal from all data analyzers (process 202). After receiving the ready signals from all the data analyzers, in step 203, the data storage device scans the data to be analyzed, and if there is any data that has not been transmitted, the process proceeds to step 204; If all the data to be analyzed have been transmitted), the process proceeds to step 206. In the record transmission process 204, the data storage device transmits one record of data that has not been transmitted yet.

【００１９】一度送信されたレコードは送信済みとして
扱われる。送信されたレコードはバス型通信路を介して
複数のデータ分析装置へ伝送される。通信路がバス型で
あるため、レコードはデータ格納装置からの１回の送信
で、通信路に接続された全てのデータ分析装置へ伝送さ
れる。データ受信、分析処理２０５では、データ格納装
置から送信されたレコードをそれぞれのデータ分析装置
において受信する。レコードの受信後、データ分析装置
では次のレコードを受信する準備を行い、受信準備が完
了したら、準備完了の信号をデータ格納装置へ送信す
る。また、既に受信済みのレコードとともに受信したレ
コードを対象としてデータ分析を行う。データ受信、分
析処理２０５の後は処理２０２へ戻る。分析結果集計処
理２０６では、分析結果集計装置はデータ分析装置から
分析結果を受け取り、これを集計して全体の分析結果を
得る。以上が、分析方法の概要である。このように、複
数の装置が互いに協調しながら、データ分析を行う。以
下に、特徴ルールの発見を例にとり、各装置において行
われる処理を詳細に説明する。A record transmitted once is treated as transmitted. The transmitted record is transmitted to a plurality of data analyzers via a bus-type communication path. Since the communication path is a bus type, the record is transmitted to all the data analyzers connected to the communication path in one transmission from the data storage device. In the data reception / analysis processing 205, the records transmitted from the data storage devices are received by the respective data analysis devices. After receiving the record, the data analyzer prepares to receive the next record, and when the preparation for reception is completed, transmits a signal indicating completion of preparation to the data storage device. In addition, data analysis is performed on records received together with records already received. After the data receiving and analyzing process 205, the process returns to the process 202. In the analysis result tallying process 206, the analysis result tallying device receives the analysis results from the data analysis device and totals them to obtain the entire analysis result. The above is the outline of the analysis method. In this way, a plurality of devices perform data analysis while cooperating with each other. Hereinafter, the processing performed in each device will be described in detail, taking discovery of a feature rule as an example.

【００２０】分析対象データは複数のフィールドからな
るレコードの集合であり、全てのレコードは同数のフィ
ールドを持つ。レコードは分析の対象となる対象物、フ
ィールドは対象物の持つ属性に対応する。商店の顧客情
報を例にとると、１つのレコードは１人の顧客、各フィ
ールドは性別、年齢などの、顧客の属性に対応する。特
徴ルール発見では前処理として、各属性値を少数のカテ
ゴリに変換する。例えば、「年齢」は通常１０〜１００
程度の範囲の値を取り得るが、これを「３５歳以下」、
「３６歳以上５５歳以下」、「５６歳以上」のような少
数のカテゴリに変換する。また、「性別」は、もともと
「男」「女」の２つの値しか取り得ないので、このまま
２つのカテゴリとして用いることが多い。このようにカ
テゴリへの変換を施した分析対象データの例を図３に示
す。The data to be analyzed is a set of records composed of a plurality of fields, and all records have the same number of fields. A record corresponds to an object to be analyzed, and a field corresponds to an attribute of the object. Taking shop customer information as an example, one record corresponds to one customer, and each field corresponds to a customer attribute such as gender or age. In feature rule discovery, each attribute value is converted into a small number of categories as preprocessing. For example, “age” is usually 10 to 100
The value can be in the range of about
It is converted into a small number of categories such as "36 to 55" and "56 and over". In addition, since "sex" can originally take only two values of "male" and "female", it is often used as it is as two categories. FIG. 3 shows an example of the analysis target data which has been converted into the category as described above.

【００２１】特徴ルールは、次のように書き表される。The feature rules are written as follows.

【００２２】「if 性別=男 and 年齢=56以上 then 購入
額=大」すなわち、対象物の属性とカテゴリ化された値
の組からなる if-then ルールである。特徴ルールの if
部に現れる属性を「条件項目」、 then 部に現れる属
性を「結論項目」と呼ぶ。１つの属性が同時に条件項目
と結論項目の両方になることはない。また、特徴ルール
は評価値を持つ。一般に特徴ルールを「if A then B」
と表した時、その評価値は次式で定義される。"If gender = male and age = 56 or more then purchase amount = large", that is, an if-then rule composed of a set of a target attribute and a categorized value. If of feature rule
The attribute appearing in the part is called "condition item", and the attribute appearing in the then part is called "conclusion item". An attribute cannot be both a condition item and a conclusion item at the same time. The feature rule has an evaluation value. In general, set the feature rule to "if A then B"
, The evaluation value is defined by the following equation.

【００２３】[0023]

【数２】P(A)^a log{P(B|A)／P(B)} ここで、P(A)、P(B) はそれぞれ、分析対象データ全体
のうちで条件 A および条件 B を満足するレコードの割
合であり、 P(B|A) は条件 A を満足するレコードのう
ち、条件 A と条件 B の両方を満足するレコードの割
合である。また、指数 a はヒューリスティックに定め
られる正の定数である。また、評価値の別の定義とし
て、次の式を用いる場合もある。P (A) ^ a log {P (B | A) / P (B)} where P (A) and P (B) are the conditions A and the conditions in the whole data to be analyzed, respectively. P (B | A) is the ratio of records that satisfy both conditions A and B among records that satisfy condition A. The index a is a heuristically determined positive constant. Also, the following expression may be used as another definition of the evaluation value.

【００２４】[0024]

【数３】P(A)^a P(B|A)log{P(B|A)／P(B)} 数１、数２のいずれにおいても、ルールに現れる条件を
満たすレコード、および分析対象データ全体のレコード
の数を知ることによって評価値を算出することができ
る。[Equation 3] P (A) ^ a P (B | A) log {P (B | A) / P (B)} In each of Equations (1) and (2), a record satisfying a condition appearing in a rule, and analysis The evaluation value can be calculated by knowing the number of records of the entire target data.

【００２５】特徴ルール発見とは、上記で定義したルー
ル評価値に基づき、評価値の大きい特徴ルールを発見す
る処理である。この時、発見すべき特徴ルールの数の上
限、結論項目となるフィールドとその値、条件項目の候
補となる複数のフィールド、１つの特徴ルールに現れる
条件項目の数の上限が分析者によって与えられているも
のとする。図４、５、６にルール発見の手順を示す。図
４はデータ格納装置における処理手順、図５はデータ分
析装置における処理手順、図６は分析結果集計装置にお
ける処理手順である。分析対象データとして図３に示し
たデータを例にとり、結論項目を「購入額」、その値を
「大」、条件項目の候補を「性別」、「年齢」、「職
業」、条件項目の条件数の上限を２とする。また、「性
別」は「男」、「女」の２値、「年齢」は「３５歳以
下」、「３６歳から５５歳」、「５６歳以上」の３値、
「職業」は「有」、「無」の２値を取り得るものとす
る。The feature rule discovery is a process for finding a feature rule having a large evaluation value based on the rule evaluation value defined above. At this time, the upper limit of the number of feature rules to be discovered, the fields and their values as conclusion items, the plurality of fields as condition item candidates, and the upper limit of the number of condition items appearing in one feature rule are given by the analyst. It is assumed that FIGS. 4, 5, and 6 show the procedure for finding rules. 4 shows a processing procedure in the data storage device, FIG. 5 shows a processing procedure in the data analysis device, and FIG. 6 shows a processing procedure in the analysis result totaling device. Taking the data shown in FIG. 3 as an example of the analysis target data, the conclusion item is “purchase amount”, the value is “large”, and the candidate condition items are “sex”, “age”, “occupation”, and the condition of the condition item. The upper limit of the number is 2. In addition, "sex" is a binary value of "male" and "female", "age" is a ternary value of "35 and under", "36 to 55", "56 and over",
“Occupation” can take two values, “Yes” and “No”.

【００２６】まず、データ格納装置、データ分析装置、
分析結果集計装置の全てにおいて、割り当て設定処理４
０１、５０１、６０１を行う。割り当て設定処理におい
ては、指定された条件項目の候補と、条件数の上限にし
たがって、可能な全ての特徴ルールに対応するカウンタ
を用意する。上記の条件項目候補と条件数の上限を用い
た場合、条件項目の可能な組合せは２３通りであり、図
７に示す２３通りの特徴ルールが可能である。特徴ルー
ルの評価値算出には、前述の P(A)、P(B|A)、P(B) が必
要である。ここで、結論項目となるフィールドとその値
は１つに指定されているため、 P(B) は全ての特徴ルー
ルについて同じである。したがって、各特徴ルールにつ
いて、 P(A)、P(B|A) を知るための２つのカウンタを用
意することになる。これらのカウンタは複数のデータ分
析装置に分配して割り当てられる。First, a data storage device, a data analysis device,
Assignment setting processing 4 in all of the analysis result totaling devices
01, 501 and 601 are performed. In the assignment setting process, counters corresponding to all possible feature rules are prepared according to the designated condition item candidates and the upper limit of the number of conditions. When the above condition item candidates and the upper limit of the number of conditions are used, there are 23 possible combinations of condition items, and the 23 feature rules shown in FIG. 7 are possible. The above-described P (A), P (B | A), and P (B) are required for calculating the evaluation value of the feature rule. Here, since the field and the value which are the conclusion items are specified as one, P (B) is the same for all the feature rules. Therefore, two counters for knowing P (A) and P (B | A) are prepared for each feature rule. These counters are distributed and assigned to a plurality of data analyzers.

【００２７】どのデータ分析装置にどの特徴ルールのカ
ウンタを割り当てるかはデータ分析装置のうちの１つ、
または、分析結果集計装置、またはデータ格納装置のい
ずれかによって決定される。そして、カウンタを割り当
てられたデータ分析装置のそれぞれの識別名、または、
カウンタを割り当てられたデータ分析装置の数がデータ
格納装置へ通知される。各データ分析装置では割り当て
にしたがってカウンタを用意する。Which feature analysis counter is assigned to which data analyzer is one of the data analyzers.
Alternatively, it is determined by either the analysis result totaling device or the data storage device. Then, each identifier of the data analyzer to which the counter is assigned, or
The number of data analyzers to which the counter has been assigned is notified to the data storage device. Each data analyzer prepares a counter according to the assignment.

【００２８】データ格納装置では割り当て設定処理４０
１の後、カウンタを割り当てられた全てのデータ分析装
置と分析結果集計装置から、準備完了の信号を受信する
のを待ち（処理４０２）、準備完了の信号を全て受信し
たら、処理４０３へ進む。処理４０３では、未送信のデ
ータがあるかどうかを確認し、未送信のデータがある場
合はレコード送信処理４０４へ、全てのデータが送信済
みである場合は送信終了処理４０５へ進む。レコード送
信処理４０４では、まだ送信していないデータのうちの
１個のレコードを通信路へ送信し、処理４０２へ戻る。
送信終了処理４０５では、全てのデータを送信し終わっ
たことを示す信号を通信路へ送信する。In the data storage device, the assignment setting processing 40
After 1, the process waits for reception of a preparation completion signal from all data analyzers and analysis result totalization devices to which the counters are assigned (processing 402). In the process 403, it is confirmed whether or not there is untransmitted data. If there is untransmitted data, the process proceeds to the record transmission process 404. If all the data has been transmitted, the process proceeds to the transmission end process 405. In the record transmission process 404, one record of the data not yet transmitted is transmitted to the communication path, and the process returns to the process 402.
In the transmission end process 405, a signal indicating that all data has been transmitted is transmitted to the communication path.

【００２９】以上で、データ格納装置における処理は終
了する。With the above, the processing in the data storage device ends.

【００３０】データ分析装置では割り当て設定処理５０
１の後、分析対象データを受信する準備が完了したこと
を示す信号をデータ格納装置へ送信する（処理５０
２）。データ受信待ち処理５０３では、データ格納装置
からデータを受信するのを待ち、データを受信したら、
処理５０４へ進む。処理５０４においては、受信したデ
ータがデータの終了を示す信号であるかどうかを判定
し、データ終了信号であればルール評価処理５０７へ、
そうでなければカウンタ更新処理５０５へ進む。カウン
タ更新処理５０５では、受信したデータは分析対象デー
タのレコードであるとして、そのフィールドの値を評価
する。そして、フィールドの値が、データ分析装置に割
り当てられた特徴ルールの条件と一致していれば、該当
するカウンタの値を更新する。一般に、１個のレコード
は、複数の特徴ルールの条件と一致し得る。すなわち、
１個のレコードの処理において、複数の特徴ルールのカ
ウンタ更新が行われる。データ受信準備処理５０６で
は、処理５０３で受信したレコードを破棄し、次のレコ
ードを受信する準備をし、処理５０２へ戻る。ルール評
価処理５０７では、データ分析装置に割り当てられたル
ールの評価値をカウンタの値に基づいて算出し、評価値
の大きい順に、発見すべき特徴ルールの数の上限として
指定された数の特徴ルールを取り出す。ただし、評価値
が 0 よりも小さい特徴ルールは取り出されない。した
がって、指定された数よりも少数の特徴ルールしか取り
出されない場合がある。この後、分析結果集計装置から
の指示を待つ処理５０８へ進む。分析結果集計装置から
の指示が終了指示である場合（処理５０９）は、データ
分析装置における処理を終了する。分析結果集計装置か
らの指示が分析結果送信指示である場合（処理５１０）
は、取り出しておいた特徴ルールとその評価値を分析結
果集計装置へ送信し（処理５１１）、指示を待つ処理５
０８へ戻る。In the data analyzer, an assignment setting process 50 is performed.
After step 1, a signal indicating that preparation for receiving the data to be analyzed is completed is transmitted to the data storage device (step 50).
2). In the data reception waiting process 503, the process waits for data to be received from the data storage device.
Proceed to process 504. In the process 504, it is determined whether or not the received data is a signal indicating the end of the data.
Otherwise, the process proceeds to the counter update processing 505. In the counter update processing 505, it is assumed that the received data is a record of the data to be analyzed, and the value of the field is evaluated. If the value of the field matches the condition of the feature rule assigned to the data analyzer, the value of the corresponding counter is updated. In general, one record may match the conditions of multiple feature rules. That is,
In processing one record, the counters of a plurality of feature rules are updated. In the data reception preparation process 506, the record received in the process 503 is discarded, preparations are made for receiving the next record, and the process returns to the process 502. In the rule evaluation process 507, the evaluation values of the rules assigned to the data analyzer are calculated based on the value of the counter, and the number of feature rules specified as the upper limit of the number of feature rules to be discovered is determined in descending order of the evaluation value. Take out. However, feature rules with an evaluation value smaller than 0 are not extracted. Therefore, only a small number of feature rules may be extracted from the specified number. Thereafter, the process proceeds to a process 508 of waiting for an instruction from the analysis result totaling device. If the instruction from the analysis result totaling device is a termination instruction (process 509), the process in the data analysis device is terminated. When the instruction from the analysis result totaling device is an analysis result transmission instruction (process 510)
Transmits the extracted feature rule and its evaluation value to the analysis result totaling device (process 511), and waits for an instruction 5
Return to 08.

【００３１】分析結果集計装置では割り当て設定処理６
０１の後、分析対象データを受信する準備が完了したこ
とを示す信号をデータ格納装置へ送信する（処理６０
２）。データ受信待ち処理６０３では、データ格納装置
からデータを受信するのを待ち、データを受信したら、
処理６０４へ進む。処理６０４においては、受信したデ
ータがデータの終了を示す信号であるかどうかを判定
し、データ終了信号であれば分析結果収集処理６０６
へ、そうでなければ受信準備処理６０５へ進む。受信準
備処理６０５では、データ格納装置から受信したデータ
を破棄し、次のデータを受信する準備をし、処理６０２
へ戻る。分析結果収集処理６０６では、全てのデータ分
析装置へ順に分析結果送信指示を送り、それぞれのデー
タ分析装置で取り出された特徴ルールとその評価値を収
集する。分析結果集計処理６０７では、収集した特徴ル
ールの中から、評価値の大きい順に、発見すべき特徴ル
ールの数の上限として指定された数の特徴ルールを取り
出し、これを特徴ルール発見の結果とする。終了指示処
理６０８では、全てのデータ分析装置へ終了指示を送信
する。以上で、分析結果集計装置における処理を終了す
る。In the analysis result totaling device, assignment setting processing 6
01, a signal indicating that preparation for receiving the analysis target data is completed is transmitted to the data storage device (processing 60).
2). In the data reception waiting process 603, the process waits for data to be received from the data storage device.
Proceed to process 604. In the process 604, it is determined whether or not the received data is a signal indicating the end of the data.
Otherwise, the process proceeds to the reception preparation processing 605. In the reception preparation processing 605, the data received from the data storage device is discarded, and preparations for receiving the next data are made.
Return to In the analysis result collection processing 606, an analysis result transmission instruction is sequentially sent to all the data analyzers, and the feature rules extracted by the respective data analyzers and their evaluation values are collected. In the analysis result totaling process 607, the number of feature rules specified as the upper limit of the number of feature rules to be found are taken out of the collected feature rules in descending order of the evaluation value, and this is set as a feature rule discovery result. . In the end instruction processing 608, an end instruction is transmitted to all data analyzers. Thus, the processing in the analysis result totaling device is completed.

【００３２】図８に本発明の第二の実施の形態の構成を
示す。第一の形態ではバス型の通信路を使用していた
が、第二の形態ではリング型の通信路を使用している。
リング型通信路では、全ての装置は送信端子と受信端子
を持ち、装置の送信端子は別の装置の受信端子と単方向
の通信路を介して接続されている。データ格納装置、デ
ータ分析装置、分析結果集計装置はいずれも、データを
発信する場合は、発信する装置の識別子をデータに付加
して、送信端子からデータを送出する。データを受信す
る場合は、受信端子からデータを受け取る。そして、そ
のデータに付加された識別子が自分のものであれば、そ
のデータを破棄し、識別子が自分のものでなければ、必
要に応じてそのデータを装置内に取り込むとともに、デ
ータと識別子を自分の送信端子から送出する。このよう
に、ある装置から送信されたデータは全ての装置の間を
順に転送されて送信元の装置へ戻り、そこで転送が終了
する。FIG. 8 shows the configuration of the second embodiment of the present invention. In the first embodiment, a bus-type communication path is used, whereas in the second embodiment, a ring-type communication path is used.
In a ring communication channel, all devices have a transmission terminal and a reception terminal, and the transmission terminal of the device is connected to the reception terminal of another device via a unidirectional communication channel. When transmitting data, any of the data storage device, the data analysis device, and the analysis result totaling device adds an identifier of the transmitting device to the data and transmits the data from the transmission terminal. When receiving data, the data is received from the receiving terminal. If the identifier added to the data is your own, the data is discarded. If the identifier is not your own, the data is imported into the device as needed, and the data and the identifier are From the transmission terminal. As described above, data transmitted from a certain device is sequentially transferred between all devices and returns to the transmission source device, where the transfer ends.

【００３３】したがって、バス型の通信路と同様に、送
信元の装置からの１回の送信で、他の全ての装置にデー
タが伝送される。Therefore, as in the case of the bus-type communication path, data is transmitted to all the other devices by one transmission from the transmission source device.

【００３４】図９に本発明の第三の実施の形態の構成を
示す。第二の形態ではスター型の通信路を使用してい
る。スター型通信路では、全ての装置は双方向の通信路
を介して集線装置９０１と接続されている。集線装置は
複数の接続端子を持ち、１つの接続端子において受信し
た信号を他の全ての接続端子から送信する機能を持つ。
したがって、バス型の通信路と同様に、送信元の装置か
らの１回の送信で、他の全ての装置にデータが伝送され
る。FIG. 9 shows the configuration of the third embodiment of the present invention. In the second embodiment, a star-type communication path is used. In the star communication path, all devices are connected to the line concentrator 901 via a bidirectional communication path. The concentrator has a plurality of connection terminals and has a function of transmitting a signal received at one connection terminal from all other connection terminals.
Therefore, as in the case of the bus-type communication path, data is transmitted to all other devices by one transmission from the transmission source device.

【００３５】図１０に本発明の第四の実施の形態の構成
を示す。データ管理装置１００４、複数のデータ分析装
置１００２、分析結果集計装置１００３が１つの共有メ
モリ１００５に接続され、データ管理装置１００４は通
信路を介してデータ格納装置１００１に接続されてい
る。分析対象データはデータ格納装置に格納されてい
る。FIG. 10 shows the configuration of the fourth embodiment of the present invention. A data management device 1004, a plurality of data analysis devices 1002, and an analysis result totaling device 1003 are connected to one shared memory 1005, and the data management device 1004 is connected to a data storage device 1001 via a communication path. The data to be analyzed is stored in the data storage device.

【００３６】共有メモリは、データ区画１００６、カウ
ンタ区画１００７、分析結果区画１００８に分けられて
いる。The shared memory is divided into a data section 1006, a counter section 1007, and an analysis result section 1008.

【００３７】図１１、１２、１３に本実施形態における
特徴ルール発見の手順を示す。図１１はデータ管理装置
における手順、図１２はデータ分析装置における手順、
図１３は分析結果集計装置における手順である。まず、
データ管理装置、データ分析装置、分析結果集計装置の
全てにおいて、割り当て設定処理１１０１、１２０１、
１３０１を行う。割り当て設定処理においては、指定さ
れた条件項目の候補と、条件数の上限にしたがって、可
能な全ての特徴ルールに対応するカウンタを共有メモリ
のカウンタ区画内に用意する。また、特徴ルールのそれ
ぞれについて、そのカウンタ更新処理を担当するデータ
分析装置を決める。どのデータ分析装置にどの特徴ルー
ルのカウンタ更新処理を割り当てるかはデータ分析装置
のうちの１つ、または、分析結果集計装置のいずれかに
よって決定される。また、データ分析装置のそれぞれの
分析結果を書き込む領域を分析結果区画に用意する。ま
た、データ管理装置において、共有メモリのデータ区画
内に複数のレコードバッファ１００９を用意する。１つ
のレコードバッファは１つのレコードを格納できるレコ
ード領域と、カウンタ更新処理を割り当てられた複数の
データ分析装置の１つ１つと対応するフラグ領域からな
る。フラグの１つ１つは「データ無効」、「データ有
効」のどちらかの状態を取る。１つのレコードバッファ
のフラグが全て「データ無効」である場合、そのレコー
ドバッファは「空いている」と言う。初期状態では、全
てのレコードバッファは空いている。また、レコードバ
ッファとは別に、データ区画内にデータ終了フラグ１０
１０を用意する。データ終了フラグは「真」、「偽」の
２つの状態の一方を取ることができ、初期状態は「偽」
である。FIGS. 11, 12, and 13 show the procedure for finding a feature rule in this embodiment. 11 is a procedure in the data management device, FIG. 12 is a procedure in the data analysis device,
FIG. 13 shows a procedure in the analysis result totaling device. First,
In all of the data management device, the data analysis device, and the analysis result totaling device, the assignment setting processes 1101, 1201,
Step 1301 is performed. In the assignment setting process, counters corresponding to all possible feature rules are prepared in the counter section of the shared memory in accordance with the designated condition item candidates and the upper limit of the number of conditions. In addition, for each of the feature rules, a data analyzer that is in charge of the counter update processing is determined. Which data analyzer is assigned the counter update process of which feature rule is determined by one of the data analyzers or the analysis result totaling device. Also, an area for writing each analysis result of the data analyzer is prepared in the analysis result section. In the data management device, a plurality of record buffers 1009 are prepared in the data section of the shared memory. One record buffer includes a record area in which one record can be stored, and a flag area corresponding to each of the plurality of data analyzers to which the counter update processing is assigned. Each of the flags takes a state of "data invalid" or "data valid". If all the flags of one record buffer are “data invalid”, the record buffer is said to be “empty”. In the initial state, all record buffers are empty. In addition, separately from the record buffer, the data end flag 10
Prepare 10 The data end flag can take one of two states of "true" and "false", and the initial state is "false".
It is.

【００３８】データ管理装置では、データ格納装置から
分析対象データのレコードを１個入力する（処理１１０
２）。この時、データ格納装置からデータの終了を示す
信号を受信した場合は処理１１０６へ、そうでなければ
処理１１０４へ進む（処理１１０３）。処理１１０４で
は、共有メモリのレコードバッファを走査し、空いてい
るレコードバッファを１つ探し出す。これが見つからな
かった場合は、見つかるまで走査を繰り返す。空いてい
るレコードバッファが見つかった場合は処理１１０５へ
進む。処理１１０５では、処理１１０４で見つかったレ
コードバッファのレコード領域にデータ格納装置から入
力したレコードを格納し、そのレコードバッファの全て
のフラグに「データ有効」を示す値を設定する。その
後、処理１１０２へ戻る。処理１１０６では、データ区
画のデータ終了フラグを「真」の状態にする。以上でデ
ータ管理装置における処理を終了する。In the data management device, one record of the data to be analyzed is input from the data storage device (step 110).
2). At this time, if a signal indicating the end of the data has been received from the data storage device, the process proceeds to process 1106, otherwise proceeds to process 1104 (process 1103). In step 1104, the record buffer of the shared memory is scanned to find one empty record buffer. If this is not found, the scan is repeated until found. If an empty record buffer is found, the process proceeds to processing 1105. In the process 1105, the record input from the data storage device is stored in the record area of the record buffer found in the process 1104, and all flags of the record buffer are set to values indicating "data valid". Then, the process returns to the process 1102. In process 1106, the data end flag of the data section is set to “true”. Thus, the processing in the data management device ends.

【００３９】データ分析装置では、共有メモリのレコー
ドバッファを走査し、自データ分析装置に対応するフラ
グが「データ有効」であるレコードバッファを１つ探す
（処理１２０２）。これが見つかった場合は処理１２０
３へ、見つからなかった場合は処理１２０５へ進む。処
理１２０３では、処理１２０２で見つかったレコードバ
ッファからレコードを読み込み、そのフィールドの値
が、データ分析装置に割り当てられた特徴ルールの条件
と一致していれば、該当する特徴ルールのカウンタの値
を更新する。処理１２０４では、処理１２０２で見つか
ったレコードバッファ内の、自データ分析装置に対応す
るフラグを「データ無効」に更新する。その後、処理１
２０２へ戻る。処理１２０５では、共有メモリ内のデー
タ終了フラグを調べ、状態が「真」であれば処理１２０
６へ進み、状態が「偽」であれば処理１２０２へ戻る。
処理１２０６では、データ分析装置に割り当てられたル
ールの評価値をカウンタの値に基づいて算出し、評価値
の大きい順に、発見すべき特徴ルールの数の上限として
指定された数の特徴ルールを取り出す。ただし、評価値
が 0 よりも小さい特徴ルールは取り出されない。処理
１２０７では、取り出した特徴ルールとその評価値を共
有メモリの分析結果領域に書き込む。以上でデータ分析
装置における処理を終了する。The data analyzer scans the record buffer of the shared memory and searches for one record buffer whose flag corresponding to its own data analyzer is "data valid" (process 1202). If this is found, process 120
If no, the process proceeds to step 1205. In process 1203, a record is read from the record buffer found in process 1202, and if the value of the field matches the condition of the feature rule assigned to the data analyzer, the value of the counter of the feature rule is updated. I do. In the process 1204, the flag corresponding to the own data analyzer in the record buffer found in the process 1202 is updated to “data invalid”. Then, processing 1
Return to 202. In processing 1205, the data end flag in the shared memory is checked, and if the state is “true”, processing 1205 is executed.
Then, if the status is “false”, the process returns to step 1202.
In processing 1206, the evaluation values of the rules assigned to the data analysis device are calculated based on the values of the counters, and the number of feature rules specified as the upper limit of the number of feature rules to be discovered are extracted in descending order of the evaluation values. . However, feature rules with an evaluation value smaller than 0 are not extracted. In processing 1207, the extracted feature rule and its evaluation value are written in the analysis result area of the shared memory. Thus, the processing in the data analyzer ends.

【００４０】分析結果集計装置では、共有メモリの分析
結果領域を調べ、全てのデータ分析装置の分析結果が分
析結果領域に書き込まれるのを待つ（処理１３０２）。
全てのデータ分析装置の分析結果が揃ったら、分析結果
領域に書き込まれた特徴ルールの中から、評価値の大き
い順に、発見すべき特徴ルールの数の上限として指定さ
れた数の特徴ルールを取り出し、これを特徴ルール発見
の結果とする（処理１３０３）。以上で分析結果集計装
置における処理を終了する。The analysis result totaling apparatus checks the analysis result area of the shared memory, and waits until the analysis results of all the data analyzers are written in the analysis result area (step 1302).
When the analysis results of all data analyzers are completed, the number of feature rules specified as the upper limit of the number of feature rules to be discovered are extracted from the feature rules written in the analysis result area in descending order of evaluation value. This is set as the result of feature rule discovery (process 1303). Thus, the processing in the analysis result totaling device is completed.

【００４１】なお、以上で説明した第四の実施の形態は
カウンタを共有メモリ内に置く形態であったが、データ
分析装置のそれぞれが局所メモリ１４０１を持っている
場合は、カウンタを局所メモリ内に置くこともできる
（図１４）。In the fourth embodiment described above, the counter is stored in the shared memory. However, when each of the data analyzers has the local memory 1401, the counter is stored in the local memory. (FIG. 14).

【００４２】[0042]

【発明の効果】本発明によれば、大量のデータを多面的
に分析するデータマイニングなどのデータ分析処理を複
数の処理装置を並列に動作させて実行する場合に、分析
対象データの転送量、転送回数を少なく抑え、かつ、各
々の処理装置において必要となる主記憶の量を少なく抑
えることができる。また、大量データ分析方法に適する
ように設計されたデータ格納方法を特に必要としないの
で、一般のデータベースシステムに格納されたデータを
分析対象とすることができる。According to the present invention, when a plurality of processing devices are operated in parallel to execute data analysis processing such as data mining for analyzing a large amount of data from various aspects, the amount of data to be analyzed is reduced. The number of transfers can be reduced, and the amount of main memory required in each processing device can be reduced. Further, since a data storage method designed so as to be suitable for a mass data analysis method is not particularly required, data stored in a general database system can be analyzed.

[Brief description of the drawings]

【図１】第１の実施形態の構成図。FIG. 1 is a configuration diagram of a first embodiment.

【図２】分析方法の概要を示す流れ図。FIG. 2 is a flowchart showing an outline of an analysis method.

【図３】分析対象データの例を示す図。FIG. 3 is a diagram showing an example of analysis target data.

【図４】第１の実施形態のデータ格納装置における処理
を示す流れ図。FIG. 4 is a flowchart showing processing in the data storage device of the first embodiment.

【図５】第１の実施形態のデータ分析装置における処理
を示す流れ図。FIG. 5 is a flowchart showing processing in the data analyzer of the first embodiment.

【図６】第１の実施形態の分析結果集計装置における処
理を示す流れ図。FIG. 6 is a flowchart showing processing in the analysis result totaling device of the first embodiment.

【図７】全ての可能な特徴ルールを列挙した図。FIG. 7 is a diagram listing all possible feature rules.

【図８】第２の実施形態の構成図。FIG. 8 is a configuration diagram of a second embodiment.

【図９】第３の実施形態の構成図。FIG. 9 is a configuration diagram of a third embodiment.

【図１０】第４の実施形態の構成図。FIG. 10 is a configuration diagram of a fourth embodiment.

【図１１】第４の実施形態のデータ管理装置における処
理を示す流れ図。FIG. 11 is a flowchart showing processing in the data management device of the fourth embodiment.

【図１２】第４の実施形態のデータ分析装置における処
理を示す流れ図。FIG. 12 is a flowchart showing processing in the data analyzer of the fourth embodiment.

【図１３】第４の実施形態の分析結果集計装置における
処理を示す流れ図。FIG. 13 is a flowchart showing processing in the analysis result totaling apparatus of the fourth embodiment.

【図１４】第４の実施形態において、局所メモリにカウ
ンタを置く構成図。FIG. 14 is a configuration diagram in which a counter is placed in a local memory in the fourth embodiment.

[Explanation of symbols]

１０１…データ格納装置、１０２…データ分析装置、１
０３…分析結果集計装置、１０４…バス型通信路、１０
０４…データ管理装置、１００５…共有メモリ、１０
０６…共有メモリ内のデータ区画、１００７…共有メモ
リ内のカウンタ区画、１００８…共有メモリ内の分析結
果区画、１００９…レコードバッファ、１０１０…終了
フラグ、１４０１…局所メモリ内のカウンタ。101: data storage device, 102: data analysis device, 1
03: Analysis result totaling device, 104: Bus type communication path, 10
04: data management device, 1005: shared memory, 10
06: data section in shared memory, 1007: counter section in shared memory, 1008: analysis result section in shared memory, 1009: record buffer, 1010: end flag, 1401: counter in local memory.

フロントページの続き (72)発明者伊藤幸康神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア事業部内Ｆターム(参考） 5B075 KK02 PQ05 QS05 Continued on the front page (72) Inventor Yukiyasu Ito 5030 Totsuka-cho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture F-term in the Software Division, Hitachi, Ltd. F-term (reference) 5B075 KK02 PQ05 QS05

Claims

[Claims]

In data analysis using a plurality of data analysis means for data stored in one or a plurality of data storage means, the plurality of data analysis means may store the same data from the data storage means. And performing analysis in each of the plurality of data analysis means, and combining the analysis results in each of the plurality of data analysis means to obtain an overall analysis result.

2. The data analysis method according to claim 1, wherein the plurality of data analysis units share the analysis target data output once by the data storage unit.

3. The method according to claim 1, wherein a plurality of data analysis units use a shared storage unit, and the plurality of data analysis units output respective analysis results to the shared storage unit. 3. The data analysis method according to 2.

4. A storage medium storing a computer program for executing the parallel analysis of variance method according to claim 1, 2 or 3 on a computer.