CN103258049A

CN103258049A - Association rule mining method based on mass data

Info

Publication number: CN103258049A
Application number: CN2013102012225A
Authority: CN
Inventors: 杨勇; 王伟
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2013-05-27
Filing date: 2013-05-27
Publication date: 2013-08-21

Abstract

The invention discloses an association rule mining method based on massive data, and relates to the technical field of information processing. Specifically, the input data is divided according to the records, the frequent 1-itemset table is constructed, and then the data is projected on the frequent 1-itemset, the projected data is stored in the established grouping table, and the local frequent pattern tree is constructed, and then Mining data association rules for frequent pattern trees. In the process of mining frequent pattern trees, the pruning strategy is adopted to reduce the number of iterations of frequent pattern tree mining, and each computing node only processes a part of data records, which solves the problem of Massive data cannot be read into memory by a single machine for processing; in addition, multiple nodes participate in parallel processing, effectively improving processing efficiency.

Description

A Method of Mining Association Rules Based on Massive Data

技术领域technical field

本发明涉及数据挖掘领域，尤其涉及数据关联规则挖掘实现方法。The invention relates to the field of data mining, in particular to a method for realizing data association rule mining.

背景技术Background technique

随着信息技术的迅猛发展，要从日益庞大和复杂的数据中发现有价值的信息和知识，达到为决策服务的目的，己成为非常艰巨的任务。数据挖掘技术在此背景下应运而生。关联规则挖掘是数据挖掘中的一个重要分支，也是目前应用最广的一种数据挖掘类型。With the rapid development of information technology, it has become a very difficult task to discover valuable information and knowledge from increasingly large and complex data to serve the purpose of decision-making. Data mining technology came into being in this context. Association rule mining is an important branch of data mining, and it is also the most widely used type of data mining.

数据关联规则的挖掘目的是发现在大量数据项之间存在的值得关注的关联或相关关系，典型应用是零售业的购物篮分析。所谓购物篮分析指对数据进行关联规则研究，发现交易数据库中不同商品之间的联系，有助于找出顾客购买行为的模式。例如，如果面包和牛奶经常被顾客同时购买，则把它们摆放在一起有助于增加两种商品的销售量。为了衡量规则的重要程度，关联规则通常采用支持度和可信度作为度量标准。支持度可以表示商品在超市销售中的重要程度，可信度反映了商品之间的关联程度。如果在购买面包的交易中，有60%的交易既购买了面包又购买了牛奶，则称关联规则“面包牛奶”（表示如果购买面包则购买牛奶）的可信度为60%。The purpose of mining data association rules is to discover the noteworthy associations or correlations among a large number of data items, and the typical application is shopping basket analysis in the retail industry. The so-called shopping basket analysis refers to the study of the association rules of the data, and discovers the connection between different commodities in the transaction database, which helps to find out the pattern of customer purchasing behavior. For example, if bread and milk are often purchased by customers at the same time, placing them together can help increase sales of both items. In order to measure the importance of rules, association rules usually use support and credibility as metrics. The degree of support can indicate the importance of commodities in supermarket sales, and the degree of credibility reflects the degree of correlation between commodities. If 60% of the transactions that buy bread buy both bread and milk, then the association rule "bread and milk" (meaning that if you buy bread, you buy milk) has 60% confidence.

关联规则（表示A与B同时存在）在事务数据库D中的支持度，可用概率表示，而关联规则在事务数据库D中的可信度，是在事务数据库D中的那些包含A的事务中，B也同时出现的概率，即条件概率。The support of association rules (meaning that A and B exist simultaneously) in transaction database D can be represented by probability, and the credibility of association rules in transaction database D is in those transactions containing A in transaction database D, The probability that B also occurs at the same time is the conditional probability.

一个项集X在事务数据库D中的支持度，是事务数据库D中包含X的事务占事务总数的百分比，即概率P(X)。对于一个项集X，如果其支持度大于或等于预先给定的支持度阈值min_s，则称X为频繁项集。The support degree of an itemset X in the transaction database D is the percentage of transactions containing X in the transaction database D to the total number of transactions, that is, the probability P(X). For an itemset X, if its support is greater than or equal to the pre-specified support threshold min_s, X is called a frequent itemset.

FP-growth（频繁模式增长算法）算法的基本思想是利用树结构对事务进行压缩，同时保留事务中属性之间的关系。该算法的重要步骤是频繁模式树的建立和挖掘，需要扫描两次事务集：第一次扫描事务集D，找出频繁1-项集集合，并且将集合按照支持度计数降序排列；第二次扫描事务集D，以“Null”为根节点，构建频繁模式树。为了快速遍历频繁模式树，创建了项头表T，该表中每一行表示一个频繁项及其出现的频率，并且设置指针指向它在频繁模式树中对应的节点，最后对频繁模式树进行递归挖掘找出所有的频繁项集。具体算法流程如下：The basic idea of the FP-growth (Frequent Pattern Growth Algorithm) algorithm is to use the tree structure to compress transactions while retaining the relationship between attributes in the transaction. The important step of the algorithm is the establishment and mining of frequent pattern trees, which need to scan transaction sets twice: the first scan transaction set D, find frequent 1-itemset sets, and arrange the sets in descending order of support count; the second The transaction set D is scanned twice, and the frequent pattern tree is constructed with "Null" as the root node. In order to quickly traverse the frequent pattern tree, an item header table T is created, each row in the table represents a frequent item and its frequency of occurrence, and the pointer is set to point to its corresponding node in the frequent pattern tree, and finally the frequent pattern tree is recursed Mining to find all frequent itemsets. The specific algorithm flow is as follows:

(1)创建原始频繁模式树(1) Create the original frequent pattern tree

1)扫描事务集合T，找出支持度计数满足条件的项，并将这些项组合成频繁1-项集合L，基于支持度计数将L降序排列得到1-项集列表。1) Scan the transaction set T, find out the items whose support count meets the condition, and combine these items into a frequent 1-item set L, and sort L in descending order based on the support count to get the 1-item set list.

2)创建原始频繁模式树，并以“Null”为根节点。2) Create an original frequent pattern tree with "Null" as the root node.

3)创建项头表，为了方便遍历原始频繁模式树，项头表中每一行表示一个频繁项，并且有对应指针指向它在频繁模式树中的节点。3) Create an item header table. For the convenience of traversing the original frequent pattern tree, each row in the item header table represents a frequent item, and has a corresponding pointer pointing to its node in the frequent pattern tree.

4)遍历一次事务集合T，根据1-项集列表将T中所有事务的项次序进行调整，为每个调整后项次序的事务创建一个事务分支。如果分支可以共享路径则共享，并且在各个节点上记录共享事务数目。4) Traverse the transaction set T once, adjust the item order of all transactions in T according to the 1-itemset list, and create a transaction branch for each transaction with the adjusted item order. Branches share paths if they can, and record the number of shared transactions on each node.

(2)在原始频繁模式树上递归地找出所有的频繁项集(2) Recursively find all frequent itemsets on the original frequent pattern tree

1)赋后缀模式a的初始值为根节点Null，即a=Null。在频繁模式树上采用递归的方法搜索频繁项集，如果频繁模式树只有一个分支，那么分支路径上的节点的一个组合就是一个前缀模式b。剔除分支路径上的不满足最小支持度的节点，剩余的节点值所组成的任何集合b与后缀模式a取并，可获得所有对应的频繁项集。其中b_i为一前缀模式，否则增长a，a=a∪{E_i}(其中E_i为1-项集表中的最后一项，即支持度计数的最小项)。然后构造后缀模式的条件模式基与条件频繁模式树。(其中a的条件模式基，指的是频繁模式树中以E_i为叶子节点的所有分支。A的条件频繁模式树，指得是以a的条件模式基为事务按照步骤1中的方式所创建的新的频繁模式树)1) The initial value of the suffix pattern a is the root node Null, that is, a=Null. A recursive method is used to search for frequent itemsets on the frequent pattern tree. If the frequent pattern tree has only one branch, then a combination of nodes on the branch path is a prefix pattern b. Eliminate the nodes on the branch path that do not meet the minimum support, and merge any set b composed of the remaining node values with the suffix pattern a to obtain all corresponding frequent itemsets. Where b _i is a prefix pattern, otherwise increase a, a=a∪{E _i } (where E _i is the last item in the 1-itemset table, which is the minimum item of support count). Then construct the conditional pattern base and conditional frequent pattern tree of the suffix pattern. (The conditional pattern base of a refers to all branches with E _i as the leaf node in the frequent pattern tree. The conditional frequent pattern tree of A refers to the conditional pattern base of a as the transaction according to the method in step 1. The new frequent pattern tree created)

2)在条件频繁模式树上采用相同方法递归地搜索频繁项集(后缀模式a此时为{E_i})。2) Use the same method to recursively search for frequent itemsets on the conditional frequent pattern tree (the suffix pattern a is {E _i } at this time).

3)对每个最大频繁项集，取其所有子集，其中每个子集为一频繁项集。3) For each maximum frequent itemset, take all its subsets, and each subset is a frequent itemset.

上述FP-growth（频繁模式增长算法）数据挖掘方法，对于少量数据记录，可以方便地在单机上实现。但对于海量的数据记录进行数据关联规则的挖掘时，一方面由于单机内存容量有限，不可能读入海量的数据；另一方面，由于数据挖掘过程中需要处理的数据记录太多，处理时间很长，在实际的数据业务应用中，效率很低。The above-mentioned FP-growth (Frequent Pattern Growth Algorithm) data mining method can be conveniently implemented on a single computer for a small amount of data records. However, when mining data association rules for massive data records, on the one hand, due to the limited memory capacity of a single machine, it is impossible to read massive data; Long, in the actual data business application, the efficiency is very low.

因此，对于实际应用中海量数据的关联规则挖掘，如何有效地提升处理效率是数据关联规则挖掘中需要加以解决的一个主要问题。Therefore, for association rule mining of massive data in practical applications, how to effectively improve processing efficiency is a major problem that needs to be solved in data association rule mining.

云计算具有海量的存储能力和可弹性变化的计算能力，因此基于云计算进行海量数据挖掘和获取关联规则，成为解决传统集中式数据挖掘对海量数据动态增长的不适应性的一种高效、可信途径。Cloud computing has massive storage capacity and elastically variable computing power. Therefore, massive data mining and association rule acquisition based on cloud computing has become an efficient and reliable solution to the inadaptability of traditional centralized data mining to the dynamic growth of massive data. letter path.

发明内容Contents of the invention

本发明针对现有技术存在的海量数据关联规则挖掘问题，对经典FP-growth算法的频繁模式树的结构和挖掘过程进行了深入分析和研究，分析了频繁模式树单路径和多路径的挖掘方法，提出了一个剪枝方法，并提出了剪枝频繁模式树算法；可以减少部分分支的迭代次数。然后利用云计算的MapReduce编程技术，对剪枝频繁模式树算法的各个步骤并行化，提高了剪枝频繁模式树算法对数据的处理能力和效率。Aiming at the problem of mining association rules of massive data existing in the prior art, the present invention conducts in-depth analysis and research on the structure and mining process of the frequent pattern tree of the classic FP-growth algorithm, and analyzes the single-path and multi-path mining methods of the frequent pattern tree , a pruning method is proposed, and a pruning frequent pattern tree algorithm is proposed; the number of iterations of some branches can be reduced. Then, using the MapReduce programming technology of cloud computing, the steps of the pruning frequent pattern tree algorithm are parallelized, which improves the data processing ability and efficiency of the pruning frequent pattern tree algorithm.

本发明解决上述技术问题的技术方案是：提出一种海量数据的数据关联规则挖掘实现方法，包括如下步骤：The technical solution of the present invention to solve the above-mentioned technical problems is: to propose a data association rule mining implementation method for massive data, including the following steps:

1）将输入数据按照记录进行分割，并将分割后的数据分发给多个计算节点，由各计算节点并行统计1-项集的频率，按照1-项集频率排序的结果，构建频繁1-项集表，删除小于给定阈值的项集，生成最终的频繁1-项集表；1) Divide the input data according to the records, and distribute the divided data to multiple computing nodes, each computing node counts the frequency of 1-itemsets in parallel, sorts the results according to the frequency of 1-itemsets, and constructs frequent 1-itemsets Itemset table, delete itemsets smaller than a given threshold, and generate the final frequent 1-itemset table;

2）根据频繁1-项集表对频繁1-项集进行分组，然后对频繁1-项集进行数据投影，把得到投影后的数据存储在建立的分组表中；2) Group frequent 1-itemsets according to the frequent 1-itemset table, then perform data projection on frequent 1-itemsets, and store the projected data in the established grouping table;

3）由各节点对分组表中的数据进行统计，构建本地频繁模式树，然后对频繁模式树进行数据关联规则挖掘，其中在对频繁模式树进行遍历挖掘时，如果项集i在频繁模式树中存在前缀路径集合A，B并且满足

，那么前缀路径集合中长路径集合B中的项集i就可以剪枝和短路径集合A合并。3) Each node counts the data in the grouping table, builds a local frequent pattern tree, and then mines the data association rules of the frequent pattern tree. When traversing the frequent pattern tree, if the item set i is in the frequent pattern tree There exists a set of prefix paths A, B in and satisfies

, then the item set i in the long path set B in the prefix path set can be pruned and merged with the short path set A.

假设集合A中i出现的频率为m，集合B中i出现的频率为n，最小支持度为q，且m≥q,n<q。未剪枝的频繁模式树得到i的条件模式基为{{A:m},{B:n}}，产生的频繁模式为{Ai:m+n}。剪枝以后的频繁模式树得到i的条件模式基为{A:m+n}，产生的频繁模式为{Ai:m+n}。剪枝后包含i的频繁模式和未剪枝包含i的频繁模式相等。因此，集合B中的i就可以剪枝和短路径集合A合并。Suppose the frequency of i in set A is m, the frequency of i in set B is n, the minimum support is q, and m≥q, n<q. The conditional pattern base of i obtained from the unpruned frequent pattern tree is {{A:m},{B:n}}, and the resulting frequent pattern is {Ai:m+n}. After pruning the frequent pattern tree, the conditional pattern base of i is {A:m+n}, and the resulting frequent pattern is {Ai:m+n}. The frequent pattern containing i after pruning is equal to the frequent pattern containing i without pruning. Therefore, i in set B can be pruned and merged with short path set A.

4）对各个子树进行频繁模式的挖掘，然后合并由各子树挖掘得到的频繁模式，得到最终的频繁模式。4) Mining the frequent patterns of each subtree, and then merging the frequent patterns mined from each subtree to obtain the final frequent pattern.

进一步，所述步骤1）具体包括如下步骤：Further, the step 1) specifically includes the following steps:

11）将数据对象按记录均匀划分，使每组数据包含平衡的数据对象；11) Divide data objects evenly by records, so that each group of data contains balanced data objects;

12）在每组数据中，统计每一个项集出现的次数；12) In each set of data, count the number of occurrences of each item set;

13）根据项集出现的频率进行排序，然后和给定的阈值（绝对支持度的计数）进行对比，小于给定阈值的项集删除，最后建立一个Hash表来存储频繁1-项集，该表命名为频繁1-项集表。13) Sort according to the frequency of item sets, and then compare with a given threshold (absolute support count), delete itemsets smaller than the given threshold, and finally create a Hash table to store frequent 1-itemsets, the The table is named frequent 1-itemset table.

进一步，所述步骤2）具体包括如下步骤：Further, the step 2) specifically includes the following steps:

21）根据频繁1-项集表，对频繁1-项集进行数据分组，分组后的数据进行源数据的投影，建立Hash表把投影后的数据存储起来，存储的表命名分组表；21) According to the frequent 1-itemset table, group the frequent 1-itemset data, project the source data on the grouped data, create a Hash table to store the projected data, and name the stored table as the grouping table;

22）根据集群的节点的个数设置分组表大小，分组表大小等于MapReduce中设置的Map个数。22) Set the size of the grouping table according to the number of nodes in the cluster, and the size of the grouping table is equal to the number of Maps set in MapReduce.

进一步，所述步骤4）具体包括如下步骤：Further, the step 4) specifically includes the following steps:

41）对剪枝之后的频繁模式树进行频繁模式的挖掘；41) Mining frequent patterns on the frequent pattern tree after pruning;

42）合并所有的频繁模式。42) Merge all frequent patterns.

进一步，步骤13）中，生成频繁1-项集，这里的频繁1-项集的数据结构是Hash形式，每个Hash数据由q个数据对象进行单向Hash计算获得，每个数据对象参与t+1次Hash计算，获得q(t+1)个Hash数据。Further, in step 13), frequent 1-itemsets are generated, where the data structure of frequent 1-itemsets is in the form of Hash, and each Hash data is obtained by one-way Hash calculation of q data objects, and each data object participates in t +1 Hash calculation, get q(t+1) Hash data.

进一步，所述步骤21）中，分组表的数据结构是Hash形式，分组表存储分组后的项集，以及项集所在数据的行号；记录数据所在记录，去投影源数据，把数据分到对应的组中。Further, in the step 21), the data structure of the grouping table is in Hash form, and the grouping table stores the grouped item sets and the row number of the data where the item sets are located; record the records where the data is located, project the source data, and divide the data into in the corresponding group.

本发明提供的数据关联规则挖掘方法中，采用了多个计算节点并行地参与数据项统计过程再合并统计结果，加快了生成频繁1-项集的速度；并通过数据分组，然后进行数据的分发，最后通过分发的数据集到计算节点上挖掘出关联规则，其中，每个节点在对频繁模式树挖掘过程采用了剪枝策略，减少了数据挖掘的迭代次数，提高了处理效率。同时每个计算节点仅处理一部分数据记录，解决了海量数据无法由单机读入内存进行处理的问题。本发明可用于数据挖掘的关联规则提取中；适用于数据处理在商业、企业、政府部门及科学研究等领域中；也适用于其他需要提取数据的关联性的各种场合。In the data association rule mining method provided by the present invention, multiple computing nodes are used to participate in the data item statistical process in parallel and then merge the statistical results, which speeds up the generation of frequent 1-item sets; and through data grouping, and then data distribution , and finally mine the association rules through the distributed data sets on the computing nodes, in which each node adopts the pruning strategy in the process of frequent pattern tree mining, which reduces the number of iterations of data mining and improves the processing efficiency. At the same time, each computing node only processes a part of data records, which solves the problem that massive data cannot be read into memory by a single machine for processing. The invention can be used in the extraction of association rules of data mining; it is suitable for data processing in the fields of commerce, enterprises, government departments and scientific research; it is also applicable to various occasions that need to extract the association of data.

本发明的其他优点、目标，和特征在某种程度上将在随后的说明书中进行阐述，并且在某种程度上，基于对下文的考察研究对本领域技术人员而言将是显而易见的，或者可以从本发明的实践中得到教导。Other advantages, objects, and features of the present invention will be set forth in the ensuing description to some extent, and to some extent, will be obvious to those skilled in the art based on the investigation and research below, or can be Learn from the practice of the invention.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本明作进一步的详细描述：In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings:

图1为表一数据集建立的频繁模式树结构示意图；Figure 1 is a schematic diagram of the frequent pattern tree structure established by the data set in Table 1;

图2为本发明流程图；Fig. 2 is a flowchart of the present invention;

图3为并行的剪枝频繁模式树算法框架图；Fig. 3 is the frame diagram of parallel pruning frequent pattern tree algorithm;

图4为云计算平台结构示意图。Fig. 4 is a schematic structural diagram of a cloud computing platform.

具体实施方式Detailed ways

下面结合附图，对本发明实施例提供的数据关联规则挖掘实现方法进行详细阐述。The method for implementing data association rule mining provided by the embodiment of the present invention will be described in detail below in conjunction with the accompanying drawings.

图1本发明实施例提供的频繁模式树进行剪枝的结构示意图，包括：Fig. 1 is a schematic structural diagram of frequent pattern tree pruning provided by an embodiment of the present invention, including:

对源数据进行频繁1-项集的统计，得到条件模式基，构建好频繁模式树，挖掘频繁模式树的过程采用一个优化方法，具体步骤参见图2，包括以下步骤：Perform frequent 1-itemset statistics on the source data to obtain the conditional pattern base, construct the frequent pattern tree, and adopt an optimization method in the process of mining the frequent pattern tree. See Figure 2 for specific steps, including the following steps:

将输入数据进行分割，并分发给多个计算节点，由各计算节点并行统计1-项集的频率，根据1-项集频率排序，并删除小于给定阈值的1-项集，构建频繁1-项集表；根据频繁1-项集表对频繁1-项集进行分组建立的分组表，然对频繁1-项集进行数据投影，把投影后的数据存储在分组表中；由各计算节点对分组表中的数据进行统计，构建本地频繁模式树，然后对频繁模式树各子树进行数据关联规则挖掘，如果项集i在频繁模式树中存在前缀路径集合A，B，并且满足，则集合B中的项集i就可以剪枝，并和集合A合并；对各频繁模式子树进行数据关联规则挖掘，合并由各频繁模式子树挖掘得到的频繁模式。Divide the input data and distribute it to multiple computing nodes. Each computing node counts the frequency of 1-itemsets in parallel, sorts them according to the frequency of 1-itemsets, and deletes 1-itemsets smaller than a given threshold to construct frequent 1-itemsets. -itemset table; according to the frequent 1-itemset table to group frequent 1-itemsets to establish a grouping table, then perform data projection on frequent 1-itemsets, and store the projected data in the grouping table; calculated by each The node counts the data in the grouping table, builds a local frequent pattern tree, and then mines data association rules for each subtree of the frequent pattern tree. If item set i has prefix path sets A and B in the frequent pattern tree, and satisfies , then item set i in set B can be pruned and merged with set A; carry out data association rule mining on each frequent pattern subtree, and merge frequent patterns mined from each frequent pattern subtree.

将数据对象按记录均匀划分，使每组数据包含平衡的数据对象；在每组数据中，统计每一个项集出现的次数；根据项集出现的频率进行排序，给定阈值为绝对支持度的计数，频繁1-项集表为一个存储频繁1-项集的Hash表。其中，Hash表中每个Hash数据由q个数据对象进行单向Hash计算获得，每个数据对象参与t+1次Hash计算，获得q(t+1)个Hash数据。Divide the data objects evenly according to the records, so that each group of data contains balanced data objects; in each group of data, count the number of occurrences of each item set; sort according to the frequency of occurrence of the item set, and the given threshold is absolute support Counting, frequent 1-itemset table is a Hash table that stores frequent 1-itemsets. Wherein, each Hash data in the Hash table is obtained by performing one-way Hash calculation on q data objects, and each data object participates in t+1 Hash calculations to obtain q(t+1) Hash data.

把分组后的数据进行源数据的投影，建立Hash表把投影后的数据存储起来，该表命名为分组表；根据计算节点个数设置分组表大小，分组表大小等于并行编程模式MapReduce设置的函数Map个数。分组表的数据结构是Hash形式，存储分组后的项集，以及项集所在数据的行号；记录数据所在的行号，去投影源数据，把数据分到对应的组中。Project the grouped data to the source data, create a Hash table to store the projected data, and name the table as a grouping table; set the size of the grouping table according to the number of computing nodes, and the size of the grouping table is equal to the function set by the parallel programming mode MapReduce Number of maps. The data structure of the grouping table is in Hash form, which stores the grouped itemset and the row number of the data where the itemset is located; records the row number where the data is located, projects the source data, and divides the data into corresponding groups.

对构建好的频繁模式树，遍历其中的每一个项集，如果项集i在某一个路径上是非频繁的，假设项集i在频繁模式树中存在前缀路径集合A、B，并且满足

，那么前缀路径集合中长路径集合B中的项集i就可以剪枝和短路径集合A合并，For the constructed frequent pattern tree, traverse each item set in it, if the item set i is infrequent on a certain path, assuming that the item set i has prefix path sets A and B in the frequent pattern tree, and satisfies

, then the item set i in the long path set B in the prefix path set can be pruned and merged with the short path set A,

具体为：假设集合A中i出现的频率为m，集合B中i出现的频率为n，最小支持度为q，且m≥q,n<q。未剪枝的频繁模式树得到i的条件模式基为{{A:m},{B:n}}，产生的频繁模式为{Ai:m+n}。剪枝以后的频繁模式树得到i的条件模式基为{A:m+n}，产生的频繁模式为{Ai:m+n}。剪枝后包含i的频繁模式和未剪枝包含i的频繁模式相等。因此，集合B中的i就可以剪枝和短路径集合A合并。Specifically: Assume that the frequency of i in set A is m, the frequency of i in set B is n, the minimum support is q, and m≥q, n<q. The conditional pattern base of i obtained from the unpruned frequent pattern tree is {{A:m},{B:n}}, and the resulting frequent pattern is {Ai:m+n}. After pruning the frequent pattern tree, the conditional pattern base of i is {A:m+n}, and the resulting frequent pattern is {Ai:m+n}. The frequent pattern containing i after pruning is equal to the frequent pattern containing i without pruning. Therefore, i in set B can be pruned and merged with short path set A.

对于多路径的项集来说，有的项集存在某个路径上的支持度小于给定的阈值，因此可以采用上述方法剪枝。剪枝后如果是单路径的就可以归为单路径的情况下去考虑。这样可以减少项集挖掘的递归次数，提高挖掘效率。For multi-path itemsets, some itemsets have a support degree on a certain path that is less than a given threshold, so the above method can be used to prune. After pruning, if it is a single path, it can be considered as a single path. This can reduce the recursion times of itemset mining and improve mining efficiency.

以表1事务数据集D为例来说明剪枝方法的过程。假如最小支持度为2，在建立图1的频繁模式树后，可以得到该频繁模式树对应的HeaderTable（项头表）信息，通过用HeaderTable中节点链的信息，查找各个项的各个节点前缀路径。Take the transaction data set D in Table 1 as an example to illustrate the process of the pruning method. If the minimum support is 2, after the frequent pattern tree in Figure 1 is established, the HeaderTable (item header table) information corresponding to the frequent pattern tree can be obtained, and the prefix path of each node of each item can be found by using the information of the node chain in the HeaderTable .

表1数据集DTable 1 Dataset D

从图1可以看出，I5项的一个前缀路径是A={I2，I1}，I5项的另外一个前缀路径是B={I2，I1，I3}，A是B的真子集，满足上面提出的剪枝方法，所以剪枝I5后的频繁模式树，I5就从原来的多路径变为单路径，直接进行频繁项的挖掘，减少了挖掘的递归次数。It can be seen from Figure 1 that one prefix path of I5 item is A={I2, I1}, another prefix path of I5 item is B={I2, I1, I3}, and A is a proper subset of B, satisfying the above mentioned Therefore, after pruning the frequent pattern tree of I5, I5 changes from the original multi-path to single-path, and directly mines frequent items, reducing the number of recursive mining.

图3为本发明实施例提供的并行的剪枝频繁模式树算法步骤流程图，包括如下步骤：Fig. 3 is the flow chart of parallel pruning frequent pattern tree algorithm steps provided by the embodiment of the present invention, including the following steps:

剪枝频繁模式树算法，在串行算法上表现了一定的优势。通过构造频繁模式树来寻找频繁模式，然而在数据集达到一定规模时，构造基于内存的频繁模式树是不现实的。因此，提出了基于MapReduce并行的剪枝频繁模式树算法解决上述问题。该算法在剪枝频繁模式树算法的基础上，利用MapReduce编程模型，对剪枝频繁模式树算法的各个步骤并行化。总体分成四个步骤：Pruning frequent pattern tree algorithm shows certain advantages in serial algorithm. Find frequent patterns by constructing frequent pattern trees. However, when the data set reaches a certain scale, it is unrealistic to construct memory-based frequent pattern trees. Therefore, a parallel pruning frequent pattern tree algorithm based on MapReduce is proposed to solve the above problems. Based on the pruning frequent pattern tree algorithm, the algorithm uses the MapReduce programming model to parallelize each step of the pruning frequent pattern tree algorithm. Overall divided into four steps:

频繁1-项集的频率统计；利用频繁1-项集的频率统计结果建立一个频繁1-项集表，按照频繁1-项集表对数据进行分组，这样把数据分成了n个部分；对分解后的数据进行挖掘关联规则；汇总最后的频繁模式，即M1，M2，M3，M4。Frequency statistics of frequent 1-itemsets; use the frequency statistics of frequent 1-itemsets to create a frequent 1-itemset table, and group the data according to the frequent 1-itemset table, thus dividing the data into n parts; The decomposed data is used to mine association rules; the final frequent patterns are summarized, namely M1, M2, M3, M4.

M1阶段是使用一个MapReduce任务来进行所有项集的频率统计。当输入一个分片后，首先调用Map函数，按照项集的记录逐行将项集分割成pair<key=num，String>；接着对String按照项集的间隔进行分割，形成pair<key，value=Ti>，其中，key为单个的项，value为项所在的事务；然后对Map分解后的结果进行Reduce（MapReduce模式的一种函数）操作，把key所在的事务数全部查找出来，并进行频率统计；最后将频率统计的结果按照设定的最小支持度进行筛选，把筛选的结果按照项集的频度从高到低的顺序存入频繁1-项集表中。The M1 stage uses a MapReduce task to perform frequency statistics of all itemsets. When a fragment is input, the Map function is first called to divide the item set into pair<key=num, String> line by line according to the records of the item set; then the String is divided according to the interval of the item set to form a pair<key, value= Ti>, where the key is a single item, and the value is the transaction where the item is located; then the Reduce (a function of the MapReduce mode) operation is performed on the result after the Map is decomposed, and the number of transactions where the key is located is found out, and the frequency Statistics; finally, the results of frequency statistics are screened according to the set minimum support, and the screened results are stored in the frequent 1-itemset table in order of the frequency of itemsets from high to low.

M2阶段是为了达到负载均衡的效果，并且保证每组相对独立，以便后续处理更方便并且保证挖掘结果的完备性，要对数据进行平衡分组。该步骤首先根据F-list表中的事务投影数据来分组，然后创建一个分组表，最后将分组后的数据存放该表中。The M2 stage is to achieve the effect of load balancing and to ensure that each group is relatively independent, so that the subsequent processing is more convenient and the integrity of the mining results is ensured. The data should be balanced and grouped. In this step, first, group the transaction projection data in the F-list table, then create a group table, and finally store the grouped data in the table.

M3阶段按照平衡分组的策略把数据分成n组，使得当数据输入时，一个Map函数对应一个分组，分别对各分组进行挖掘。这样就可以达到数据分解的效果，最后利用Reduce函数把Map处理的结果汇总起来。每个Map任务的运行是基于在平衡分组的过程中产生的分组表。建立的分组表是Hash结构，每个项映射到该项所在的事务数。每个Reduce任务归约Map任务产生的频繁模式。In the M3 stage, the data is divided into n groups according to the strategy of balanced grouping, so that when the data is input, one Map function corresponds to one group, and each group is mined separately. In this way, the effect of data decomposition can be achieved, and finally the Reduce function is used to summarize the results of Map processing. The operation of each map task is based on the grouping table generated during the process of balancing grouping. The established grouping table is a Hash structure, and each item is mapped to the transaction number of the item. Each Reduce task reduces the frequent patterns generated by the Map task.

M4阶段是汇总最后的频繁模式，也就是Reduce把所有的结果汇总起来，进行后面的处理。The M4 stage is to summarize the final frequent patterns, that is, Reduce collects all the results for subsequent processing.

参见图4，为本发明实施例提供的云计算平台结构示意图，包括：Referring to Fig. 4, it is a schematic structural diagram of a cloud computing platform provided by an embodiment of the present invention, including:

系统平台由三大部分组成:分布式文件系统服务，其中，主控服务器（Namenode）用于维护和管理整个文件系统，数据服务器（Datanode）用来存储数据块；MapReduce计算服务，其中，任务服务器(TaskTracker)用来完成Map和Reduce任务，而工作服务器（JobTracker）主要用来调度任务的执行、处理机器故障(容错)理、TaskTracker通讯以及对Job和Task执行情况的监控等；此外，还有一个工作(Job)的提交者，即客户端。The system platform consists of three parts: Distributed file system service, in which the main control server (Namenode) is used to maintain and manage the entire file system, and the data server (Datanode) is used to store data blocks; MapReduce computing service, in which the task server (TaskTracker) is used to complete Map and Reduce tasks, while the job server (JobTracker) is mainly used to schedule task execution, handle machine failure (fault tolerance), TaskTracker communication, and monitor the execution of Job and Task; in addition, there are The submitter of a job (Job), that is, the client.

本发明提供的数据关联规则挖掘实现方法中，提出了一个剪枝方法，并提出了剪枝频繁模式树算法，减少部分分支的迭代次数。然后本发明利用云计算的MapReduce编程技术，对剪枝频繁模式树算法的各个步骤并行化，提高了剪枝频繁模式树算法对数据的处理能力和效率，同时使得剪枝频繁模式树算法能适应云计算环境。将待处理的数据记录分块后分配给不同的计算节点处理，每个计算节点仅处理一部分数据记录，解决了海量数据无法全部由单机读入内存进行挖掘处理的问题；本发明提供的数据关联规则挖掘实现方法中，采用了多个计算节点并行地参与数据项统计过程再合并统计结果，有效解决了现有技术无法实现对海量数据的关联规则挖掘及处理效率低的问题。本发明不仅对传统的FP-growth算法进行了优化，提高了挖掘效率，还提出了改进后的并行挖掘算法，解决海量数据挖掘问题。本发明可用于数据挖掘的关联规则提取中；适用于数据处理在商业、企业、政府部门及科学研究等领域中；也适用于其他需要提取数据的关联性的各种场合。In the data association rule mining implementation method provided by the present invention, a pruning method is proposed, and a pruning frequent pattern tree algorithm is proposed to reduce the number of iterations of some branches. Then the present invention utilizes the MapReduce programming technology of cloud computing to parallelize each step of the pruning frequent pattern tree algorithm, which improves the data processing capability and efficiency of the pruning frequent pattern tree algorithm, and simultaneously enables the pruning frequent pattern tree algorithm to adapt to cloud computing environment. The data records to be processed are divided into blocks and distributed to different computing nodes for processing. Each computing node only processes a part of the data records, which solves the problem that the massive data cannot be read into the memory by a single machine for mining and processing; the data association provided by the present invention In the implementation method of rule mining, multiple computing nodes are used to participate in the statistical process of data items in parallel and then merge the statistical results, which effectively solves the problem that the existing technology cannot realize the mining of association rules for massive data and the problem of low processing efficiency. The invention not only optimizes the traditional FP-growth algorithm to improve the mining efficiency, but also proposes an improved parallel mining algorithm to solve the massive data mining problem. The invention can be used in the extraction of association rules of data mining; it is suitable for data processing in the fields of commerce, enterprises, government departments and scientific research; it is also applicable to various occasions that need to extract the association of data.

以上所述仅为本发明的优选实施例，并不用于限制本发明，显然，本领域的技术人员可以采用本发明中的剪枝频繁模式树算法进行优化，并进行并行化处理对本发明进行应用，不脱离本发明的精神和范围。The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention. Obviously, those skilled in the art can use the pruning frequent pattern tree algorithm in the present invention to optimize and perform parallel processing to apply the present invention , without departing from the spirit and scope of the present invention.

Claims

1. data association rule mining implementation method, it is characterized in that, may further comprise the steps: (1) will be imported data and cut apart, and be distributed to a plurality of computing nodes, frequency by the parallel statistics of each computing node 1-item collection, according to the ordering of 1-item collection frequency, and deletion makes up frequent 1-item collection table less than the 1-item collection of given threshold value; (2) grouping sheet that frequent 1-item collection is divided into groups to set up according to frequent 1-item collection table so carries out data projection to frequent 1-item collection, and the data after the projection are stored in the grouping sheet; (3) by each computing node the data in the grouping sheet are added up, make up local frequent pattern tree (fp tree), then each subtree of frequent pattern tree (fp tree) is carried out the data association rule mining, if there is the prefix path set A in a collection i in frequent pattern tree (fp tree), B, and satisfy

, then the item collection i in the set B just can beta pruning, and and the set A merging; (4) each frequent mode subtree is carried out the data association rule mining, merge by each frequent mode subtree and excavate the frequent mode that obtains.

2. data association rule mining implementation method as claimed in claim 1 is characterized in that, described step (1) further comprises: data object is evenly divided by record, made every group of data comprise the data object of balance; In every group of data, add up the number of times that each collection occurs; The frequency that occurs according to the item collection sorts, and given threshold value is the counting of absolute support, and frequent 1-item collection table is the Hash table of a frequent 1-item collection of storage.

3. data association rule mining implementation method as claimed in claim 1, it is characterized in that, described step (2) further comprises: the data after the grouping are carried out the projection of source data, set up the Hash table data after the projection are stored, this table called after grouping sheet; According to the computing node number grouping sheet size is set, the grouping sheet size equals the function Map number that Parallel Programming Models MapReduce arranges.

4. data association rule mining implementation method as claimed in claim 2, it is characterized in that, each Hash data is carried out unidirectional Hash calculating acquisition by q data object in the Hash table, and each data object participates in t+1 Hash and calculates, and obtains the individual Hash data of q (t+1).

5. data association rule mining implementation method as claimed in claim 3 is characterized in that the data structure of grouping sheet is the Hash form, the item collection after the stores packets, and the row of a collection place data number; The row at record data place number goes the projection source data, and data are assigned in the corresponding group.