[go: up one dir, main page]

CN103258049A - Association rule mining method based on mass data - Google Patents

Association rule mining method based on mass data Download PDF

Info

Publication number
CN103258049A
CN103258049A CN2013102012225A CN201310201222A CN103258049A CN 103258049 A CN103258049 A CN 103258049A CN 2013102012225 A CN2013102012225 A CN 2013102012225A CN 201310201222 A CN201310201222 A CN 201310201222A CN 103258049 A CN103258049 A CN 103258049A
Authority
CN
China
Prior art keywords
data
frequent
item collection
association rule
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102012225A
Other languages
Chinese (zh)
Inventor
杨勇
王伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN2013102012225A priority Critical patent/CN103258049A/en
Publication of CN103258049A publication Critical patent/CN103258049A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开一种基于海量数据的关联规则挖掘方法,涉及信息处理技术领域。具体包括,将输入数据按照记录进行分割,构建频繁1-项集表,然后对频繁1-项集进行数据投影,把投影后的数据存储在建立的分组表中,构建本地频繁模式树,然后对频繁模式树进行数据关联规则挖掘,在对频繁模式树进行挖掘的过程中,采用了剪枝策略,减少了频繁模式树挖掘的迭代次数,并且每个计算节点仅处理一部分数据记录,解决了海量数据无法由单机读入内存进行处理的问题;除此以外,多个节点并行参与处理,有效地提高了处理效率。

Figure 201310201222

The invention discloses an association rule mining method based on massive data, and relates to the technical field of information processing. Specifically, the input data is divided according to the records, the frequent 1-itemset table is constructed, and then the data is projected on the frequent 1-itemset, the projected data is stored in the established grouping table, and the local frequent pattern tree is constructed, and then Mining data association rules for frequent pattern trees. In the process of mining frequent pattern trees, the pruning strategy is adopted to reduce the number of iterations of frequent pattern tree mining, and each computing node only processes a part of data records, which solves the problem of Massive data cannot be read into memory by a single machine for processing; in addition, multiple nodes participate in parallel processing, effectively improving processing efficiency.

Figure 201310201222

Description

一种基于海量数据的关联规则挖掘方法A Method of Mining Association Rules Based on Massive Data

技术领域technical field

本发明涉及数据挖掘领域,尤其涉及数据关联规则挖掘实现方法。The invention relates to the field of data mining, in particular to a method for realizing data association rule mining.

背景技术Background technique

随着信息技术的迅猛发展,要从日益庞大和复杂的数据中发现有价值的信息和知识,达到为决策服务的目的,己成为非常艰巨的任务。数据挖掘技术在此背景下应运而生。关联规则挖掘是数据挖掘中的一个重要分支,也是目前应用最广的一种数据挖掘类型。With the rapid development of information technology, it has become a very difficult task to discover valuable information and knowledge from increasingly large and complex data to serve the purpose of decision-making. Data mining technology came into being in this context. Association rule mining is an important branch of data mining, and it is also the most widely used type of data mining.

数据关联规则的挖掘目的是发现在大量数据项之间存在的值得关注的关联或相关关系,典型应用是零售业的购物篮分析。所谓购物篮分析指对数据进行关联规则研究,发现交易数据库中不同商品之间的联系,有助于找出顾客购买行为的模式。例如,如果面包和牛奶经常被顾客同时购买,则把它们摆放在一起有助于增加两种商品的销售量。为了衡量规则的重要程度,关联规则通常采用支持度和可信度作为度量标准。支持度可以表示商品在超市销售中的重要程度,可信度反映了商品之间的关联程度。如果在购买面包的交易中,有60%的交易既购买了面包又购买了牛奶,则称关联规则“面包牛奶”(表示如果购买面包则购买牛奶)的可信度为60%。The purpose of mining data association rules is to discover the noteworthy associations or correlations among a large number of data items, and the typical application is shopping basket analysis in the retail industry. The so-called shopping basket analysis refers to the study of the association rules of the data, and discovers the connection between different commodities in the transaction database, which helps to find out the pattern of customer purchasing behavior. For example, if bread and milk are often purchased by customers at the same time, placing them together can help increase sales of both items. In order to measure the importance of rules, association rules usually use support and credibility as metrics. The degree of support can indicate the importance of commodities in supermarket sales, and the degree of credibility reflects the degree of correlation between commodities. If 60% of the transactions that buy bread buy both bread and milk, then the association rule "bread and milk" (meaning that if you buy bread, you buy milk) has 60% confidence.

关联规则(表示A与B同时存在)在事务数据库D中的支持度,可用概率表示,而关联规则在事务数据库D中的可信度,是在事务数据库D中的那些包含A的事务中,B也同时出现的概率,即条件概率。The support of association rules (meaning that A and B exist simultaneously) in transaction database D can be represented by probability, and the credibility of association rules in transaction database D is in those transactions containing A in transaction database D, The probability that B also occurs at the same time is the conditional probability.

一个项集X在事务数据库D中的支持度,是事务数据库D中包含X的事务占事务总数的百分比,即概率P(X)。对于一个项集X,如果其支持度大于或等于预先给定的支持度阈值min_s,则称X为频繁项集。The support degree of an itemset X in the transaction database D is the percentage of transactions containing X in the transaction database D to the total number of transactions, that is, the probability P(X). For an itemset X, if its support is greater than or equal to the pre-specified support threshold min_s, X is called a frequent itemset.

FP-growth(频繁模式增长算法)算法的基本思想是利用树结构对事务进行压缩,同时保留事务中属性之间的关系。该算法的重要步骤是频繁模式树的建立和挖掘,需要扫描两次事务集:第一次扫描事务集D,找出频繁1-项集集合,并且将集合按照支持度计数降序排列;第二次扫描事务集D,以“Null”为根节点,构建频繁模式树。为了快速遍历频繁模式树,创建了项头表T,该表中每一行表示一个频繁项及其出现的频率,并且设置指针指向它在频繁模式树中对应的节点,最后对频繁模式树进行递归挖掘找出所有的频繁项集。具体算法流程如下:The basic idea of the FP-growth (Frequent Pattern Growth Algorithm) algorithm is to use the tree structure to compress transactions while retaining the relationship between attributes in the transaction. The important step of the algorithm is the establishment and mining of frequent pattern trees, which need to scan transaction sets twice: the first scan transaction set D, find frequent 1-itemset sets, and arrange the sets in descending order of support count; the second The transaction set D is scanned twice, and the frequent pattern tree is constructed with "Null" as the root node. In order to quickly traverse the frequent pattern tree, an item header table T is created, each row in the table represents a frequent item and its frequency of occurrence, and the pointer is set to point to its corresponding node in the frequent pattern tree, and finally the frequent pattern tree is recursed Mining to find all frequent itemsets. The specific algorithm flow is as follows:

(1)创建原始频繁模式树(1) Create the original frequent pattern tree

1)扫描事务集合T,找出支持度计数满足条件的项,并将这些项组合成频繁1-项集合L,基于支持度计数将L降序排列得到1-项集列表。1) Scan the transaction set T, find out the items whose support count meets the condition, and combine these items into a frequent 1-item set L, and sort L in descending order based on the support count to get the 1-item set list.

2)创建原始频繁模式树,并以“Null”为根节点。2) Create an original frequent pattern tree with "Null" as the root node.

3)创建项头表,为了方便遍历原始频繁模式树,项头表中每一行表示一个频繁项,并且有对应指针指向它在频繁模式树中的节点。3) Create an item header table. For the convenience of traversing the original frequent pattern tree, each row in the item header table represents a frequent item, and has a corresponding pointer pointing to its node in the frequent pattern tree.

4)遍历一次事务集合T,根据1-项集列表将T中所有事务的项次序进行调整,为每个调整后项次序的事务创建一个事务分支。如果分支可以共享路径则共享,并且在各个节点上记录共享事务数目。4) Traverse the transaction set T once, adjust the item order of all transactions in T according to the 1-itemset list, and create a transaction branch for each transaction with the adjusted item order. Branches share paths if they can, and record the number of shared transactions on each node.

(2)在原始频繁模式树上递归地找出所有的频繁项集(2) Recursively find all frequent itemsets on the original frequent pattern tree

1)赋后缀模式a的初始值为根节点Null,即a=Null。在频繁模式树上采用递归的方法搜索频繁项集,如果频繁模式树只有一个分支,那么分支路径上的节点的一个组合就是一个前缀模式b。剔除分支路径上的不满足最小支持度的节点,剩余的节点值所组成的任何集合b与后缀模式a取并,可获得所有对应的频繁项集。其中bi为一前缀模式,否则增长a,a=a∪{Ei}(其中Ei为1-项集表中的最后一项,即支持度计数的最小项)。然后构造后缀模式的条件模式基与条件频繁模式树。(其中a的条件模式基,指的是频繁模式树中以Ei为叶子节点的所有分支。A的条件频繁模式树,指得是以a的条件模式基为事务按照步骤1中的方式所创建的新的频繁模式树)1) The initial value of the suffix pattern a is the root node Null, that is, a=Null. A recursive method is used to search for frequent itemsets on the frequent pattern tree. If the frequent pattern tree has only one branch, then a combination of nodes on the branch path is a prefix pattern b. Eliminate the nodes on the branch path that do not meet the minimum support, and merge any set b composed of the remaining node values with the suffix pattern a to obtain all corresponding frequent itemsets. Where b i is a prefix pattern, otherwise increase a, a=a∪{E i } (where E i is the last item in the 1-itemset table, which is the minimum item of support count). Then construct the conditional pattern base and conditional frequent pattern tree of the suffix pattern. (The conditional pattern base of a refers to all branches with E i as the leaf node in the frequent pattern tree. The conditional frequent pattern tree of A refers to the conditional pattern base of a as the transaction according to the method in step 1. The new frequent pattern tree created)

2)在条件频繁模式树上采用相同方法递归地搜索频繁项集(后缀模式a此时为{Ei})。2) Use the same method to recursively search for frequent itemsets on the conditional frequent pattern tree (the suffix pattern a is {E i } at this time).

3)对每个最大频繁项集,取其所有子集,其中每个子集为一频繁项集。3) For each maximum frequent itemset, take all its subsets, and each subset is a frequent itemset.

上述FP-growth(频繁模式增长算法)数据挖掘方法,对于少量数据记录,可以方便地在单机上实现。但对于海量的数据记录进行数据关联规则的挖掘时,一方面由于单机内存容量有限,不可能读入海量的数据;另一方面,由于数据挖掘过程中需要处理的数据记录太多,处理时间很长,在实际的数据业务应用中,效率很低。The above-mentioned FP-growth (Frequent Pattern Growth Algorithm) data mining method can be conveniently implemented on a single computer for a small amount of data records. However, when mining data association rules for massive data records, on the one hand, due to the limited memory capacity of a single machine, it is impossible to read massive data; Long, in the actual data business application, the efficiency is very low.

因此,对于实际应用中海量数据的关联规则挖掘,如何有效地提升处理效率是数据关联规则挖掘中需要加以解决的一个主要问题。Therefore, for association rule mining of massive data in practical applications, how to effectively improve processing efficiency is a major problem that needs to be solved in data association rule mining.

云计算具有海量的存储能力和可弹性变化的计算能力,因此基于云计算进行海量数据挖掘和获取关联规则,成为解决传统集中式数据挖掘对海量数据动态增长的不适应性的一种高效、可信途径。Cloud computing has massive storage capacity and elastically variable computing power. Therefore, massive data mining and association rule acquisition based on cloud computing has become an efficient and reliable solution to the inadaptability of traditional centralized data mining to the dynamic growth of massive data. letter path.

发明内容Contents of the invention

本发明针对现有技术存在的海量数据关联规则挖掘问题,对经典FP-growth算法的频繁模式树的结构和挖掘过程进行了深入分析和研究,分析了频繁模式树单路径和多路径的挖掘方法,提出了一个剪枝方法,并提出了剪枝频繁模式树算法;可以减少部分分支的迭代次数。然后利用云计算的MapReduce编程技术,对剪枝频繁模式树算法的各个步骤并行化,提高了剪枝频繁模式树算法对数据的处理能力和效率。Aiming at the problem of mining association rules of massive data existing in the prior art, the present invention conducts in-depth analysis and research on the structure and mining process of the frequent pattern tree of the classic FP-growth algorithm, and analyzes the single-path and multi-path mining methods of the frequent pattern tree , a pruning method is proposed, and a pruning frequent pattern tree algorithm is proposed; the number of iterations of some branches can be reduced. Then, using the MapReduce programming technology of cloud computing, the steps of the pruning frequent pattern tree algorithm are parallelized, which improves the data processing ability and efficiency of the pruning frequent pattern tree algorithm.

本发明解决上述技术问题的技术方案是:提出一种海量数据的数据关联规则挖掘实现方法,包括如下步骤:The technical solution of the present invention to solve the above-mentioned technical problems is: to propose a data association rule mining implementation method for massive data, including the following steps:

1)将输入数据按照记录进行分割,并将分割后的数据分发给多个计算节点,由各计算节点并行统计1-项集的频率,按照1-项集频率排序的结果,构建频繁1-项集表,删除小于给定阈值的项集,生成最终的频繁1-项集表;1) Divide the input data according to the records, and distribute the divided data to multiple computing nodes, each computing node counts the frequency of 1-itemsets in parallel, sorts the results according to the frequency of 1-itemsets, and constructs frequent 1-itemsets Itemset table, delete itemsets smaller than a given threshold, and generate the final frequent 1-itemset table;

2)根据频繁1-项集表对频繁1-项集进行分组,然后对频繁1-项集进行数据投影,把得到投影后的数据存储在建立的分组表中;2) Group frequent 1-itemsets according to the frequent 1-itemset table, then perform data projection on frequent 1-itemsets, and store the projected data in the established grouping table;

3)由各节点对分组表中的数据进行统计,构建本地频繁模式树,然后对频繁模式树进行数据关联规则挖掘,其中在对频繁模式树进行遍历挖掘时,如果项集i在频繁模式树中存在前缀路径集合A,B并且满足

Figure BDA00003255809800031
,那么前缀路径集合中长路径集合B中的项集i就可以剪枝和短路径集合A合并。3) Each node counts the data in the grouping table, builds a local frequent pattern tree, and then mines the data association rules of the frequent pattern tree. When traversing the frequent pattern tree, if the item set i is in the frequent pattern tree There exists a set of prefix paths A, B in and satisfies
Figure BDA00003255809800031
, then the item set i in the long path set B in the prefix path set can be pruned and merged with the short path set A.

假设集合A中i出现的频率为m,集合B中i出现的频率为n,最小支持度为q,且m≥q,n<q。未剪枝的频繁模式树得到i的条件模式基为{{A:m},{B:n}},产生的频繁模式为{Ai:m+n}。剪枝以后的频繁模式树得到i的条件模式基为{A:m+n},产生的频繁模式为{Ai:m+n}。剪枝后包含i的频繁模式和未剪枝包含i的频繁模式相等。因此,集合B中的i就可以剪枝和短路径集合A合并。Suppose the frequency of i in set A is m, the frequency of i in set B is n, the minimum support is q, and m≥q, n<q. The conditional pattern base of i obtained from the unpruned frequent pattern tree is {{A:m},{B:n}}, and the resulting frequent pattern is {Ai:m+n}. After pruning the frequent pattern tree, the conditional pattern base of i is {A:m+n}, and the resulting frequent pattern is {Ai:m+n}. The frequent pattern containing i after pruning is equal to the frequent pattern containing i without pruning. Therefore, i in set B can be pruned and merged with short path set A.

4)对各个子树进行频繁模式的挖掘,然后合并由各子树挖掘得到的频繁模式,得到最终的频繁模式。4) Mining the frequent patterns of each subtree, and then merging the frequent patterns mined from each subtree to obtain the final frequent pattern.

进一步,所述步骤1)具体包括如下步骤:Further, the step 1) specifically includes the following steps:

11)将数据对象按记录均匀划分,使每组数据包含平衡的数据对象;11) Divide data objects evenly by records, so that each group of data contains balanced data objects;

12)在每组数据中,统计每一个项集出现的次数;12) In each set of data, count the number of occurrences of each item set;

13)根据项集出现的频率进行排序,然后和给定的阈值(绝对支持度的计数)进行对比,小于给定阈值的项集删除,最后建立一个Hash表来存储频繁1-项集,该表命名为频繁1-项集表。13) Sort according to the frequency of item sets, and then compare with a given threshold (absolute support count), delete itemsets smaller than the given threshold, and finally create a Hash table to store frequent 1-itemsets, the The table is named frequent 1-itemset table.

进一步,所述步骤2)具体包括如下步骤:Further, the step 2) specifically includes the following steps:

21)根据频繁1-项集表,对频繁1-项集进行数据分组,分组后的数据进行源数据的投影,建立Hash表把投影后的数据存储起来,存储的表命名分组表;21) According to the frequent 1-itemset table, group the frequent 1-itemset data, project the source data on the grouped data, create a Hash table to store the projected data, and name the stored table as the grouping table;

22)根据集群的节点的个数设置分组表大小,分组表大小等于MapReduce中设置的Map个数。22) Set the size of the grouping table according to the number of nodes in the cluster, and the size of the grouping table is equal to the number of Maps set in MapReduce.

进一步,所述步骤4)具体包括如下步骤:Further, the step 4) specifically includes the following steps:

41)对剪枝之后的频繁模式树进行频繁模式的挖掘;41) Mining frequent patterns on the frequent pattern tree after pruning;

42)合并所有的频繁模式。42) Merge all frequent patterns.

进一步,步骤13)中,生成频繁1-项集,这里的频繁1-项集的数据结构是Hash形式,每个Hash数据由q个数据对象进行单向Hash计算获得,每个数据对象参与t+1次Hash计算,获得q(t+1)个Hash数据。Further, in step 13), frequent 1-itemsets are generated, where the data structure of frequent 1-itemsets is in the form of Hash, and each Hash data is obtained by one-way Hash calculation of q data objects, and each data object participates in t +1 Hash calculation, get q(t+1) Hash data.

进一步,所述步骤21)中,分组表的数据结构是Hash形式,分组表存储分组后的项集,以及项集所在数据的行号;记录数据所在记录,去投影源数据,把数据分到对应的组中。Further, in the step 21), the data structure of the grouping table is in Hash form, and the grouping table stores the grouped item sets and the row number of the data where the item sets are located; record the records where the data is located, project the source data, and divide the data into in the corresponding group.

本发明提供的数据关联规则挖掘方法中,采用了多个计算节点并行地参与数据项统计过程再合并统计结果,加快了生成频繁1-项集的速度;并通过数据分组,然后进行数据的分发,最后通过分发的数据集到计算节点上挖掘出关联规则,其中,每个节点在对频繁模式树挖掘过程采用了剪枝策略,减少了数据挖掘的迭代次数,提高了处理效率。同时每个计算节点仅处理一部分数据记录,解决了海量数据无法由单机读入内存进行处理的问题。本发明可用于数据挖掘的关联规则提取中;适用于数据处理在商业、企业、政府部门及科学研究等领域中;也适用于其他需要提取数据的关联性的各种场合。In the data association rule mining method provided by the present invention, multiple computing nodes are used to participate in the data item statistical process in parallel and then merge the statistical results, which speeds up the generation of frequent 1-item sets; and through data grouping, and then data distribution , and finally mine the association rules through the distributed data sets on the computing nodes, in which each node adopts the pruning strategy in the process of frequent pattern tree mining, which reduces the number of iterations of data mining and improves the processing efficiency. At the same time, each computing node only processes a part of data records, which solves the problem that massive data cannot be read into memory by a single machine for processing. The invention can be used in the extraction of association rules of data mining; it is suitable for data processing in the fields of commerce, enterprises, government departments and scientific research; it is also applicable to various occasions that need to extract the association of data.

本发明的其他优点、目标,和特征在某种程度上将在随后的说明书中进行阐述,并且在某种程度上,基于对下文的考察研究对本领域技术人员而言将是显而易见的,或者可以从本发明的实践中得到教导。Other advantages, objects, and features of the present invention will be set forth in the ensuing description to some extent, and to some extent, will be obvious to those skilled in the art based on the investigation and research below, or can be Learn from the practice of the invention.

附图说明Description of drawings

为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本明作进一步的详细描述:In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings:

图1为表一数据集建立的频繁模式树结构示意图;Figure 1 is a schematic diagram of the frequent pattern tree structure established by the data set in Table 1;

图2为本发明流程图;Fig. 2 is a flowchart of the present invention;

图3为并行的剪枝频繁模式树算法框架图;Fig. 3 is the frame diagram of parallel pruning frequent pattern tree algorithm;

图4为云计算平台结构示意图。Fig. 4 is a schematic structural diagram of a cloud computing platform.

具体实施方式Detailed ways

下面结合附图,对本发明实施例提供的数据关联规则挖掘实现方法进行详细阐述。The method for implementing data association rule mining provided by the embodiment of the present invention will be described in detail below in conjunction with the accompanying drawings.

图1本发明实施例提供的频繁模式树进行剪枝的结构示意图,包括:Fig. 1 is a schematic structural diagram of frequent pattern tree pruning provided by an embodiment of the present invention, including:

对源数据进行频繁1-项集的统计,得到条件模式基,构建好频繁模式树,挖掘频繁模式树的过程采用一个优化方法,具体步骤参见图2,包括以下步骤:Perform frequent 1-itemset statistics on the source data to obtain the conditional pattern base, construct the frequent pattern tree, and adopt an optimization method in the process of mining the frequent pattern tree. See Figure 2 for specific steps, including the following steps:

将输入数据进行分割,并分发给多个计算节点,由各计算节点并行统计1-项集的频率,根据1-项集频率排序,并删除小于给定阈值的1-项集,构建频繁1-项集表;根据频繁1-项集表对频繁1-项集进行分组建立的分组表,然对频繁1-项集进行数据投影,把投影后的数据存储在分组表中;由各计算节点对分组表中的数据进行统计,构建本地频繁模式树,然后对频繁模式树各子树进行数据关联规则挖掘,如果项集i在频繁模式树中存在前缀路径集合A,B,并且满足,则集合B中的项集i就可以剪枝,并和集合A合并;对各频繁模式子树进行数据关联规则挖掘,合并由各频繁模式子树挖掘得到的频繁模式。Divide the input data and distribute it to multiple computing nodes. Each computing node counts the frequency of 1-itemsets in parallel, sorts them according to the frequency of 1-itemsets, and deletes 1-itemsets smaller than a given threshold to construct frequent 1-itemsets. -itemset table; according to the frequent 1-itemset table to group frequent 1-itemsets to establish a grouping table, then perform data projection on frequent 1-itemsets, and store the projected data in the grouping table; calculated by each The node counts the data in the grouping table, builds a local frequent pattern tree, and then mines data association rules for each subtree of the frequent pattern tree. If item set i has prefix path sets A and B in the frequent pattern tree, and satisfies , then item set i in set B can be pruned and merged with set A; carry out data association rule mining on each frequent pattern subtree, and merge frequent patterns mined from each frequent pattern subtree.

将数据对象按记录均匀划分,使每组数据包含平衡的数据对象;在每组数据中,统计每一个项集出现的次数;根据项集出现的频率进行排序,给定阈值为绝对支持度的计数,频繁1-项集表为一个存储频繁1-项集的Hash表。其中,Hash表中每个Hash数据由q个数据对象进行单向Hash计算获得,每个数据对象参与t+1次Hash计算,获得q(t+1)个Hash数据。Divide the data objects evenly according to the records, so that each group of data contains balanced data objects; in each group of data, count the number of occurrences of each item set; sort according to the frequency of occurrence of the item set, and the given threshold is absolute support Counting, frequent 1-itemset table is a Hash table that stores frequent 1-itemsets. Wherein, each Hash data in the Hash table is obtained by performing one-way Hash calculation on q data objects, and each data object participates in t+1 Hash calculations to obtain q(t+1) Hash data.

把分组后的数据进行源数据的投影,建立Hash表把投影后的数据存储起来,该表命名为分组表;根据计算节点个数设置分组表大小,分组表大小等于并行编程模式MapReduce设置的函数Map个数。分组表的数据结构是Hash形式,存储分组后的项集,以及项集所在数据的行号;记录数据所在的行号,去投影源数据,把数据分到对应的组中。Project the grouped data to the source data, create a Hash table to store the projected data, and name the table as a grouping table; set the size of the grouping table according to the number of computing nodes, and the size of the grouping table is equal to the function set by the parallel programming mode MapReduce Number of maps. The data structure of the grouping table is in Hash form, which stores the grouped itemset and the row number of the data where the itemset is located; records the row number where the data is located, projects the source data, and divides the data into corresponding groups.

对构建好的频繁模式树,遍历其中的每一个项集,如果项集i在某一个路径上是非频繁的,假设项集i在频繁模式树中存在前缀路径集合A、B,并且满足

Figure BDA00003255809800062
,那么前缀路径集合中长路径集合B中的项集i就可以剪枝和短路径集合A合并,For the constructed frequent pattern tree, traverse each item set in it, if the item set i is infrequent on a certain path, assuming that the item set i has prefix path sets A and B in the frequent pattern tree, and satisfies
Figure BDA00003255809800062
, then the item set i in the long path set B in the prefix path set can be pruned and merged with the short path set A,

具体为:假设集合A中i出现的频率为m,集合B中i出现的频率为n,最小支持度为q,且m≥q,n<q。未剪枝的频繁模式树得到i的条件模式基为{{A:m},{B:n}},产生的频繁模式为{Ai:m+n}。剪枝以后的频繁模式树得到i的条件模式基为{A:m+n},产生的频繁模式为{Ai:m+n}。剪枝后包含i的频繁模式和未剪枝包含i的频繁模式相等。因此,集合B中的i就可以剪枝和短路径集合A合并。Specifically: Assume that the frequency of i in set A is m, the frequency of i in set B is n, the minimum support is q, and m≥q, n<q. The conditional pattern base of i obtained from the unpruned frequent pattern tree is {{A:m},{B:n}}, and the resulting frequent pattern is {Ai:m+n}. After pruning the frequent pattern tree, the conditional pattern base of i is {A:m+n}, and the resulting frequent pattern is {Ai:m+n}. The frequent pattern containing i after pruning is equal to the frequent pattern containing i without pruning. Therefore, i in set B can be pruned and merged with short path set A.

对于多路径的项集来说,有的项集存在某个路径上的支持度小于给定的阈值,因此可以采用上述方法剪枝。剪枝后如果是单路径的就可以归为单路径的情况下去考虑。这样可以减少项集挖掘的递归次数,提高挖掘效率。For multi-path itemsets, some itemsets have a support degree on a certain path that is less than a given threshold, so the above method can be used to prune. After pruning, if it is a single path, it can be considered as a single path. This can reduce the recursion times of itemset mining and improve mining efficiency.

以表1事务数据集D为例来说明剪枝方法的过程。假如最小支持度为2,在建立图1的频繁模式树后,可以得到该频繁模式树对应的HeaderTable(项头表)信息,通过用HeaderTable中节点链的信息,查找各个项的各个节点前缀路径。Take the transaction data set D in Table 1 as an example to illustrate the process of the pruning method. If the minimum support is 2, after the frequent pattern tree in Figure 1 is established, the HeaderTable (item header table) information corresponding to the frequent pattern tree can be obtained, and the prefix path of each node of each item can be found by using the information of the node chain in the HeaderTable .

表1数据集DTable 1 Dataset D

Figure BDA00003255809800071
Figure BDA00003255809800071

从图1可以看出,I5项的一个前缀路径是A={I2,I1},I5项的另外一个前缀路径是B={I2,I1,I3},A是B的真子集,满足上面提出的剪枝方法,所以剪枝I5后的频繁模式树,I5就从原来的多路径变为单路径,直接进行频繁项的挖掘,减少了挖掘的递归次数。It can be seen from Figure 1 that one prefix path of I5 item is A={I2, I1}, another prefix path of I5 item is B={I2, I1, I3}, and A is a proper subset of B, satisfying the above mentioned Therefore, after pruning the frequent pattern tree of I5, I5 changes from the original multi-path to single-path, and directly mines frequent items, reducing the number of recursive mining.

图3为本发明实施例提供的并行的剪枝频繁模式树算法步骤流程图,包括如下步骤:Fig. 3 is the flow chart of parallel pruning frequent pattern tree algorithm steps provided by the embodiment of the present invention, including the following steps:

剪枝频繁模式树算法,在串行算法上表现了一定的优势。通过构造频繁模式树来寻找频繁模式,然而在数据集达到一定规模时,构造基于内存的频繁模式树是不现实的。因此,提出了基于MapReduce并行的剪枝频繁模式树算法解决上述问题。该算法在剪枝频繁模式树算法的基础上,利用MapReduce编程模型,对剪枝频繁模式树算法的各个步骤并行化。总体分成四个步骤:Pruning frequent pattern tree algorithm shows certain advantages in serial algorithm. Find frequent patterns by constructing frequent pattern trees. However, when the data set reaches a certain scale, it is unrealistic to construct memory-based frequent pattern trees. Therefore, a parallel pruning frequent pattern tree algorithm based on MapReduce is proposed to solve the above problems. Based on the pruning frequent pattern tree algorithm, the algorithm uses the MapReduce programming model to parallelize each step of the pruning frequent pattern tree algorithm. Overall divided into four steps:

频繁1-项集的频率统计;利用频繁1-项集的频率统计结果建立一个频繁1-项集表,按照频繁1-项集表对数据进行分组,这样把数据分成了n个部分;对分解后的数据进行挖掘关联规则;汇总最后的频繁模式,即M1,M2,M3,M4。Frequency statistics of frequent 1-itemsets; use the frequency statistics of frequent 1-itemsets to create a frequent 1-itemset table, and group the data according to the frequent 1-itemset table, thus dividing the data into n parts; The decomposed data is used to mine association rules; the final frequent patterns are summarized, namely M1, M2, M3, M4.

M1阶段是使用一个MapReduce任务来进行所有项集的频率统计。当输入一个分片后,首先调用Map函数,按照项集的记录逐行将项集分割成pair<key=num,String>;接着对String按照项集的间隔进行分割,形成pair<key,value=Ti>,其中,key为单个的项,value为项所在的事务;然后对Map分解后的结果进行Reduce(MapReduce模式的一种函数)操作,把key所在的事务数全部查找出来,并进行频率统计;最后将频率统计的结果按照设定的最小支持度进行筛选,把筛选的结果按照项集的频度从高到低的顺序存入频繁1-项集表中。The M1 stage uses a MapReduce task to perform frequency statistics of all itemsets. When a fragment is input, the Map function is first called to divide the item set into pair<key=num, String> line by line according to the records of the item set; then the String is divided according to the interval of the item set to form a pair<key, value= Ti>, where the key is a single item, and the value is the transaction where the item is located; then the Reduce (a function of the MapReduce mode) operation is performed on the result after the Map is decomposed, and the number of transactions where the key is located is found out, and the frequency Statistics; finally, the results of frequency statistics are screened according to the set minimum support, and the screened results are stored in the frequent 1-itemset table in order of the frequency of itemsets from high to low.

M2阶段是为了达到负载均衡的效果,并且保证每组相对独立,以便后续处理更方便并且保证挖掘结果的完备性,要对数据进行平衡分组。该步骤首先根据F-list表中的事务投影数据来分组,然后创建一个分组表,最后将分组后的数据存放该表中。The M2 stage is to achieve the effect of load balancing and to ensure that each group is relatively independent, so that the subsequent processing is more convenient and the integrity of the mining results is ensured. The data should be balanced and grouped. In this step, first, group the transaction projection data in the F-list table, then create a group table, and finally store the grouped data in the table.

M3阶段按照平衡分组的策略把数据分成n组,使得当数据输入时,一个Map函数对应一个分组,分别对各分组进行挖掘。这样就可以达到数据分解的效果,最后利用Reduce函数把Map处理的结果汇总起来。每个Map任务的运行是基于在平衡分组的过程中产生的分组表。建立的分组表是Hash结构,每个项映射到该项所在的事务数。每个Reduce任务归约Map任务产生的频繁模式。In the M3 stage, the data is divided into n groups according to the strategy of balanced grouping, so that when the data is input, one Map function corresponds to one group, and each group is mined separately. In this way, the effect of data decomposition can be achieved, and finally the Reduce function is used to summarize the results of Map processing. The operation of each map task is based on the grouping table generated during the process of balancing grouping. The established grouping table is a Hash structure, and each item is mapped to the transaction number of the item. Each Reduce task reduces the frequent patterns generated by the Map task.

M4阶段是汇总最后的频繁模式,也就是Reduce把所有的结果汇总起来,进行后面的处理。The M4 stage is to summarize the final frequent patterns, that is, Reduce collects all the results for subsequent processing.

参见图4,为本发明实施例提供的云计算平台结构示意图,包括:Referring to Fig. 4, it is a schematic structural diagram of a cloud computing platform provided by an embodiment of the present invention, including:

系统平台由三大部分组成:分布式文件系统服务,其中,主控服务器(Namenode)用于维护和管理整个文件系统,数据服务器(Datanode)用来存储数据块;MapReduce计算服务,其中,任务服务器(TaskTracker)用来完成Map和Reduce任务,而工作服务器(JobTracker)主要用来调度任务的执行、处理机器故障(容错)理、TaskTracker通讯以及对Job和Task执行情况的监控等;此外,还有一个工作(Job)的提交者,即客户端。The system platform consists of three parts: Distributed file system service, in which the main control server (Namenode) is used to maintain and manage the entire file system, and the data server (Datanode) is used to store data blocks; MapReduce computing service, in which the task server (TaskTracker) is used to complete Map and Reduce tasks, while the job server (JobTracker) is mainly used to schedule task execution, handle machine failure (fault tolerance), TaskTracker communication, and monitor the execution of Job and Task; in addition, there are The submitter of a job (Job), that is, the client.

本发明提供的数据关联规则挖掘实现方法中,提出了一个剪枝方法,并提出了剪枝频繁模式树算法,减少部分分支的迭代次数。然后本发明利用云计算的MapReduce编程技术,对剪枝频繁模式树算法的各个步骤并行化,提高了剪枝频繁模式树算法对数据的处理能力和效率,同时使得剪枝频繁模式树算法能适应云计算环境。将待处理的数据记录分块后分配给不同的计算节点处理,每个计算节点仅处理一部分数据记录,解决了海量数据无法全部由单机读入内存进行挖掘处理的问题;本发明提供的数据关联规则挖掘实现方法中,采用了多个计算节点并行地参与数据项统计过程再合并统计结果,有效解决了现有技术无法实现对海量数据的关联规则挖掘及处理效率低的问题。本发明不仅对传统的FP-growth算法进行了优化,提高了挖掘效率,还提出了改进后的并行挖掘算法,解决海量数据挖掘问题。本发明可用于数据挖掘的关联规则提取中;适用于数据处理在商业、企业、政府部门及科学研究等领域中;也适用于其他需要提取数据的关联性的各种场合。In the data association rule mining implementation method provided by the present invention, a pruning method is proposed, and a pruning frequent pattern tree algorithm is proposed to reduce the number of iterations of some branches. Then the present invention utilizes the MapReduce programming technology of cloud computing to parallelize each step of the pruning frequent pattern tree algorithm, which improves the data processing capability and efficiency of the pruning frequent pattern tree algorithm, and simultaneously enables the pruning frequent pattern tree algorithm to adapt to cloud computing environment. The data records to be processed are divided into blocks and distributed to different computing nodes for processing. Each computing node only processes a part of the data records, which solves the problem that the massive data cannot be read into the memory by a single machine for mining and processing; the data association provided by the present invention In the implementation method of rule mining, multiple computing nodes are used to participate in the statistical process of data items in parallel and then merge the statistical results, which effectively solves the problem that the existing technology cannot realize the mining of association rules for massive data and the problem of low processing efficiency. The invention not only optimizes the traditional FP-growth algorithm to improve the mining efficiency, but also proposes an improved parallel mining algorithm to solve the massive data mining problem. The invention can be used in the extraction of association rules of data mining; it is suitable for data processing in the fields of commerce, enterprises, government departments and scientific research; it is also applicable to various occasions that need to extract the association of data.

以上所述仅为本发明的优选实施例,并不用于限制本发明,显然,本领域的技术人员可以采用本发明中的剪枝频繁模式树算法进行优化,并进行并行化处理对本发明进行应用,不脱离本发明的精神和范围。The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention. Obviously, those skilled in the art can use the pruning frequent pattern tree algorithm in the present invention to optimize and perform parallel processing to apply the present invention , without departing from the spirit and scope of the present invention.

Claims (5)

1. data association rule mining implementation method, it is characterized in that, may further comprise the steps: (1) will be imported data and cut apart, and be distributed to a plurality of computing nodes, frequency by the parallel statistics of each computing node 1-item collection, according to the ordering of 1-item collection frequency, and deletion makes up frequent 1-item collection table less than the 1-item collection of given threshold value; (2) grouping sheet that frequent 1-item collection is divided into groups to set up according to frequent 1-item collection table so carries out data projection to frequent 1-item collection, and the data after the projection are stored in the grouping sheet; (3) by each computing node the data in the grouping sheet are added up, make up local frequent pattern tree (fp tree), then each subtree of frequent pattern tree (fp tree) is carried out the data association rule mining, if there is the prefix path set A in a collection i in frequent pattern tree (fp tree), B, and satisfy
Figure FDA00003255809700011
, then the item collection i in the set B just can beta pruning, and and the set A merging; (4) each frequent mode subtree is carried out the data association rule mining, merge by each frequent mode subtree and excavate the frequent mode that obtains.
2. data association rule mining implementation method as claimed in claim 1 is characterized in that, described step (1) further comprises: data object is evenly divided by record, made every group of data comprise the data object of balance; In every group of data, add up the number of times that each collection occurs; The frequency that occurs according to the item collection sorts, and given threshold value is the counting of absolute support, and frequent 1-item collection table is the Hash table of a frequent 1-item collection of storage.
3. data association rule mining implementation method as claimed in claim 1, it is characterized in that, described step (2) further comprises: the data after the grouping are carried out the projection of source data, set up the Hash table data after the projection are stored, this table called after grouping sheet; According to the computing node number grouping sheet size is set, the grouping sheet size equals the function Map number that Parallel Programming Models MapReduce arranges.
4. data association rule mining implementation method as claimed in claim 2, it is characterized in that, each Hash data is carried out unidirectional Hash calculating acquisition by q data object in the Hash table, and each data object participates in t+1 Hash and calculates, and obtains the individual Hash data of q (t+1).
5. data association rule mining implementation method as claimed in claim 3 is characterized in that the data structure of grouping sheet is the Hash form, the item collection after the stores packets, and the row of a collection place data number; The row at record data place number goes the projection source data, and data are assigned in the corresponding group.
CN2013102012225A 2013-05-27 2013-05-27 Association rule mining method based on mass data Pending CN103258049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013102012225A CN103258049A (en) 2013-05-27 2013-05-27 Association rule mining method based on mass data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102012225A CN103258049A (en) 2013-05-27 2013-05-27 Association rule mining method based on mass data

Publications (1)

Publication Number Publication Date
CN103258049A true CN103258049A (en) 2013-08-21

Family

ID=48961966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102012225A Pending CN103258049A (en) 2013-05-27 2013-05-27 Association rule mining method based on mass data

Country Status (1)

Country Link
CN (1) CN103258049A (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678695A (en) * 2013-12-27 2014-03-26 中国科学院深圳先进技术研究院 Concurrent processing method and device
CN103761236A (en) * 2013-11-20 2014-04-30 同济大学 Incremental frequent pattern increase data mining method
CN104298778A (en) * 2014-11-04 2015-01-21 北京科技大学 Method and system for predicting quality of rolled steel product based on association rule tree
CN104598992A (en) * 2015-01-12 2015-05-06 北京中交兴路车联网科技有限公司 Common route analysis method for vehicle
CN104636956A (en) * 2015-01-26 2015-05-20 沈阳师范大学 Method for gain mining of user behavior pattern based on mobile transaction sequence pattern
CN104731852A (en) * 2014-12-16 2015-06-24 芜湖乐锐思信息咨询有限公司 Big data system
CN104834734A (en) * 2015-05-18 2015-08-12 成都博元科技有限公司 Efficient data analysis and processing method
CN104834733A (en) * 2015-05-18 2015-08-12 成都博元科技有限公司 Big data mining and analyzing method
CN105512322A (en) * 2015-12-18 2016-04-20 中国农业银行股份有限公司 Frequent item set generating method and device
CN105589908A (en) * 2014-12-31 2016-05-18 中国银联股份有限公司 Association rule computing method for transaction set
CN105760279A (en) * 2016-03-09 2016-07-13 北京国电通网络技术有限公司 Method and system for generating fault early warning relevance tree of distributed database cluster
CN105930457A (en) * 2016-04-21 2016-09-07 南开大学 Distributed architecture-based data flow frequent item mining method
CN106327323A (en) * 2016-08-19 2017-01-11 清华大学 Bank frequent item mode mining method and bank frequent item mode mining system
CN106503218A (en) * 2016-10-27 2017-03-15 北京邮电大学 A kind of parallelization Workflow association data find method
CN106598494A (en) * 2016-12-05 2017-04-26 东软集团股份有限公司 Data statistical method and apparatus
CN106790468A (en) * 2016-12-10 2017-05-31 武汉白虹软件科技有限公司 A kind of distributed implementation method for analyzing user's WiFi event trace rules
CN106874396A (en) * 2017-01-16 2017-06-20 重庆大学 A kind of Frequent Pattern Mining method based on nonvolatile memory
CN107291734A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method for digging of frequent item set, apparatus and system
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
CN107609110A (en) * 2017-09-13 2018-01-19 深圳大学 The method for digging and device of maximum various frequent mode based on classification tree
CN107766442A (en) * 2017-09-21 2018-03-06 深圳金融电子结算中心有限公司 A kind of mass data association rule mining method and system
CN108182294A (en) * 2018-01-31 2018-06-19 湖北工业大学 A kind of film based on frequent item set growth algorithm recommends method and system
CN108200084A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of network security daily record based on grey wolf algorithm determines method and system
CN108280210A (en) * 2018-01-31 2018-07-13 湖北工业大学 A kind of traffic route based on fireworks algorithm determines method and system
CN109669967A (en) * 2018-12-13 2019-04-23 深圳市信义科技有限公司 A kind of space-time data association analysis method based on big data technology
CN110188131A (en) * 2019-06-03 2019-08-30 西北工业大学 Method and device for frequent pattern mining
CN110309200A (en) * 2019-06-26 2019-10-08 复旦大学 Improved FP-Growth correlation analysis method for abnormal product quality data
CN110334796A (en) * 2019-06-28 2019-10-15 北京科技大学 A method and device for mining association rules of social security events
CN110909238A (en) * 2019-10-25 2020-03-24 北京比财数据科技有限公司 Association mining algorithm considering competition mode
CN110990434A (en) * 2019-11-29 2020-04-10 国网四川省电力公司信息通信公司 Spark platform grouping and Fp-Growth association rule mining method
CN113360471A (en) * 2021-05-31 2021-09-07 浙大宁波理工学院 High-utility frequent pattern mining method based on distributed computation
CN113396409A (en) * 2019-03-01 2021-09-14 国际商业机器公司 Association rule mining system
CN113449868A (en) * 2021-07-09 2021-09-28 西安交通大学 Machining process parameter decision support knowledge mining method and system
CN114138860A (en) * 2021-11-23 2022-03-04 江西理工大学 Parallel frequent itemset mining method based on Spark
CN114328491A (en) * 2021-12-29 2022-04-12 中国工商银行股份有限公司 Data processing method and device
CN114547153A (en) * 2022-03-03 2022-05-27 浙江大学 Customized product implicit demand mining method and system based on data timeliness updating
CN115473933A (en) * 2022-10-10 2022-12-13 国网江苏省电力有限公司南通供电分公司 A Discovery Method of Association Service in Network System Based on Frequent Subgraph Mining
CN116383272A (en) * 2023-03-31 2023-07-04 厦门城市职业学院(厦门开放大学) Method and device for excavating regular paths in driving records

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665669B2 (en) * 2000-01-03 2003-12-16 Db Miner Technology Inc. Methods and system for mining frequent patterns
CN102289507A (en) * 2011-08-30 2011-12-21 王洁 Method for mining data flow weighted frequent mode based on sliding window

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665669B2 (en) * 2000-01-03 2003-12-16 Db Miner Technology Inc. Methods and system for mining frequent patterns
CN102289507A (en) * 2011-08-30 2011-12-21 王洁 Method for mining data flow weighted frequent mode based on sliding window

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘娜: ""基于MapReduce的数据挖掘算法在全国人口系统中的应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 4, 15 April 2012 (2012-04-15) *
马丽生: ""最大频繁项目集挖掘算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 4, 15 October 2007 (2007-10-15) *

Cited By (57)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761236B (en) * 2013-11-20 2017-02-08 同济大学 Incremental frequent pattern increase data mining method
CN103761236A (en) * 2013-11-20 2014-04-30 同济大学 Incremental frequent pattern increase data mining method
CN103678695A (en) * 2013-12-27 2014-03-26 中国科学院深圳先进技术研究院 Concurrent processing method and device
CN103678695B (en) * 2013-12-27 2018-05-11 中国科学院深圳先进技术研究院 A kind of method for parallel processing and device
CN104298778A (en) * 2014-11-04 2015-01-21 北京科技大学 Method and system for predicting quality of rolled steel product based on association rule tree
CN104298778B (en) * 2014-11-04 2017-07-04 北京科技大学 A kind of Forecasting Methodology and system of the steel rolling product quality based on correlation rule tree
CN104731852A (en) * 2014-12-16 2015-06-24 芜湖乐锐思信息咨询有限公司 Big data system
CN105589908A (en) * 2014-12-31 2016-05-18 中国银联股份有限公司 Association rule computing method for transaction set
CN104598992A (en) * 2015-01-12 2015-05-06 北京中交兴路车联网科技有限公司 Common route analysis method for vehicle
CN104636956B (en) * 2015-01-26 2018-02-13 沈阳师范大学 User behavior pattern gain method for digging based on move transaction sequence pattern
CN104636956A (en) * 2015-01-26 2015-05-20 沈阳师范大学 Method for gain mining of user behavior pattern based on mobile transaction sequence pattern
CN104834733A (en) * 2015-05-18 2015-08-12 成都博元科技有限公司 Big data mining and analyzing method
CN104834734A (en) * 2015-05-18 2015-08-12 成都博元科技有限公司 Efficient data analysis and processing method
CN105512322A (en) * 2015-12-18 2016-04-20 中国农业银行股份有限公司 Frequent item set generating method and device
CN105512322B (en) * 2015-12-18 2019-02-15 中国农业银行股份有限公司 The generation method and device of frequent item set
CN105760279A (en) * 2016-03-09 2016-07-13 北京国电通网络技术有限公司 Method and system for generating fault early warning relevance tree of distributed database cluster
CN105760279B (en) * 2016-03-09 2018-09-07 北京国电通网络技术有限公司 Distributed experiment & measurement system fault pre-alarming relevance tree generation method and system
CN107291734A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of method for digging of frequent item set, apparatus and system
CN107292186B (en) * 2016-03-31 2021-01-12 阿里巴巴集团控股有限公司 Model training method and device based on random forest
CN107292186A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 A kind of model training method and device based on random forest
US11276013B2 (en) 2016-03-31 2022-03-15 Alibaba Group Holding Limited Method and apparatus for training model based on random forest
CN105930457A (en) * 2016-04-21 2016-09-07 南开大学 Distributed architecture-based data flow frequent item mining method
CN106327323A (en) * 2016-08-19 2017-01-11 清华大学 Bank frequent item mode mining method and bank frequent item mode mining system
CN106503218A (en) * 2016-10-27 2017-03-15 北京邮电大学 A kind of parallelization Workflow association data find method
CN106598494B (en) * 2016-12-05 2022-07-29 东软集团股份有限公司 Data statistical method and device
CN106598494A (en) * 2016-12-05 2017-04-26 东软集团股份有限公司 Data statistical method and apparatus
CN106790468A (en) * 2016-12-10 2017-05-31 武汉白虹软件科技有限公司 A kind of distributed implementation method for analyzing user's WiFi event trace rules
CN106790468B (en) * 2016-12-10 2020-06-02 武汉白虹软件科技有限公司 Distributed implementation method for analyzing WiFi (Wireless Fidelity) activity track rule of user
CN106874396A (en) * 2017-01-16 2017-06-20 重庆大学 A kind of Frequent Pattern Mining method based on nonvolatile memory
CN107609110A (en) * 2017-09-13 2018-01-19 深圳大学 The method for digging and device of maximum various frequent mode based on classification tree
CN107609110B (en) * 2017-09-13 2020-12-04 深圳大学 Method and Device for Mining Maximum Diverse Frequent Patterns Based on Classification Trees
CN107766442B (en) * 2017-09-21 2019-02-01 深圳金融电子结算中心有限公司 A kind of mass data association rule mining method and system
CN107766442A (en) * 2017-09-21 2018-03-06 深圳金融电子结算中心有限公司 A kind of mass data association rule mining method and system
CN108280210B (en) * 2018-01-31 2020-10-09 湖北工业大学 Traffic route determination method and system based on firework algorithm
CN108182294A (en) * 2018-01-31 2018-06-19 湖北工业大学 A kind of film based on frequent item set growth algorithm recommends method and system
CN108280210A (en) * 2018-01-31 2018-07-13 湖北工业大学 A kind of traffic route based on fireworks algorithm determines method and system
CN108200084A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of network security daily record based on grey wolf algorithm determines method and system
CN109669967A (en) * 2018-12-13 2019-04-23 深圳市信义科技有限公司 A kind of space-time data association analysis method based on big data technology
CN109669967B (en) * 2018-12-13 2022-04-15 深圳市信义科技有限公司 Big data technology-based spatio-temporal data correlation analysis method
CN113396409A (en) * 2019-03-01 2021-09-14 国际商业机器公司 Association rule mining system
CN110188131A (en) * 2019-06-03 2019-08-30 西北工业大学 Method and device for frequent pattern mining
CN110188131B (en) * 2019-06-03 2022-10-11 西北工业大学 Frequent pattern mining method and device
CN110309200A (en) * 2019-06-26 2019-10-08 复旦大学 Improved FP-Growth correlation analysis method for abnormal product quality data
CN110334796A (en) * 2019-06-28 2019-10-15 北京科技大学 A method and device for mining association rules of social security events
CN110909238A (en) * 2019-10-25 2020-03-24 北京比财数据科技有限公司 Association mining algorithm considering competition mode
CN110909238B (en) * 2019-10-25 2022-06-07 北京比财数据科技有限公司 Association mining algorithm considering competition mode
CN110990434B (en) * 2019-11-29 2023-04-18 国网四川省电力公司信息通信公司 Spark platform grouping and Fp-Growth association rule mining method
CN110990434A (en) * 2019-11-29 2020-04-10 国网四川省电力公司信息通信公司 Spark platform grouping and Fp-Growth association rule mining method
CN113360471A (en) * 2021-05-31 2021-09-07 浙大宁波理工学院 High-utility frequent pattern mining method based on distributed computation
CN113449868A (en) * 2021-07-09 2021-09-28 西安交通大学 Machining process parameter decision support knowledge mining method and system
CN114138860B (en) * 2021-11-23 2025-09-19 韶关学院 Spark-based parallel frequent item set mining method
CN114138860A (en) * 2021-11-23 2022-03-04 江西理工大学 Parallel frequent itemset mining method based on Spark
CN114328491A (en) * 2021-12-29 2022-04-12 中国工商银行股份有限公司 Data processing method and device
CN114547153B (en) * 2022-03-03 2024-12-24 浙江大学 Hidden demand mining method and system for customized products based on data timeliness update
CN114547153A (en) * 2022-03-03 2022-05-27 浙江大学 Customized product implicit demand mining method and system based on data timeliness updating
CN115473933A (en) * 2022-10-10 2022-12-13 国网江苏省电力有限公司南通供电分公司 A Discovery Method of Association Service in Network System Based on Frequent Subgraph Mining
CN116383272A (en) * 2023-03-31 2023-07-04 厦门城市职业学院(厦门开放大学) Method and device for excavating regular paths in driving records

Similar Documents

Publication Publication Date Title
CN103258049A (en) Association rule mining method based on mass data
CN101996102B (en) Method and system for mining data association rule
CN103425772B (en) A kind of mass data inquiry method with multidimensional information
Yan et al. Quegel: A general-purpose query-centric framework for querying big graphs
CN104767813A (en) Public bank big data service platform based on openstack
US20160253366A1 (en) Analyzing a parallel data stream using a sliding frequent pattern tree
CN105608135B (en) Data mining method and system based on Apriori algorithm
CN103761236A (en) Incremental frequent pattern increase data mining method
CN110309200A (en) Improved FP-Growth correlation analysis method for abnormal product quality data
Ahuja et al. State of big data analysis in the cloud
CN107330098B (en) Query method, computing node and query system for custom report
CN105302803A (en) Product BOM difference analyzing and synchronous updating method
CN108520035A (en) Query Processing Method of SPARQL Basic Graph Pattern Based on Star Decomposition
CN104731925A (en) MapReduce-based FP-Growth load balance parallel computing method
CN111723089A (en) Method and device for processing data based on columnar storage format
CN108875015A (en) A kind of Pruning strategy based on FpGrowth algorithm
Theeten et al. Chive: Bandwidth optimized continuous querying in distributed clouds
Lin et al. Mining high-utility sequential patterns from big datasets
dite Gassama et al. S-FPG: A parallel version of FP-Growth algorithm under Apache Spark™
CN106599122A (en) Parallel frequent closed sequence mining method based on vertical resolution
Raj et al. PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique
Sanchez et al. Evaluating the scalability of big data frameworks
CN115221157A (en) Data processing method and apparatus, computer readable storage medium and electronic device
Huang et al. FIFA: A Forest-based Sliding Window Aggregation Scheme for Out-of-Order Data Streams
Zhang et al. A optimization algorithm for association rule based on spark platform

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20130821

RJ01 Rejection of invention patent application after publication