CN103995869B

CN103995869B - Data-caching method based on Apriori algorithm

Info

Publication number: CN103995869B
Application number: CN201410214776.3A
Authority: CN
Inventors: 张莉; 郭昆; 杨乐游
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2014-05-20
Filing date: 2014-05-20
Publication date: 2017-02-22
Anticipated expiration: 2034-05-20
Also published as: CN103995869A

Abstract

A data caching method based on the Apriori algorithm of the present invention establishes a query log for the condition attribute in the disk, calculates the query frequency of each data block, and forms a frequent data block set with multiple data blocks with high query frequency, and calculates the frequent data block The query frequency of condition attributes in the set, multiple condition attributes with high query frequency form a frequent condition attribute set. Use the Apriori algorithm to obtain the frequent condition attribute group set, map the query frequency to the support degree in the Apriori algorithm, obtain the frequent condition attribute group set, cache the data corresponding to the frequent condition attribute group set in the memory, and build an index for the frequent condition attribute . The data caching method can significantly improve data query efficiency in frequent areas, and caching multiple condition attribute groups has higher query efficiency than a single condition attribute, thereby reducing the retrieval pressure of the database and achieving higher query efficiency.

Description

A Data Cache Method Based on Apriori Algorithm

技术领域technical field

本发明属于数据查询技术领域，具体涉及一种基于Apriori算法的数据缓存方法。The invention belongs to the technical field of data query, and in particular relates to a data caching method based on an Apriori algorithm.

背景技术Background technique

近些年来随着互联网的飞速发展，尤其是微博、微信等社交应用的兴起，数据量爆炸性的增长，2011年，人类正式进入了ZB时代。不得不承认，我们已经生活在了大数据时代。然而，大数据自诞生以来就被赋予了价值密度低、类型繁多的特点，这也决定了海量数据在查询时将会面临诸多问题。在数据规模不太大的情况下，传统的关系型数据库具有不错的性能，高稳定型，久经历史考验。但当数据量达到一定规模时，对于关系数据库来说，效率是极其低下、难以忍受的。总而言之，关系型数据库并不能满足大数据时代对数据库高并发读写的需求、对海量数据的高效率存储和访问的需求以及对数据库的高可扩展性和高可用性的需求。In recent years, with the rapid development of the Internet, especially the rise of social applications such as Weibo and WeChat, the amount of data has exploded. In 2011, human beings officially entered the ZB era. We have to admit that we are already living in the era of big data. However, big data has been endowed with the characteristics of low value density and various types since its birth, which also determines that massive data will face many problems when querying. When the data size is not too large, the traditional relational database has good performance, high stability, and has been tested by history. But when the amount of data reaches a certain scale, for relational databases, the efficiency is extremely low and unbearable. All in all, relational databases cannot meet the requirements of high concurrent read and write of databases in the era of big data, the requirements for efficient storage and access of massive data, and the requirements for high scalability and high availability of databases.

问题的发现催生出了新的技术——NoSQL。NoSQL意即“不仅仅是SQL”，是非关系型数据存储的广义定义。它打破了长久以来关系型数据库与ACID理论大一统的局面。NoSQL数据存储不需要固定的表结构，通常也不存在连接操作。在大数据存取上具备关系型数据库无法比拟的性能优势。然而，当前主流的NoSQL数据库多采用LIRS算法实现数据缓存机制，然而LIRS算法无法对较长时间内频繁查询的数据进行有效统计，不能采取有针对性的策略缓存待查询数据。The discovery of the problem gave birth to a new technology - NoSQL. NoSQL means "Not Just SQL" and is a broad definition of non-relational data storage. It breaks the long-standing unification of relational database and ACID theory. NoSQL data storage does not require a fixed table structure, and usually there is no join operation. In terms of big data access, it has incomparable performance advantages compared to relational databases. However, the current mainstream NoSQL databases mostly use the LIRS algorithm to implement the data caching mechanism. However, the LIRS algorithm cannot effectively count the data that is frequently queried for a long period of time, and cannot adopt a targeted strategy to cache the data to be queried.

发明内容Contents of the invention

针对现有技术存在的不足，本发明提供了一种基于Apriori算法的数据缓存方法。Aiming at the deficiencies in the prior art, the present invention provides a data caching method based on the Apriori algorithm.

本发明的技术方案是：Technical scheme of the present invention is:

一种基于Apriori算法的数据缓存方法，包括以下步骤：A data caching method based on the Apriori algorithm, comprising the following steps:

步骤1：在磁盘中以天为单位记录T天内用户查询语句中的条件属性，建立T个查询日志，即用户查询内容。Step 1: Record conditional attributes in user query statements within T days in the disk in units of days, and create T query logs, that is, user query content.

步骤2：计算查询日志中各数据块的查询频繁度，根据得到数据块查询频繁度的大小获得查询频繁度高的多个数据块，形成频繁数据块集合。Step 2: Calculate the query frequency of each data block in the query log, and obtain multiple data blocks with high query frequency according to the obtained data block query frequency to form a frequent data block set.

步骤2.1：确定T个查询日志中各数据块中数据的查询次数。Step 2.1: Determine the query times of the data in each data block in the T query logs.

步骤2.2：对各数据块中的数据的查询次数进行规范化处理：设置近期日志比例区分近期日志数据与历史日志数据，当数据块中的历史日志数据的查询次数高于历史日志数据查询次数上限阈值时，则该历史日志数据的查询次数取值为该上限阈值；当数据块中的近期日志数据的查询次数高于近期日志数据查询次数上限阈值时，则该近期日志数据的查询次数取值为该上限阈值。Step 2.2: Normalize the number of data queries in each data block: set the recent log ratio to distinguish recent log data from historical log data. When the query times of historical log data in a data block is higher than the upper threshold of historical log data query times , the query times of the historical log data is the upper threshold; when the query times of the recent log data in the data block is higher than the upper threshold of the query times of the recent log data, the query times of the recent log data is The upper threshold.

步骤2.3：对规范化处理后的数据块中数据的查询次数进行加权操作：分别对规范化处理后的T个查询日志中各数据块中数据的查询次数加权求和后取平均值，即得到各数据块的查询频繁度。Step 2.3: Perform weighting operation on the query times of the data in the data blocks after normalization processing: respectively weight and sum the query times of the data in each data block in the T query logs after normalization processing and take the average value to obtain each data How often the block is queried.

步骤2.4：根据各数据块查询频繁度的大小选择查询频繁度高的多个数据块，即频繁数据块，各频繁数据块形成频繁数据块集合。Step 2.4: Select multiple data blocks with high query frequency according to the query frequency of each data block, that is, frequent data blocks, and each frequent data block forms a frequent data block set.

步骤3：各频繁数据块的条件属性形成条件属性集合。Step 3: The condition attributes of each frequent data block form a condition attribute set.

步骤4：计算条件属性集合中的每个条件属性的查询频繁度，根据得到条件属性查询频繁度的大小获得查询频繁度高的多个条件属性，形成频繁条件属性集合。Step 4: Calculate the query frequency of each condition attribute in the condition attribute set, and obtain multiple condition attributes with high query frequency according to the obtained condition attribute query frequency to form a frequent condition attribute set.

步骤4.1：确定T个查询日志中在频繁数据块中各条件属性的查询次数。Step 4.1: Determine the query times of each condition attribute in the frequent data blocks in the T query logs.

步骤4.2：对各条件属性的查询次数进行规范化处理：根据近期日志比例区分近期日志条件属性与历史日志条件属性，当历史日志条件属性的查询次数高于历史日志条件属性查询次数上限阈值时，则该历史日志条件属性的查询次数取值为该上限阈值；当近期日志条件属性的查询次数高于近期日志条件属性查询次数上限阈值时，则该近期日志条件属性查询次数取值为该上限阈值。Step 4.2: Normalize the query times of each condition attribute: Distinguish recent log condition attributes and historical log condition attributes according to the ratio of recent logs. The query times of the historical log condition attribute takes the value of the upper threshold; when the query times of the recent log condition attribute is higher than the upper threshold of the query times of the recent log condition attribute, the query times of the recent log condition attribute takes the value of the upper threshold.

步骤4.3：对规范化处理后的条件属性查询次数进行加权操作：分别对规范化处理后的T天内的各条件属性的查询次数加权后求和取平均值，即得到各条件属性的查询频繁度。Step 4.3: Perform weighting operation on the query times of condition attributes after normalization processing: respectively weight and average the query times of each condition attribute within T days after normalization processing, and obtain the query frequency of each condition attribute.

步骤4.4：根据得到的各条件属性的查询频繁度，选择频繁度高的多个条件属性，即频繁条件属性，各频繁条件属性形成频繁条件属性集合。Step 4.4: According to the obtained query frequency of each condition attribute, select a plurality of condition attributes with high frequency, that is, frequent condition attributes, and each frequent condition attribute forms a frequent condition attribute set.

步骤5：利用Apriori算法和频繁条件属性集合获得频繁条件属性组集合，条件属性的查询频繁度映射为Apriori算法中的支持度，Apriori算法得到的频繁项集即为频繁条件属性组集合。Step 5: Use the Apriori algorithm and the frequent condition attribute set to obtain the frequent condition attribute group set, the query frequency of the condition attribute is mapped to the support degree in the Apriori algorithm, and the frequent item set obtained by the Apriori algorithm is the frequent condition attribute group set.

步骤6：将频繁条件属性组集合对应的数据缓存至内存中，并对频繁条件属性集合中的频繁条件属性建立索引，完成数据缓存。Step 6: Cache the data corresponding to the frequent condition attribute group set into the memory, and establish an index for the frequent condition attribute in the frequent condition attribute set, and complete the data cache.

步骤7：当客户端需要进行数据查询时，根据要查询的数据的条件属性，进行查询操作：若要查询的数据的条件属性全部为内存缓存的频繁条件属性，则直接得到查询结果；若要查询的数据的条件属性一部分为内存缓存的频繁条件属性，则根据该部分频繁条件属性的索引查询磁盘数据库中满足该部分条件属性的数据，完成查询操作；若要查询的数据的条件属性均不在内存缓存的频繁条件属性集合中，则从磁盘中加载数据块进行查询操作。Step 7: When the client needs to query data, perform query operations according to the conditional attributes of the data to be queried: if all the conditional attributes of the data to be queried are frequent conditional attributes of the memory cache, the query result will be obtained directly; If some of the conditional attributes of the queried data are frequent conditional attributes of the memory cache, then query the data in the disk database that meets the conditional attributes in the disk database according to the index of the frequent conditional attributes of the part, and complete the query operation; if the conditional attributes of the data to be queried are not in In the frequent condition attribute set of the memory cache, the data block is loaded from the disk for query operation.

本发明的有益效果在于，提出了一种全新的数据缓存方法，结合NoSQL数据库，在数据节点内存中开辟频繁数据列缓存区，本数据缓存方法在频繁区域中能够明显提高数据查询效率，而对于其他区域中的数据查询，由于未做任何处理，故不会影响其查询操作，缓存多个条件属性组相比单一条件属性具有更高的查询效率，对于条件属性组中条件属性个数处于中间规模的缓存，尽管牺牲了一部分缓存完全命中率，但该类缓存能够更出色地完成中间记录的精简工作，缩减内存中因部分条件属性命中而产生的中间结果集，并根据频繁条件属性索引快速定位数据，进而减轻数据库的检索压力，取得了更高的查询效率。The beneficial effect of the present invention is that a brand-new data caching method is proposed, combined with a NoSQL database, a frequent data column caching area is opened in the memory of the data node, and the data caching method can obviously improve the data query efficiency in the frequent area, while for Data query in other areas will not affect its query operation because no processing is done. Compared with a single condition attribute, caching multiple condition attribute groups has higher query efficiency. The number of condition attributes in a condition attribute group is in the middle Large-scale cache, although part of the cache complete hit rate is sacrificed, this type of cache can better complete the streamlining of intermediate records, reduce the intermediate result set in memory due to partial conditional attribute hits, and quickly index according to frequent conditional attributes Locate the data, thereby reducing the retrieval pressure of the database and achieving higher query efficiency.

附图说明Description of drawings

图1为本发明具体实施方式中运行环境HBase分割数据表过程图；Fig. 1 is operating environment HBase segmentation data table process figure in the specific embodiment of the present invention;

图2为本发明具体实施方式中改进的数据查询过程图；Fig. 2 is an improved data query process diagram in a specific embodiment of the present invention;

图3为本发明具体实施方式中基于Apriori算法的数据缓存方法流程图；Fig. 3 is the flow chart of the data caching method based on Apriori algorithm in the specific embodiment of the present invention;

图4为本发明具体实施方式中查询条件属性不同命中情况处理流程图；Fig. 4 is a flow chart of processing different hit situations of query condition attributes in a specific embodiment of the present invention;

图5为本发明具体实施方式中不同缓存方式查询效率对比图；Fig. 5 is a comparison diagram of query efficiency of different cache modes in the specific embodiment of the present invention;

图6为本发明具体实施方式中不同缓存方式条件属性命中情况对比图。FIG. 6 is a comparison diagram of condition attribute hits in different cache modes in a specific embodiment of the present invention.

具体实施方式detailed description

下面结合附图对本发明具体实施方式进行详细说明。The specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

本实施方式在Hadoop-HBase环境下，对查询数据以及用户查询行为使用新浪微博用户数据进行了仿真模拟，令T＝7，将仿真数据分成7等份，以模拟不同时间的查询日志。In this embodiment, under the Hadoop-HBase environment, the query data and user query behavior are simulated using Sina Weibo user data, and T=7 is used to divide the simulation data into 7 equal parts to simulate query logs at different times.

HBase是一个面向列的NoSQL数据库，其作为Hadoop项目的一部分，运行于HDFS文件系统之上。在数据读取方面，HBase采取按列存储方法，相比于按行存储方法，减少了数据读取过程中冗余数据的读取，提高了数据读取效率，使数据检索更加迅速有效。在存储方面，HBase将规模较大的数据表分割成若干数据区域，即数据块，每个区域顺序存储数据表中一定数量的记录，将多个相关区域合并操作，即可获得完整的表信息。HBase数据表分割过程如图1所示。HBase is a column-oriented NoSQL database that runs on the HDFS file system as part of the Hadoop project. In terms of data reading, HBase adopts a column-based storage method. Compared with a row-based storage method, it reduces redundant data reading during the data reading process, improves data reading efficiency, and makes data retrieval faster and more effective. In terms of storage, HBase divides a large-scale data table into several data areas, that is, data blocks. Each area sequentially stores a certain number of records in the data table. By merging multiple related areas, complete table information can be obtained. . The HBase data table segmentation process is shown in Figure 1.

HBase中的区域对应数据块概念，基于本实施方式的数据缓存方法，根据数据查询情况筛选出查询频繁的数据区域，即频繁数据快，将频繁度最高的若干区域中的频繁条件属性数据缓存至内存缓冲区中。当数据区域中的数据被访问时，根据查询条件属性与内存中的缓存的命中情况，进行不同的数据查询操作。在HBase环境下，数据查询过程如图2所示，客户端向数据区域服务器发送查询请求，数据区域服务器根据查询情况返回查询结果或进一步查询，若要查询的数据的条件属性全部为内存缓存的频繁条件属性，则直接得到查询结果；若要查询的数据的条件属性一部分为内存缓存的频繁条件属性，则根据该部分条件属性的索引查询磁盘数据库中满足该部分条件属性的数据，完成查询操作；若要查询的数据的条件属性均不在内存缓存的频繁条件属性集合中，则从磁盘中加载数据块进行查询操作。而在Hadoop层的存储节点则负责加载磁盘数据和执行查询操作。Areas in HBase correspond to the concept of data blocks. Based on the data caching method of this embodiment, data areas with frequent queries are screened out according to data query conditions, that is, frequent data is fast, and frequent condition attribute data in several areas with the highest frequency are cached in in the memory buffer. When the data in the data area is accessed, different data query operations are performed according to the hit situation between the query condition attribute and the cache in the memory. In the HBase environment, the data query process is shown in Figure 2. The client sends a query request to the data region server, and the data region server returns the query result or further query according to the query situation. If the condition attributes of the data to be queried are all cached in memory Frequent condition attributes, the query result will be obtained directly; if part of the condition attributes of the data to be queried is the frequent condition attributes of the memory cache, then the data in the disk database that meets the condition attributes in the disk database will be queried according to the index of the part of the condition attributes, and the query operation will be completed ; If none of the condition attributes of the data to be queried is in the frequent condition attribute set of the memory cache, load the data block from the disk to perform the query operation. The storage nodes at the Hadoop layer are responsible for loading disk data and performing query operations.

本实施方式的基于Apriori算法的数据缓存方法如图3所示，包括以下步骤：The data caching method based on the Apriori algorithm of the present embodiment is as shown in Figure 3, comprises the following steps:

步骤1在磁盘中以天为单位记录T天内用户查询语句中的条件属性，建立T个查询日志，即用户查询内容。Step 1: record the conditional attributes in user query statements within T days in the disk in units of days, and create T query logs, that is, user query content.

本实施方式中条件属性为新浪微博用户的年龄、性别、所在地区、注册日期、在线时间的个人信息，在实施过程中，共创建7个查询日志，表示最近7天的用户查询记录。In this embodiment, the condition attribute is the personal information of the Sina Weibo user's age, gender, location, registration date, and online time. During the implementation process, a total of 7 query logs are created to represent the user query records in the last 7 days.

步骤2：计算查询日志中各数据块block的查询频繁度，根据得到数据块查询频繁度的大小获得查询频繁度高的多个数据块，即频繁数据块集合block_fd。假设共有3个数据块，分别为block₁、block₂、block₃，数据块block₁查询频繁度计算过程如下：Step 2: Calculate the query frequency of each data block block in the query log, and obtain multiple data blocks with high query frequency according to the query frequency of the obtained data blocks, that is, the frequent data block set block _fd . Assuming that there are 3 data blocks in total, namely block ₁ , block ₂ , and block ₃ , the query frequency calculation process of data block block ₁ is as follows:

步骤2.1：确定7个查询日志中各数据块数据的查询次数。Step 2.1: Determine the query times of each data block data in the 7 query logs.

根据查询日志计数获得block₁在7天中的查询次数Count(t)。据统计，block₁相关查询在t取值为0、1、2、3、4、5、6时查询次数Count(t)分别为1350,1433,1236,1546,1354,1029,1175。Obtain the query count Count(t) of block ₁ in 7 days according to the query log count. According to statistics, the query times Count(t) of block ₁ related queries are 1350, 1433, 1236, 1546, 1354, 1029, 1175 when the values of t are 0, 1, 2, 3, 4, 5, and 6 respectively.

步骤2.2：对各数据块中的数据的查询次数进行规范化处理：设置近期日志比例区分近期日志数据与历史日志数据，当数据块中的历史日志数据的查询次数高于历史日志数据查询次数上限阈值时，则该历史日志数据查询次数取值为该上限阈值；当数据块中的近期日志数据的查询次数高于近期日志数据查询次数上限阈值时，则该近期日志数据的查询次数取值为该上限阈值。Step 2.2: Normalize the number of data queries in each data block: set the recent log ratio to distinguish recent log data from historical log data. When the query times of historical log data in a data block is higher than the upper threshold of historical log data query times When the query times of the historical log data is the upper threshold; when the query times of the recent log data in the data block is higher than the upper threshold of the query times of the recent log data, the query times of the recent log data are set to the upper threshold. upper threshold.

对各数据块中的数据的查询次数Count(t)进行规范化处理：设置近期日志比例q_rec区分近期日志数据与历史日志数据，q_rec根据用户实际需要设置，取值范围0＜q_rec＜1，本实施方式取q_rec＝0.3，则当t＜q_rec×T时，即前5天查询日志数据属于历史日志数据，当t≥q_rec×T时，即最近2天查询日志数据属于近期日志数据，对于数据块中的历史日志数据，设置历史日志数据查询次数上限阈值Max_his，通常情况下，Max_his应设置为所有记录平均查询次数的1.5倍，Max_his＝1400，该数据块查询次数高于历史日志数据查询次数上限阈值时，则该数块据查询次数取值为该上限阈值，对于近期日志数据，设置近期日志数据查询次数上限阈值Max_rec，Max_rec应设置为所有记录平均查询次数的2倍，Max_rec＝1700，当该数据块查询次数高于近期日志数据查询次数上限阈值时，则该数据块查询次数取值为该上限阈值，根据规范化公式(1)对查询次数Count(t)进行规范化处理：Normalize the query times Count(t) of the data in each data block: set the recent log ratio q _rec to distinguish recent log data from historical log data, q _rec is set according to the actual needs of users, and the value range is 0<q _rec <1 , this embodiment takes q _rec =0.3, then when t<q _rec ×T, that is, the query log data of the previous 5 days belongs to historical log data, and when t≥q _rec ×T, that is, the query log data of the last 2 days belongs to recent For log data, for the historical log data in the data block, set the upper limit threshold Max _his of historical log data query times. Normally, Max _his should be set to 1.5 times the average query times of all records, Max _his = 1400, the data block query When the number of times is higher than the upper limit threshold of historical log data query times, the value of the block data query times is the upper threshold value. For recent log data, set the upper limit threshold of recent log data query times Max _rec , and Max _rec should be set to the average value of all records. 2 times the number of queries, Max _rec = 1700, when the query times of the data block is higher than the upper threshold of recent log data query times, then the query times of the data block is taken as the upper threshold, and the query times are calculated according to the normalized formula (1) Count(t) is normalized:

由于步骤2.1中当t＝1和t＝3时，相关查询次数超过了历史日志数据查询次数上限阈值，故Count(1)＝Count(3)＝Max_his＝1400，Count_std1(t)为1350,1400,1236,1400,1354,1029,1175。Since in step 2.1 when t=1 and t=3, the number of related queries has exceeded the upper limit threshold of historical log data queries, so Count(1)=Count(3)=Max _his =1400, and Count _std1 (t) is 1350 ,1400,1236,1400,1354,1029,1175.

通过对查询次数进行规范化处理，可以在一定程度上避免因个别天查询次数过高而导致条件属性查询频繁度虚高的情况。By normalizing the number of queries, it is possible to avoid, to a certain extent, the situation that the query frequency of conditional attributes is too high due to the high number of queries in individual days.

步骤2.3：对规范化处理后的数据块查询次数Count_std1(t)进行加权操作：分别对规范化处理后的7个查询日志中各数据块中数据的查询次数加权求和后取平均值，即得到该数据块查询频繁度FD_block：Step 2.3: Perform a weighting operation on the number of data block queries Count _std1 (t) after normalization processing: respectively weight and sum the query times of data in each data block in the seven query logs after normalization processing, and take the average value to obtain The data block query frequency FD _block :

其中Count_std1(t)为规范化处理后的数据块中数据的查询次数，W(t)为加权函数，为增函数。Among them, Count _std1 (t) is the query times of data in the data block after normalization processing, and W(t) is a weighting function, which is an increasing function.

本实施方式以单调递增的正比例型函数在第一象限内的函数部分作为加权函数，即W(t)＝t+1，其中0≤t≤6In this embodiment, the function part of the monotonically increasing proportional function in the first quadrant is used as the weighting function, that is, W(t)=t+1, where 0≤t≤6

计算block₁的频繁度：Calculate the frequency of block ₁ :

步骤2.4：根据各数据块查询频繁度的大小选择查询频繁度高的多个数据块，即频繁数据块，各频繁数据块形成频繁数据块集合，其中block₂查询频繁度为5973.13648，block₃查询频繁度为5294.65。将数据块进行大小排序，获得数据块依次求和后内存在1G内的查询频繁度高的多个数据块，其中block₂属于频繁数据块。Step 2.4: Select multiple data blocks with high query frequency according to the query frequency of each data block, that is, frequent data blocks, and each frequent data block forms a frequent data block set, where the query frequency of block ₂ is 5973.13648, and the query frequency of block ₃ is 5973.13648 The frequency is 5294.65. Sort the data blocks by size, and obtain multiple data blocks with high query frequency within 1G after summing up the data blocks sequentially, among which block ₂ belongs to the frequent data block.

步骤3：各频繁数据块的条件属性形成条件属性集合：本实施方式中用户的年龄、性别、所在地区、注册日期、在线时间的条件属性集合为条件属性集合。Step 3: The conditional attributes of each frequent data block form a conditional attribute set: in this embodiment, the conditional attribute set of the user's age, gender, location, registration date, and online time is a conditional attribute set.

步骤4：计算条件属性集合中的每个条件属性的查询频繁度，根据得到条件属性查询频繁度的大小获得查询频繁度高的多个条件属性，形成频繁条件属性集合。以年龄条件属性为例，条件属性查询频繁度计算过程如下：Step 4: Calculate the query frequency of each condition attribute in the condition attribute set, and obtain multiple condition attributes with high query frequency according to the obtained condition attribute query frequency to form a frequent condition attribute set. Taking the age condition attribute as an example, the calculation process of condition attribute query frequency is as follows:

步骤4.1：确定7个查询日志中在频繁数据块中各条件属性的查询次数，本实施方式中与年龄条件属性相关的查询在t取值为0、1、2、3、4、5、6时查询次数分别为130、135、125、160、110、115、120。Step 4.1: Determine the number of queries of each condition attribute in the frequent data block in the 7 query logs. In this embodiment, the query related to the age condition attribute is 0, 1, 2, 3, 4, 5, 6 at t The query times are 130, 135, 125, 160, 110, 115, and 120 respectively.

对条件属性查询次数进行规范化处理，根据近期日志比例q_rec＝0.3，前5天查询日志条件属性属于历史日志条件属性，最近2天的查询日志条件属性属于近期日志条件属性，对于历史日志条件属性，设置历史日志条件属性查询次数上限阈值Max_his，Max_his＝140，当该条件属性查询次数高于历史日志条件属性查询次数上限阈值时，则该条件属性的查询次数取值为该上限阈值。对于近期日志条件属性，设置近期日志条件属性查询次数上限阈值Max_rec，Max_rec＝150，当该条件属性查询次数高于近期日志条件属性查询次数上限阈值时，则该条件属性查询次数取值为该上限阈值，根据规范化公式(1)对条件属性查询次数Count(t)进行规范化处理，由于步骤4.1中得到的当t＝3时，Count(3)＝160，超过了历史日志查询次数上限阈值，故令Count(3)＝Max_his＝140。Count_std2(t)Count(stt)d为130、135、125、140、110、115、120。Normalize the query times of condition attributes. According to the recent log ratio q _rec = 0.3, the query log condition attributes of the previous 5 days belong to the historical log condition attributes, the query log condition attributes of the last 2 days belong to the recent log condition attributes, and the historical log condition attributes , set the historical log condition attribute query times upper limit threshold Max _his , Max _his =140, when the condition attribute query times is higher than the historical log condition attribute query times upper limit threshold, then the condition attribute query times value is the upper threshold value. For the recent log condition attribute, set the upper limit threshold Max _rec of the recent log condition attribute query times, Max _rec =150, when the condition attribute query times is higher than the recent log condition attribute query times upper threshold value, then the condition attribute query times value is The upper limit threshold, according to the normalization formula (1), carries out normalization processing to the count (t) of conditional attribute query times, because when t=3 obtained in step 4.1, Count (3)=160, has exceeded the historical log query times upper limit threshold , so Count(3)=Max _his =140. Count _std2 (t)Count(stt)d is 130, 135, 125, 140, 110, 115, 120.

步骤4.3：对规范化处理后的条件属性查询次数进行加权操作：分别对规范化处理后的7天内的各条件属性的查询次数加权后求和取平均值，即得到各条件属性的查询频繁度FD_sa：Step 4.3: Perform weighting operation on the query times of conditional attributes after normalization processing: respectively weight the query times of each conditional attribute within 7 days after normalization processing, sum and take the average value, that is, obtain the query frequency FD _sa of each conditional attribute :

同样以单调递增的正比例型函数在第一象限内的函数部分作为加权函数，即W(t)＝t+1，其中0≤t≤6，计算年龄条件属性的频繁度：Also use the function part of the monotonically increasing proportional function in the first quadrant as a weighting function, that is, W(t)=t+1, where 0≤t≤6, calculate the frequency of the age condition attribute:

步骤4.4：根据得到的各条件属性的查询频繁度，选择频繁度高的多个条件属性，即频繁条件属性，各频繁条件属性形成频繁条件属性集合，其中年龄的查询频繁度为487.8571，性别的查询频繁度为539.2857143，所在地区的查询频繁度为632.1428571，注册日期的查询频繁度为217.1429，在线时间的查询频繁度为103.4923。Step 4.4: According to the obtained query frequency of each condition attribute, select multiple condition attributes with high frequency, that is, frequent condition attributes, and each frequent condition attribute forms a frequent condition attribute set, in which the query frequency of age is 487.8571, and the query frequency of gender is The query frequency is 539.2857143, the query frequency of the region is 632.1428571, the query frequency of registration date is 217.1429, and the query frequency of online time is 103.4923.

步骤5.1：令A₁＝φ，设k为当前最高频繁条件属性组长度，当k＝1时，表示长度为1的频繁条件属性组集合A₁。Step 5.1: Let A ₁ =φ, let k be the current highest frequent condition attribute group length, when k=1, it means the frequent condition attribute group set A 1 with length ₁ .

步骤5.2：统计频繁条件属性集合中各条件属性查询频繁度，其中年龄、性别、所在地区、注册日期、在线时间的条件属性对应的频繁度分别为487.8571、539.2857、632.1428、217.1429、103.4923，设置最小频繁度阈值min_fd＝175，将所有大于等于最小频繁度阈值的年龄、性别、所在地区、注册日期、在线时间的条件属性放入A₁中，获得长度为1的频繁条件属性组集合A₁。Step 5.2: Count the query frequency of each condition attribute in the frequent condition attribute set. Among them, the frequency corresponding to the condition attributes of age, gender, location, registration date, and online time are 487.8571, 539.2857, 632.1428, 217.1429, and 103.4923 respectively, and the minimum setting is Frequency threshold min _fd = 175, put all condition attributes greater than or equal to the minimum frequency threshold such as age, gender, location, registration date, and online time into A ₁ , and obtain a frequent condition attribute group set A _{1 with a length of 1} .

步骤5.3：对A₁中的元素做按条件属性名称做字典排序并进行自然连接，获得长度为2的频繁条件属性组候选集C₂,其中C₂包括所在地区—注册日期、所在地区—年龄、所在地区—性别、年龄—性别、年龄-注册日期、性别-注册日期。Step 5.3: Sort the elements in A ₁ according to the conditional attribute names and perform a natural connection to obtain a frequent conditional attribute group candidate set C ₂ with a length of 2, where C ₂ includes the location - registration date, location - age , Location-gender, age-gender, age-registration date, gender-registration date.

步骤5.4：令A₂＝φ，查询C₂中各个条件属性组，并检索所有频繁条件属性集合，统计C₂中各条件属性组的查询频繁度，其中所在地区—注册日期、所在地区—年龄、所在地区—性别、年龄—性别、年龄-注册日期、性别-注册日期等属性组对应的频繁度分别为202.14285、339.2857、401.4285、321.4285、98.4957、135.671，将条件属性组频繁度大于等于最小频繁度阈值的所地区—注册日期、所在地区—年龄、所在地区—性别、年龄—性别等条件属性组放入A2长度为2的频繁条件属性组集合A₂中。Step 5.4: Let A ₂ = φ, query each conditional attribute group in C ₂ , and retrieve all frequent conditional attribute sets, and count the query frequency of each conditional attribute group in C ₂ , where the location - registration date, location - age The frequencies corresponding to attribute groups such as location-gender, age-gender, age-registration date, and gender-registration date are 202.14285, 339.2857, 401.4285, 321.4285, 98.4957, and 135.671 respectively, and the frequency of the conditional attribute group is greater than or equal to the minimum frequency The condition attribute groups such as location-registration date, location-age, location-gender, age-gender and other condition attribute groups of the degree threshold are put into the frequent condition attribute group set A2 whose length is ₂ in A2.

步骤5.5：对A₂中的元素按条件属性名称做字典排序并进行自然连接，获得长度为3的频繁条件属性组候选集C₃,其中C₃包括地区—性别—年龄条件属性组。Step 5.5: Sort the elements in A ₂ lexicographically according to the conditional attribute names and perform natural connection to obtain the frequent conditional attribute group candidate set C ₃ with a length of 3, where C ₃ includes the region-gender-age conditional attribute group.

步骤5.6：令A₃＝φ，查询C₃中各个条件属性组，并检索所有频繁条件属性集合，统计C₃中条件属性组的查询频繁度，其中地区—性别—年龄这一条件属性组对应的频繁度分别为183.5714286，将条件属性组频繁度大于等于最小频繁度阈值的地区—性别—年龄的条件属性组放入长度为3的频繁条件属性组集合A₃中。Step 5.6: Let A ₃ = φ, query each conditional attribute group in C ₃ , and retrieve all frequent conditional attribute sets, and count the query frequency of the conditional attribute group in C ₃ , where the conditional attribute group of region-gender-age corresponds to The frequency of each is 183.5714286, and the region-gender-age conditional attribute group whose conditional attribute group frequency is greater than or equal to the minimum frequency threshold is put into the frequent conditional attribute group set A ₃ with a length of 3.

步骤5.7：对A₃中的元素按条件属性名称做字典排序并进行自然连接，获得长度为4的频繁条件属性组候选集C₄，其中C₄＝φ。Step 5.7: Sort the elements in A ₃ lexicographically according to the conditional attribute names and perform natural connection to obtain a frequent conditional attribute group candidate set C ₄ with a length of 4, where C ₄ =φ.

步骤5.8：获得查询日志中各长度的频繁查询条件属性组集A，其中A＝∪_kA_k＝A₁∪A₂∪…∪A_k：Step 5.8: Obtain the frequent query condition attribute group set A of each length in the query log, where A=∪ _k A _k =A ₁ ∪A ₂ ∪…∪A _k :

长度为1的频繁条件属性有年龄、性别、所在地区、注册日期，对应的频繁度分别为487.8571、539.2857、632.1428、217.1429。The frequent condition attributes with a length of 1 include age, gender, location, and registration date, and the corresponding frequencies are 487.8571, 539.2857, 632.1428, and 217.1429, respectively.

长度为2的频繁条件属性组有所在地区—注册日期、地区—年龄、地区—性别、年龄—性别，对应的频繁度分别为202.14285、339.2857、401.4285、321.4285。The frequent condition attribute groups with a length of 2 include region-registration date, region-age, region-gender, age-gender, and the corresponding frequencies are 202.14285, 339.2857, 401.4285, and 321.4285, respectively.

长度为3的频繁条件属性组有所在地区—性别—年龄，频繁度为183.5714286。The frequent condition attribute group with a length of 3 includes region-sex-age, and the frequency is 183.5714286.

仅在内存中缓存3列条件属性数据，有3组缓存方式，第1组内存缓存中缓存年龄、性别、所在地区数据，第2组内存缓存中缓存所在地区—年龄、地区—性别数据，由于所在地区条件属性重复，故不占用内存空间，第3组内存缓存中缓存所在地区—性别—年龄数据。Only 3 columns of conditional attribute data are cached in the memory, and there are 3 sets of cache methods. The first set of memory cache caches age, gender, and region data, and the second set of memory cache caches location-age, region-gender data. The condition attributes of the region are repeated, so no memory space is occupied. The region-gender-age data is cached in the third group of memory caches.

步骤7：当客户端需要进行数据查询时，根据要查询的数据的条件属性，进行查询操作：若要查询的数据的条件属性全部为内存缓存的频繁条件属性，则直接得到查询结果；若要查询的数据的条件属性一部分为内存缓存的频繁条件属性，则根据该部分频繁条件属性的索引查询磁盘数据库中满足该部分条件属性的数据，完成查询操作；若要查询的数据的条件属性均不在内存缓存的频繁条件属性集合中，则从磁盘中加载数据块进行查询操作性均不在内存缓存的频繁条件属性集合中，即未命中，则从磁盘中加载数据块进行查询操作。Step 7: When the client needs to query data, perform query operations according to the conditional attributes of the data to be queried: if all the conditional attributes of the data to be queried are frequent conditional attributes of the memory cache, the query result will be obtained directly; If some of the conditional attributes of the queried data are frequent conditional attributes of the memory cache, then query the data in the disk database that meets the conditional attributes in the disk database according to the index of the frequent conditional attributes of the part, and complete the query operation; if the conditional attributes of the data to be queried are not in In the frequent condition attribute set of the memory cache, load the data block from the disk for query operability, that is, if there is a miss, load the data block from the disk for query operation.

对于一次实际数据查询，共有3种可能的不同命中情况，如图4所示。For an actual data query, there are three possible different hit situations, as shown in Figure 4.

当用户查询出生日期条件属性时，出生日期条件属性未缓存在内存中，属于查询中的条件属性均不在内存缓存中情况，则从磁盘中加载数据块进行查询操作。When the user queries the date of birth condition attribute, the date of birth condition attribute is not cached in the memory, and the condition attribute in the query is not in the memory cache, the data block is loaded from the disk to perform the query operation.

当用户查询所在地区—出生日期条件属性组时，属于查询中的条件属性仅有一部分在内存缓存中情况，则根据该地区条件属性的索引查询磁盘数据库中满足该地区条件属性的数据，完成查询操作。When the user queries the region-date of birth condition attribute group, only a part of the condition attributes in the query is in the memory cache, then query the data in the disk database that meets the region condition attribute according to the index of the region condition attribute, and complete the query operate.

当用户查询地区条件属性时，属于查询中的条件属性全部在内存缓存中情况，此时直接在内存中检索相关数据并返回结果即可。When the user queries region conditional attributes, all the conditional attributes in the query are in the memory cache. At this time, the relevant data can be directly retrieved in the memory and the result returned.

在不同缓存方式下，平均查询效率对比如图5所示。在应用本方法前，一条正常SQLSelect语句的查询时间平均约为1500毫秒。The comparison of average query efficiency under different cache modes is shown in Figure 5. Before applying this method, the average query time of a normal SQLSelect statement is about 1500 milliseconds.

本实施方式的数据缓存方法在频繁区域中能够明显提高数据查询效率，而对于其他区域中的数据查询，由于未做任何处理，故不会影响其上的查询操作。缓存二、三条件属性组相比单一条件属性具有更高的查询效率，这是由于在实际查询过程中，单一条件属性的条件查询频率较低，缓存完全命中率不理想，相比于多条件属性查询，单一条件属性缓存不能很好地去除不相关记录，筛选出的记录规模较大，给之后在数据库中的索引检验工作带来了巨大的时间开销。The data caching method of this embodiment can obviously improve the efficiency of data query in the frequent area, but for the data query in other areas, since no processing is done, it will not affect the query operation on it. Cache two and three condition attribute groups have higher query efficiency than a single condition attribute. This is because in the actual query process, the condition query frequency of a single condition attribute is low, and the cache complete hit rate is not ideal. Compared with multi-condition Attribute query, a single-condition attribute cache cannot remove irrelevant records well, and the size of the filtered records is large, which brings huge time overhead to the subsequent index verification work in the database.

查询命中情况对比如图6所示，尽管两条件属性组缓存相比三条件属性组缓存完全命中率相差较多，但其部分命中率却高达63.93％。对于条件属性组中条件属性个数处于中间规模的缓存，尽管牺牲了一部分缓存完全命中率，但该类缓存能够更出色地完成中间记录的精简工作，缩减内存中因部分条件属性命中而产生的中间结果集，并根据频繁条件属性索引快速定位数据，进而减轻数据库的检索压力，取得了更高的查询效率，两条件属性组缓存平均查询效率略高于三条件属性组缓存正属于这种情况。The comparison of query hits is shown in Figure 6. Although the complete hit rate of the two-condition attribute group cache is much different than that of the three-condition attribute group cache, its partial hit rate is as high as 63.93%. For the cache with the number of conditional attributes in the conditional attribute group at an intermediate scale, although part of the cache complete hit rate is sacrificed, this type of cache can better complete the streamlining of intermediate records and reduce memory due to partial conditional attribute hits. Intermediate result sets, and quickly locate data according to the frequent condition attribute index, thereby reducing the retrieval pressure of the database and achieving higher query efficiency. The average query efficiency of the two-condition attribute group cache is slightly higher than that of the three-condition attribute group cache. This is the case .

Claims

1. a kind of data cache method based on Apriori algorithm is it is characterised in that comprise the following steps：

Step 1：Record the conditional attribute in user's query statement in T days in disk, set up T inquiry day in units of sky Will, i.e. user's inquiry content；

Step 2：Calculate the inquiry frequent degree of each data block in inquiry log, the size according to obtaining data block inquiry frequent degree obtains The high multiple data blocks of frequent degree must be inquired about, form frequent data item set of blocks；

Step 3：The conditional attribute formation condition community set of each frequent data item block；

Step 4：The inquiry frequent degree of each conditional attribute in design conditions community set, inquires about frequency according to obtaining conditional attribute The size of numerous degree obtains the high multiple conditional attributes of inquiry frequent degree, forms frequent conditional attribute set；

Step 5：Obtain frequent conditional attribute group set using Apriori algorithm and frequent conditional attribute set, conditional attribute Inquiry frequent degree is mapped as the support in Apriori algorithm, and the frequent item set that Apriori algorithm obtains is frequent condition and belongs to Property group set；

Step 6：By corresponding for frequent conditional attribute group set data buffer storage to internal memory, and in frequent conditional attribute set Frequently conditional attribute sets up index, completes data buffer storage；

Step 7：When client needs to carry out data query, according to the conditional attribute of data to be inquired about, carry out inquiry operation： To the frequent conditional attribute of all memory caches of conditional attribute of the data of inquiry, then directly obtain Query Result；To A conditional attribute part for the data of inquiry is the frequent conditional attribute of memory cache, then according to this partly frequent conditional attribute Meet the data of this partial condition attribute in search index disk database, complete inquiry operation；Bar to the data of inquiry Part attribute all not in the frequent conditional attribute set of memory cache, then loads data block from disk and carries out inquiry operation.

2. the data cache method based on Apriori algorithm according to claim 1 is it is characterised in that described step 2 has Body executes as follows：

Step 2.1：Determine the inquiry times of data in each data block in T inquiry log；

Step 2.2：Standardization processing is carried out to the inquiry times of the data in each data block：Arrange recent daily record ratio to distinguish closely Phase daily record data and history log data, when the inquiry times of the history log data in data block are looked into higher than history log data When asking number of times upper limit threshold, then the inquiry times value of this history log data is this upper limit threshold；When recent in data block When the inquiry times of daily record data are higher than recent daily record data inquiry times upper limit threshold, then the inquiry of this recent daily record data time Number value is this upper limit threshold；

Step 2.3：The inquiry times of data in the data block after standardization processing are weighted operating：At respectively to standardization Average after the inquiry times weighted sum of data in each data block in T inquiry log after reason, that is, obtain each data block Inquiry frequent degree；

Step 2.4：The high multiple data blocks of inquiry frequent degree are selected according to the size that each data block inquires about frequent degree, frequently counts According to block, each frequent data item block forms frequent data item set of blocks.

3. the data cache method based on Apriori algorithm according to claim 1 is it is characterised in that described step 4 has Body executes as follows：

Step 4.1：Determine the inquiry times of each conditional attribute in frequent data item block in T inquiry log；

Step 4.2：Standardization processing is carried out to the inquiry times of each conditional attribute：Recent daily record is distinguished according to recent daily record ratio Conditional attribute and history log conditional attribute, when the inquiry times of history log conditional attribute are looked into higher than history log conditional attribute When asking number of times upper limit threshold, then the inquiry times value of this history log conditional attribute is this upper limit threshold；When recent daily record bar When the inquiry times of part attribute are higher than recent Log conditions attribute query number of times upper limit threshold, then this recent Log conditions attribute is looked into Inquiry number of times value is this upper limit threshold；

Step 4.3：Conditional attribute inquiry times after standardization processing are weighted operate：After respectively to standardization processing Sue for peace after the inquiry times weighting of each conditional attribute in T days and average, that is, obtain the inquiry frequent degree of each conditional attribute；

Step 4.4：According to the inquiry frequent degree of each conditional attribute obtaining, select the high multiple conditional attributes of frequent degree, i.e. frequency Numerous conditional attribute, each frequent conditional attribute forms frequent conditional attribute set.